Yet Another Introduction to MapReduce (part 2)

I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!

What is MapReduce?

According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.

MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value.

Let’s take a look at the most popular example of MapReduce algorithm, word count. This algorithm take some text documents and return all the words and their sums on the documents. The map function take the text in the documents as inputs and produce intermediate key-value pairs. The words become the key and “1” become the value. It means there is “1” of the word in the documents. This key-value pairs then sent to reduce function. The reduce function, for the same intermediate key, merges the values. So, for the same word, the reduce function sums up the “1” value. At the end of the reduce function, the value is the sum of the corresponding word.

This is how it looks in pseudocode as stated in Hadoop Tutorial by Yahoo!:

mapper (filename, file-contents):
  for each word in file-contents:
    emit (word, 1)

reducer (word, values):
  sum = 0
  for each value in values:
    sum = sum + value
  emit (word, sum)

Process and Data Distribution

The main idea behind parallel programming is about the distribution of process and data. In MapReduce framework, there are two process: map and reduce. These two process are assigned to workers machine by a single (or maybe and a backup) master machine. The workers work on their own task, while the master periodically pings all the workers to make sure that everything is worked.

The input data will be processed by many workers. The input for map tasks will be partitioned into M splits. Each split then distributed and processed by the workers. As for the reduce tasks, the intermediate key space are partitioned into R pieces and processed separately. The master machine pick the idle workers to work on map or reduce process.

The worker, assigned by the map split, read the input data and process the map function to produce intermediate key-value pairs. These pairs are buffered in the memory. The key-value pairs periodically are written to local disk, partitioned into R regions. The worker then notifies the master about the location of the pairs.

The master, notified by the map worker, forward the location information to the reduce worker.  The reduce worker use remote procedure calls to read the data from the local disk of map worker. After all the data has been read, the reduce worker sort the data by intermediate keys. The paper said that the sorting process is needed because typically many different keys map to the same reduce task. After the sorting process, the reduce worker iterates the intermediate keys of the data and give it to the reduce function. The output of the reduce function is then appended as the final output file for the corresponding partition. After all the map and reduce task are finished, the master machine return to the user program.

next on the 3rd part: Failure handling, list of references and resources…

image computer lab at The International School Bangalore, by Kprateek 88, licensed GNU Free Documentation version 1.2 or any later version.

One thought on “Yet Another Introduction to MapReduce (part 2)

  1. Ingin memberi pertanyaan seputar proses MapReduce :

    Bila kita memiliki 10 node komputer, lalu dengan Map = 1000 dan Reduce = 10.
    Kira-kira, menurut kamu berapa kali akses ke disk dgn RPC?

    Karena saya agak bingung saat menghitungnya, berkaitan bagaimana proses membagi peran dari 10 node komputer tersebut, untuk menjadi Master, Map Workers, atau Reduce Workers.

    Mungkin bisa membantu saya untuk mencari cara solusinya?

    Terimakasih sebelumnya.

What's in your mind?