I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!
What is MapReduce?
According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.
MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value. Continue reading