There are so many article outside about what is MapReduce, the basic concepts behind it, how it works, and many other things. Even that, I still wanna write a little introduction to MapReduce. It’s mandatory, at least for me, to write about “something” in order to understand the “something”. I challenge my understanding about MapReduce in this post. I’ll use some resources available on the clouds like I mentioned earlier. This is just another introduction to MapReduce.
Data, Data, Data
We are living in the clouds era. Internet provide us with such a great resource to help our lives. In the progress, we created a lot of data. Consider a search engine like Google or Bing. They indexed all of sites across the network. If we are talking about sites these days, that’s a big number we are talking about. Netcraft reported that there are more than 200 Millions sites in the world. It means the search engine must process and analysis a lot of data.We all know that the development in hardware, especially CPU, is fast. We can easily find a multi-core processor nowadays. But still, processing a large scale of data in a single computer is time consuming. This could happens because there is speed gap between CPU and storage devices. Storage devices read and write process is much slower than CPU processing speed. So, distributing the process into some computers working together might be faster than on a single computer for large data. There comes the distributed computing.
Distributed computing is the process of using multiple autonomous computers that working together and communicating using computer network. Each computers process some parts of the tasks given to them concurrently on their own memory. This method is good at handling large scale of data, but there are some challenges that must be considered. One of the common problems in distributed environments is network failures. It can be caused by broken link or router error. This can affects data transmission, disrupting the entire process.
The most challenging problem in distributed computing is synchronization between multiple machines. This problem covers how to prevent from deadlock and race condition in the transmission process. Another thing to considered is how to handle nodes failure. If one node among hundreds of node failed, then how to move the jobs in the failure node delivered to the other running nodes or restart the computation. This is not trivial task. Then MapReduce comes in handy.
to be continued in part 2