Hello there? S’up?
On my previous post, we’ve learned how to develop Hadoop MapReduce application in Netbeans. After our application run well on the Netbeans, now it’s the time to deploy it on cluster of computers. Well, it supposed to be multi node cluster, but for now, let’s try it on a single node cluster. This article will give a step-by-step guide on how to deploy MapReduce application on a single node cluster.
In this tutorial, I’m using Ubuntu 9.10 Karmic Koala. For the Hadoop MapReduce application, I’ll use the code from my previous post. You can try it by yourself or you can just download the jar file. Are you ready? Let’s go then..
Preparing the Environment
First time first, we must preparing the deploying environment. We must install and configure all the software required. For this process, I followed a great tutorial by Michael Noll about how to run Hadoop on single node cluster. For simplicity, I’ll write a summary of all the steps mentioned on Michael’s post. I do recommend you to read it for the details. Continue reading
What? NoSQL? Yeah, you read it correctly. NoSQL. I forgot when and where I heard about this for the first time. But I noticed about this data store technology again when I was attending the second Bancakan 2.0 meet up in last March. When I listened to the speaker, lynxluna, I remember about HBase, a scalable distributed database that becomes part of Apache Hadoop project. For your own sake, Apache Hadoop is just one implementation of MapReduce framework.
What is NoSQL?
So, what the hell is NoSQL? Here is the definition of NoSQL in Wikipedia:
NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage.
Continue reading
Some days ago, there’s a vacancy offer in my undergraduate department mailing list. A company is looking for a programmer. I didn’t pay much attention to this email. Okay, here’s the email:
Mr. XXX, my office needs a programmer with this qualification:
- Have knowledge in VB, Java, and PHP
- Have any experiences as a programmer/developer for at least 1 year in IT division or in IT company or software house
- Have an ability to give product presentation to potential clients
- Have knowledge in CorelDRAW and Photoshop
- Have knowledge in Linux
- Have knowledge in building computer networks
- Have knowledge in hardware
Continue reading
I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!
What is MapReduce?
According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.
MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value. Continue reading
There are so many article outside about what is MapReduce, the basic concepts behind it, how it works, and many other things. Even that, I still wanna write a little introduction to MapReduce. It’s mandatory, at least for me, to write about “something” in order to understand the “something”. I challenge my understanding about MapReduce in this post. I’ll use some resources available on the clouds like I mentioned earlier. This is just another introduction to MapReduce.
Data, Data, Data
We are living in the clouds era. Internet provide us with such a great resource to help our lives. In the progress, we created a lot of data. Consider a search engine like Google or Bing. They indexed all of sites across the network. If we are talking about sites these days, that’s a big number we are talking about. Netcraft reported that there are more than 200 Millions sites in the world. It means the search engine must process and analysis a lot of data. Continue reading