About Me

It's me, Arif N. In this blog I'll write about my adventure related to computer, programming, and anything that I found interesting. I wish you a happy reading.. :D

Category: MapReduce

In the development phase of Hadoop MapReduce program, you will be involved with testing your program on a real cluster with small data to make sure that it’s working correctly. To do that, you must package your application into jar file, then run it with Hadoop jar command on the terminal. Then, you check the output target directory of your program, are the outputs correct? If not, you must delete the output directory in HDFS, check and repair your program, then start the build jar – run Hadoop – check output circle. For once or twice, it’s okay. But in the development process, we will surely make hell a lot of mistakes in our program. Doing the build jar – run Hadoop – check output – delete output directory repeatly could take a lot of time. Not to mention the typo when you interact with Hadoop shell command. To make this testing process easier, we can use Karmasphere: a Hadoop plugin for Netbeans IDE. This article is about how to test your Hadoop program on a real cluster easily using Netbeans.

Learn more ...

The Three Modes of Hadoop

As you may already knew, we can configure and use Hadoop in three modes. These modes are:

Standalone mode

This mode is the default mode that you get when you’re downloading and extracting Hadoop for the first time. In this mode, Hadoop didn’t utilize HDFS to store input and output files. Hadoop just use local filesystem in its process. This mode is very useful for debugging your MapReduce code before you deploy it on large cluster and handle huge amounts of data. In this mode, the Hadoop’s configuration file triplet (mapred-site.xml, core-site.xml, hdfs-site.xml) still free from custom configuration.

Pseudo distributed mode (or single node cluster)

In this mode, we configure the configuration triplet to run on a single cluster. The replication factor of HDFS is one, because we only use one node as Master Node, Data Node, Job Tracker, and Task Tracker. We can use this mode to test our code in the real HDFS without the complexity of fully distributed cluster. I’ve already covered the configuration process on my previous post.

Fully distributed mode (or multiple node cluster)

In this mode, we use Hadoop at its full scale. We can use cluster consists of a thousand nodes working together. This is the production phase, where your code and data are used and distributed across many nodes. You use this mode when your code is ready and work properly on the previous mode.

Learn more ...

Hello, it has been a while since I updated this blog. I’m a little busy with college stuffs and something like that. And finally, I have came to the last year of my graduate study. After doing some consultations with some professors in my college, I got something as my research focus. Actually, it still at proposal stage, but I hope this will works, because so many people are counting on me about it.

So, I wanna implement MapReduce to optimize processing in automatic part-of-speech tagging (POS tagging). POS tagging is a process of assigning types of words in entire collection of text document. To make the process automatic, we can use some approaches that involves natural language processing techniques. Some approaches involve supervised learning, it means it needs to train the models with tagged corpus before we use the models to tag the real world text document. We can use MapReduce to optimize the learning and the real tagging process.

Since this is my first time dealing with (yeah) MapReduce and natural language processing, I feel a little bit anxious. Even, my anxiety is taking over my excitement already. Hearing this, maybe you’ll say how come I feel anxiety more than excitement. The answer is “I don’t know”, but I hope this will works out and I can finish the research on time. Oh, maybe because there is time variable. Well, if we don’t have time variable then when we will start to do the work?

Well, this is just me rambling around. Thank you for all the readers who have asked some questions, comments, and anything in this blog. I hope we can keep in touch. Wish me luck. I’ll write about my research little by little in this blog. So, be aware.. And let’s get started!!

Hello there? S’up?

On my previous post, we’ve learned how to develop Hadoop MapReduce application in Netbeans. After our application run well on the Netbeans, now it’s the time to deploy it on cluster of computers. Well, it supposed to be multi node cluster, but for now, let’s try it on a single node cluster. This article will give a step-by-step guide on how to deploy MapReduce application on a single node cluster.

In this tutorial, I’m using Ubuntu 9.10 Karmic Koala. For the Hadoop MapReduce application, I’ll use the code from my previous post. You can try it by yourself or you can just download the jar file. Are you ready? Let’s go then..

Preparing the Environment

First time first, we must preparing the deploying environment. We must install and configure all the software required. For this process, I followed a great tutorial by Michael Noll about how to run Hadoop on single node cluster. For simplicity, I’ll write a summary of all the steps mentioned on Michael’s post. I do recommend you to read it for the details.

Learn more ...

I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!

What is MapReduce?

According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.

MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value.

Learn more ...