Hello there? S’up?
On my previous post, we’ve learned how to develop Hadoop MapReduce application in Netbeans. After our application run well on the Netbeans, now it’s the time to deploy it on cluster of computers. Well, it supposed to be multi node cluster, but for now, let’s try it on a single node cluster. This article will give a step-by-step guide on how to deploy MapReduce application on a single node cluster.
In this tutorial, I’m using Ubuntu 9.10 Karmic Koala. For the Hadoop MapReduce application, I’ll use the code from my previous post. You can try it by yourself or you can just download the jar file. Are you ready? Let’s go then..
Preparing the Environment
First time first, we must preparing the deploying environment. We must install and configure all the software required. For this process, I followed a great tutorial by Michael Noll about how to run Hadoop on single node cluster. For simplicity, I’ll write a summary of all the steps mentioned on Michael’s post. I do recommend you to read it for the details.
First, we will create a dedicated user account for running Hadoop. It will help us separate it from other account for many reason, including security concerns. We will also create a dedicated group for this account. Let’s call the user and the group name: hadoop. This is how to do it.
- Open your terminal.
- On the terminal, type this command:
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hadoop $ sudo adduser hadoop admin
The first line command create a new group called
hadoop. The second line command creating a new account called
hadoopand assign it to
hadoopgroup. And the third line assign
hadoopaccount to group
admin. This command will make the hadoop account able to use
sudocommand. This is just for practical reason.
Next, we will configure the SSH access for Hadoop. Like Noll’s said, Hadoop use SSH access to control its nodes. Because we are configuring Hadoop for single node cluster, then we’ll configure SSH to access
hadoop account we created earlier. This is how to do it.
- We will generate an SSH key for the user hadoop we created.
user@computer:~$ su - hadoop hadoop@computer:~$ ssh-keygen -t rsa -P ""
The first line command change from account user to account hadoop. The second line command generate RSA key pair for hadoop account with empty password. The empty password is not recommended and only used for practical purpose.
- Next, we will enable local access using the new generated RSA key pair
hadoop@computer:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- The last step, we will test the SSH setup for hadoop account access to localhost. This step will also used for save the host machine fingerprint to list of known host.
hadoop@computer:~$ ssh localhost
For more information, if the SSH settings fail or something, you can refers to Noll’s post.
Next, we need to disable IPv6. To do it, open file /etc/systl.conf and add this line to the end of file:
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
Restart your computer to make this setting enabled. To test whether the IPv6 still enable or not, we can use this command:
If the IPv6 is disabled, the command will return value 1. Now, we’re ready to install Hadoop.
Hadoop Installation and Configuration
In order to make Hadoop works, you must already installed Java 1.6.x. In this article, I assume you’ve already installed Java. After that, you must obtain Hadoop MapReduce. In this tutorial, I used Hadoop version 0.20.1 from Apache project. You can download it from its official website. This is how to do it.
- Place the Hadoop archive file on your preferred location. In this article, I followed Noll’s post and placed it on /usr/local.
- Open the terminal, and type this command
$ cd /usr/local $ sudo tar xzf hadoop-0.20.1.tar.gz $ sudo mv hadoop-0.20.1 hadoop $ sudo chown -R hadoop:hadoop hadoop
The first line command change to the
/usr/localdirectory, place where you want to install Hadoop. The second line command extract the Hadoop archive file in this directory. The third line change the directory name from
hadoop, to make it easier when we are executing Hadoop’s commands later. The last line command change the owner of
hadoopaccount from group
After we successfully installed the Hadoop MapReduce, now we must configure it first. To configure Hadoop, we must edit some of its configuration files. You can refer to Hadoop’s Getting Started page to know the details of this configuration files. For now, we will configure it as follows.
First, we must set the hadoop’s Java path we intended to use. Locate your Java’s library directory. Typically, the library is in the
/usr/lib/jvm/java-6-sun. Add this location at
JAVA_HOME parameter in the
hadoop-env.sh file located (in this case)
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
Next, we will set all the hadoop’s configuration xml files. These configuration file usually located in
conf/ directory. The function of these xml files are described below:
core-site.xml: configuration settings for hadoop core. Read the configurable properties here.
mapred-site.xml: configuration settings for mapreduce daemon, such as jobtracker and tasktracker. Read the configurable properties here.
hdfs-site.xml: configuration settings for hdfs daemon, such as data replication, name node, secondary name node, and data node. Read the configurable properties here.
First, we configure the
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-datastore/hadoop</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property </configuration>
In this configuration, I set
/usr/local/hadoop-datastore/hadoop as my hadoop temporary directories base. Because this directory didn’t exist yet, I’ll create this directory with this command:
$ sudo mkdir /usr/local/hadoop-datastore/hadoop $ sudo chown hadoop:hadoop /usr/local/hadoop-datastore/hadoop $ sudo chmod 750 /usr/local/hadoop-datastore/hadoop
The first command create the directory. The second command change the ownership of the directory to hadoop:hadoop, it means hadoop user account from hadoop group. The third command give necessaries access to hadoop user to read, modify, and execute the directories.
The second property being set on the configuration files is the name of default file system used. Hadoop has many type of file system implemented from abstract class
org.apache.hadoop.fs.FileSystem. The implementation we’re going to use here is the local HDFS on port 54310. Actually, you can set the port as you like as long as you remember it for later use.
Next, we configure the
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> </property> </configuration>
In this configuration file, we configure the job tracker. We set it to local (because we’re still using single node) so that there is only one map and reduce task.
Then, we configure the
<configuration> <property> <name>dfs.replication</name> <value>1</value> <description>The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description> </property> </configuration>
In this configuration file, we set the dfs replication to one, since we’re deploying on single node. The default replication is 3 for multi node cluster.
Formatting and Running Hadoop
After we configured the hadoop installation, we need to format the hadoop file system. We just need to format it for the first time when installing the hadoop. This is how to do it.
- Open your terminal.
- On your terminal, type this command:
hadoop@computer:~$ cd /usr/local/hadoop hadoop@computer:/usr/local/hadoop$ bin/hadoop namenode -format
The first line changing directory from hadoop home directory to where we install the hadoop directory. The second command format the name node’s hadoop file system.
Finally, we’re ready to running hadoop. To start hadoop and dfs daemon, type this command on the terminal:
hadoop@computer:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-computer.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-computer.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-computer.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-computer.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-computer.out
To check whether the hadoop daemon already running or not, you can use
Now, after our hadoop is running, we can continue to deploy our WordCount application we wrote earlier. If you following my last tutorial (as I mentioned at the beginning on this post), you can get the WordCount.jar from
/your-path-to-NetbeansProject/WordCount/dist directory. After that, copy it to hadoop installation directory (
After that, we need to copy the input directory to the hdfs. Following the last tutorial, the input directory is
/home/username/input. This is how to copy the input directory and its contents to hdfs:
hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /home/hadoop/input input hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs ls Found 2 items -rw-r--r-- 3 hadoop supergroup 22 2010-07-29 20:52 /user/hadoop/input/file01 -rw-r--r-- 3 hadoop supergroup 24 2010-07-29 20:52 /user/hadoop/input/file02
The second command copy folder
input directory on running HDFS. The second command list the contents of input directory on HDFS.
Now, let’s run the WordCount application on the hadoop installation. This is how to do it:
hadoop@computer:/usr/local/hadoop$ bin/hadoop jar WordCount.jar input output 10/07/29 21:06:16 INFO mapred.FileInputFormat: Total input paths to process : 2 10/07/29 21:06:17 INFO mapred.JobClient: Running job: job_201007292027_0002 10/07/29 21:06:18 INFO mapred.JobClient: map 0% reduce 0% 10/07/29 21:06:27 INFO mapred.JobClient: map 100% reduce 0% 10/07/29 21:06:39 INFO mapred.JobClient: map 100% reduce 100% 10/07/29 21:06:41 INFO mapred.JobClient: Job complete: job_201007292027_0002 10/07/29 21:06:41 INFO mapred.JobClient: Counters: 18 10/07/29 21:06:41 INFO mapred.JobClient: Job Counters 10/07/29 21:06:41 INFO mapred.JobClient: Launched reduce tasks=1 10/07/29 21:06:41 INFO mapred.JobClient: Launched map tasks=2 10/07/29 21:06:41 INFO mapred.JobClient: Data-local map tasks=2 10/07/29 21:06:41 INFO mapred.JobClient: FileSystemCounters 10/07/29 21:06:41 INFO mapred.JobClient: FILE_BYTES_READ=100 10/07/29 21:06:41 INFO mapred.JobClient: HDFS_BYTES_READ=46 10/07/29 21:06:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=270 10/07/29 21:06:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=31 10/07/29 21:06:41 INFO mapred.JobClient: Map-Reduce Framework 10/07/29 21:06:41 INFO mapred.JobClient: Reduce input groups=4 10/07/29 21:06:41 INFO mapred.JobClient: Combine output records=0 10/07/29 21:06:41 INFO mapred.JobClient: Map input records=2 10/07/29 21:06:41 INFO mapred.JobClient: Reduce shuffle bytes=54 10/07/29 21:06:41 INFO mapred.JobClient: Reduce output records=4 10/07/29 21:06:41 INFO mapred.JobClient: Spilled Records=16 10/07/29 21:06:41 INFO mapred.JobClient: Map output bytes=78 10/07/29 21:06:41 INFO mapred.JobClient: Map input bytes=46 10/07/29 21:06:41 INFO mapred.JobClient: Combine input records=0 10/07/29 21:06:41 INFO mapred.JobClient: Map output records=8 10/07/29 21:06:41 INFO mapred.JobClient: Reduce input records=8
The command execute the
WordCount.jar with two arguments: the input and output directory. If you remember my last article about hadoop in Netbeans, the WordCount application takes that two arguments. The mapreduce job will produce the output into output directory. To read the output, let’s type some commands on the terminal:
hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -ls output Found 2 items drwxr-xr-x - hadoop supergroup 0 2010-07-29 21:06 /user/hadoop/output/_logs -rw-r--r-- 3 hadoop supergroup 31 2010-07-29 21:06 /user/hadoop/output/part-00000 hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -cat output/part-00000 Bye 2 Hello 2 hadoop 2 world 2
The first command list the files inside
output directory, apparently the output file of our WordCount application is in the
part-00000 file. The second command print the content of
part-00000 file to the terminal. Now, if we want to take the file out from HDFS, we can type these commands:
hadoop@computer:/usr/local/hadoop$ mkdir /home/hadoop/output hadoop@computer:/usr/local/hadoop$ bin/hadoop -getmerge output /home/hadoop/output hadoop@computer:/usr/local/hadoop$ cd /home/hadoop/output hadoop@computer:~/output$ ls hadoop hadoop@computer:/home/hadoop/output$ cat output Bye 2 Hello 2 hadoop 2 world 2
The first command create a directory for our output file, we created /home/hadoop/output directory. The second command fetch the output file from HDFS to the local directory. The third command change to output directory we specified earlier. When we list the directory on the forth command, we see that there’s a file named output there. So we display the output file with the fifth command.
In this article, we’ve already set up our hadoop installation on single node cluster. Then, we’ve tried the WordCount application on it. This post taught me a lot on how use hadoop. On the next post, we’ll try on the multi node cluster!! Yay..