Hadoop on Single Node Cluster

Hello there? S’up?

On my previous post, we’ve learned how to develop Hadoop MapReduce application in Netbeans. After our application run well on the Netbeans, now it’s the time to deploy it on cluster of computers. Well, it supposed to be multi node cluster, but for now, let’s try it on a single node cluster. This article will give a step-by-step guide on how to deploy MapReduce application on a single node cluster.

In this tutorial, I’m using Ubuntu 9.10 Karmic Koala. For the Hadoop MapReduce application, I’ll use the code from my previous post. You can try it by yourself or you can just download the jar file. Are you ready? Let’s go then..

Preparing the Environment

First time first, we must preparing the deploying environment. We must install and configure all the software required. For this process, I followed a great tutorial by Michael Noll about how to run Hadoop on single node cluster. For simplicity, I’ll write a summary of all the steps mentioned on Michael’s post. I do recommend you to read it for the details.

First, we will create a dedicated user account for running Hadoop. It will help us separate it from other account for many reason, including security concerns. We will also create a dedicated group for this account. Let’s call the user and the group name: hadoop. This is how to do it.

  1. Open your terminal.
  2. On the terminal, type this command:
    $ sudo addgroup hadoop
    $ sudo adduser --ingroup hadoop hadoop
    $ sudo adduser hadoop admin
    

    The first line command create a new group called hadoop. The second line command creating a new account called hadoop and assign it to hadoop group. And the third line assign hadoop account to group admin. This command will make the hadoop account able to use sudo command. This is just for practical reason.

Next, we will configure the SSH access for Hadoop. Like Noll’s said, Hadoop use SSH access to control its nodes. Because we are configuring Hadoop for single node cluster, then we’ll configure SSH to access localhost from hadoop account we created earlier. This is how to do it.

  1. We will generate an SSH key for the user hadoop we created.
    user@computer:~$ su - hadoop
    hadoop@computer:~$  ssh-keygen -t rsa -P ""
    

    The first line command change from account user to account hadoop. The second line command generate RSA key pair for hadoop account with empty password. The empty password is not recommended and only used for practical purpose.

  2. Next, we will enable local access using the new generated RSA key pair
    hadoop@computer:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    
  3. The last step, we will test the SSH setup for hadoop account access to localhost. This step will also used for save the host machine fingerprint to list of known host.
    hadoop@computer:~$ ssh localhost
    

    For more information, if the SSH settings fail or something, you can refers to Noll’s post.

Next, we need to disable IPv6. To do it, open file /etc/systl.conf and add this line to the end of file:

 #disable ipv6
 net.ipv6.conf.all.disable_ipv6 = 1
 net.ipv6.conf.default.disable_ipv6 = 1
 net.ipv6.conf.lo.disable_ipv6 = 1

Restart your computer to make this setting enabled. To test whether the IPv6 still enable or not, we can use this command:

cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If the IPv6 is disabled, the command will return value 1. Now, we’re ready to install Hadoop.

Hadoop Installation and Configuration

In order to make Hadoop works, you must already installed Java 1.6.x. In this article, I assume you’ve already installed Java. After that, you must obtain Hadoop MapReduce. In this tutorial, I used Hadoop version 0.20.1 from Apache project. You can download it from its official website. This is how to do it.

  1. Place the Hadoop archive file on your preferred location. In this article, I followed Noll’s post and placed it on /usr/local.
  2. Open the terminal, and type this command
    $ cd /usr/local
    $ sudo tar  xzf hadoop-0.20.1.tar.gz
    $ sudo mv hadoop-0.20.1 hadoop
    $ sudo chown -R hadoop:hadoop hadoop
    

    The first line command change to the /usr/local directory, place where you want to install Hadoop. The second line command extract the Hadoop archive file in this directory. The third line change the directory name from hadoop-0.20.1 to hadoop, to make it easier when we are executing Hadoop’s commands later. The last line command change the owner of hadoop folder into hadoop account from group hadoop.

After we successfully installed the Hadoop MapReduce, now we must configure it first. To configure Hadoop, we must edit some of its configuration files. You can refer to Hadoop’s Getting Started page to know the details of this configuration files. For now, we will configure it as follows.

First, we must set the hadoop’s Java path we intended to use. Locate your Java’s library directory. Typically, the library is in the /usr/lib/jvm/java-6-sun. Add this location at JAVA_HOME parameter in the hadoop-env.sh file located (in this case) /usr/local/hadoop/conf/.

# The java implementation to use.  Required.
 export JAVA_HOME=/usr/lib/jvm/java-6-sun

Next, we will set all the hadoop’s configuration xml files. These configuration file usually located in conf/ directory. The function of these xml files are described below:

First, we configure the core-site.xml.

<configuration>
     <property>
          <name>hadoop.tmp.dir</name>
          <value>/usr/local/hadoop-datastore/hadoop</value>
          <description>A base for other temporary directories.</description>
     </property>

     <property>
          <name>fs.default.name</name>
          <value>hdfs://localhost:54310</value>
          <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
     </property
</configuration>

In this configuration, I set /usr/local/hadoop-datastore/hadoop as my hadoop temporary directories base. Because this directory didn’t exist yet, I’ll create this directory with this command:

$ sudo mkdir /usr/local/hadoop-datastore/hadoop
$ sudo chown  hadoop:hadoop /usr/local/hadoop-datastore/hadoop
$ sudo chmod 750 /usr/local/hadoop-datastore/hadoop

The first command create the directory. The second command change the ownership of the directory to hadoop:hadoop, it means hadoop user account from hadoop group. The third command give necessaries access to hadoop user to read, modify, and execute the directories.

The second property being set on the configuration files is the name of default file system used. Hadoop has many type of file system implemented from abstract class org.apache.hadoop.fs.FileSystem. The implementation we’re going to use here is the local HDFS on port 54310. Actually, you can set the port as you like as long as you remember it for later use.

Next, we configure the mapred-site.xml.

<configuration>
     <property>
          <name>mapred.job.tracker</name>
          <value>localhost:54311</value>
          <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>
     </property>
</configuration>

In this configuration file, we configure the job tracker. We set it to local (because we’re still using single node) so that there is only one map and reduce task.

Then, we configure the hdfs-site.xml.

<configuration>
     <property>
          <name>dfs.replication</name>
          <value>1</value>
          <description>The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
     </property>
</configuration>

In this configuration file, we set the dfs replication to one, since we’re deploying on single node. The default replication is 3 for multi node cluster.

Formatting and Running Hadoop

After we configured the hadoop installation, we need to format the hadoop file system. We just need to format it for the first time when installing the hadoop. This is how to do it.

  1. Open your terminal.
  2. On your terminal, type this command:
    hadoop@computer:~$ cd /usr/local/hadoop
    hadoop@computer:/usr/local/hadoop$ bin/hadoop namenode -format
    

    The first line changing directory from hadoop home directory to where we install the hadoop directory. The second command format the name node’s hadoop file system.

Finally, we’re ready to running hadoop. To start hadoop and dfs daemon, type this command on the terminal:

hadoop@computer:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-computer.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-computer.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-computer.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-computer.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-computer.out

To check whether the hadoop daemon already running or not, you can use jps commands.

Now, after our hadoop is running, we can continue to deploy our WordCount application we wrote earlier. If you following my last tutorial (as I mentioned at the beginning on this post), you can get the WordCount.jar from /your-path-to-NetbeansProject/WordCount/dist directory. After that, copy it to hadoop installation directory (/usr/local/hadoop).

After that, we need to copy the input directory to the hdfs. Following the last tutorial, the input directory is /home/username/input. This is how to copy the input directory and its contents to hdfs:

hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /home/hadoop/input input
hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs ls
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2010-07-29 20:52 /user/hadoop/input/file01
-rw-r--r--   3 hadoop supergroup         24 2010-07-29 20:52 /user/hadoop/input/file02

The second command copy folder /home/hadoop/input into input directory on running HDFS. The second command list the contents of input directory on HDFS.

Now, let’s run the WordCount application on the hadoop installation. This is how to do it:

hadoop@computer:/usr/local/hadoop$ bin/hadoop jar WordCount.jar input output
10/07/29 21:06:16 INFO mapred.FileInputFormat: Total input paths to process : 2
10/07/29 21:06:17 INFO mapred.JobClient: Running job: job_201007292027_0002
10/07/29 21:06:18 INFO mapred.JobClient:  map 0% reduce 0%
10/07/29 21:06:27 INFO mapred.JobClient:  map 100% reduce 0%
10/07/29 21:06:39 INFO mapred.JobClient:  map 100% reduce 100%
10/07/29 21:06:41 INFO mapred.JobClient: Job complete: job_201007292027_0002
10/07/29 21:06:41 INFO mapred.JobClient: Counters: 18
10/07/29 21:06:41 INFO mapred.JobClient:   Job Counters
10/07/29 21:06:41 INFO mapred.JobClient:     Launched reduce tasks=1
10/07/29 21:06:41 INFO mapred.JobClient:     Launched map tasks=2
10/07/29 21:06:41 INFO mapred.JobClient:     Data-local map tasks=2
10/07/29 21:06:41 INFO mapred.JobClient:   FileSystemCounters
10/07/29 21:06:41 INFO mapred.JobClient:     FILE_BYTES_READ=100
10/07/29 21:06:41 INFO mapred.JobClient:     HDFS_BYTES_READ=46
10/07/29 21:06:41 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=270
10/07/29 21:06:41 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=31
10/07/29 21:06:41 INFO mapred.JobClient:   Map-Reduce Framework
10/07/29 21:06:41 INFO mapred.JobClient:     Reduce input groups=4
10/07/29 21:06:41 INFO mapred.JobClient:     Combine output records=0
10/07/29 21:06:41 INFO mapred.JobClient:     Map input records=2
10/07/29 21:06:41 INFO mapred.JobClient:     Reduce shuffle bytes=54
10/07/29 21:06:41 INFO mapred.JobClient:     Reduce output records=4
10/07/29 21:06:41 INFO mapred.JobClient:     Spilled Records=16
10/07/29 21:06:41 INFO mapred.JobClient:     Map output bytes=78
10/07/29 21:06:41 INFO mapred.JobClient:     Map input bytes=46
10/07/29 21:06:41 INFO mapred.JobClient:     Combine input records=0
10/07/29 21:06:41 INFO mapred.JobClient:     Map output records=8
10/07/29 21:06:41 INFO mapred.JobClient:     Reduce input records=8

The command execute the WordCount.jar with two arguments: the input and output directory. If you remember my last article about hadoop in Netbeans, the WordCount application takes that two arguments. The mapreduce job will produce the output into output directory. To read the output, let’s type some commands on the terminal:

hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -ls output
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2010-07-29 21:06 /user/hadoop/output/_logs
-rw-r--r--   3 hadoop supergroup         31 2010-07-29 21:06 /user/hadoop/output/part-00000
hadoop@computer:/usr/local/hadoop$ bin/hadoop dfs -cat output/part-00000
Bye	2
Hello	2
hadoop	2
world	2

The first command list the files  inside output directory, apparently the output file of our WordCount application is in the part-00000 file. The second command print the content of part-00000 file to the terminal. Now, if we want to take the file out from HDFS, we can type these commands:

hadoop@computer:/usr/local/hadoop$ mkdir /home/hadoop/output
hadoop@computer:/usr/local/hadoop$ bin/hadoop -getmerge output /home/hadoop/output
hadoop@computer:/usr/local/hadoop$ cd /home/hadoop/output
hadoop@computer:~/output$ ls
hadoop
hadoop@computer:/home/hadoop/output$ cat output
Bye	2
Hello	2
hadoop	2
world	2

The first command create a directory for our output file, we created /home/hadoop/output directory. The second command fetch the output file from HDFS to the local directory. The third command change to output directory we specified earlier. When we list the directory on the forth command, we see that there’s a file named output there. So we display the output file with the fifth command.

Summary

In this article, we’ve already set up our hadoop installation on single node cluster. Then, we’ve tried the WordCount application on it. This post taught me a lot on how use hadoop. On the next post, we’ll try on the multi node cluster!! Yay..

References:

11 thoughts on “Hadoop on Single Node Cluster

  1. Arif,

    Just now i've noticed the error am getting while executing

    scp -r /home/hadoop/project ip_addr_slave_i:/home/hadoop

    scp: /home/hadoop//project/hadoop/conf/hadoop-policy.xml: Permission denied

    scp: /home/hadoop//project/hadoop/hadoop-0.20.2-ant.jar: Permission denied

    But i can ping to other system using ssh ipaddr_of_remote_host..

    Also i am not able to update the java version in a node…

    Please help..I got stuck in this for the past few weeks..

    Thanx in advance

  2. Arif,

    I am able to format the namenode now. I had followed the same procedures again and i dont know how it is working now..

    [hadoop@localhost ~]$ hadoop namenode -format

    10/10/06 16:50:46 INFO namenode.NameNode: STARTUP_MSG:

    /************************************************************

    STARTUP_MSG: Starting NameNode

    STARTUP_MSG: host = localhost.localdomain/127.0.0.1

    STARTUP_MSG: args = [-format]

    STARTUP_MSG: version = 0.20.2

    STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/br… -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010

    ************************************************************/

    10/10/06 16:50:47 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop_user,root

    10/10/06 16:50:47 INFO namenode.FSNamesystem: supergroup=supergroup

    10/10/06 16:50:47 INFO namenode.FSNamesystem: isPermissionEnabled=true

    10/10/06 16:50:47 INFO common.Storage: Image file of size 96 saved in 0 seconds.

    10/10/06 16:50:47 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.

    10/10/06 16:50:47 INFO namenode.NameNode: SHUTDOWN_MSG:

    /************************************************************

    SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1

    ************************************************************/

    But now getting error while starting the cluster..

    [hadoop@localhost ~]$ start-all.sh

    mkdir: cannot create directory `/home/hadoop/project/hadoop/bin/../logs': Permission denied

    mkdir: cannot create directory `/var/hadoop': Permission denied

    rsync from master:/home/hadoop/src/hadoop

    ssh: master: Temporary failure in name resolution

    rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

    rsync error: unexplained error (code 255) at io.c(463) [receiver=2.6.8]

    starting namenode, logging to /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-namenode-localhost.localdomain.out

    /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 117: /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-namenode-localhost.localdomain.out: No such file or directory

    /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 118: /var/hadoop/pids/hadoop-hadoop-namenode.pid: No such file or directory

    head: cannot open `/home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-namenode-localhost.localdomain.out' for reading: No such file or directory

    The authenticity of host 'localhost (127.0.0.1)' can't be established.

    RSA key fingerprint is db:4f:6f:cc:83:7f:28:65:5f:80:0a:0f:39:3c:6b:79.

    Are you sure you want to continue connecting (yes/no)? yes

    localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

    hadoop@localhost's password:

    localhost: mkdir: cannot create directory `/home/hadoop/project/hadoop/bin/../logs': Permission denied

    localhost: mkdir: cannot create directory `/var/hadoop': Permission denied

    localhost: rsync from master:/home/hadoop/src/hadoop

    localhost: ssh: master: Temporary failure in name resolution

    localhost: rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

    localhost: rsync error: unexplained error (code 255) at io.c(463) [receiver=2.6.8]

    localhost: starting datanode, logging to /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-datanode-localhost.localdomain.out

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 117: /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-datanode-localhost.localdomain.out: No such file or directory

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 118: /var/hadoop/pids/hadoop-hadoop-datanode.pid: No such file or directory

    localhost: head: cannot open `/home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-datanode-localhost.localdomain.out' for reading: No such file or directory

    hadoop@localhost's password:

    localhost: mkdir: cannot create directory `/home/hadoop/project/hadoop/bin/../logs': Permission denied

    localhost: mkdir: cannot create directory `/var/hadoop': Permission denied

    localhost: rsync from master:/home/hadoop/src/hadoop

    localhost: ssh: master: Temporary failure in name resolution

    localhost: rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

    localhost: rsync error: unexplained error (code 255) at io.c(463) [receiver=2.6.8]

    localhost: starting secondarynamenode, logging to /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 117: /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out: No such file or directory

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 118: /var/hadoop/pids/hadoop-hadoop-secondarynamenode.pid: No such file or directory

    localhost: head: cannot open `/home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out' for reading: No such file or directory

    mkdir: cannot create directory `/home/hadoop/project/hadoop/bin/../logs': Permission denied

    mkdir: cannot create directory `/var/hadoop': Permission denied

    rsync from master:/home/hadoop/src/hadoop

    ssh: master: Temporary failure in name resolution

    rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

    rsync error: unexplained error (code 255) at io.c(463) [receiver=2.6.8]

    starting jobtracker, logging to /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-jobtracker-localhost.localdomain.out

    /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 117: /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-jobtracker-localhost.localdomain.out: No such file or directory

    /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 118: /var/hadoop/pids/hadoop-hadoop-jobtracker.pid: No such file or directory

    head: cannot open `/home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-jobtracker-localhost.localdomain.out' for reading: No such file or directory

    hadoop@localhost's password:

    localhost: mkdir: cannot create directory `/home/hadoop/project/hadoop/bin/../logs': Permission denied

    localhost: mkdir: cannot create directory `/var/hadoop': Permission denied

    localhost: rsync from master:/home/hadoop/src/hadoop

    localhost: ssh: master: Temporary failure in name resolution

    localhost: rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

    localhost: rsync error: unexplained error (code 255) at io.c(463) [receiver=2.6.8]

    localhost: starting tasktracker, logging to /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-tasktracker-localhost.localdomain.out

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 117: /home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-tasktracker-localhost.localdomain.out: No such file or directory

    localhost: /home/hadoop/project/hadoop/bin/hadoop-daemon.sh: line 118: /var/hadoop/pids/hadoop-hadoop-tasktracker.pid: No such file or directory

    localhost: head: cannot open `/home/hadoop/project/hadoop/bin/../logs/hadoop-hadoop-tasktracker-localhost.localdomain.out' for reading: No such file or directory

    while executing it,i was asked to give the hadoop password thrice..

    what shall i do now..

    Thanx in advance..

  3. I would be simpler just to include link to Michael Noll’s post on how running Hadoop on Ubuntu (Single-Node-Cluster), instead rewriting his work.

  4. I am working on starting namenode on hadoop-0.23.10 .Word unexpected error..in hadoop.config.sh file

What's in your mind?