The Three Modes of Hadoop
As you may already knew, we can configure and use Hadoop in three modes. These modes are:
This mode is the default mode that you get when you’re downloading and extracting Hadoop for the first time. In this mode, Hadoop didn’t utilize HDFS to store input and output files. Hadoop just use local filesystem in its process. This mode is very useful for debugging your MapReduce code before you deploy it on large cluster and handle huge amounts of data. In this mode, the Hadoop’s configuration file triplet (
hdfs-site.xml) still free from custom configuration.
Pseudo distributed mode (or single node cluster)
In this mode, we configure the configuration triplet to run on a single cluster. The replication factor of HDFS is one, because we only use one node as Master Node, Data Node, Job Tracker, and Task Tracker. We can use this mode to test our code in the real HDFS without the complexity of fully distributed cluster. I’ve already covered the configuration process on my previous post.
Fully distributed mode (or multiple node cluster)
In this mode, we use Hadoop at its full scale. We can use cluster consists of a thousand nodes working together. This is the production phase, where your code and data are used and distributed across many nodes. You use this mode when your code is ready and work properly on the previous mode.
When you’re developing your MapReduce application, you will use these three modes interchangeably. Editing your configuration triplet everytime you want to switch to other mode can quite frustating and wasting your precious time. Therefore, we can configure this triplet so that you can switch from one mode to another easily. And here’s the way to do it.
We will separate the Hadoop’s configuration directory (
conf/) for each mode. Let’s assume that you just extracted your Hadoop distribution and haven’t made any changes on the configuration triplet. In the terminal, write these commands:
hadoop@computer:~$ cd /your/hadoop/installation/directory hadoop@computer:~$ cp -R conf conf.standalone hadoop@computer:~$ cp -R conf conf.pseudo hadoop@computer:~$ cp -R conf conf.distributed hadoop@computer:~$ rm -R conf
The first line is used to move to your Hadoop installation directory. On the second to third line, we copy the
conf/ directory to three different directory:
conf.standalone: used to store standalone mode configuration. You don’t need to edit the triplet if you’re just extracting your Hadoop distribution.
conf.pseudo: used to store pseudo-distributed mode configuration. You can edit the triplet as I explained on the single node configuration.
conf.distributed: used to store fully distributed mode configuration. You can set the triplet with multinode cluster as in Michael Noll’s post.
The last line is used to remove the conf directory. Please make sure that you’ve already do the second to third line correctly before removing the conf directory. I’ve warned you.. :p
Next step is the trick. We use
ln command to create link between the configuration we want to use and the configuration that we already set. For more information about
ln command, you can read the manual by typing
man ln on your terminal or you can read this post. And now here is how to do it.
Switching to standalone mode
hadoop@computer:~$ ln -s conf.standalone conf
Switching to pseudo-distributed mode
hadoop@computer:~$ ln -s conf.pseudo conf
Switching to fully distributed mode
hadoop@computer:~$ ln -s conf.distributed conf
Here is the explanation:
We created symbolic link from our specific configuration folder to
conf folder (which is the one that being referenced by running Hadoop). On the standalone mode, we create link from
conf.standalone directory to
conf directory. Hadoop then will use
conf directory to execute. The other mode is just the same, but we create link from the specific configuration directory. Everytime you want to switching from one mode to another, you can just do the command above. Quite nifty right? Feel free to share your tought.. :D