Programming Hadoop in Eclipse (Inverted Index Examples)

It has been two years since I wrote about programming Hadoop in Netbeans using Karmasphere Studio.  Meanwhile, apparently Netbeans is no longer supported by them, and they focused on the other IDE, Eclipse. I have relatively no problem in using Eclipse, thanks to some Android projects that I’m working on right now. In this post, I’ll show you another example of programming Hadoop in Eclipse by implementing distributed inverted index in MapReduce. So, let’s get started, shall we?

Installing Karmasphere Studio in Eclipse

The first thing to do is installing Karmasphere Studio plugins. For using Karmasphere Studio, you must register for free and then you’ll get the serial-key that you can use to download the plugins. Then, you can follow the step-by-step installation procedure on their website. The installation process is trivial and I’m sure you can make it.

Besides the installation procedure, they also provided example on building the famous WordCount MapReduce implementation using their marvelous MapReduce job workflow. I don’t remember if this feature was available two years ago, but I found this to be helpful on debugging our MapReduce process. Although I didn’t use it on this post anyway. Okay, next: inverted index.

Distributed Inverted Index

Okay, first things first, what is inverted index? Inverted index is index data structure  storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. So, giving some text documents, we are going to build the content of the documents, which is the words, and the documents where these words were found.

Let’s see the example, I have two documents called doc1.txt and doc2.txt. The doc1.txt content is shown below:

hello hadoop
goodbye hadoop
hello you
hello me

and the doc2.txt is shown below:

hello world
goodbye world

So, the inverted index constructed will be something like below:

goodbye	doc1.txt=>1 doc2.txt=>1
hadoop	doc1.txt=>2
hello	doc1.txt=>3 doc2.txt=>1
me	doc1.txt=>1
world	doc2.txt=>2
you	doc1.txt=>1 

Okay, the inverted index construction could be implemented in MapReduce paradigm. Yeah, you could find some great explanation on Yahoo!’s Hadoop tutorial (they provided the code too there on the bottom). The Map phase reads the line and outputs the words as keys and the file name as the values. The Reduce phase outputs words as keys and collections of its corresponding file names as values.

Code! Code! Code!

Let’s get our hand dirty. Open your Eclipse and create a New Java Project. Let’s give it a name like DistributedIndexing. Set the path to your favorite place. For me, I’ll use /programming/hadoop/distributedindexing. My Java project are shown below:

Click Next, and the Java Settings window will showed up. Now, go click the Libraries tab. On this window, click Add Library… on the right sidebar and pick Hadoop Libraries from Karmasphere, then Next pick Hadoop MapReduce (0.18.3) as shown below. Click Finish if you’re done.

The new Java Project will be available on the left side of your Eclipse.

Now, In the reduce phase, we need to collects the file name from each corresponding words received. So, I’ll write my own custom Writable to handle file name collections. For your information, Writable is object data type that can be used as key or value data type. So I wrote ArrayStringWritable to collects the file name and used as output of the reduce phase. I put my custom Writable in the writable package. Here is the code:

public class DocSumWritable implements Writable{

	private HashMap<String,Integer> map = new HashMap<String, Integer>();

	public DocSumWritable() {
    }

    public DocSumWritable(HashMap<String,Integer> map){

    	this.map = map;
    }

    private Integer getCount(String tag){
        return map.get(tag);
    }

	@Override
	public void readFields(DataInput in) throws IOException {
		Iterator<String> it = map.keySet().iterator();
        Text tag = new Text();

        while(it.hasNext()){
        	String t = it.next();
            tag = new Text(t);
            tag.readFields(in);
            new IntWritable(getCount(t)).readFields(in);
        }

	}

	@Override
	public void write(DataOutput out) throws IOException {
		Iterator<String> it = map.keySet().iterator();
        Text tag = new Text();
        IntWritable count = new IntWritable();

        while(it.hasNext()){
        	String t = it.next();
            new Text(t).write(out);
            new IntWritable(getCount(t)).write(out);
        }

	}

	@Override
    public String toString() {

		String output = "";

        for(String tag : map.keySet()){
            output += (tag+"=>"+getCount(tag).toString()+" ");
        }

        return output;

    }

}

Let me give you some explanation. The readFields and write method handle serialization and deserialization of this data input and output object. The toString method is responsible for writing the output values. There, I play the string to print the file name list as  (filename1, filename2,…).

Next, let’s write the mapper. I put the mapper on mapred package. The mapper outputs word as key and its file name as the value. My mapper name is IndexMapper. To create it, right click the project name and pick New > Other.. Then pick Hadoop Mapper as shown in the picture below.

On the next window, we defined the data type. Let’s define it as below.

In the IndexMapper, we use StringTokenizer to get the word from each line and use reporter.getInputSplit() to get the file name. The code is shown below.

public class IndexMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,Text> {

    @Override
    public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {

    	FileSplit filesplit = (FileSplit) reporter.getInputSplit();
    	String fileName = filesplit.getPath().getName();

    	String line = value.toString();

    	StringTokenizer tokenizer = new StringTokenizer(line);
    	while(tokenizer.hasMoreTokens()){
    		String token = tokenizer.nextToken();

    		output.collect(new Text(token.toLowerCase()), new Text(fileName));
    	}

    }
}

Okay, now the reducer. I put the reducer called IndexReducer in mapred package. The creation is the same as the mapper, but you choose the Hadoop Reducer instead. And then, we defined the key-value data type as shown in the picture. To find the ArrayStringWritable for Output value class, click the Search button and pick the ArrayStringWritable.

In the reducer we collect the file names from each word and output it as the value. Here, I use our custom ArrayStringWritable above. The code is shown below.

public class IndexReducer extends MapReduceBase implements Reducer<Text,Text,Text,ArrayStringWritable> {

	@Override
    public void reduce(Text key, Iterator<Text> value, OutputCollector<Text, ArrayStringWritable> output, Reporter reporter)
            throws IOException {
        HashSet<String> docId = new HashSet<String>();

        while(value.hasNext()){
        	docId.add(value.next().toString());
        }

        output.collect(key, new ArrayStringWritable(docId));

    }
}

Okay. Now we create the Hadoop Job configuration. This time we create New, and pick Hadoop MapReduce Job and sets as below.

Okay. I don’t know why when I try using Hadoop Job Workflow provided by Karmasphere Studio, in the Mapper tab there’s a warning said that reporter has no input. Maybe, the Workflow couldn’t solve the reporter that being used for getting the file name. Anyway I edited the IndexJob source code and added the output value data type definition on the initCustom() method. I added as below.

public static void initCustom(JobConf conf) {
        // Add custom initialisation here, you may have to rebuild your project before
        // changes are reflected in the workflow.
    	conf.setOutputValueClass(DocSumWritable.class);
}

Okay, we’re done here. The project structure will look like this.

We’re almost done here. Next, we must generate jar file from our project. To do this right click the project and pick Export. Pick jar file, and your destination name. For me it’s on /programming/hadoop/indexing.jar. We all set!

Run! Run! Run!

Okay, now that we have created the jar file, let’s specified the input. I have created the four inputs file on a folder that can be downloaded here. Extract it on your jar directory, for me it’s on /programming/hadoop. So the input will be in /programming/hadoop/input. There are four files there.

Okay, next we’re going to run it. I assume that you already installed Hadoop MapReduce. I’ll use the standalone setup here, but I think it’ll work on multi-node cluster too. The Hadoop installation and configuration could be found on Hadoop homepage.

Open the terminal and goes to the jar directory (/programming/hadoop). Type in the terminal as below:

hadoop jar indexing.jar job.IndexJob input output

It means we execute the indexing.jar by calling job.IndexJob main class with input folder as its input and output folder as its output. If you follow correctly, the output will be on /programming/hadoop/output directory. The output file is part-00000. Here is  some contents of part-00000:

</pre>
albert kompas-2.txt=>1
amerika kompas-1.txt=>1
ana, kompas-1.txt=>1
anak kompas-1.txt=>2
anda kompas-1.txt=>5
aneka kompas-1.txt=>3
anggrek kompas-1.txt=>5
anggrek. kompas-1.txt=>2
anggrek? kompas-1.txt=>1
antara detik-2.txt=>1
antemortem detik-2.txt=>2 detik-1.txt=>3
<pre>

What’s Next?

Next, you could develop your own Hadoop MapReduce applications. You can test it on the real multi-node cluster. Happy Hadoop-ing. :D

Note: You can get the source code of this tutorial in my github.

24 thoughts on “Programming Hadoop in Eclipse (Inverted Index Examples)

  1. Could you please show the imports also ? ( I couldn’t fix them in the ArrayStringWritable class)

    1. ArrayStringWritable is a custom Writable that I wrote by myself. I put this class in writable package. So in the IndexReducer, I put import ArrayStringWritable like this:

      import writable.ArrayStringWritable;

      1. Could you please provide the code of ArrayStringWritable please? I am trying to get familiar with Hadoop.

  2. How to use Hadoop to communicate with DB in Eclipse.
    If any refrence please let me know.
    thanks in advance

  3. Does Programming Hadoop in Ecllipse supports MYSQL database connectivity?
    If so Could you please make a refer to it..
    How to make a distributed environment with Hadoop in Ecllipse?

      1. Is there any possibilities to make a distributed environment using hadoop in Visual Studio or any IDE supporting C#?

  4. I tried to install hadoop 0.20.1 and hadoop plugin for eclipse(necessary for my project)in linux to form an distributed environment. I downloaded hadoop 0.20.1.tar.gz and when i tried to execute the jar file in it using the command java -jar hadoop 0.20.1.tar.gz
    I encountered the error “Failed to load the main class Manifest from the jar file”
    And when I tried to run the start-all.sh file in bin folder there is an error that “JAVA_HOME is not set”

  5. I am not able to find the karmasphere studio for eclipse plugin on its site. It is showing 404 error.
    Can you help me, i.e. from where to get it…???

  6. hey guys,
    I’m having the same problem, Karmasphere studio for eclipse plugin is not available.
    looks like they are doing maintenance on the site.

    I’ve tried cloudera but also had problem with installing CDH and configuration.

    Advices regarding cloudera or karmasphere are very much appreciated !!!

  7. If you’re going to reference ArrayStringWritable in your post, you should include in the post. Hadoop is hard enough without you providing code that won’t even compile.

What's in your mind?