Programming Hadoop in Netbeans

23/01/2010

Hadoop MapReduce is an Open Source implementation of MapReduce programming model for processing large scale of data in distributed environment. Hadoop is implemented in Java as a class library. There are some distribution for Hadoop, from Apache, Cloudera, and Yahoo!

Meanwhile, Netbeans is an integrated development environment (or IDE) for programming in Java and many other programming languages. Netbeans (like any other IDE) helps programmer to develop applications easier and as painless as possible with its features. For this case, it helps us to develop Hadoop MapReduce jobs.

In this post, I’ll tell you step-by-step how to use Netbeans to develop a Hadoop MapReduce job. I’m using Netbeans 6.8 in Ubuntu Karmic Koala distribution. The MapReduce program we are going to create here is a simple program called wordcount. This program reads text in some files and lists all the words and how many those words present in all files. The source code of this program is available on the MapReduce tutorials packed with the Apache Hadoop distribution.

We divided this tutorial into three steps. First, we will install Karmasphere Studio for Hadoop, a Netbeans extension. Then, we will type some codes. And finally, we will run the MapReduce job in the Netbeans. Okay, fasten your seat belt.. Here we go..

Install Karmasphere Studio for Hadoop

In order to do this, you must already installed JDK 1.6 and Netbeans (of course). There is a nifty tutorial with pictures about how to install the Karmasphere Studio for Hadoop on their site, but I’ll write it again here.

  1. Open your Netbeans, go to Update Center using Tools > Plugins.
  2. In the Update Center, go to Settings tab and click the Add button. Enter the following Name and URL in the Update Center Customizer window:
    Name: Karmasphere Studio for Hadoop
    URL: http://hadoopstudio.org/updates/updates.xml
  3. Now, select the Available Plugins tab. Find the “Karmasphere Studio for Hadoop” in the list and check it. Then click the Install button.
  4. Click Next and accept the license agreement. Click Install for list of will be installed plugins. Then, click Continue to download and install the plugins. The plugins size is about 20-something MB (I forgot). Wait for it and when it’s finished, restart your IDE.
  5. Done, we are good to go.

Typing some codes

Now, we are going to type some codes for wordcount program. To do this you must restart your IDE after the plugins installation. If you haven’t do it, then do it now, I’ll wait. Done it? Okay, let’s continue.

  1. We need to create a new Java application. To do that, go to File > New Project. Pick Java Application project and click Next.
  2. In the next window, give WordCount as the name of the project. Then type WordCount as the Main Class. When you’re done, click Finish.
  3. Okay, the editor for WordCount.java is now open. But first, we must added the Hadoop library to the project. To do this right-click on the Libraries on the WordCount project folder at the left side of the IDE, then pick Add Library.
  4. In the Add Library window, select Hadoop 0.20.0 as the version of Hadoop that we are going to use. Then click the Add Library button.
  5. The appropriate library now has been added to the project. Next we are going to the WordCount.java editor. Edit this file with this code below:
    import java.io.IOException;
    import java.util.*;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapred.*;
    
    public class WordCount{
    
    	public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{
    
    		private final static IntWritable one =  new IntWritable(1);
    		private Text word = new Text();
    
    		public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException{
    				String line = value.toString();
    				StringTokenizer tokenizer = new StringTokenizer(line);
    
    				while(tokenizer.hasMoreTokens()){
    					word.set(tokenizer.nextToken());
    					output.collect(word, one);
    				}
    		}
    	}
    
    	public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{
    
    		public void reduce(Text key, Iterator<IntWritable> values,
    			OutputCollector<Text, IntWritable> output, Reporter reporter)
    			throws IOException{
    
    				int sum = 0;
    				while (values.hasNext()){
    
    					sum += values.next().get();
    				}
    
    				output.collect(key, new IntWritable(sum));
    		}
    	}
    
    	public static void main(String[]args) throws IOException{
    
    		JobConf conf = new JobConf(WordCount.class);
    		conf.setJobName("wordcount");
    		conf.setOutputKeyClass(Text.class);
                    conf.setOutputValueClass(IntWritable.class);
    		conf.setMapperClass(Map.class);
    		conf.setReducerClass(Reduce.class);
    		conf.setInputFormat(TextInputFormat.class);
    
    		conf.setOutputFormat(TextOutputFormat.class);
    
    		FileInputFormat.setInputPaths(conf, new Path(args[0]));
    		FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    
                    try{
                        JobClient.runJob(conf);
                    }catch(IOException e){
                        System.err.println(e.getMessage());
                    }
    	}
    }

    This program will take two arguments, the directory path of the input and the output. In this post, I’ll not explain the details about the code above. Please refer to the Apache Hadoop MapReduce tutorial if you wanna know about it

  6. After we sure that there is no error or typo, let’s build the program. To do this, right-click the WordCount project in the left side and pick Build. This step will create the JAR file of the program.
  7. Next, we will prepare the input for this program. We will create a folder and two text files inside the folder.
    For example, if you are creating input folder at your home directory, then the path will be /home/username/input. Inside it create two text files, let’s name it file01 and file02.
    On the first file type the sentence (without the quotes): “Hello world Bye world
    And in the second sentence type (without the quotes): “Hello Hadoop Bye Hadoop
    Actually, you can type anything you want. The two sentences are just examples. Save the files when you’re done
  8. We are done in this step. Let’s go to the final step.

Running the MapReduce job

Okay. Now we are going to run the MapReduce job locally in Netbeans. This is how it’s done.

  1. On the left side of the IDE, click the Services tab. Right-click on the Hadoop Jobs and pick New Job.
  2. Give WordCount as Job Name and select the Hadoop Job from pre-existing JAR file type. Click Next when you’re done.
  3. Then, browse the JAR file we already created in the previous step. Click browse and go to your Netbeans WordCount Project folder. The JAR file is located in the dist folder. If you’re using Netbeans default settings, then the JAR file will be located in /home/username/NetbeansProjects/WordCount/dist. Click Next when you’re done.
  4. In the step Set Job Defaults (Step 5 of 5), choose In-Process Thread (0.20.0) as the default cluster. Then, in the Default Arguments type the arguments needed by the program. In this case, the input and output directory path. Type the input folder that we created earlier and the output folder:
    /home/username/input /home/username/output
    For your information, we don’t need to create the output folder first. The program will create the folder for you. Click Finish when you’re done.
  5. Now, we will finally run the MapReduce job. To do this right-click the WordCount under the Hadoop Jobs list and pick Run Job…
  6. In the Execute Hadoop Job window, give WordCount as the Job Name and click Run.
  7. If your job executes successfully, there will be an output directory and inside it you’ll find a file. Inside the file you’ll find something like this:
    Bye	2
    Hadoop	2
    Hello	2
    World	2

Now we’re done. If you have a question, feel free to ask me. But for your information, I’m still learning about this too. Let’s study about it together. Have a nice try and see you on the next post.

Comments

  1. is english mandatory here? >_>
    well, emm, i followed the step and succeed. while i’m still a noob at these, i think i can still understand a lil’ bit, maybe because i took paralel programming subject back then :D. after skimmed the wiki article, i assume the concept basically similar with the mpi in paralel programming

    btw, this is a great tutorial and very well written. keep up the good work!

Pings

  1. Social comments and analytics for this post…

    This post was mentioned on Twitter by sidudun: Programming Hadoop in Netbeans – http://bit.ly/7m3jIw...

Leave a Comment

:argh: :ampun: :begadang: :bobo: :bosen: :capek: :hi: :hiks: :kagum: :kenyang: :keren: :mabok: :malu: :ngamuk: :marah: :mentok: :nyerah: :cool: :muntah: :naksir: :nangis: :ngakak: :ketawa: :ngayal: :ngudut: :ngupil: :sakit: :dingin: :tolong: :wow: