Programming Hadoop in Eclipse (Inverted Index Examples)

It has been two years since I wrote about programming Hadoop in Netbeans using Karmasphere Studio.  Meanwhile, apparently Netbeans is no longer supported by them, and they focused on the other IDE, Eclipse. I have relatively no problem in using Eclipse, thanks to some Android projects that I’m working on right now. In this post, I’ll show you another example of programming Hadoop in Eclipse by implementing distributed inverted index in MapReduce. So, let’s get started, shall we? Continue reading

Setting Up Mercurial for Netbeans Project

In software development, version control system (VCS) hold an important role. Especially when the project is collaborated by many programmers. Besides to keep tracks of changes, version control could helps handle task distribution and later project integration from the programmers. Basically, there are two flavors of version control system: centralized and distributed. There are many comparison between these two flavors on the net, one of them explained it well with some illustrations. The key point between these two systems is there are local working copies of the project in distributed VCS, while in centralized VCS, every changes must be updated to the central repository. Continue reading

Programming Hadoop in Netbeans

Note: It seems that Netbeans is no longer supported by Karmasphere Studio. For programming Hadoop in Eclipse, you could read it here.

Hadoop MapReduce is an Open Source implementation of MapReduce programming model for processing large scale of data in distributed environment. Hadoop is implemented in Java as a class library. There are some distribution for Hadoop, from Apache, Cloudera, and Yahoo!

Meanwhile, Netbeans is an integrated development environment (or IDE) for programming in Java and many other programming languages. Netbeans (like any other IDE) helps programmer to develop applications easier and as painless as possible with its features. For this case, it helps us to develop Hadoop MapReduce jobs.

In this post, I’ll tell you step-by-step how to use Netbeans to develop a Hadoop MapReduce job. I’m using Netbeans 6.8 in Ubuntu Karmic Koala distribution. The MapReduce program we are going to create here is a simple program called wordcount. This program reads text in some files and lists all the words and how many those words present in all files. The source code of this program is available on the MapReduce tutorials packed with the Apache Hadoop distribution.

We divided this tutorial into three steps. First, we will install Karmasphere Studio for Hadoop, a Netbeans extension. Then, we will type some codes. And finally, we will run the MapReduce job in the Netbeans. Okay, fasten your seat belt.. Here we go.. Continue reading

Why Generate Database Keys?

Howdy.

Not so long ago, I created a GUI for data storage using Java and of course JDBC. I followed a tutorial about how to insert database records to the database table. The tutorial said that before we’re inserting new record to the database, we should generate unique key as record identity. The generation process handled by Java code. The tutorial use System.currentTimeMilis() function to create the key. So, the code will be look like this:

//set new id
Number time = System.currentTimeMillis();
Integer id = (time.intValue()/10000);

The generated key then being inserted to the database.

As you may realized, database systems usually have their own key generation technique. For example AUTO_INCREMENT attribute in MySQL. I followed the tutorial and sure it works. But, I tried database-generation key, and it works too. So, why we should bother generate our own keys?

The O’Reilly Java Author gave an explanation about that. They said:

However, using the supported key generation tools of your database of choice presents several problems:

  • Every database engine handles key generation differently. Thus, it is difficult to build a truly portable JDBC application that uses proprietary key generation schemes.
  • Until JDBC 3.0, a Java application had no clear way of finding out which keys were generated on an insert.
  • Automated key generation wreaks havoc with EJBs.

I got the point that the application should be portable. Who knows that someday we’ll migrating from one database systems to another? So the database generation key will be hard to control. 

Another detailed answer came from Scott Selikoff. He wrote a complete article about database key generation and gave an example:

Now, let’s say a user is in the process of creating a new record in your system. For each user record, you also have a set of postal addresses. For example, Bob may be purchasing items on NewEgg and have a home address and a work address. Furthermore, Bob enters his two addresses at the time he creates his account, so the application server receives the information to create all 3 records at once. In such a situation, you would normally have 3 records: 1 user record for Bob and 2 address records. You could add the address info in the user table, although then you have to restrict the number of addresses Bob can have and/or have a user table with a lot of extra columns.

Inserting Bob into the user table is straight forward enough, but there is a problem when you go to insert users into the address table, namely that you need Bob’s newly generated User Id in order to insert any records into the address table. After all, you can’t insert addresses without being connected to a specific user, lest chaos ensue in your data management system.

The problems will arise in complex database relationship. As we don’t know the generated key, we couldn’t set the foreign key of corresponding tables. I didn’t realize this because the GUI I made was using a simple database design. Maybe with no table relation at all.

So, I got my question answered. Do you have another answer for my question? Feel free to share.

Lighweight

Did you know the opposite of lightweight? Maybe almost 99% of you said that the opposite was heavyweight. It’s easy. Just take the adjective phrase, which is light, and change it to its opposite, which is heavy. It applied on language. But, did it apply to all language?

Maybe now, 50% of you said that it applied to all language. How about I said that if a thing wasn’t lightweight, so it must be a heavyweight? If you said that I was correct, maybe you forgot about one thing. In mathematics, there was a subject called mathematical logic.

Rings a bell? Remember about the if-then theorem? If you do it” then “something happens”. It means that “you do it” was just one of the causes to make “something happens”. But, if something happens, it doesn’t mean that always “you do it”. If you want a clause that said that “something happens” just because “you do it”, use if-and-only-if instead.

Well, why I bother telling a story about if-then and if-only-if? Yeah, I want to tell you about my experience when I was coding Java. Some of you maybe know about swing components in Java. They are divided into two categories, the heavyweight and the lightweight components. I was playing with JInternalFrame instance back then. I wanted to make a popup window when I clicked a button in this JInternalFrame instance. But I also wanted that if the window popped, the JInternalFrame was not enabled, means user couldn’t do anything in JInternalFrame.

Before we continued, I wanted to share that according to JDK API docs:

Disabling a lightweight component does not prevent it from receiving MouseEvents.

So then I checked if the JInternalFrame was a lightweight or not. To do this, I used a method called isLightweight. It returned false. So I think that it was a heavyweight component. Then I disabled this internal frame when the window popped up. But you know what, it didn’t work. The internal frame still could receive MouseEvents. When I checked again in JDK API docs about JInternalFrame, it said that it was a lightweight component. So what happened here?

Well, actually it’s totally my fault. I used isLightweight method when the internal frame still not visible. When I checked about this method, the JDK said:

This method will always return false if this component is not displayable because it is impossible to determine the weight of an undisplayable component.

So, the method returned false because the component wasn’t visible and not because it was a heavyweight component. Here, I got the if-then message. It reminded me about an advertisement in television not so long ago. It said: “smart people drink this tonic*. I guess you could use if-then to conclude its meaning. Well, have a nice day..


* I censored the trademark. I guess you already know about the advertisement already.