Yet Another Introduction to MapReduce (part 2)

13/03/2010

I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!

What is MapReduce?

According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.

MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value. Continue reading

Yet Another Introduction to MapReduce (part 1)

03/02/2010

There are so many article outside about what is MapReduce, the basic concepts behind it, how it works, and many other things. Even that, I still wanna write a little introduction to MapReduce. It’s mandatory, at least for me, to write about “something” in order to understand the “something”. I challenge my understanding about MapReduce in this post. I’ll use some resources available on the clouds like I mentioned earlier. This is just another introduction to MapReduce.

Data, Data, Data

We are living in the clouds era. Internet provide us with such a great resource to help our lives. In the progress, we created a lot of data. Consider a search engine like Google or Bing. They indexed all of sites across the network. If we are talking about sites these days, that’s a big number we are talking about. Netcraft reported that there are more than 200 Millions sites in the world. It means the search engine must process and analysis a lot of data. Continue reading

Programming Hadoop in Netbeans

23/01/2010

Hadoop MapReduce is an Open Source implementation of MapReduce programming model for processing large scale of data in distributed environment. Hadoop is implemented in Java as a class library. There are some distribution for Hadoop, from Apache, Cloudera, and Yahoo!

Meanwhile, Netbeans is an integrated development environment (or IDE) for programming in Java and many other programming languages. Netbeans (like any other IDE) helps programmer to develop applications easier and as painless as possible with its features. For this case, it helps us to develop Hadoop MapReduce jobs.

In this post, I’ll tell you step-by-step how to use Netbeans to develop a Hadoop MapReduce job. I’m using Netbeans 6.8 in Ubuntu Karmic Koala distribution. The MapReduce program we are going to create here is a simple program called wordcount. This program reads text in some files and lists all the words and how many those words present in all files. The source code of this program is available on the MapReduce tutorials packed with the Apache Hadoop distribution.

We divided this tutorial into three steps. First, we will install Karmasphere Studio for Hadoop, a Netbeans extension. Then, we will type some codes. And finally, we will run the MapReduce job in the Netbeans. Okay, fasten your seat belt.. Here we go.. Continue reading

Why Generate Database Keys?

09/06/2009

Howdy.

Not so long ago, I created a GUI for data storage using Java and of course JDBC. I followed a tutorial about how to insert database records to the database table. The tutorial said that before we’re inserting new record to the database, we should generate unique key as record identity. The generation process handled by Java code. The tutorial use System.currentTimeMilis() function to create the key. So, the code will be look like this:

//set new id
Number time = System.currentTimeMillis();
Integer id = (time.intValue()/10000);

The generated key then being inserted to the database.

As you may realized, database systems usually have their own key generation technique. For example AUTO_INCREMENT attribute in MySQL. I followed the tutorial and sure it works. But, I tried database-generation key, and it works too. So, why we should bother generate our own keys?

The O’Reilly Java Author gave an explanation about that. They said:

However, using the supported key generation tools of your database of choice presents several problems:

  • Every database engine handles key generation differently. Thus, it is difficult to build a truly portable JDBC application that uses proprietary key generation schemes.
  • Until JDBC 3.0, a Java application had no clear way of finding out which keys were generated on an insert.
  • Automated key generation wreaks havoc with EJBs.

I got the point that the application should be portable. Who knows that someday we’ll migrating from one database systems to another? So the database generation key will be hard to control. 

Another detailed answer came from Scott Selikoff. He wrote a complete article about database key generation and gave an example:

Now, let’s say a user is in the process of creating a new record in your system. For each user record, you also have a set of postal addresses. For example, Bob may be purchasing items on NewEgg and have a home address and a work address. Furthermore, Bob enters his two addresses at the time he creates his account, so the application server receives the information to create all 3 records at once. In such a situation, you would normally have 3 records: 1 user record for Bob and 2 address records. You could add the address info in the user table, although then you have to restrict the number of addresses Bob can have and/or have a user table with a lot of extra columns.

Inserting Bob into the user table is straight forward enough, but there is a problem when you go to insert users into the address table, namely that you need Bob’s newly generated User Id in order to insert any records into the address table. After all, you can’t insert addresses without being connected to a specific user, lest chaos ensue in your data management system.

The problems will arise in complex database relationship. As we don’t know the generated key, we couldn’t set the foreign key of corresponding tables. I didn’t realize this because the GUI I made was using a simple database design. Maybe with no table relation at all.

So, I got my question answered. Do you have another answer for my question? Feel free to share.

Aggregator Adventures

27/05/2009

Howdy.

One day, when I surfed the web, I found an interesting site. This site has function like a normal blog, but the post written there was posted from various blog sources. Later, I knew that this blog is called aggregator. How this blog aggregator works? It works by collecting RSS Feeds from various blogs registered. The feeds then posted as normal posts in the aggregator. One of the differences is the article title linked to the “real” blog source.

Ilkomerz 41 Blog Aggregator

Ilkomerz 41 Blog Aggregator

When I knew that some of my friends also have their own blogs, I started to thinking about building this aggregator. The site that I found earlier is powered by Planet, a Phyton-based feed reader. I tried it, and ended up with failure. I just couldn’t configure and tested it properly. I didn’t have Internet connection and didn’t familiar with Phyton at that time. So, I tried to find alternative. I found WP-o-Matic plugins for Wordpress. So I installed, asked my friends permission to grab their feeds, and hosted it. I called it Ilkomerz 41 Blog Aggregator.

Another problem arose. This plugin will automatically read and parse feeds from its registered blogs every amount of time. To make this fetching run automatically, it uses cron job. The WP-o-Matic has two options of cron job. The first one is UNIX cron job run by web hosting. Free hosting, like the one I use, doesn’t give cron job feature. So I go to the next option, web cron. It will automatically calling fetching script on the hosting.

I use the web cron option for some months until suddenly the aggregator was down. Well, it worked again when I send a support ticket to the hosting provider. I know that maybe it’s not because of the web cron spending too much resource. I suspected that because of the multi-user nature of the aggregator, the blog automatically sending email to the blog writer. The hosting may called it spam. So I killed the auto-email feature. I also shut down WP-o-Matic and switched it to manual fetching. I know that maybe it’s not because of the WP-o-Matic’s web cron, but I don’t wanna take the risk losing this blog for the second time.

Manual fetching was a pain. I must open my Google Reader, find if my friends have new post, and then login to the aggregator, fetching the post, and the post will showed up. I was thinking, there must be a better way to do this. Accidently, I found out that there are some webs that provide free cron job. One of them is SetCronJob. This site can calls the cron script url of the aggregator. So, I tried it yesterday. I registered the aggregator to the site. And now, my aggregator works well automatically. I don’t know what will come in the future, but I have a high hope for SetCronJob.

Do you have another opinions or experience? Feel free to share it.

Page 1 of 3123