Yet Another Introduction to MapReduce (part 2)

13/03/2010

I’m sorry for the long delay from the first part. I’ve been pretty busy lately. On this part, I write about the idea of MapReduce, how is it work, and how it distributes the data and process. This article is heavily referenced from MapReduce paper by Google. I write it again to deepen my knowledge about the concept. Enjoy!

What is MapReduce?

According to Wikipedia, MapReduce is a software framework patented by Google to support distributed computing on large data sets on clusters of computers. This framework is presented by Jeffery Dean and Sanjay Ghemawat in OSDI’04: Sixth Symposium on Operating System Design and Implementation on December 2004. The main idea is to utilize functional programming techniques, to obtain processing simplification in distributed environment.

MapReduce processing data using list concept that usually used in functional programming. The process consists of two function, map and reduce function. Each function take list of input elements and produce list of output. Map function take inputs and produce intermediate key-value pairs. These pairs then sent to the reduce function. The reduce function take these intermediate key-value pairs as a input. Then, for the same intermediate key, the function merges together the values to produce output. According to the paper, for every reduce invocation typically produces zero or one output value. Continue reading

College Students, Your Job is in Danger

19/01/2010

In my college’s department mailing list, there is an interesting discussion about the quality of IT bachelor degree in the workplace. There are some reasons behind that:

  • The bachelor graduate worker lacking practical skills. They can not answer a fundamental question that every IT or computer science graduate should know.
  • The bachelor graduate worker also lacking soft skills, like how to speak with the higher-ups and communicate with another workers.

As a result, the companies prefer to hire a vocational IT graduate. Why?

  • A vocational graduate sometimes have the practical skills that a bachelor graduate didn’t have. Computer science or IT is a wide spread knowledge. It means you didn’t have to go to the college just the learn how to program. It’s all over the clouds. So the learning materials are reachable to everyone.
  • Vocational graduates are easier to manage. Some of them have more respect to the higher-ups than the bachelor graduates.
  • The standard salary for the vocational graduates is less expensive than the bachelor graduates. Combine this factor with better skills and higher respect means that bachelor graduates’s job are in a grave danger.

Continue reading

Research Plan

18/01/2010

Howdy,

When I was in my college, I tried to implement Web Map Service (WMS) and Web Feature Service (WFS) as a foundation for a distributed Geographical Information Systems (or better known as GIS). My academic advisor at that time told me that this idea is not entirely new, but there are still a lot of people didn’t know about it yet. So with this topic as my thesis, he wished that one day people will know about this technology.

The implementation that I made was quite simple actually. But let me tell you the complete story. At first, I was thinking about develop a geographical operation that can be operated via web service in the clouds. After some weeks of analyzing and gathering informations, I found out that this work could be really hard and time consuming. I didn’t have background in geography–I’m a computer science student–and I didn’t have much time before the next graduation. Finally, I just created a spatial data repository and make it accessible across the network using GeoServer,an Open Source implementation of WMS and WFS. I, then, created a simple web application to pull the spatial data and display it to the browser. I also provided a simple data update feature, utilizing one of the feature of WFS. I used OpenLayers to create the application. It’s really simple actually.

computers
In my graduate study, right now, I want to try something entirely different. I want to explore MapReduce, a programming model for processing a large scale of data in a distributed environment. I heard about this model from some mailing lists and websites, surprised that the paper [pdf], the lecture notes and videos are easy to get. So, for the time being, I decided to do some experiments in order to learn something about it.

It’s still a plan in my head actually. I never talked about it to my thesis advisor (because I have none yet). But I can predict some problems that I will be dealing with if I do this research plan. They are:

  • The case. I don’t have any idea about the case that I should solve with this research. My college’s advisor told me about doing something in bioinformatics like genome assembling. I think I will cosider it. But I’m open for an idea.
  • The machine and its network. The lab are always busy with the other graduate student. Fortunately, one friend of mine told me that there is another place that I can use in the campus to do experiments. But I should create a permission letter first. Okay, I’ll do it.

In the mean time, I’ll focus myself to learn about MapReduce. Maybe I’ll post something about it in this blog. If you have a suggestion about what should I do with this programming model, let me know. I’d be really glad to hear it.


Credits:

EDSAC pictures, copyrighted Computer Laboratory, University of Cambridge, licensed under the Creative Commons Attribution 2.0 Generic license.

Scheduling

03/09/2009

My schedule for this semester was out. Differ from the time table I usually got in my old college, this schedule was arranged by the subject instead of by the day. I needed some adjustments in how to read the schedule but so far, I’m doing fine. The only things that not going well was about the schedule itself.

It is a fact that the schedule at the first week might dissatisfied some people involved. The chance of clash on schedule is pretty high. There were a lot of people involved though. So the first week is usually a mediation term. In the next coming week, the schedule will be re-arranged to meet everyone’s need. Continue reading

Why Generate Database Keys?

09/06/2009

Howdy.

Not so long ago, I created a GUI for data storage using Java and of course JDBC. I followed a tutorial about how to insert database records to the database table. The tutorial said that before we’re inserting new record to the database, we should generate unique key as record identity. The generation process handled by Java code. The tutorial use System.currentTimeMilis() function to create the key. So, the code will be look like this:

//set new id
Number time = System.currentTimeMillis();
Integer id = (time.intValue()/10000);

The generated key then being inserted to the database.

As you may realized, database systems usually have their own key generation technique. For example AUTO_INCREMENT attribute in MySQL. I followed the tutorial and sure it works. But, I tried database-generation key, and it works too. So, why we should bother generate our own keys?

The O’Reilly Java Author gave an explanation about that. They said:

However, using the supported key generation tools of your database of choice presents several problems:

  • Every database engine handles key generation differently. Thus, it is difficult to build a truly portable JDBC application that uses proprietary key generation schemes.
  • Until JDBC 3.0, a Java application had no clear way of finding out which keys were generated on an insert.
  • Automated key generation wreaks havoc with EJBs.

I got the point that the application should be portable. Who knows that someday we’ll migrating from one database systems to another? So the database generation key will be hard to control. 

Another detailed answer came from Scott Selikoff. He wrote a complete article about database key generation and gave an example:

Now, let’s say a user is in the process of creating a new record in your system. For each user record, you also have a set of postal addresses. For example, Bob may be purchasing items on NewEgg and have a home address and a work address. Furthermore, Bob enters his two addresses at the time he creates his account, so the application server receives the information to create all 3 records at once. In such a situation, you would normally have 3 records: 1 user record for Bob and 2 address records. You could add the address info in the user table, although then you have to restrict the number of addresses Bob can have and/or have a user table with a lot of extra columns.

Inserting Bob into the user table is straight forward enough, but there is a problem when you go to insert users into the address table, namely that you need Bob’s newly generated User Id in order to insert any records into the address table. After all, you can’t insert addresses without being connected to a specific user, lest chaos ensue in your data management system.

The problems will arise in complex database relationship. As we don’t know the generated key, we couldn’t set the foreign key of corresponding tables. I didn’t realize this because the GUI I made was using a simple database design. Maybe with no table relation at all.

So, I got my question answered. Do you have another answer for my question? Feel free to share.

Page 1 of 212