Simple Crawling with Nutch

Like I tell you on the last post, in order to create automatic part-of-speech tagging for text document, I need to collect some corpora. In fact, because I wanna do it on distributed system, I need a large corpora. One great source to collect corpora is from web. But extracting plain text from HTML manually is quite cumbersome. So I heard that we can use a crawler to extract text from the web. Then I stumbled into Nutch.

A Little About Nutch

Nutch is an open source search engine, builds on Lucene and Solr. According to Tom White, Nutch basically consists of two parts: crawler and searcher. The crawler fetches pages from the web and creates an inverted index from it. The searcher accepts user’s queries to the fetched pages. Nutch can run on a single computer, but also can works great on multinode cluster. Nutch use Hadoop MapReduce in order to work well on distributed environment.

Simple Crawling with Nutch

Let’s get to the point. The objective that I defined here is to make corpora from web pages. In order to achieve that, I’m just gonna crawl some web pages and extract its text. So I’ll not writing about searching for now, but I consider to write it on the other post. Okay, this is my environment when I do this experiment:

  • Ubuntu 10.10 Maverick Meerkat
  • Java 6 OpenJDK
  • Nutch version 1.0, you can download here.

After you’re ready, let’s get started, shall we? Continue reading

Let’s Get Started

Hello, it has been a while since I updated this blog. I’m a little busy with college stuffs and something like that. And finally, I have came to the last year of my graduate study. After doing some consultations with some professors in my college, I got something as my research focus. Actually, it still at proposal stage, but I hope this will works, because so many people are counting on me about it.

So, I wanna implement MapReduce to optimize processing in automatic part-of-speech tagging (POS tagging). POS tagging is a process of assigning types of words in entire collection of text document. To make the process automatic, we can use some approaches that involves natural language processing techniques. Some approaches involve supervised learning, it means it needs to train the models with tagged corpus before we use the models to tag the real world text document. We can use MapReduce to optimize the learning and the real tagging process.

Since this is my first time dealing with (yeah) MapReduce and natural language processing, I feel a little bit anxious. Even, my anxiety is taking over my excitement already. Hearing this, maybe you’ll say how come I feel anxiety more than excitement. The answer is “I don’t know”, but I hope this will works out and I can finish the research on time. Oh, maybe because there is time variable. Well, if we don’t have time variable then when we will start to do the work?

Well, this is just me rambling around. Thank you for all the readers who have asked some questions, comments, and anything in this blog. I hope we can keep in touch. Wish me luck. I’ll write about my research little by little in this blog. So, be aware.. And let’s get started!!

Hadoop on Single Node Cluster

Hello there? S’up?

On my previous post, we’ve learned how to develop Hadoop MapReduce application in Netbeans. After our application run well on the Netbeans, now it’s the time to deploy it on cluster of computers. Well, it supposed to be multi node cluster, but for now, let’s try it on a single node cluster. This article will give a step-by-step guide on how to deploy MapReduce application on a single node cluster.

In this tutorial, I’m using Ubuntu 9.10 Karmic Koala. For the Hadoop MapReduce application, I’ll use the code from my previous post. You can try it by yourself or you can just download the jar file. Are you ready? Let’s go then..

Preparing the Environment

First time first, we must preparing the deploying environment. We must install and configure all the software required. For this process, I followed a great tutorial by Michael Noll about how to run Hadoop on single node cluster. For simplicity, I’ll write a summary of all the steps mentioned on Michael’s post. I do recommend you to read it for the details. Continue reading

HTML: Alternative to Presentation Program?

Here is the situation. You are going to do an important presentation in an international conference. You have made your presentation slides, it’s like the greatest presentation in the universe. You made the presentation using the latest version of Microsoft PowerPoint or OpenOffice.org Impress. You double checked your presentation and laptop right before you’re doing your presentation. Suddenly, out of nowhere, your laptop crash, error, bsod, or anything. You have no choice, you transfered your slides file to another computer that can do presentation. Unfortunately, the computer didn’t have the program that can open your slides. The computer has the older version of PowerPoint that couldn’t open your slides. Or maybe the computer using another operating system that your presentation program didn’t support. You’re panic and can’t think clearly. Everything went dark and suddenly you passed out. Continue reading

NoSQL: the End of RDBMS?

What? NoSQL? Yeah, you read it correctly. NoSQL. I forgot when and where I heard about this for the first time. But I noticed about this data store technology again when I was attending the second Bancakan 2.0 meet up in last March. When I listened to the speaker, lynxluna, I remember about HBase, a scalable distributed database that becomes part of Apache Hadoop project. For your own sake, Apache Hadoop is just one implementation of MapReduce framework.

What is NoSQL?

So, what the hell is NoSQL? Here is the definition of NoSQL in Wikipedia:

NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage.

Continue reading

Have You Tried It on IE6?

About a year ago, I created a blog aggregator, or sometimes also called Planet. This planet display all the blog posts from my registered friends. At first, I did it alone. I maintained and designed it by myself. And then, my friend Andreas wanna help me maintaining the site. So I gave him a role as administrator. Some days ago, he sent me this email:

Rif,

Do you have opened the website in IE6? The layout and the design looks screwed :(

IE6 or Internet Explorer version 6 is a browser shipped with Windows XP. It was released in 2001. It has better CSS support than the previous version, at that time. The problem with the browser is it’s lacking support on the web standard. If you’re a web designer, you must know the designing problems in IE6 . This browser has bad reputation among web designers. Continue reading

Wanted: Superman

Some days ago, there’s a vacancy offer in my undergraduate department mailing list. A company is looking for a programmer. I didn’t pay much attention to this email. Okay, here’s the email:

Mr. XXX, my office needs a programmer with this qualification:

  • Have knowledge in VB, Java, and PHP
  • Have any experiences as a programmer/developer for at least 1 year in IT division or in IT company or software house
  • Have an ability to give product presentation to potential clients
  • Have knowledge in CorelDRAW and Photoshop
  • Have knowledge in Linux
  • Have knowledge in building computer networks
  • Have knowledge in hardware

Continue reading