Like I tell you on the last post, in order to create automatic part-of-speech tagging for text document, I need to collect some corpora. In fact, because I wanna do it on distributed system, I need a large corpora. One great source to collect corpora is from web. But extracting plain text from HTML manually is quite cumbersome. So I heard that we can use a crawler to extract text from the web. Then I stumbled into Nutch.
A Little About Nutch
Nutch is an open source search engine, builds on Lucene and Solr. According to Tom White, Nutch basically consists of two parts: crawler and searcher. The crawler fetches pages from the web and creates an inverted index from it. The searcher accepts user’s queries to the fetched pages. Nutch can run on a single computer, but also can works great on multinode cluster. Nutch use Hadoop MapReduce in order to work well on distributed environment.
Simple Crawling with Nutch
Let’s get to the point. The objective that I defined here is to make corpora from web pages. In order to achieve that, I’m just gonna crawl some web pages and extract its text. So I’ll not writing about searching for now, but I consider to write it on the other post. Okay, this is my environment when I do this experiment:
- Ubuntu 10.10 Maverick Meerkat
- Java 6 OpenJDK
- Nutch version 1.0, you can download here.
After you’re ready, let’s get started, shall we?
Set the JAVA_HOME directory
First, you must make sure your JAVA_HOME is set to the where the Java installed. To do this you can open your terminal and write these commands:
JAVA_HOME=/usr/lib/jvm/java-6-openjdk export JAVA_HOME PATH=$PATH:$JAVA_HOME/bin export PATH
On the first line, I give JAVA_HOME variable the location of my Java installation directory. In my case, the installation directory is in
/usr/lib/jvm/java-6-openjdk. The second line, I set the JAVA_HOME as environment variable using
export command. On the third line, the PATH variable is set into the older PATH (before JAVA_HOME addition) concatenated with the new JAVA_HOME directory. And the last line set it as environment variable. To check if the environment is correctly set, you can use these commands:
echo $JAVA_HOME /usr/lib/jvm/java-6-openjdk echo $PATH [another path]://usr/lib/jvm/java-6-openjdk
The echo command is used to print the environment variable. If your Java installation directory is listed, then you’re ready to go.
Install (Extract) Nutch
After you download Nutch, you can install it by just extracting it to your favourite directory. In this article, I’ll just use my home directory as the extraction target. So, the NUTCH_HOME directory will be
Set The Crawler Name
Before doing the crawler, we must set the identity of our crawler. It’s the right thing to do to inform the website owner this is our crawler and what we are intended to do on their website. In Nutch, we can do this by setting the
nutch-default.xml. This file consists of the default configuration for our Nutch installation. This is how to do it:
- Open up NUTCH_HOME/conf/nutch_default.xml file.
- Set the
http.agent.nameproperties. Insert the name of the crawler in
<value>parameter and enter its description on the
- You can optionally set the
http.agent.urlwith the url of the page that describes about your crawler and
http.agent.emailwith your contact email.
For your preferences, here a portion of my configuration file:
... <property> <name>http.agent.name</name> <value>Arif's Spider</value> <description>This crawler is used to fetch text documents from web pages that will be used as a corpus for Part-of-speech-tagging </description> </property> ... <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>aku at arifn dot web dot id</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property>
Listing the URL to Crawl
Now, let’s create the list of URL we wished to crawl. So these is how to do it:
Create a file listed the URL that we wanna fetch. In this experiment, I wanna crawl this blog. So I created a directory called urls in NUTCH_HOME. Inside this directory, I created a file titled arifn. This is the command to do it:
cd nutch mkdir urls echo 'http://arifn.web.id/blog' > arifn
The first command change to the NUTCH_HOME directory. The second command create a directory named urls. The third commands writing a file named arifn which has this blog url written on it. Well actually you can use regular text editor to create this file. :D
Open the conf/crawl-urlfilter.txt. This file configures the url name filtering. Let’s change the MY.DOMAIN.NAME into the domain we want to crawl. In this case: arifn.web.id. So this is the file after the changes:
... # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*arifn.web.id/ ...
Okay, now we’re ready to crawl!! :D
This is how we do the crawl. Open your terminal and make sure you’re on Nutch installation directory. Type this command:
bin/nutch crawl urls -dir crawl -depth 1 >& crawl.log
This is the explanation of the command:
- the ‘
bin/nutch crawl‘ is used to execute Nutch crawl command.
- the ‘
urls‘ keyword specifies the list of URL directory to fetch.
- the ‘
-dir crawl‘ is the destination of crawling results, in this case the destination is crawl directory.
- the ‘
-depth 1‘ is the link depth of the link to fetch. If we define 1, it means we just want to crawl the first page and not going to crawl all the links on this page.
- the ‘
>& crawl.log‘ is used to write the log of crawling process to the crawl.log file.
In this experiment, we only crawl to the home page of this blog, so that we can see the result quickly. You then can experiment changing the link depth of the website you want to crawl using the
-depth parameter. For list of crawl commands, you can refer to Nutch Tutorial page.
Seeing the Result
You’ve done the little crawling. Now let’s see the crawling result. The crawling result will placed where you set the -dir parameter. In this case, I set it to crawl directory. The resulting directory consists of some folders. Here is the folders and its explanation:
crawldb, this is the crawl database that contain information about the URL known to Nutch.
indexes, this is the Lucene format indexes from the crawled web pages.
linkdb, or link database that contain list of known links to each URL, including source and anchor text for the link.
segments, is a set of URLs fetched at one time. It consists of some directory:
content, contains raw content retrieved from URLs
crawl_fetch, contains status of fetching the URLs
crawl_generate, names a set of URLs to be fetched
crawl_parse, contains outlink of URLs that used to update the db
parse_data, contains outlinks and metadata parsed from each URLs
parse_text, contains parsed texts of each URLs
First, we wanna make sure that the crawl process is working. So we will check the stats of
bin/nutch readdb crawl/crawldb -stats
The readdb command is used to read from the crawl/crawldb, which is the crawl database fetched. The -stats parameter generate the statistics in the crawl database. Here is the result of the command above:
CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 59 retry 0: 59 min score: 0.009 avg score: 0.032915253 max score: 1.0 status 1 (db_unfetched): 58 status 2 (db_fetched): 1 CrawlDb statistics: done
From the result above, you can see that there are 59 URLs found in my home page, but only 1 link fetched (this is because we set the
-depth parameter to 1). Well, our crawling is successful.
Next, because I want to get the plain text of the web page, I need to read from the segment directory that we just crawled. to do this, I type the command:
bin/nutch readseg -dump crawl/segments/* arifn
This command means I wanna dump the data from all segments in
crawl/segments directory (specified with the asterisk *) to arifn directory. This will result dumping all of the information in the segments to be written in
Now, I just wanna take the parsed text from the web pages. So, I modified the command into this:
bin/nutch readseg -dump crawl/segments/* arifn -nocontent -nofetch -nogenerate -noparse -noparsedata
With this command, I’ll only get the parsed text from the segments. Here is a portion of the text I got.
Recno:: 0 URL:: http://arifn.web.id/blog/ ParseText:: sidudun sidudun Let’s Get Started 24/10/2010 no comment yet Hello, it has been a while since I updated this blog. I’m a little busy with college stuffs and something like that. And finally, I have came to the last year of my graduate study. After doing some consultations with some professors in my college, I got something as my research focus. Actually, it still at proposal stage, but I hope this will works, because so many people are counting on me about it. So, I wanna implement MapReduce to optimize processing in automatic part-of-speech tagging (POS tagging). ...
Well, the next thing you can do is stop crawling my blog and go crawl another site :p. Then, you can experiment on how to fetch more than one URLs and running the Nutch on Hadoop multinode clusters. Good Luck..