Simple Crawling with Nutch

Like I tell you on the last post, in order to create automatic part-of-speech tagging for text document, I need to collect some corpora. In fact, because I wanna do it on distributed system, I need a large corpora. One great source to collect corpora is from web. But extracting plain text from HTML manually is quite cumbersome. So I heard that we can use a crawler to extract text from the web. Then I stumbled into Nutch.

A Little About Nutch

Nutch is an open source search engine, builds on Lucene and Solr. According to Tom White, Nutch basically consists of two parts: crawler and searcher. The crawler fetches pages from the web and creates an inverted index from it. The searcher accepts user’s queries to the fetched pages. Nutch can run on a single computer, but also can works great on multinode cluster. Nutch use Hadoop MapReduce in order to work well on distributed environment.

Simple Crawling with Nutch

Let’s get to the point. The objective that I defined here is to make corpora from web pages. In order to achieve that, I’m just gonna crawl some web pages and extract its text. So I’ll not writing about searching for now, but I consider to write it on the other post. Okay, this is my environment when I do this experiment:

  • Ubuntu 10.10 Maverick Meerkat
  • Java 6 OpenJDK
  • Nutch version 1.0, you can download here.

After you’re ready, let’s get started, shall we?

Set the JAVA_HOME directory

First, you must make sure your JAVA_HOME is set to the where the Java installed. To do this you can open your terminal and write these commands:

JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

On the first line, I give JAVA_HOME variable the location of my Java installation directory. In my case, the installation directory is in /usr/lib/jvm/java-6-openjdk. The second line, I set the JAVA_HOME as environment variable using export command. On the third line, the PATH variable is set into the older PATH (before JAVA_HOME addition) concatenated with the new JAVA_HOME directory. And the last line set it as environment variable. To check if the environment is correctly set, you can use these commands:

echo $JAVA_HOME
/usr/lib/jvm/java-6-openjdk
echo $PATH
[another path]://usr/lib/jvm/java-6-openjdk

The echo command is used to print the environment variable. If your Java installation directory is listed, then you’re ready to go.

Install (Extract) Nutch

After you download Nutch, you can install it by just extracting it to your favourite directory. In this article, I’ll just use my home directory as the extraction target. So, the NUTCH_HOME directory will be /home/user.

Set The Crawler Name

Before doing the crawler, we must set the identity of our crawler. It’s the right thing to do to inform the website owner this is our crawler and what we are intended to do on their website. In Nutch, we can do this by setting the nutch-default.xml. This file consists of the default configuration for our Nutch installation. This is how to do it:

  1. Open up NUTCH_HOME/conf/nutch_default.xml file.
  2. Set the http.agent.name properties. Insert the name of the crawler in <value> parameter and enter its description on the <description> parameter.
  3. You can optionally set the http.agent.url with the url of the page that describes about your crawler and http.agent.email with your contact email.

For your preferences, here a portion of my configuration file:

...
<property>
  <name>http.agent.name</name>
  <value>Arif's Spider</value>
  <description>This crawler is used to fetch text documents from
web pages that will be used as a corpus for Part-of-speech-tagging
  </description>
</property>
...
<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>
<property>
  <name>http.agent.email</name>
  <value>aku at arifn dot web dot id</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

Listing the URL to Crawl

Now, let’s create the list of URL we wished to crawl. So these is how to do it:

Create a file listed the URL that we wanna fetch. In this experiment, I wanna crawl this blog. So I created a directory called urls in NUTCH_HOME. Inside this directory, I created a file titled arifn. This is the command to do it:

cd nutch
mkdir urls
echo 'http://arifn.web.id/blog' > arifn

The first command change to the NUTCH_HOME directory. The second command create a directory named urls. The third commands writing a file named arifn which has this blog url written on it. Well actually you can use regular text editor to create this file. :D

Open the conf/crawl-urlfilter.txt. This file configures the url name filtering. Let’s change the MY.DOMAIN.NAME into the domain we want to crawl. In this case: arifn.web.id. So this is the file after the changes:

...
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*arifn.web.id/
...

Okay, now we’re ready to crawl!! :D

CRAWL!!!!

This is how we do the crawl. Open your terminal and make sure you’re on Nutch installation directory. Type this command:

bin/nutch crawl urls -dir crawl -depth 1 >& crawl.log

This is the explanation of the command:

  • the ‘bin/nutch crawl‘ is used to execute Nutch crawl command.
  • the ‘urls‘ keyword specifies the list of URL directory to fetch.
  • the ‘-dir crawl‘ is the destination of crawling results, in this case the destination is crawl directory.
  • the ‘-depth 1‘ is the link depth of the link to fetch. If we define 1, it means we just want to crawl the first page and not going to crawl all the links on this page.
  • the ‘>& crawl.log‘ is used to write the log of crawling process to the crawl.log file.

In this experiment, we only crawl to the home page of this blog, so that we can see the result quickly. You then can experiment changing the link depth of the website you want to crawl using the -depth parameter. For list of crawl commands, you can refer to Nutch Tutorial page.

Seeing the Result

You’ve done the little crawling. Now let’s see the crawling result. The crawling result will placed where you set the -dir parameter. In this case, I set it to crawl directory. The resulting directory consists of some folders. Here is the folders and its explanation:

  • crawldb, this is the crawl database that contain information about the URL known to Nutch.
  • index and indexes, this is the Lucene format indexes from the crawled web pages.
  • linkdb, or link database that contain list of known links to each URL, including source and anchor text for the link.
  • segments, is a set of URLs fetched at one time. It consists of some directory:
    • content, contains raw content retrieved from URLs
    • crawl_fetch, contains status of fetching the URLs
    • crawl_generate, names a set of URLs to be fetched
    • crawl_parse, contains outlink of URLs that used to update the db
    • parse_data, contains outlinks and metadata parsed from each URLs
    • parse_text, contains parsed texts of each URLs

First, we wanna make sure that the crawl process is working. So we will check the stats of crawldb.

bin/nutch readdb crawl/crawldb -stats

The readdb command is used to read from the crawl/crawldb, which is the crawl database fetched. The -stats parameter generate the statistics in the crawl database. Here is the result of the command above:

CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:	59
retry 0:	59
min score:	0.009
avg score:	0.032915253
max score:	1.0
status 1 (db_unfetched):	58
status 2 (db_fetched):	1
CrawlDb statistics: done

From the result above, you can see that there are 59 URLs found in my home page, but only 1 link fetched (this is because we set the -depth parameter to 1). Well, our crawling is successful.

Next, because I want to get the plain text of the web page, I need to read from the segment directory that we just crawled. to do this, I type the command:

bin/nutch readseg -dump crawl/segments/* arifn

This command means I wanna dump the data from all segments in crawl/segments directory (specified with the asterisk *) to arifn directory. This will result dumping all of the information in the segments to be written in arifn directory.

Now, I just wanna take the parsed text from the web pages. So, I modified the command into this:

bin/nutch readseg -dump crawl/segments/* arifn -nocontent -nofetch -nogenerate -noparse -noparsedata

With this command, I’ll only get the parsed text from the segments. Here is a portion of the text I got.

Recno:: 0
URL:: http://arifn.web.id/blog/

ParseText::
sidudun sidudun Let’s Get Started 24/10/2010 no comment yet Hello, it has been a while since I updated this blog. I’m a little busy with college stuffs and something like that. And finally, I have came to the last year of my graduate study. After doing some consultations with some professors in my college, I got something as my research focus. Actually, it still at proposal stage, but I hope this will works, because so many people are counting on me about it. So, I wanna implement MapReduce to optimize processing in automatic part-of-speech tagging (POS tagging). ...

What’s Next?

Well, the next thing you can do is stop crawling my blog and go crawl another site :p. Then, you can experiment on how to fetch more than one URLs and running the Nutch on Hadoop multinode clusters. Good Luck..

28 thoughts on “Simple Crawling with Nutch

  1. Great article.

    I'm jealous… I always want to research about Lucene and Hadoop after finishing my undergraduate thesis (web content mining). Unfortunately I don't have spare time for that :-D .

    If someday I have spare time, I will know to whom I can consult about Lucene and Hadoop.. :-D

  2. This tutorial is very helpful and tells you exactly (nothing less and nothing more) what you need to start crawling with Nutch.

    However, I'm having troubles crawling pages that have deep domain names ,e.g. http://ravenyoung.spaces.live.com/ (just as an example). I wonder if Nutch can crawl domain that have 3 or more dots in their names…

    1. tnx for the tutorial :ngayal:

      @remi i think it can, just set

      +^http://([a-z0-9]*.)*arifn.web.id/ this line to index more 3rd level domains… sorry i suck at regex but i am shore you will figure it out

  3. I am trying to develop an app wherein i need to get only the meta data of webpages.
    It’s like I will be getting all the users twits in the time line and then I am taking all the urls using regex of php and I am keeping them in a file. Now I am giving this file a an input to nutch for crawling.

    It takes hell lot of time and I am able to get all the outlinks of the webpages and their corresponding data.
    But I dont need all these. I just want the meta data(title, description) of these webpages in the file. Can anyone help me doing this with nutch.

  4. This is great Arif. I've a question.

    Do you know how to get the response code and response message for a given URL after Nutch crawled?

    Thanks,

    Ann

  5. Thanks and congrats, great article. I have a question. When I crawl the sites and get the raw html by readseg, some language specific characters(in my case it’s Turkish such as ğ,ü) are replaced by “?”. Any ideas or suggestions to solve this?

  6. Arif, Thanks! for such a nice article on nutch.. this would really help ppls like me to kick start on nutch easily.

  7. It is nicely explained, it resolved my issues…
    Is there any Tutorial on setting up nutch on single node hadoop?
    Any ideas are welcome!!

    Thanks
    Jaipal R

  8. Your comment is awaiting moderation.

    i have integrated nutch with tomcat 7…it is work correctly and search website but i could not search for a search engine like yahoo.com or google.com..please reply me about this problem…Thanks in Advance

  9. Hi,

    I am very bad at Regex and i need to crawl a website to find out links to all the pdfs file from the site. Can u please help me with the regex?

      1. Hello Arifn

        I have a Question , when I start crawling I see a lot of fetches from URLs that I dont want… like twitter… plus.google.com … etc…

        fetching http://t.co/fBh8YmH8HO (queue crawl delay=5000ms)
        fetching http://www.finanzas.df.gob.mx/sitiosInteres.html (queue crawl delay=5000ms)
        fetching https://mobile.twitter.com/statuses/504802117794930688/retweet (queue crawl delay=5000ms)
        fetching https://twitter.com/NBA/status/446440855080296448 (queue crawl delay=1000ms)
        -activeThreads=50, spinWaiting=47, fetchQueues.totalSize=2495, fetchQueues.getQueueCount=11
        fetching http://www.debian.org/events/index.pt.html (queue crawl delay=5000ms)
        fetching http://directmemory.apache.org/examples/index.html (queue crawl delay=5000ms)
        fetching https://www.youtube.com/channel/UCGej5zp_KWZ-b_1w4Rq2hyA (queue crawl delay=5000ms)

        In my domain-urlfilter.txt I have

        +^http://([a-z0-9]*.)*telmex.com/
        +^http://([a-z0-9]*.)*telmex.com/

        In the plugin.xml of the urlfilter-domain I have defined

        I executed

        $ bin/crawl urls/ crawl/ http://127.0.0.1/solr/ 50

        In urls/ I have a file seed.txt which has

        http://www.telmex.com/
        https://tienda.telmex.com/shell/af/home.do

        So , I dont know why is doing this… I am using nutch 1.9

        Thanks

  10. hello! great post, it helped me a lot, but can you post same blog on the latest version of nutch? or site some parts where there are changes from this post.. thanks

  11. In my last post , the domain-urlfilter.txt did not paste fine because of the tags

    but Is “ok”, I hope you can help me

    The other Question is , why when I run it 2 times , the first time takes less time than the second (first time 20 minutes, second time hours)

  12. Exception in thread “main” java.lang.NoClassDefFoundError: Updated
    Caused by: java.lang.ClassNotFoundException: Updated
    at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:319)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:264)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:332)
    Could not find the main class: Updated. Program will exit.

    I am getting above error. Please Help!!!!!

  13. I used this command :bin/nutch readseg -dump crawl/segments/* arifn -nocontent -nofetch -nogenerate -noparse -noparsedata to view the parsed data but I get this message SegmentReader: dump segment: crawl/segments/20150122131252
    SegmentReader: done.
    Plese help as how to view the parsed text.

What's in your mind?