Like I tell you on the last post, in order to create automatic part-of-speech tagging for text document, I need to collect some corpora. In fact, because I wanna do it on distributed system, I need a large corpora. One great source to collect corpora is from web. But extracting plain text from HTML manually is quite cumbersome. So I heard that we can use a crawler to extract text from the web. Then I stumbled into Nutch.
A Little About Nutch
Nutch is an open source search engine, builds on Lucene and Solr. According to Tom White, Nutch basically consists of two parts: crawler and searcher. The crawler fetches pages from the web and creates an inverted index from it. The searcher accepts user’s queries to the fetched pages. Nutch can run on a single computer, but also can works great on multinode cluster. Nutch use Hadoop MapReduce in order to work well on distributed environment.
Simple Crawling with Nutch
Let’s get to the point. The objective that I defined here is to make corpora from web pages. In order to achieve that, I’m just gonna crawl some web pages and extract its text. So I’ll not writing about searching for now, but I consider to write it on the other post. Okay, this is my environment when I do this experiment:
- Ubuntu 10.10 Maverick Meerkat
- Java 6 OpenJDK
- Nutch version 1.0, you can download here.
After you’re ready, let’s get started, shall we?
Set the JAVA_HOME directory
First, you must make sure your JAVA_HOME is set to the where the Java installed. To do this you can open your terminal and write these commands:
JAVA_HOME=/usr/lib/jvm/java-6-openjdk export JAVA_HOME PATH=$PATH:$JAVA_HOME/bin export PATH
On the first line, I give JAVA_HOME variable the location of my Java installation directory. In my case, the installation directory is in /usr/lib/jvm/java-6-openjdk
. The second line, I set the JAVA_HOME as environment variable using export
command. On the third line, the PATH variable is set into the older PATH (before JAVA_HOME addition) concatenated with the new JAVA_HOME directory. And the last line set it as environment variable. To check if the environment is correctly set, you can use these commands:
echo $JAVA_HOME /usr/lib/jvm/java-6-openjdk echo $PATH [another path]://usr/lib/jvm/java-6-openjdk
The echo command is used to print the environment variable. If your Java installation directory is listed, then you’re ready to go.
Install (Extract) Nutch
After you download Nutch, you can install it by just extracting it to your favourite directory. In this article, I’ll just use my home directory as the extraction target. So, the NUTCH_HOME directory will be /home/user
.
Set The Crawler Name
Before doing the crawler, we must set the identity of our crawler. It’s the right thing to do to inform the website owner this is our crawler and what we are intended to do on their website. In Nutch, we can do this by setting the nutch-default.xml
. This file consists of the default configuration for our Nutch installation. This is how to do it:
- Open up NUTCH_HOME/conf/nutch_default.xml file.
- Set the
http.agent.name
properties. Insert the name of the crawler in<value>
parameter and enter its description on the<description>
parameter. - You can optionally set the
http.agent.url
with the url of the page that describes about your crawler andhttp.agent.email
with your contact email.
For your preferences, here a portion of my configuration file:
... <property> <name>http.agent.name</name> <value>Arif's Spider</value> <description>This crawler is used to fetch text documents from web pages that will be used as a corpus for Part-of-speech-tagging </description> </property> ... <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>aku at arifn dot web dot id</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property>
Listing the URL to Crawl
Now, let’s create the list of URL we wished to crawl. So these is how to do it:
Create a file listed the URL that we wanna fetch. In this experiment, I wanna crawl this blog. So I created a directory called urls in NUTCH_HOME. Inside this directory, I created a file titled arifn. This is the command to do it:
cd nutch mkdir urls echo 'http://arifn.web.id/blog' > arifn
The first command change to the NUTCH_HOME directory. The second command create a directory named urls. The third commands writing a file named arifn which has this blog url written on it. Well actually you can use regular text editor to create this file. :D
Open the conf/crawl-urlfilter.txt. This file configures the url name filtering. Let’s change the MY.DOMAIN.NAME into the domain we want to crawl. In this case: arifn.web.id. So this is the file after the changes:
... # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*arifn.web.id/ ...
Okay, now we’re ready to crawl!! :D
CRAWL!!!!
This is how we do the crawl. Open your terminal and make sure you’re on Nutch installation directory. Type this command:
bin/nutch crawl urls -dir crawl -depth 1 >& crawl.log
This is the explanation of the command:
- the ‘
bin/nutch crawl
‘ is used to execute Nutch crawl command. - the ‘
urls
‘ keyword specifies the list of URL directory to fetch. - the ‘
-dir crawl
‘ is the destination of crawling results, in this case the destination is crawl directory. - the ‘
-depth 1
‘ is the link depth of the link to fetch. If we define 1, it means we just want to crawl the first page and not going to crawl all the links on this page. - the ‘
>& crawl.log
‘ is used to write the log of crawling process to the crawl.log file.
In this experiment, we only crawl to the home page of this blog, so that we can see the result quickly. You then can experiment changing the link depth of the website you want to crawl using the -depth
parameter. For list of crawl commands, you can refer to Nutch Tutorial page.
Seeing the Result
You’ve done the little crawling. Now let’s see the crawling result. The crawling result will placed where you set the -dir parameter. In this case, I set it to crawl directory. The resulting directory consists of some folders. Here is the folders and its explanation:
crawldb
, this is the crawl database that contain information about the URL known to Nutch.index
andindexes
, this is the Lucene format indexes from the crawled web pages.linkdb
, or link database that contain list of known links to each URL, including source and anchor text for the link.segments
, is a set of URLs fetched at one time. It consists of some directory:content
, contains raw content retrieved from URLscrawl_fetch
, contains status of fetching the URLscrawl_generate
, names a set of URLs to be fetchedcrawl_parse
, contains outlink of URLs that used to update the dbparse_data
, contains outlinks and metadata parsed from each URLsparse_text
, contains parsed texts of each URLs
First, we wanna make sure that the crawl process is working. So we will check the stats of crawldb
.
bin/nutch readdb crawl/crawldb -stats
The readdb command is used to read from the crawl/crawldb, which is the crawl database fetched. The -stats parameter generate the statistics in the crawl database. Here is the result of the command above:
CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 59 retry 0: 59 min score: 0.009 avg score: 0.032915253 max score: 1.0 status 1 (db_unfetched): 58 status 2 (db_fetched): 1 CrawlDb statistics: done
From the result above, you can see that there are 59 URLs found in my home page, but only 1 link fetched (this is because we set the -depth
parameter to 1). Well, our crawling is successful.
Next, because I want to get the plain text of the web page, I need to read from the segment directory that we just crawled. to do this, I type the command:
bin/nutch readseg -dump crawl/segments/* arifn
This command means I wanna dump the data from all segments in crawl/segments
directory (specified with the asterisk *) to arifn directory. This will result dumping all of the information in the segments to be written in arifn
directory.
Now, I just wanna take the parsed text from the web pages. So, I modified the command into this:
bin/nutch readseg -dump crawl/segments/* arifn -nocontent -nofetch -nogenerate -noparse -noparsedata
With this command, I’ll only get the parsed text from the segments. Here is a portion of the text I got.
Recno:: 0 URL:: http://arifn.web.id/blog/ ParseText:: sidudun sidudun Let’s Get Started 24/10/2010 no comment yet Hello, it has been a while since I updated this blog. I’m a little busy with college stuffs and something like that. And finally, I have came to the last year of my graduate study. After doing some consultations with some professors in my college, I got something as my research focus. Actually, it still at proposal stage, but I hope this will works, because so many people are counting on me about it. So, I wanna implement MapReduce to optimize processing in automatic part-of-speech tagging (POS tagging). ...
What’s Next?
Well, the next thing you can do is stop crawling my blog and go crawl another site :p. Then, you can experiment on how to fetch more than one URLs and running the Nutch on Hadoop multinode clusters. Good Luck..
Great article.
I'm jealous… I always want to research about Lucene and Hadoop after finishing my undergraduate thesis (web content mining). Unfortunately I don't have spare time for that :-D .
If someday I have spare time, I will know to whom I can consult about Lucene and Hadoop.. :-D
This tutorial is very helpful and tells you exactly (nothing less and nothing more) what you need to start crawling with Nutch.
However, I'm having troubles crawling pages that have deep domain names ,e.g. http://ravenyoung.spaces.live.com/ (just as an example). I wonder if Nutch can crawl domain that have 3 or more dots in their names…
tnx for the tutorial
@remi i think it can, just set
+^http://([a-z0-9]*.)*arifn.web.id/ this line to index more 3rd level domains… sorry i suck at regex but i am shore you will figure it out
I am trying to develop an app wherein i need to get only the meta data of webpages.
It’s like I will be getting all the users twits in the time line and then I am taking all the urls using regex of php and I am keeping them in a file. Now I am giving this file a an input to nutch for crawling.
It takes hell lot of time and I am able to get all the outlinks of the webpages and their corresponding data.
But I dont need all these. I just want the meta data(title, description) of these webpages in the file. Can anyone help me doing this with nutch.
CMIIW, for doing that you can use parse-html plugin of Nutch..
I never do this, so I suggest you refers to these websites:
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
http://wiki.apache.org/nutch/PluginCentral
This is great Arif. I've a question.
Do you know how to get the response code and response message for a given URL after Nutch crawled?
Thanks,
Ann
what kind of response code and message do you mean?
Thanks and congrats, great article. I have a question. When I crawl the sites and get the raw html by readseg, some language specific characters(in my case it’s Turkish such as ğ,ü) are replaced by “?”. Any ideas or suggestions to solve this?
you can try this solution : http://stackoverflow.com/questions/9825793/utf-8-characters-not-showing-properly
Arif, Thanks! for such a nice article on nutch.. this would really help ppls like me to kick start on nutch easily.
You’re always welcome.. :)
It is nicely explained, it resolved my issues…
Is there any Tutorial on setting up nutch on single node hadoop?
Any ideas are welcome!!
Thanks
Jaipal R
Your comment is awaiting moderation.
i have integrated nutch with tomcat 7…it is work correctly and search website but i could not search for a search engine like yahoo.com or google.com..please reply me about this problem…Thanks in Advance
Hi,
I am very bad at Regex and i need to crawl a website to find out links to all the pdfs file from the site. Can u please help me with the regex?
Nice article arif. I can try to crawl using nutch 1.6 but it throws an error.http://stackoverflow.com/questions/17233197/nutch-crawler-read-segment-results
i cant find file name “conf/crawl-urlfilter.txt”.. em using nutch 2.2.1…
the new version of Nutch is kinda different with the version used in the post.
Hello Arifn
I have a Question , when I start crawling I see a lot of fetches from URLs that I dont want… like twitter… plus.google.com … etc…
fetching http://t.co/fBh8YmH8HO (queue crawl delay=5000ms)
fetching http://www.finanzas.df.gob.mx/sitiosInteres.html (queue crawl delay=5000ms)
fetching https://mobile.twitter.com/statuses/504802117794930688/retweet (queue crawl delay=5000ms)
fetching https://twitter.com/NBA/status/446440855080296448 (queue crawl delay=1000ms)
-activeThreads=50, spinWaiting=47, fetchQueues.totalSize=2495, fetchQueues.getQueueCount=11
fetching http://www.debian.org/events/index.pt.html (queue crawl delay=5000ms)
fetching http://directmemory.apache.org/examples/index.html (queue crawl delay=5000ms)
fetching https://www.youtube.com/channel/UCGej5zp_KWZ-b_1w4Rq2hyA (queue crawl delay=5000ms)
In my domain-urlfilter.txt I have
+^http://([a-z0-9]*.)*telmex.com/
+^http://([a-z0-9]*.)*telmex.com/
In the plugin.xml of the urlfilter-domain I have defined
I executed
$ bin/crawl urls/ crawl/ http://127.0.0.1/solr/ 50
In urls/ I have a file seed.txt which has
http://www.telmex.com/
https://tienda.telmex.com/shell/af/home.do
So , I dont know why is doing this… I am using nutch 1.9
Thanks
hello! great post, it helped me a lot, but can you post same blog on the latest version of nutch? or site some parts where there are changes from this post.. thanks
In my last post , the domain-urlfilter.txt did not paste fine because of the tags
but Is “ok”, I hope you can help me
The other Question is , why when I run it 2 times , the first time takes less time than the second (first time 20 minutes, second time hours)
How do I crawl or parse video and audio data using Nutch 2.2.1?
Exception in thread “main” java.lang.NoClassDefFoundError: Updated
Caused by: java.lang.ClassNotFoundException: Updated
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:319)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:264)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:332)
Could not find the main class: Updated. Program will exit.
I am getting above error. Please Help!!!!!
I used this command :bin/nutch readseg -dump crawl/segments/* arifn -nocontent -nofetch -nogenerate -noparse -noparsedata to view the parsed data but I get this message SegmentReader: dump segment: crawl/segments/20150122131252
SegmentReader: done.
Plese help as how to view the parsed text.
Same here. I found a folder which has some files that unknown to my Windows. It consists data and .data.crc. Please help :'(
* It consists dump and .dump.crc.
No problem. I found it. Just open it with command cat on cygwin. cat / > Sorry for the trouble. Thanks for the info! :D
hi, can you tell mehow to open the parsetext please??
hi, can you tell me how to open the parsetext?