Friday, March 5, 2010

Play with Nutch - System Setup in Windows Environment

1. Preparation


Software needs:


Cygwin: used to support shell commands in windows environment. Nutch does not provide separate scripts for NT cmd!! (NT cmd shell does not nest environments recursively)


Tomcat: used to provide servers!!


Nutch 1.0: latest vesion of Nutch!! After downloading, please extract nutch files under the /home/yourusername directory in Cygwin.


2. Set up environment variables for BOTH windows and Cygwin


(a) Set up the following environment variables for windows


JAVA_HOME: value = your_java_jre_location (e.g. D:\SoftWare\JAVA)
NUTCH_HOME: value = your_nutch_location (e.g. D:\SoftWare\cygwin\)
NUTCH_JAVA_HOME: same with JAVA_HOME


Then add these variables into "path".


(b) Add the following scripts into .bash_profile

PATH="/usr/local/bin:/usr/bin:/bin:$PATH:/cygdrive/d/SoftWare/JAVA";
export CLASSPATH=D:\SyftWare\cygwin\home\yourusername\nutch\lib\lucene-core-2.4.0;
export JAVA_HOME=/cygdrive/d/SoftWare/JAVA;
if [ -f ~/.bashrc ]; then . ~/.bashrc; fi


3. Test your Nutch


Create a folder "urls" in /home/yourusername/nutch/bin/; then create a text file "url.txt" in it. Write in the web address you want to crawl, e.g. http://www.iub.edu/. Don't forget the "/" at the end of the address. (This is important)


Now modify crawl-urlfilter.txt under \home\kduan\nutch\conf. Identify the following:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/


Change it to:

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*iub.edu/


Then modify nutch-site.xml in the same folder. The use the following scripts to override :




<:property>
http.agent.name
HD nutch agent

<:property>
http.agent.version
1.0



Now you can test your crawler. Run following script:

$ nutch crawl urls -dir crawl -depth 3 -topN 50

If you are not so sure about these arguments, please just type "nutch crawl" to see the specifications!!

The crawler should be working now. You can find the crawled results in "crawl" folder that we have just speficied in our script. If want to search through the results, you can simply run:

$ nutch org.apache.nutch.searcher.NutchBean indiana

where "indiana" is a keyword we are searching for, then you'll see the following :

$ nutch org.apache.nutch.searcher.NutchBean indianaTotal hits: 54 0 20100305123428/http://www.iub.edu/ ... videos ?Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... site index Visit IU Bloomington Indiana University News RSS feed of 1 20100305123447/http://www.iub.edu/index.shtml ... videos ?Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... site index Visit IU Bloomington Indiana University News RSS feed of 2 20100305123537/http://emergency.iub.edu/faq.shtml ... edu ) and the Indiana University Emergency Preparedness Web site ... hear mean? The Indiana University campus, city of ... 3 20100305123447/http://www.iub.edu/videos/index.shtml ... performances. Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... site index Visit IU Bloomington Indiana University News RSS feed of 4 20100305123447/http://libraries.iub.edu/ ... Star, and the Indiana Daily Student "IU-Bloomington Libraries ... Top Honors The Indiana University Bloomington Libraries have been ... 5 20100305123447/http://www.iub.edu/slideshows/index.shtml ... Stories ?Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... site index Visit IU Bloomington Indiana University News RSS feed of 6 20100305123447/http://www.iub.edu/academic/index.shtml ... departments Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... siteindex Visit IU Bloomington Indiana University News RSS feed of 7 20100305123447/http://www.iub.edu/comments/index.shtml ... here ?Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... site indexVisit IU Bloomington Indiana University News RSS feed of 8 20100305123447/http://www.iub.edu/student/index.shtml ... consultants Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... siteindex Visit IU Bloomington Indiana University News RSS feed of 9 20100305123447/http://www.iub.edu/about/index.shtml ... Bloomington Campus Info 107 S. Indiana Ave. Bloomington, IN 47405 ... siteindex Visit IU Bloomington Indiana University News RSS feed of

Total hits is the number of records found!!

4. Use Nutch on Tomcat servers

Copy nutch-1.0.war (under root folder of Nutch) to webapps under Tomcat installation folder, and rename it to search.war. Start Tomcat to let it automatically extract contents from this archive. Modify nutch-site.xml in tomcat\webapps\search\WEB-INF\classes; add the following property:

searcher.dir
D:\SoftWare\cygwin\home\kduan\nutch\bin\crawl


This is to specify the targeted search directory, i.e. where you put the crawled contents.


Special note to Chinese users.
Then modify server.xml in tomcat\conf, identify the "Connector" part and add

URIEncoding="UTF-8" useBodyEncodingForURI="true"

We use URIEncoding="UTF-8" useBodyEncodingForURI="true" here to solve Chinese character encoding problem.

Then you can restart Tomcat, then type http://localhost:8080/search to the Nutch search page. Then you have set up a search engine now!!

Here is a screenshot. Have fun!!





No comments: