|
Latest step by Step Installation
guide for dummies: Nutch 0.9 By Peter P.
Wang, Zillionics Inc. Try the search engine I developed: Malachi Search Please support my effort by using the best free/low price web hosting: 1&1 Inc To add your comments, please go to: http://nutchtube.blogspot.com/2008/02/latest-step-by-step-installation-guide.html
Run it by clicking the Configure Tomcat icon below.
Click the Start button below to start Apache Tomcat Service.
Then you will be able to see the following screen in the
browser if you go to http://localhost:8080
+^http://([a-z0-9]*\.)*apache.org/
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration><property> <name>http.agent.name</name> <value>Peter Wang</value> <description>Peter Pu Wang </description></property><property> <name>http.agent.description</name> <value>Nutch spiderman</value> <description> Nutch spiderman </description></property><property> <name>http.agent.url</name> <value>http://peterpuwang.googlepages.com </value> <description>http://peterpuwang.googlepages.com </description></property><property> <name>http.agent.email</name> <value>MyEmail</value> <description>peterpuwang@yahoo.com </description></property></configuration>
Once things are configured,
running the crawl is easy. Just use the crawl command. Its options include:
For example, a typical call
might be: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Typically one starts
testing one's configuration by crawling at shallow depths, sharply limiting the
number of pages fetched at each level (-topN), and watching the output to check that desired
pages are fetched and undesirable pages are not. Once one is confident of the
configuration, then an appropriate depth for a full crawl is around 10. The
number of pages per level (-topN)
for a full crawl can be from tens of thousands to millions, depending on your
resources.
d.
Set Your Searcher Directory
Next, navigate to your nutch webapp folder then WEB-INF/classes.
Edit the nutch-site.xml
file and add the following to it (make sure you don't have two sets of
<configuration></configuration> tags!): <configuration> <property> <name>searcher.dir</name> <value>your_crawl_folder_here</value> </property></configuration>
For example, if your nutch directory resides at C:\nutch-0.9.0
and you specified crawl as the directory after the -dir
command, then enter C:\nutch-0.9.0\crawl\ instead of your_crawl_folder_here.
e.
Reload
Reload the Application. Use
the Tomcat Manager and simply click the "Reload" command for nutch, or restart Tomcat using the windows services tool. Open up a browser and enter
the url http://localhost:8080.
The nutch search page should appear. As long as
you've defined the correct location of your nutch
index directory (as shown above), clicking search should yield results.
Congratulations!
It rocks! Peter P.
Wang peterpuwang@gmail.com |