Quantcast
Viewing latest article 1
Browse Latest Browse All 4

Nutch的简单使用

Image may be NSFW.
Clik here to view.

Nutch是一个开源的搜索引擎,包括抓取,索引,搜索,不过它主要专注于抓取,下面我讲一下它的简单使用。

首先,从这里下载Nutch的最新release(作此文时最新release为1.0),或者从这里直接下载源码,然后解压。解压后,打开文件$_HOME/conf/-site.xml(_HOME为你nutch所在的文件夹,这个nutch-site文件是nutch的配置文件,不要直接修改nutch-default文件,那个是nutch的默认配置,-site.xml会覆盖nutch-default.xml中的配置,详情请见Nutch配置文件的加载。当然你也可以修改nutch-default,xml,但是nutch官方不推荐那样做),在<configuration>和</configuration>之间输入以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<property>
  <name>http.agent.name</name>
  <value>spider</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.
 
  NOTE: You should also check other related properties:
 
	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version
 
  and set their values appropriately.
 
  </description>
</property>
 
<property>
  <name>http.robots.agents</name>
  <value>spider,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

其中字段“http.agent.name”为你的crawler的名字(记得早期的版本可以不填的,现在的版本不填就报错),字段http.robots.agents,也可以不填,但是不填的话抓取的时候nutch会报:

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.

烦的慌,你要是不怕烦的话可以不填。
然后再打开文件$NUTCH_HOME/conf/-urlfilter.txt,把该文件里面的MY.DOMAIN.NAME替换成你想抓取的域名,比如apache.org。

修改完以上的配置,现在就可以抓取了,抓取之前你得建立一个文件,里面存放你要抓取的url,比如建立一个文件urls,内容为:http://lucene.apache.org/nutch/,把该文件放到目录urls下面,Nutch抓取的时候只能对一个目录下的所有文件中的url进行抓取,不能对一个文件中的url进行抓取(这是由它的分布式系统Hadoop的特性决定的)。抓取很简单:

$NUTCH_HOME/bin/nutch crawl urls -dir crawl -depth 2

urls为待抓取的urls目录,crawl为输出目录(可以不写,默认为”crawl-”加当前日期和时间),depth为抓取深度,默认为5。输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
ahei@ubuntu3:~/nutch-1.0/bin$ ./nutch crawl urls -dir crawl -depth 2
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 2
	Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20091126170222
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20091126170222
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching http://lucene.apache.org/nutch/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20091126170222]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20091126170233
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: crawl/segments/20091126170233
Fetcher: threads: 10
QueueFeeder finished: total 38 records.
fetching http://wiki.apache.org/nutch/
fetching http://issues.apache.org/jira/browse/Nutch
fetching http://lucene.apache.org/nutch/tutorial.html
-activeThreads=10, spinWaiting=7, fetchQueues.totalSize=35
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=35
fetching http://lucene.apache.org/nutch/skin/breadcrumbs.js
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=34
Error parsing: http://lucene.apache.org/nutch/skin/breadcrumbs.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/nutch/skin/breadcrumbs.js
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
 
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=34
fetching http://lucene.apache.org/nutch/version_control.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=33
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=33
fetching http://wiki.apache.org/nutch/FAQ
fetching http://lucene.apache.org/nutch/apidocs-0.8.x/index.html
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=31
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=31
fetching http://lucene.apache.org/hadoop/
-activeThreads=10, spinWaiting=8, fetchQueues.totalSize=30
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=30
fetching http://forrest.apache.org/
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=29
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=29
fetching http://lucene.apache.org/nutch/apidocs-0.9/index.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=28
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=28
fetching http://lucene.apache.org/nutch/credits.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=27
fetching http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=26

抓取完数据之后怎样检验呢?使用命令:

$NUTCH_HOME/bin/nutch org.apache.nutch.searcher.NutchBean apache

这个命令会给出apache的搜索结果,这个命令默认是对crawl目录进行搜索,这是代码证明:

1
2
3
4
5
6
7
文件:$NUTCH_HOME/src/java/org/apache/nutch/searcher/NutchBean.java:87
  public NutchBean(Configuration conf, Path dir) throws IOException {
    this.conf = conf;
    this.fs = FileSystem.get(this.conf);
    if (dir == null) {
      dir = new Path(this.conf.get("searcher.dir", "crawl"));
    }

要想对其他目录进行搜索,在nutch-site.xml中加入以下内容:

1
2
3
4
5
6
7
8
9
10
11
<property>
  <name>searcher.dir</name>
  <value>other-searcher-dir</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

搜索结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ahei@ubuntu3:~/nutch-1.0/bin$ ./nutch org.apache.nutch.searcher.NutchBean apache
Total hits: 25
 0 20091126170222/http://lucene.apache.org/nutch/
 ... Lucene. January 2005: Nutch Joins Apache Incubator Nutch is a ... determined that the Apache license is the appropriate
 1 20091126170233/http://www.apache.org/
 ... including Apache XML, Apache Jakarta, Apache Cocoon, Apache Xerces, Apache Ant, and Apache ... Source projects such as NoSQL, Apache ... 
 2 20091126170233/http://www.apache.org/licenses/
 ... Copyright © 2009 The Apache Software Foundation, Licensed under the ... Apache License, Version 2.0 . Apache ... Apache and the  ... 
 3 20091126170233/http://forrest.apache.org/
 ... Welcome to Apache Forrest apache > forrest   Welcome Developers Versioned Docs ... Example sites Thanks Related projects Apache Gump Apache ... 
 4 20091126170233/http://lucene.apache.org/
 ... the release of Apache Mahout 0.1. Apache Mahout is a subproject ... on top of  ... 
 5 20091126170233/http://wiki.apache.org/nutch/
FrontPage - Nutch Wiki Search: Nutch Wiki Login FrontPage FrontPage RecentChanges FindPage HelpContents Immutable Page Comments Info Attachments More Actions:  ... 
 6 20091126170233/http://lucene.apache.org/nutch/index.html
 ... Lucene. January 2005: Nutch Joins Apache Incubator Nutch is a ... determined that the Apache license is the appropriate
 7 20091126170233/http://wiki.apache.org/nutch/FAQ
 ... all available at http://lucene.apache.org/nutch/mailing_lists.html . How ... 
 8 20091126170233/http://lucene.apache.org/nutch/tutorial8.html
 ... http://([a-z0-9]*\.)*apache.org/ This will include any ... in the domain apache.org . Edit the file ... 
 9 20091126170233/http://lucene.apache.org/nutch/tutorial.html
 ... crawl to the apache.org domain, the line ... http://([a-z0-9]*\.)*apache.org/ This will include any

Nutch的入门使用很简单吧,上面所述只是在一台机器上进行抓取,Nutch有个分布式系统Hadoop,可以实现分布式抓取,请看Nutch的分布式抓取


Viewing latest article 1
Browse Latest Browse All 4

Trending Articles