Developing Application Monitoring: Nutch 1.7 or Nutch 2.1, setting nutch 1.7 environment

Is nutch1.7 enough for about 100，000 pages? I try it.

The valuable stuff is not always too immense. When I crawled less than 300 websites, It occupy less than 1G. If you have limited resource with CPU and memory, nutch 1.7 is good.

So I think nutch 1.7 fits this.

I set up nutch1.7 with solr4.6. And there was some problems.

nutch reported:

2013-12-07 11:26:08,540 INFO parse.ParseSegment - Parsed (0ms):http://www.bjmm.org.cn/outpart/managerarticle.do?method=page&articleId=1651
2013-12-07 11:26:08,545 WARN mapred.LocalJobRunner - job_local1202925054_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)

Solution Wrong: set fetcher.parse to true and fetcher.store.content to false It did not help.

Solution Right: modify some codes as https://issues.apache.org/jira/browse/NUTCH-1640?jql=project%20%3D%20NUTCH%20AND%20text%20~%20%22create%20thread%22

Some thing like:

in file src/java/org/apache/nutch/parse/ParseSegment.java

private ParseUtil parseUtil = null;

replace parseResult = new ParseUtil(getConf()).parse(content); with

if (parseUtil == null)

parseUtil = new ParseUtil(getConf());

parseResult = parseUtil.parse(content);

Ant and run again. It works.

What php client is proper? Solarium , Solr-client-php,...

Solarium is good for updating recently.

Developing Application Monitoring

Monday, December 9, 2013

Nutch 1.7 or Nutch 2.1, setting nutch 1.7 environment

No comments:

Post a Comment