Monday, December 9, 2013

Nutch 1.7 or Nutch 2.1, setting nutch 1.7 environment

Is nutch1.7 enough for about  100,000 pages? I try it.
The valuable stuff is not always too immense.  When I crawled less than 300 websites, It occupy less than 1G. If you have limited resource with CPU and memory, nutch 1.7 is good.

So I think nutch 1.7 fits this.

I set up nutch1.7 with solr4.6. And there was some problems.
  • nutch reported:

2013-12-07 11:26:08,540 INFO  parse.ParseSegment - Parsed (0ms):http://www.bjmm.org.cn/outpart/managerarticle.do?method=page&articleId=1651
2013-12-07 11:26:08,545 WARN  mapred.LocalJobRunner - job_local1202925054_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)

Solution Wrong: set fetcher.parse to true and fetcher.store.content to false It did not help.
Solution Right: modify some codes as https://issues.apache.org/jira/browse/NUTCH-1640?jql=project%20%3D%20NUTCH%20AND%20text%20~%20%22create%20thread%22

Some thing like:
in file  src/java/org/apache/nutch/parse/ParseSegment.java
private ParseUtil parseUtil = null;

replace parseResult = new ParseUtil(getConf()).parse(content); with

      if (parseUtil == null) 
          parseUtil = new ParseUtil(getConf());
      parseResult = parseUtil.parse(content);

Ant and run again. It works.
  • What php client is proper? Solarium , Solr-client-php,...
Solarium is good for updating recently.

No comments:

Post a Comment