Tuesday, November 26, 2013

Setting up nutch 2.2.1 with some problems

These days I am setting up a search engine for some official websites. It is only because when I search for technical anwsers, the confusions of the results  lead to many wrong ways. Finally, I find the official documents is more valuable than discussions, blogs ...

So I followed the official document at
http://wiki.apache.org/nutch/Nutch2Tutorial

If you want to encounter less problems, please follow that strictly. Hbase-0.90 has to be used.

If you have no idea on Nutch, It is better to follow http://wiki.apache.org/nutch/NutchTutorial first. Some important configuration should be set .

Below are some problems I had.

  • when run 

bin/crawl /home/nutch/apache-nutch-2.2.1/urls/seeds.txt mmtest http://localhost:8983/solr/ 10
It reports:

SolrIndexerJob: starting
Adding 11 documents
Adding 11 documents
SolrIndexerJob: java.lang.RuntimeException: job failed: name=[mmtest]solr-index, jobid=job_local1440285148_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)



solution:
  copy the schema.xml to solr dir
  • when run again
bin/crawl /home/nutch/apache-nutch-2.2.1/urls/seeds.txt mmtest http://localhost:8983/solr/ 10
It reprots:
13033 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer  ? Unable to create core: collection1
org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'solr.EnglishPorterFilterFactory'. Schema file is /home/nutch/solr-4.5.1/example/solr/collection1/schema.xml
Solution:
Use SnowballPorterFilterFactory with language="English" instead of EnglishPorterFilterFactory
  • run the crawl again
It reports:
5377 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer  ? Unable to create core: collection1
org.apache.solr.common.SolrException: Unable to use updateLog: _version_ field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)
Solution:
insert below line into the schema.xml
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
  • run the crawl again
reports:
4764 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore  ? org.apache.solr.common.SolrException: undefined field text
at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1235)
Solution:
in solrconfig.xml, change        <str name="df">text</str> to
       <str name="df">content</str>

Now It worked.

No comments:

Post a Comment