So I followed the official document at
http://wiki.apache.org/nutch/Nutch2Tutorial
If you want to encounter less problems, please follow that strictly. Hbase-0.90 has to be used.
If you have no idea on Nutch, It is better to follow http://wiki.apache.org/nutch/NutchTutorial first. Some important configuration should be set .
Below are some problems I had.
- when run
bin/crawl /home/nutch/apache-nutch-2.2.1/urls/seeds.txt mmtest http://localhost:8983/solr/ 10
It reports:
SolrIndexerJob: starting
Adding 11 documents
Adding 11 documents
SolrIndexerJob: java.lang.RuntimeException: job failed: name=[mmtest]solr-index, jobid=job_local1440285148_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
solution:
copy the schema.xml to solr dir
- when run again
It reprots:
13033 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer ? Unable to create core: collection1
org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] analyzer/filter: Error loading class 'solr.EnglishPorterFilterFactory'. Schema file is /home/nutch/solr-4.5.1/example/solr/collection1/schema.xml
Solution:
Use SnowballPorterFilterFactory with language="English" instead of EnglishPorterFilterFactory
- run the crawl again
5377 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer ? Unable to create core: collection1
org.apache.solr.common.SolrException: Unable to use updateLog: _version_ field must exist in schema, using indexed="true" stored="true" and multiValued="false" (_version_ does not exist)
Solution:
insert below line into the schema.xml
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
- run the crawl again
4764 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore ? org.apache.solr.common.SolrException: undefined field text
at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1235)
Solution:
in solrconfig.xml, change <str name="df">text</str> to
<str name="df">content</str>
Now It worked.
No comments:
Post a Comment