想用nutch去抓取一个需要登录的网站用solr去做索引。实际执行后显示以下错误:
java.lang.IllegalArgumentException: No form exists: lzform
2017-07-07 14:24:52,256 ERROR httpclient.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: lzform
at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:505)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:183)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:271)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:327)
Caused by: java.lang.IllegalArgumentException: No form exists: lzform
at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:219)
at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:503)
... 3 more
nutch-site.xml里的plugin.includes已经做了如下配置:
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
httpclient-auth.xml做了如下配置:
<auth-configuration>
<credentials authMethod="formAuth"
loginUrl="xxxlogin"
loginFormId="lzform"
loginRedirect="true">
<loginPostData>
<field name="form_email"
value="xxxxxxxxxx@gmail.com"/>
<field name="form_password"
value="xxxxxxx"/>
</loginPostData>
<additionalPostHeaders>
<field name="User-Agent"
value="Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko" />
</additionalPostHeaders>
<removedFormFields>
<field name="remember"/>
</removedFormFields>
<loginCookie>
<policy>BROWSER_COMPATIBILITY</policy>
</loginCookie>
</credentials>
</auth-configuration>
无论我指定的页面里面有没有id为lzform的表单,运行的结果都会报这个错。运行环境是虚拟机的Ubuntu16.04。