scrapy抓取淘宝商品详情页,读取url随机强制302,跳转到h5.taobao。

  1. 使用scrapy+redis从一定量的淘宝详情页url获取商品详情

  2. 已设置user-agent,已传入cookie,已设置proxy-ip

  3. 获取url,response.status有时是200,有时是302,随机改变

  4. 1000个url,成功获取商品信息大概有400多

  5. 是否为cookie未传入成功,还是proxy-ip不稳定?或者其他原因。请帮忙分析,谢谢!

  6. 报错Traceback:

2017-07-14 15:51:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://item.taobao.com/item.htm?id=10245430841&ns=1&abbucket=0#detail> (referer: None)
2017-07-14 15:51:12 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTPS connection (1): rate.taobao.com
2017-07-14 15:51:12 [requests.packages.urllib3.connectionpool] DEBUG: "GET /detailCommon.htm?auctionNumId=10245430841 HTTP/1.1" 200 None
2017-07-14 15:51:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://item.taobao.com/item.htm?id=10245430841&ns=1&abbucket=0>
None
2017-07-14 15:51:12 [taobao] DEBUG: Read 1 requests from 'taobao:start_urls'
2017-07-14 15:51:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://item.taobao.com/item.htm?id=10245681616&ns=1&abbucket=0#detail>


2017-07-14 15:51:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://h5.m.taobao.com/awp/core/detail.htm?id=10245681616&ns=1&abbucket=0> from <GET https://item.taobao.com/it
em.htm?id=10245681616&ns=1&abbucket=0#detail>
2017-07-14 15:51:12 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET http://h5.m.taobao.com/awp/core/detail.htm?id=10245681616&ns=1&abbucket=0>


2017-07-14 15:51:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://h5.m.taobao.com/awp/core/detail.htm?id=10245681616&ns=1&abbucket=0> (referer: None) ['partial']
2017-07-14 15:51:12 [scrapy.core.scraper] ERROR: Spider error processing <GET http://h5.m.taobao.com/awp/core/detail.htm?id=10245681616&ns=1&abbucket=0> (referer: None)
阅读 9.5k
2 个回答
  1. 已找到异常原因,导入user-agent里面有mobile端的ua,删除之后,就没问题了

  2. 自己更新了一个2017最新的ua_list(pc端)给大家:https://github.com/lovebaicai...

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题
宣传栏