Scrapy 在cmd上运行获取豆瓣书籍在线阅读的某一段内容,被网站禁止

新手上路,请多包涵

本人小白,在练习如何从网页上抓取文字内容时,出现了如下错误`2017-10-20 11:35:20 [scrapy.core.engine] INFO: Spider opened
2017-10-20 11:35:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-20 11:35:20 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://read.douban.com/reade...; (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://read.douban.com/reade...; (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://read.douban.com/reade...; (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.core.scraper] ERROR: Error downloading <GET https://read.douban.com/reade...;: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.core.engine] INFO: Closing spider (finished)`

好像是我的程序被网站ban了,具体代码如下,求大佬们解答~谢谢,小白学习中

# -*- coding:utf-8 -*-

from scrapy.spiders import Spider
class wenxianspider(Spider):
    name = 'wenxian'
    start_urls = ['https://read.douban.com/reader/ebook/39766537/']
    def parse(self, response):
        books = response.xpath('//p[@data-pid="580813884"]/dfn/span[@data-offset="0"]/text()').extract()
       # booksgbk = books.encode("uft-8")

        for book in books:
            print book.strip().encode("utf-8",'ignore')#在print输出位,可以用ignore忽略掉无法编码的字符
阅读 3.8k
1 个回答

你的程序被网站ban了这个表述不对,可能是因为调试时请求次数过多,导致系统监测到了,暂时把你的ip地址给封了。

可以使用ip代理池来避免这个问题。

另外控制爬取得速度,时间间隔,设置好refer,User-agent等等,把自己伪装的更像一个普通用户,而不是一直爬虫啊。

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题