本人小白,在练习如何从网页上抓取文字内容时,出现了如下错误`2017-10-20 11:35:20 [scrapy.core.engine] INFO: Spider opened
2017-10-20 11:35:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-20 11:35:20 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://read.douban.com/reade...; (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://read.douban.com/reade...; (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://read.douban.com/reade...; (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.core.scraper] ERROR: Error downloading <GET https://read.douban.com/reade...;: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2017-10-20 11:35:24 [scrapy.core.engine] INFO: Closing spider (finished)`
好像是我的程序被网站ban了,具体代码如下,求大佬们解答~谢谢,小白学习中
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
class wenxianspider(Spider):
name = 'wenxian'
start_urls = ['https://read.douban.com/reader/ebook/39766537/']
def parse(self, response):
books = response.xpath('//p[@data-pid="580813884"]/dfn/span[@data-offset="0"]/text()').extract()
# booksgbk = books.encode("uft-8")
for book in books:
print book.strip().encode("utf-8",'ignore')#在print输出位,可以用ignore忽略掉无法编码的字符
你的程序被网站ban了这个表述不对,可能是因为调试时请求次数过多,导致系统监测到了,暂时把你的ip地址给封了。
可以使用ip代理池来避免这个问题。
另外控制爬取得速度,时间间隔,设置好refer,User-agent等等,把自己伪装的更像一个普通用户,而不是一直爬虫啊。