用scrapy轮换代理抓百度,有小部分代理,请求正常的url后返回的是这个页面:http://www.baidu.com/search/error.html
找了下文档,貌似scrapy异常只能通过非正常状态码(404,500...)来捕获
我想判断如果返回的链接与请求的链接不一样,求将请求的链接传入重试队列里,换代理重抓一次,但目前没有找到可以实现的方法。
spider.py
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["www.baidu.com"]
# start_urls = ['http://www.baidu.com/s?q=&tn=baidulocal&ct=2097152&si=&ie=utf-8&cl=3&wd=%s' % urllib.quote('华为工资') ]
start_urls = []
for line in open('/Users/sunjian/Desktop/ceshi/jieguo1.csv'):
line = line.strip()
try:
ugc = search(r'(\d+),company',line)
company = search(r'company:(.*?),',line)
id = search(r'id:(\d+)',line)
word = 'www.kanzhun.com/gso%s.html' % id
except:
print 'error'
url = 'http://www.baidu.com/s?q=&ct=2097152&si=&ie=utf-8&cl=3&wd=%s&class=%s-%s' % (word,ugc,id )
start_urls.append(url)
def __get_url_query(self, url):
m = re.search("wd=(.*?)&", url).group(1)
return m
def __get_url_class(self, url):
m = re.search("class=(.*)", url).group(1)
return m
def parse(self,response):
query = urllib.unquote(self.__get_url_query(response.url))
CLASS = urllib.unquote(self.__get_url_class(response.url))
print response.url
setting.py的中间件设置:
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
RETRY_TIMES = 10
'''下载中间件设置'''
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 80,
'ceshi.middlewares.ProxyMiddleware': 90,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
}
scrapy默认配置,有一个RedirectMiddleware的中间件,处理重定向的。
想到达你的目的:
试下,重写RedirectMiddleware的process_response方法,如果是301的响应,就返回response到spider。
在spider中,判断状态码,重新把url加入到抓取url队列。