按照网上大神的例子,写了一个crawlspider的爬虫,但是运行到其中make_requests_from_url后,就直接结束了,最后的parse_page没有运行,问一下是啥原因啊。
class w3spider(CrawlSpider):
name = "w3"
allowed_domains = ['xxx.com']
start_urls = ["http://w3.xxx.com/next/indexa.html"]
rules = (
Rule(LinkExtractor(allow=("viewDoc.do\?did=\d.*&cata\=.*")), callback='parse_page', follow=True),
)
def start_requests(self):
return [Request("https://login.xxx.com/login/", meta={"cookiejar": 1}, callback=self.post_login)]
def post_login(self, response):
formdate = {
"actionFlag": "loginAuthenticate",
"lang": "en",
"loginMethod": "login",
"loginPageType": "mix",
"redirect": "http%3A%2F%2Fw3xxx.com%2Fnext%2Findexa.html",
"redirect_local": "",
"redirect_modify": "",
"scanedFinPrint": "",
"uid": "hwx371981",
"password": "QWER1234%^&*",
"verifyCode": "2345",
}
return [FormRequest.from_response(response,
meta={'cookiejar': response.meta['cookiejar']},
formdata=formdate,
callback=self.after_login,
dont_filter=True)]
def after_login(self, response):
print(response.text)
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
print(url)
def parse_page(self, response):
print(response.url)
http://w3.huawei.com/next/ind...
2017-11-23 09:31:07 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
http://w3.huawei.com/next/ind...
2017-11-23 09:31:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://w3.huawei.com/next/ind...; (referer: None)
2017-11-23 09:31:14 [scrapy.core.engine] INFO: Closing spider (finished)
可以看到2个print(url)都运行了,yield也运行了,但是最后的parse_page函数没有运行。
另外不明白的是for url in self.start_urls,这句里的start_urls就是http://w3.xxx.com/next/indexa...,为什么要循环?
yield
语句改成:另外,注意
self.start_urls
是一个列表。————分割线————
我仔细看了下crawlspider的源码,你的代码没有问题。问题很可能出在你的正则表达式上,应该使用非贪婪匹配
.*?