crawlspider登录后,无法在首页进行爬虫

按照网上大神的例子,写了一个crawlspider的爬虫,但是运行到其中make_requests_from_url后,就直接结束了,最后的parse_page没有运行,问一下是啥原因啊。

class w3spider(CrawlSpider):

name = "w3"
allowed_domains = ['xxx.com']
start_urls = ["http://w3.xxx.com/next/indexa.html"]
rules = (
    Rule(LinkExtractor(allow=("viewDoc.do\?did=\d.*&cata\=.*")), callback='parse_page', follow=True),
)

def start_requests(self):
    return [Request("https://login.xxx.com/login/", meta={"cookiejar": 1}, callback=self.post_login)]

def post_login(self, response):
    formdate = {
        "actionFlag": "loginAuthenticate",
        "lang": "en",
        "loginMethod": "login",
        "loginPageType": "mix",
        "redirect": "http%3A%2F%2Fw3xxx.com%2Fnext%2Findexa.html",
        "redirect_local": "",
        "redirect_modify": "",
        "scanedFinPrint": "",
        "uid": "hwx371981",
        "password": "QWER1234%^&*",
        "verifyCode": "2345",
    }
    return [FormRequest.from_response(response,
                                      meta={'cookiejar': response.meta['cookiejar']},
                                      formdata=formdate,
                                      callback=self.after_login,
                                      dont_filter=True)]

def after_login(self, response):
    print(response.text)

    for url in self.start_urls:
        print(url)
        yield self.make_requests_from_url(url)
        print(url)

def parse_page(self, response):
    print(response.url)




http://w3.huawei.com/next/ind...
2017-11-23 09:31:07 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
http://w3.huawei.com/next/ind...
2017-11-23 09:31:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://w3.huawei.com/next/ind...; (referer: None)
2017-11-23 09:31:14 [scrapy.core.engine] INFO: Closing spider (finished)

可以看到2个print(url)都运行了,yield也运行了,但是最后的parse_page函数没有运行。

另外不明白的是for url in self.start_urls,这句里的start_urls就是http://w3.xxx.com/next/indexa...,为什么要循环?

阅读 3.1k
2 个回答

yield语句改成:

yield Request(url, callback=self.parse_page)

另外,注意self.start_urls是一个列表。
————分割线————
我仔细看了下crawlspider的源码,你的代码没有问题。问题很可能出在你的正则表达式上,应该使用非贪婪匹配.*?

这样最后的parse_page运行了,但是不会用rules去循环爬虫

推荐问题