0

我想实现在满足一定条件下自动退出的功能,比如在爬去相应的网页的时候,如果连续重复爬去的数量有20个的时候,那么执行一个方法,使得整个scrapy停下来。注意,我的一个scrapy中只有一个spider。
为了测试这个功能,我把其他的代码都删掉了,并且以链家的主页为爬去对象。代码如下

from scrapy import signals
import scrapy
from scrapy import Spider
import time
from scrapy.exceptions import CloseSpider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        'http://sh.lianjia.com/ershoufang/d{}'.format(str(i)) for i in range(1,80)
    ]


    # @classmethod
    # def from_crawler(cls, crawler, *args, **kwargs):
    #     spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
    #     crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
    #     time.sleep(1)
    #     return spider
    #     # time.sleep(1)
    #
    #
    # def spider_closed(self, spider):
    #     # self.close()
    #     spider.logger.info('Spider closed: %s', spider.name)



    def parse(self, response):
        time.sleep(1)
        # self.close()
        raise CloseSpider('end')
        print '----------'


    # def close(spider, reason):
    #     print reason,'=============='
  

而问题是

2017-04-21 16:04:37 [scrapy.core.engine] INFO: Spider opened
2017-04-21 16:04:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-21 16:04:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6059
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://sh.lianjia.com/robots.txt> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d1> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d2> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d8> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d4> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d5> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d6> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d7> (referer: None)
2017-04-21 16:04:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d3> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] INFO: Closing spider (end)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d9> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d10> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d11> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d13> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d15> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d12> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d14> (referer: None)
2017-04-21 16:04:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d16> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d17> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d18> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d19> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d20> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d21> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d22> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d23> (referer: None)
2017-04-21 16:04:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.lianjia.com/ershoufang/d24> (referer: None)
2017-04-21 16:05:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7845,
 'downloader/request_count': 25,
 'downloader/request_method_count/GET': 25,
 'downloader/response_bytes': 491588,
 'downloader/response_count': 25,
 'downloader/response_status_count/200': 24,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'end',
 'finish_time': datetime.datetime(2017, 4, 21, 8, 5, 2, 877611),
 'log_count/DEBUG': 26,
 'log_count/INFO': 7,
 'response_received_count': 25,
 'scheduler/dequeued': 24,
 'scheduler/dequeued/memory': 24,
 'scheduler/enqueued': 24,
 'scheduler/enqueued/memory': 24,
 'start_time': datetime.datetime(2017, 4, 21, 8, 4, 38, 32776)}
2017-04-21 16:05:02 [scrapy.core.engine] INFO: Spider closed (end)

显示结果如上,显示还是继续爬取了20页,可是我一开始就使用了

raise CloseSpider()

为什么还关闭不了。希望各位高手能给一个清楚的解释。也希望能得到一个合适的解决方案。
我希望的是最好的是不要改变这个策略,虽然是增量爬取。现在只需要一个scrapy的退出方法就行了,如果各位有知道怎么退出的方法的话,不妨说出来参考一下。

1个回答

0

start_urls一开始就在执行队列里生成了80个请求。
只是执行到20页的时候,结果返回了。

如有要顺序执行,可以吧setting中的CONCURRENT_REQUESTS设置成1

撰写答案

相似问题