scrapy 爬取顺序每次都不一样

def parse(self,response):
        print "parse called"
        medias=response.xpath('//li[@class="media"]//h3/a')
        for index,media in enumerate(medias):
            url=media.xpath('./@href').extract()[0]
            print str(index+1)+":"+url
            yield Request(url,callback=self.parse_apply,dont_filter=True)

选择器抓到的链接是

http:hostname/10223/
http:hostname/10142/
http:hostname/10093/#comm-12881
http:hostname/10075/#comm-12853
http:hostname/10042/#comm-12792
http:hostname/10040/#comm-12791
http:hostname/10025/#comm-12790
http:hostname/10016/#comm-12789
http:hostname/10013/#comm-12788
http:hostname/9972/#comm-12539
http:hostname/9931/#comm-12538
http:hostname/9829/#comm-12451
http:hostname/9845/#comm-12361
http:hostname/9740/#comm-12321
http:hostname/9834/#comm-12287
http:hostname/9824/#comm-12285
http:hostname/9748/#comm-12135
http:hostname/9706/#comm-12085
http:hostname/9610/#comm-12084
http:hostname/9596/#comm-11925
http:hostname/9598/#comm-11860
http:hostname/9566/#comm-11859
http:hostname/9522/#comm-11858
http:hostname/9513/#comm-11703
http:hostname/9472/#comm-11667
http:hostname/9439/#comm-11666
http:hostname/9432/#comm-11665
http:hostname/9398/#comm-11627
http:hostname/9394/#comm-11563
http:hostname/9382/#comm-11562
http:hostname/9311/#comm-11480
http:hostname/9306/#comm-11479
http:hostname/9195/#comm-11436
http:hostname/9149/#comm-11276
http:hostname/9060/#comm-11166
http:hostname/9024/#comm-11098
http:hostname/8745/#comm-10989
http:hostname/8912/#comm-10888
http:hostname/8868/#comm-10876
http:hostname/8853/#comm-10875

顺序是按照页面顺序依次来的,但是scrapy爬取的时候,不一定从第一个开始,而且多次启动顺序还不一样。我这里没加翻页的功能,如果加翻页,还会引发一个问题就是不是所有链接都被处理,比如每一页40个,但是有时处理16个,有时处理17个,很奇怪

阅读 14.5k
2 个回答

题主可以查看一下Scrapy的官方文档:
Scrapy at a glance,
Architecture Overview

文档中很明确的说到:

Here you notice one of the main advantages about Scrapy: requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

异步处理请求,也就是说Scrapy发送请求之后,不会等待这个请求的响应(也就是不会阻塞),而是可以同时发送其他请求或者做别的事情。而我们知道服务器对于请求的响应是由很多方面的因素影响的,如猫之良品所说的网络速度、解析速度、资源抢占等等,其响应的顺序是难以预测的。

Scrapy异步的根源,在于它依赖于Twisted框架。Twisted框架是一个Python的event-driven的框架,这里你可以理解为是异步I/O的。

假如要保证顺序,则应该使用同步I/O的工具。如果需要在Scrapy内解决这个问题,可以参考这篇回答:
Scrapy Crawl URLs in Order

题主如果对于多线程和同步异步I/O不了解,可以参考:
高性能IO模型浅析
asynchronous vs non-blocking

scrapy是异步并发的,顺序取决于很多因素,如网络速度、解析速度、资源抢占等。同步的方法肯定会按顺序,但会很低效。

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进