1

概述

Scrapy 是一个用 Python 开发的 web 抓取框架,用于抓取 web 站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

硬核知识点

基本的 request 和 response 对象

request: scrapy.http.request.Request
# HtmlResponse 继承自 TextResponse 继承自 HtmlResponse
response: scrapy.http.response.html.HtmlResponse
response: scrapy.http.response.text.TextResponse
response: scrapy.http.response.Response

在 spider 内打印该 spider 的配置(settings)

for k in self.settings:
    print(k, self.settings.get(k))
    if isinstance(self.settings.get(k), scrapy.settings.BaseSettings):
        for kk in self.settings.get(k):
            print('\t', kk, self.settings.get(k).get(kk))

Scrapy 队列中的请求个数

How to get the number of requests in queue in scrapy?

# scrapy.core.scheduler.Scheduler
# spider
len(self.crawler.engine.slot.scheduler)
# pipeline 
len(spider.crawler.engine.slot.scheduler)

Scrapy 当前正在网络请求的个数

# scrapy.core.engine.Slot.inprogress 就是个 set
# spider
len(self.crawler.engine.slot.inprogress)
# pipeline 
len(spider.crawler.engine.slot.inprogress)

Scrapy 在 spider 中获取 pipeline 对象

How to get the pipeline object in Scrapy spider

# Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        
    def get_date(self):
        pass

    def open_spider(self, spider):
        spider.myPipeline = self
    
    def process_item(self, item, spider):
        pass
        
# spider
class MySpider(Spider):
    def __init__(self):
        self.myPipeline = None
        
    def start_requests(self):    
        # 可直接存储数据   
        self.mysqlPipeline.process_item(item, self)

    def parse(self, response):
        self.myPipeline.get_date()

单 spider 多 cookie session

Multiple cookie sessions per spider

# Scrapy通过使用 cookiejar Request meta key来支持单spider追踪多cookie session。 
# 默认情况下其使用一个cookie jar(session),不过您可以传递一个标示符来使用多个。
for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)
        
# 需要注意的是 cookiejar meta key不是”黏性的(sticky)”。 您需要在之后的request请求中接着传递。
def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

spider finished 的条件

Closing spider (finished)

# scrapy.core.engine.ExecutionEngine
def spider_is_idle(self, spider):
    if not self.scraper.slot.is_idle():
        # scraper is not idle
        return False

    if self.downloader.active:
        # downloader has pending requests
        return False

    if self.slot.start_requests is not None:
        # not all start requests are handled
        return False

    if self.slot.scheduler.has_pending_requests():
        # scheduler has pending requests
        return False

    return True
    
# spider 里面打印条件
self.logger.debug('engine.scraper.slot.is_idle: %s' % repr(self.crawler.engine.scraper.slot.is_idle()))
self.logger.debug('\tengine.scraper.slot.active: %s' % repr(self.crawler.engine.scraper.slot.active))
self.logger.debug('\tengine.scraper.slot.queue: %s' % repr(self.crawler.engine.scraper.slot.queue))
self.logger.debug('engine.downloader.active: %s' % repr(self.crawler.engine.downloader.active))
self.logger.debug('engine.slot.start_requests: %s' % repr(self.crawler.engine.slot.start_requests))
self.logger.debug('engine.slot.scheduler.has_pending_requests: %s' % repr(self.crawler.engine.slot.scheduler.has_pending_requests()))

判断空闲 idle 信号,添加请求

Scrapy: How to manually insert a request from a spider_idle event callback?

class FooSpider(BaseSpider):
    yet = False

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        from_crawler = super(FooSpider, cls).from_crawler
        spider = from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
        return spider

    def idle(self):
        if not self.yet:
            self.crawler.engine.crawl(self.create_request(), self)
            self.yet = True

部分配置项说明

  • HTTPERROR_ALLOW_ALL
默认值: False
non-200 response timeout
True callback errback
False errback errback

架构图

Scrapy 1.1 架构图

Scrapy 最新架构图
walker 看起来新图只是旧图的细化,无实质性差异。

本文出自 walker snapshot

qbit
268 声望279 粉丝