0

爬取一个网页时,遇到一个非常奇怪的问题,如果使用自定义函数,那么yield item 没有调用。爬取的链接:http://www.duilian360.com/chu...
代码如下:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)

    def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...') # 这句始终没有调用
                yield duilian_item

在上面的代码中,print语句一直没有调用,打断点也无法进入parse_paragraph函数。但是如果我把parse_paragraph函数的代码直接贴到调用处,print语句就可以输出,像下面这样:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

    # def parse_paragraph(self, div_list):
    #     for div in div_list:
    #         duilian_text_list = div.xpath('./text()').extract()
    #         for duilian_text in duilian_text_list:
    #             duilian_item = DuilianItem()
    #             duilian_item['category_id'] = 1
    #             duilian = duilian_text
    #             duilian_item['name'] = duilian
    #             duilian_item['desc'] = ''
    #             print('I reach here...')
    #             yield duilian_item

请问这是为什么呢?我的代码里有很多自定义函数,并且有很多for循环,直接把代码贴到调用处会很难看,也不利于统一维护,因为可能很多地方调用同一个自定义函数。

11月9日提问
3 个回答
0

已采纳

终于找打答案了,其实是调用的方式不对,在自定义函数调用前加上yield from就可以了。

def parse_page(self, response):
     div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
     yield from self.parse_paragraph(div_list)
0

这个问题不是出在这。你没有搞清楚scrapy的运作方式,start_request函数只会执行一次,把列表里的url处理并传到调度器就结束了。
重要的是你需要重写parse函数(默认的)或者自定义的callback,在其yield处返回一个Request。这样才能循环下去

def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)
        yield scrapy.Request(something)

def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...') # 这句始终没有调用
                yield duilian_item
0

你只差了一个 return 。若是没有,yield 产生的生成器不会执行,所以你看不到 print 结果。

改成这样

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
-       self.parse_paragraph(div_list)
+       return self.parse_paragraph(div_list)

撰写答案

推广链接