Python scrapy 自定义函数无法调用。

爬取一个网页时,遇到一个非常奇怪的问题,如果使用自定义函数,那么yield item 没有调用。爬取的链接:http://www.duilian360.com/chu...
代码如下:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)

    def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...') # 这句始终没有调用
                yield duilian_item

在上面的代码中,print语句一直没有调用,打断点也无法进入parse_paragraph函数。但是如果我把parse_paragraph函数的代码直接贴到调用处,print语句就可以输出,像下面这样:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url, callback=lambda response: self.parse_page(response))

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

    # def parse_paragraph(self, div_list):
    #     for div in div_list:
    #         duilian_text_list = div.xpath('./text()').extract()
    #         for duilian_text in duilian_text_list:
    #             duilian_item = DuilianItem()
    #             duilian_item['category_id'] = 1
    #             duilian = duilian_text
    #             duilian_item['name'] = duilian
    #             duilian_item['desc'] = ''
    #             print('I reach here...')
    #             yield duilian_item

请问这是为什么呢?我的代码里有很多自定义函数,并且有很多for循环,直接把代码贴到调用处会很难看,也不利于统一维护,因为可能很多地方调用同一个自定义函数。

阅读 8.7k
4 个回答

终于找打答案了,其实是调用的方式不对,在自定义函数调用前加上yield from就可以了。

def parse_page(self, response):
     div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
     yield from self.parse_paragraph(div_list)

你只差了一个 return 。若是没有,yield 产生的生成器不会执行,所以你看不到 print 结果。

改成这样

    def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
-       self.parse_paragraph(div_list)
+       return self.parse_paragraph(div_list)

(补充)
第一次见到出了错还能如此肯定别人的回答是错的。
首先,scrapy的start_requests正常情况只会执行一次,就是启动的时候,默认情况它会把strat_urls列表的url包装成scrapy.Request并依次yield到调度器,调度器下载后回调parse函数。如果需要循环爬取,在parse函数内再次yeild scrapy.Request就行了,如果采集到了数据,创建Item子类并yield ,scrapy会自动识别并传给pipline处理,楼下说的对,你的yield from不就成了return吗?


这个问题不是出在这。你没有搞清楚scrapy的运作方式,start_request函数只会执行一次,把列表里的url处理并传到调度器就结束了。
重要的是你需要重写parse函数(默认的)或者自定义的callback,在其yield处返回一个Request。这样才能循环下去

def parse_page(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)
        yield scrapy.Request(something)

def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...') # 这句始终没有调用
                yield duilian_item

您的答案不对,因为自定义的函数里面有yield, 这个是一个生成器,调用生成器不是直接调函数名。

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题