为什么scapy爬虫用管道持久化存储时创建的文件一直为空写不进去？

Question

为什么scapy爬虫用管道持久化存储时创建的文件一直为空写不进去？

发布于
2024-04-11 河南

新手上路，请多包涵

最近在学习scrapy爬虫的用管道持久化存储时，遇到了这个问题，只知道这个创建的fp一直为none

  import scrapy
  import sys
  sys.path.append(r'D:\project_test\PyDemo\demo1\xunlian\mySpider\qiubai')
  from ..items import QiubaiItem
  class BiedouSpider(scrapy.Spider):
      name = "biedou"
      #allowed_domains = ["www.xxx.com"]
      start_urls = ["https://www.biedoul.com/wenzi/"]
          def parse(self, response):
          #pass
          dl_list = response.xpath('/html/body/div[4]/div[1]/div[1]/dl')

          for dl in dl_list:
              title = dl.xpath('./span/dd/a/strong/text()')[0].extract()
              content = dl.xpath('./dd//text()').extract()
              content = ''.join(content)
              #完成数据解析

              item = QiubaiItem()
              item['title'] = title
              item['content'] = content
              yield item  #将item提交给管道
              break

接下来分别是item.py的

  import scrapy
  class QiubaiItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      title = scrapy.Field()
      content = scrapy.Field()

这个是pipelines.py:

   
  class QiubaiPipeline(object):
      def __init__(self):
          self.fp = None
      def open_spdier(self,spider):#重写父类的文件打开方法
          print("开始爬虫")
          self.fp = open('./biedou.txt','w',encoding='utf-8')

      def close_spider(self,spider):
          print("结束爬虫")
          self.fp.close()

      def process_item(self, item, spider):
          title = str(item['title'])
          content = str(item['content'])
          self.fp.write(title+':'+content+'\n')
          return item

以下是我的报错：

  PS D:\project_test\PyDemo\demo1\xunlian\mySpider\qiubai> py -m scrapy crawl biedou
  2024-04-11 10:36:12 [scrapy.core.scraper] ERROR: Error processing {'content': '笑点不同怎么做朋友。。', 'title': '笑点太低了吧'}
  Traceback (most recent call last):
    File "C:\Users\空条承太郎\AppData\Roaming\Python\Python312\site-packages\twisted\internet\defer.py", line 1078, in _runCallbacks
      current.result = callback(  # type: ignore[misc]
    File "C:\Users\空条承太郎\AppData\Roaming\Python\Python312\site-packages\scrapy\utils\defer.py", line 340, in f
      return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  TypeError: Object of type QiubaiItem is not JSON serializable
  结束爬虫
  2024-04-11 10:36:12 [scrapy.core.engine] ERROR: Scraper close failure
  Traceback (most recent call last):
    File "C:\Users\空条承太郎\AppData\Roaming\Python\Python312\site-packages\twisted\internet\defer.py", line 1078, in _runCallbacks
      current.result = callback(  # type: ignore[misc]
    File "D:\project_test\PyDemo\demo1\xunlian\mySpider\qiubai\qiubai\pipelines.py", line 24, in close_spider
      self.fp.close()
  AttributeError: 'NoneType' object has no attribute 'close'

真的找了很久的问题了，像是那种重写父类方法的问题我也比对过感觉自己重写的方式是正确的，还有就是setting文件中pipelines也有手动打开，但是始终不知道自己创建的这个fp为什么是None，一直无法写入，连txt文件都没法创建，希望有大佬能帮我解决，十分感激！

python 爬虫

阅读 758

AI BotBETA

从报错信息来看，问题主要有两个：

TypeError: Object of type QiubaiItem is not JSON serializable：这个错误表明Scrapy试图将QiubaiItem对象序列化为JSON，但失败了。默认情况下，Scrapy会将爬取到的item序列化为JSON并写入到磁盘，但你的QiubaiItem对象可能包含无法被序列化为JSON的内容，例如文件对象、自定义类等。
AttributeError: 'NoneType' object has no attribute 'close'：这个错误表明你试图关闭一个为None的文件对象。这可能是因为open_spider方法中的文件打开操作失败了，或者在你试图关闭文件之前，文件对象已经被垃圾回收或手动关闭了。

为了解决这个问题，你可以尝试以下步骤：

检查文件路径和权限：确保你有足够的权限在当前目录下创建文件，并且文件路径是正确的。你也可以尝试使用绝对路径来打开文件，看看是否可以解决问题。
重写process_item方法：在process_item方法中，不要试图写入文件。而是返回一个item，让Scrapy默认的处理机制来处理这个item。这样，如果item中包含无法序列化的内容，Scrapy会抛出错误，帮助你更容易地找到问题所在。
自定义序列化：如果你确定item中的内容无法被序列化为JSON，你可以自定义一个序列化方法，将item转换为一个可以被序列化的格式，如字符串或字典。然后，你可以在process_item方法中使用这个序列化方法来处理item。
检查open_spider和close_spider方法：确保这两个方法被正确调用，并且文件对象在open_spider中被正确创建，在close_spider中被正确关闭。你也可以在这两个方法中添加一些调试信息，以便更好地了解问题的所在。

以下是一个修改后的pipelines.py示例，其中包含了自定义序列化方法：

class QiubaiPipeline(object):
    def __init__(self):
        self.fp = None

    def open_spider(self, spider):
        print("开始爬虫")
        self.fp = open('./biedou.txt', 'w', encoding='utf-8')

    def close_spider(self, spider):
        print("结束爬虫")
        if self.fp:
            self.fp.close()

    def process_item(self, item, spider):
        # 自定义序列化方法
        serialized_item = self.serialize_item(item)
        self.fp.write(serialized_item + '\n')
        return item

    def serialize_item(self, item):
        # 将item转换为一个可以被序列化的格式
        title = str(item['title'])
        content = str(item['content'])
        return f"{title}:{content}"

希望这个示例能帮助你解决问题。如果还有其他问题，请随时提问。

1 个回答

得票最新

禹

21613

发布于
2024-04-16 浙江

方法名拼错 open_spdier -> open_spider

撰写回答