scrapy 图片爬取偶尔会报错

Question

scrapy 图片爬取偶尔会报错

发布于
2019-06-10

新手上路，请多包涵

使用scrapy爬取百度百科图片的时候偶尔会报以下错误，一直没找到解决方案，望大佬指点

2019-06-10 11:48:31 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg> referred in <None>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://gss2.bdstatic.com/-fo3dSag_xI4khGkpoWK1HF6hhy/baike/pic/item/63d9f2d3572c11df9f377a05612762d0f703c236.jpg>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 101, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 105, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 125, in get_images
    image, buf = self.convert_image(orig_image)
  File "/usr/local/lib/python3.6/dist-packages/scrapy/pipelines/images.py", line 151, in convert_image
    image.save(buf, 'JPEG')
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 1899, in save
    self.load()
  File "/usr/lib/python3/dist-packages/PIL/ImageFile.py", line 228, in load
    "(%d bytes not processed)" % len(b))
OSError: image file is truncated (4 bytes not processed)

以下为pipline相关代码

class BaidubaikeImagePipeline(ImagesPipeline):

    # 保持图片原有的名字不变
    def file_path(self, request, response=None, info=None):
        image_guid = request.url.split('/')[-1]
        image_save_path = request.meta['image_save_path']
        if image_save_path:
            filePath = u'{0}/{1}'.format(image_save_path, image_guid)
            return filePath
        else:
            filePath = u'{0}/{1}'.format('full', image_guid)
            return filePath

    def get_media_requests(self, item, info):
        if item is not None:
            if item.get('image_urls'):
                for image_url in item['image_urls']:
                    if 'data:image/png;' in image_url: #base64编码把图片数据翻译成标准ASCII字符
                        pass
                    else:
                        yield scrapy.Request(image_url, meta={'image_save_path':item['image_save_path']}, dont_filter=True)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]  # ok判断是否下载成功
        if not image_path:
            print('Item contains no images')
            # raise DropItem("Item contains no images")
        return item

python爬虫

阅读 3k

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

浏览器能请求到数据怎么换了api工具或是爬虫都没数据了呢？
4 回答2.4k 阅读

相似问题

找不到问题？创建新问题

scrapy 图片爬取偶尔会报错

使用scrapy爬取百度百科图片的时候偶尔会报以下错误，一直没找到解决方案，望大佬指点

你尚未登录，登录后可以

浏览器能请求到数据怎么换了api工具或是爬虫都没数据了呢？