用scrapy爬取腾讯视频时出现 ERROR: Spider error processing

高家伟
  • 15

系统:win10 python版本:winPython安装包 Python 3.5.3 scrapy版本:1.40
在使用scrapy抓取腾讯视频的一个视频下面的评论时,出现以下错误:
ERROR: Spider error processing

D:\data\TX_video_comments>scrapy crawl video_comments
2017-08-16 12:24:08 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: TX_video_comments)
2017-08-16 12:24:08 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'TX_video_comments', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['TX_video_comments.spiders'], 'FEED_URI': 'D:/data/datas.csv', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36', 'NEWSPIDER_MODULE': 'TX_video_comments.spiders'}
2017-08-16 12:24:08 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d
2017-08-16 12:24:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-08-16 12:24:09 [scrapy.core.engine] INFO: Spider opened
2017-08-16 12:24:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-16 12:24:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-16 12:24:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://v.qq.com/x/cover/1td1r8yyzoou3sa.html> (referer: None)
2017-08-16 12:24:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506> (referer: https://v.qq.com/x/cover/1td1r8yyzoou3sa.html)
2017-08-16 12:24:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506> (referer: https://v.qq.com/x/cover/1td1r8yyzoou3sa.html)
Traceback (most recent call last):
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\utils\defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\spidermw.py", line 49, in process_spider_input
    return scrape_func(response, request, spider)
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\scraper.py", line 146, in call_spider
    dfd.addCallbacks(request.callback or spider.parse, request.errback)
  File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 303, in addCallbacks
    assert callable(callback)
AssertionError
2017-08-16 12:24:10 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-16 12:24:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 988,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 70392,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 16, 4, 24, 10, 41789),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/AssertionError': 1,
 'start_time': datetime.datetime(2017, 8, 16, 4, 24, 9, 223426)}
2017-08-16 12:24:10 [scrapy.core.engine] INFO: Spider closed (finished)

具体程序文件如下:
item.py

import scrapy
class TxVideoCommentsItem(scrapy.Item):
    name = scrapy.Field()
    content = scrapy.Field()
    ctime = scrapy.Field()

pipelines.py

class TxVideoCommentsPipeline(object):
    
    def process_item(self, item, spider):
        return item

settings.py

BOT_NAME = 'TX_video_comments'

SPIDER_MODULES = ['TX_video_comments.spiders']
NEWSPIDER_MODULE = 'TX_video_comments.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'
#COOKIE是我用chrome访问那个出错页面,弄下来的,这里修改部分,请大家理解
COOKIE = {'pgv_pvi':'5754868736',
          'RK':'dNt+quQe',
          'ptui_loginuin':'57135436694@qq.com',
          'pt2gguin':'o05713454694',
          'ptcz':'f680eb808678973cdf7dc9e1e6d6e609fa4ef1a2a108d300d41d89852abfe562',
          'tvfe_boss_uuid':'6c1bc84cdc6cd22b',
          'mobileUV':'1_15dd4f29ca7_33170',
          'o_cookie':'524353694',
          'pgv_pvid':'7383567854'
          }
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Saving files
# save file to local
FEED_URI = u'D:/data/datas.csv'
FEED_FORMAT = 'csv'

爬虫文件
video_comments.py

import scrapy
import re
import json
#import requests
from scrapy.http import Request
from scrapy.spiders import CrawlSpider
#from scrapy.selector import Selector
from TX_video_comments.items import TxVideoCommentsItem
from scrapy.conf import settings


class VideoCommentsSpider(CrawlSpider):
    name = 'video_comments'
    cookie = settings['COOKIE']
    start_urls = ['https://v.qq.com/x/cover/1td1r8yyzoou3sa.html']
    comment_url = 'https://coral.qq.com/article/{}/comment?commentid=0&reqnum=20'
    ncgi_url = 'https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid='

    def parse(self, response):
        vid = re.search('&vid=(.{2,20})&ptag=', response.body.decode('utf-8'), re.S).group(1)
        ncgi_url = self.ncgi_url + vid
        yield Request(ncgi_url, callback='parse_id', cookies=self.cookie)

    def parse_id(self, response):
        id = re.search('"comment_id":"(\d+)",', response.body.decode('utf-8'), re.S).group(1)
        commentUrl = self.comment_url.format(id)
        yield Request(commentUrl, callback='parse_comment', cookies=self.cookie)

    def parse_comment(self, response):
        js_dict = json.loads(response.body.decode('utf-8'))
        js_data = js_dict['data']
        comments = js_data['commentid']
        for each in comments:
            item = TxVideoCommentsItem()
            item['content'] = each['content']
            item['name'] = each['userinfo']['nick']
            item['ctime'] = each['timeDifference']
            yield item

希望大家能帮我看看哪里出了问题.

回复
阅读 11.8k
2 个回答

我又去看了scrapy的官方文档,发现是我callback参数写错了,正确的写法应该是:callback=self.parse_id.

wilf
  • 2
新手上路,请多包涵

朋友scrapy有没有做腾讯视频url的爬取

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
你知道吗?

宣传栏