系统:win10 python版本:winPython安装包 Python 3.5.3 scrapy版本:1.40
在使用scrapy抓取腾讯视频的一个视频下面的评论时,出现以下错误:
ERROR: Spider error processing
D:\data\TX_video_comments>scrapy crawl video_comments
2017-08-16 12:24:08 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: TX_video_comments)
2017-08-16 12:24:08 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'TX_video_comments', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['TX_video_comments.spiders'], 'FEED_URI': 'D:/data/datas.csv', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36', 'NEWSPIDER_MODULE': 'TX_video_comments.spiders'}
2017-08-16 12:24:08 [scrapy.extensions.feedexport] ERROR: Unknown feed storage scheme: d
2017-08-16 12:24:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-08-16 12:24:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-08-16 12:24:09 [scrapy.core.engine] INFO: Spider opened
2017-08-16 12:24:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-16 12:24:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-16 12:24:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://v.qq.com/x/cover/1td1r8yyzoou3sa.html> (referer: None)
2017-08-16 12:24:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506> (referer: https://v.qq.com/x/cover/1td1r8yyzoou3sa.html)
2017-08-16 12:24:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506> (referer: https://v.qq.com/x/cover/1td1r8yyzoou3sa.html)
Traceback (most recent call last):
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid=j0024j1z506>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\utils\defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\spidermw.py", line 49, in process_spider_input
return scrape_func(response, request, spider)
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\scrapy\core\scraper.py", line 146, in call_spider
dfd.addCallbacks(request.callback or spider.parse, request.errback)
File "D:\ProgramData\WinPython-64bit-3.5.3.1Qt5\python-3.5.3.amd64\lib\site-packages\twisted\internet\defer.py", line 303, in addCallbacks
assert callable(callback)
AssertionError
2017-08-16 12:24:10 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-16 12:24:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 988,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 70392,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 16, 4, 24, 10, 41789),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/AssertionError': 1,
'start_time': datetime.datetime(2017, 8, 16, 4, 24, 9, 223426)}
2017-08-16 12:24:10 [scrapy.core.engine] INFO: Spider closed (finished)
具体程序文件如下:
item.py
import scrapy
class TxVideoCommentsItem(scrapy.Item):
name = scrapy.Field()
content = scrapy.Field()
ctime = scrapy.Field()
pipelines.py
class TxVideoCommentsPipeline(object):
def process_item(self, item, spider):
return item
settings.py
BOT_NAME = 'TX_video_comments'
SPIDER_MODULES = ['TX_video_comments.spiders']
NEWSPIDER_MODULE = 'TX_video_comments.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'
#COOKIE是我用chrome访问那个出错页面,弄下来的,这里修改部分,请大家理解
COOKIE = {'pgv_pvi':'5754868736',
'RK':'dNt+quQe',
'ptui_loginuin':'57135436694@qq.com',
'pt2gguin':'o05713454694',
'ptcz':'f680eb808678973cdf7dc9e1e6d6e609fa4ef1a2a108d300d41d89852abfe562',
'tvfe_boss_uuid':'6c1bc84cdc6cd22b',
'mobileUV':'1_15dd4f29ca7_33170',
'o_cookie':'524353694',
'pgv_pvid':'7383567854'
}
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Saving files
# save file to local
FEED_URI = u'D:/data/datas.csv'
FEED_FORMAT = 'csv'
爬虫文件
video_comments.py
import scrapy
import re
import json
#import requests
from scrapy.http import Request
from scrapy.spiders import CrawlSpider
#from scrapy.selector import Selector
from TX_video_comments.items import TxVideoCommentsItem
from scrapy.conf import settings
class VideoCommentsSpider(CrawlSpider):
name = 'video_comments'
cookie = settings['COOKIE']
start_urls = ['https://v.qq.com/x/cover/1td1r8yyzoou3sa.html']
comment_url = 'https://coral.qq.com/article/{}/comment?commentid=0&reqnum=20'
ncgi_url = 'https://ncgi.video.qq.com/fcgi-bin/video_comment_id?otype=json&op=3&vid='
def parse(self, response):
vid = re.search('&vid=(.{2,20})&ptag=', response.body.decode('utf-8'), re.S).group(1)
ncgi_url = self.ncgi_url + vid
yield Request(ncgi_url, callback='parse_id', cookies=self.cookie)
def parse_id(self, response):
id = re.search('"comment_id":"(\d+)",', response.body.decode('utf-8'), re.S).group(1)
commentUrl = self.comment_url.format(id)
yield Request(commentUrl, callback='parse_comment', cookies=self.cookie)
def parse_comment(self, response):
js_dict = json.loads(response.body.decode('utf-8'))
js_data = js_dict['data']
comments = js_data['commentid']
for each in comments:
item = TxVideoCommentsItem()
item['content'] = each['content']
item['name'] = each['userinfo']['nick']
item['ctime'] = each['timeDifference']
yield item
希望大家能帮我看看哪里出了问题.
我又去看了scrapy的官方文档,发现是我callback参数写错了,正确的写法应该是:callback=self.parse_id.