求助 scrapy 爬取数据失败,排查了好久都没有找到问题了,实在找不到了
目标:爬取欣欣旅游网的某一城市 各大景点的基本信息
这是我的 sipder 以及 item 代码
spider:
from scrapy import Request
from scrapy.spiders import Spider
from XXtourism.items import XxtourismItem
class TourismSpider(Spider):
name = "tourism"
# 初始请求
def start_requests(self):
url = "https://tianjin.cncn.com/jingdian/"
yield Request(url,dont_filter=True)
# 解析函数
def parse(self, response,*args,**kwargs):
spots_list = response.xpath('//div[@class="city_spots_list"]/ul/li')
for i in spots_list:
try:
#景点名称
name = i.xpath('./a/div[@class="title"]/b/text()').extract_first()
#景点简介
introduce = i.xpath('./div[@class="text_con"]/p/text()').extract_first()
item = XxtourismItem()
item["name"] = name
item["introduce"] = introduce
#生成详细页请求
url = i.xpath("./a/@href").extract_first()
yield Request(url,meta={"item":item},callback=self.pif_parse,dont_filter=True)
except:
pass
def pif_parse(self,response):
try:
address = response.xpath("//div[@class='type']/dl[1]/dd/text()").extract_first()
time = response.xpath("//div[@class='type']/dl[4]/dd/p/text()").extract_first()
ticket = response.xpath("//div[@class='type']/dl[5]/dd/p/text()").extract_first()
response.find_element_by_xpath("//div[@class='type']/dl[3]//dd/a/text()")
type = response.xpath("//div[@class='type']/dl[3]//dd/a/text()").extract_first()
if type:
type = type
else:
type = ' '
item = response.meta["item"]
item["address"] = address
item["time"] = time
item["ticket"] = ticket
item["type"] = type
yield item
# url = response.xpath("//div[@class='spots_info']/div[@class='type']/div[@class='introduce']/dd/a/@href").extract_first()
# yield Request(url,meta={"item":item},callback=self.fin_parse)
except:
type = ' '
# def fin_parse(self,response):
# try:
# traffic = response.xpath("//div[@class='type']/div[@class='top']/div[3]/text()").extract()
#
# item = response.meta["item"]
# item["traffic"] = traffic
#
# yield item
#
# except:
# pass
item:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XxtourismItem(scrapy.Item):
# define the fields for your item here like:
# 景点名称
name = scrapy.Field()
# 景点地址
address = scrapy.Field()
# 景点简介
introduce = scrapy.Field()
# 景点类型
type = scrapy.Field()
# 开放时间
time = scrapy.Field()
# 门票概况
ticket = scrapy.Field()
# 交通概况
traffic = scrapy.Field()
这是执行日志:
PS D:\Python\XXtourism\XXtourism> scrapy crawl tourism -o tourism.csv
2023-12-20 18:16:56 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: XXtourism)
2023-12-20 18:16:56 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23
.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.7, Platform Windows-10-10.0.19045-SP0
2023-12-20 18:16:56 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'XXtourism',
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 3,
'NEWSPIDER_MODULE': 'XXtourism.spiders',
'SPIDER_MODULES': ['XXtourism.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
2023-12-20 18:16:56 [py.warnings] WARNING: D:\Ana\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change
in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2023-12-20 18:16:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-12-20 18:16:56 [scrapy.extensions.telnet] INFO: Telnet Password: c388126d14d4b80a
2023-12-20 18:16:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-12-20 18:16:57 [scrapy.core.engine] INFO: Spider opened
2023-12-20 18:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:16:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-20 18:16:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/> (referer: None)
2023-12-20 18:17:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> from <GET http://Tianjin.cncn.com/jingdian/tianjindaxue/>
2023-12-20 18:17:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> from <GET http://Tianjin.cncn.com/jingdian/shuishanggongyuan/>
2023-12-20 18:17:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> from <GET http://Tianjin.cncn.com/jingdian/nankaidaxue/>
2023-12-20 18:17:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> from <GET http://Tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/>
2023-12-20 18:17:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shijizhong/> from <GET http://Tianjin.cncn.com/jingdian/shijizhong/>
2023-12-20 18:17:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/jingyuan/> from <GET http://Tianjin.cncn.com/jingdian/jingyuan/>
2023-12-20 18:17:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> from <GET http://Tianjin.cncn.com/jingdian/dagukoupaotai/>
2023-12-20 18:17:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> from <GET http://Tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/>
2023-12-20 18:17:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> from <GET http://Tianjin.cncn.com/jingdian/huoyuanjiaguju/>
2023-12-20 18:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangs
hijie/>
2023-12-20 18:17:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> from <GET http://Tianjin.cncn.com/jingdian/xikaijiaotang/>
2023-12-20 18:17:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> from <GET http://Tianjin.cncn.com/jingdian/tianjinziranbowuguan/>
2023-12-20 18:17:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> from <GET http://Tianjin.cncn.com/jingdian/dongwuyuan/>
2023-12-20 18:17:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhuanlegu/>
2023-12-20 18:17:57 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:17:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/cifangzi/> from <GET http://Tianjin.cncn.com/jingdian/cifangzi/>
2023-12-20 18:18:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> from <GET http://Tianjin.cncn.com/jingdian/haiheyishifengqingqu/>
2023-12-20 18:18:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> (referer: None)
2023-12-20 18:18:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> (referer: None)
2023-12-20 18:18:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> (referer: None)
2023-12-20 18:18:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> (referer: None)
2023-12-20 18:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shijizhong/> (referer: None)
2023-12-20 18:18:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/jingyuan/> (referer: None)
2023-12-20 18:18:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> (referer: None)
2023-12-20 18:18:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> (referer: None)
2023-12-20 18:18:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> (referer: None)
2023-12-20 18:18:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> (referer: None)
2023-12-20 18:18:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> (referer: None)
2023-12-20 18:18:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> (referer: None)
2023-12-20 18:18:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> (referer: None)
2023-12-20 18:18:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> (referer: None)
2023-12-20 18:18:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/cifangzi/> (referer: None)
2023-12-20 18:18:57 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 15 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:19:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> (referer: None)
2023-12-20 18:19:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinguwenhuajie/>
2023-12-20 18:19:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/wudadao/> from <GET http://Tianjin.cncn.com/jingdian/wudadao/>
2023-12-20 18:19:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> from <GET http://Tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/>
2023-12-20 18:19:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> (referer: None)
2023-12-20 18:19:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/wudadao/> (referer: None)
2023-12-20 18:19:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> (referer: None)
2023-12-20 18:19:20 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-20 18:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 12810,
'downloader/request_count': 39,
'downloader/request_method_count/GET': 39,
'downloader/response_bytes': 151805,
'downloader/response_count': 39,
'downloader/response_status_count/200': 20,
'downloader/response_status_count/301': 19,
'elapsed_time_seconds': 142.80337,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 12, 20, 10, 19, 20, 220207),
'httpcompression/response_bytes': 458357,
'httpcompression/response_count': 20,
'log_count/DEBUG': 40,
'log_count/INFO': 12,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 20,
'scheduler/dequeued': 39,
'scheduler/dequeued/memory': 39,
'scheduler/enqueued': 39,
'scheduler/enqueued/memory': 39,
'start_time': datetime.datetime(2023, 12, 20, 10, 16, 57, 416837)}
2023-12-20 18:19:20 [scrapy.core.engine] INFO: Spider closed (finished)
跟着老师讲的一步一步来的,自己多爬取了几个信息(打开对应的详细网页进行爬取)
始终获取不到任何信息,301重定向错误也试了很多方法,但都没有解决 救救我吧 大佬们
真是气人!

你这边代码有问题,然后就去exception那块了,那边没有yield item就导致中断了。其实你没必要做那个判断。
修改后就可以了
scrapy跑代码有些麻烦,用下面的吧。