scrapy 爬虫,始终获取不到数据,如何解决呢?

求助 scrapy 爬取数据失败,排查了好久都没有找到问题了,实在找不到了

目标:爬取欣欣旅游网的某一城市 各大景点的基本信息

这是我的 sipder 以及 item 代码

spider:

from scrapy import Request
from scrapy.spiders import Spider
from XXtourism.items import XxtourismItem


class TourismSpider(Spider):
    name = "tourism"

    # 初始请求
    def start_requests(self):
        url = "https://tianjin.cncn.com/jingdian/"
        yield Request(url,dont_filter=True)



    # 解析函数
    def parse(self, response,*args,**kwargs):
        spots_list = response.xpath('//div[@class="city_spots_list"]/ul/li')
        for i in spots_list:
            try:
                #景点名称
                name = i.xpath('./a/div[@class="title"]/b/text()').extract_first()
                #景点简介
                introduce = i.xpath('./div[@class="text_con"]/p/text()').extract_first()

                item = XxtourismItem()
                item["name"] = name
                item["introduce"] = introduce
                #生成详细页请求
                url = i.xpath("./a/@href").extract_first()
                yield Request(url,meta={"item":item},callback=self.pif_parse,dont_filter=True)
            except:
                pass

    def pif_parse(self,response):
        try:
            address = response.xpath("//div[@class='type']/dl[1]/dd/text()").extract_first()
            time = response.xpath("//div[@class='type']/dl[4]/dd/p/text()").extract_first()
            ticket = response.xpath("//div[@class='type']/dl[5]/dd/p/text()").extract_first()
            response.find_element_by_xpath("//div[@class='type']/dl[3]//dd/a/text()")
            type = response.xpath("//div[@class='type']/dl[3]//dd/a/text()").extract_first()
            if type:
                type = type
            else:
                type = ' '

            item = response.meta["item"]
            item["address"] = address
            item["time"] = time
            item["ticket"] = ticket
            item["type"] = type
            yield item

            # url = response.xpath("//div[@class='spots_info']/div[@class='type']/div[@class='introduce']/dd/a/@href").extract_first()
            # yield Request(url,meta={"item":item},callback=self.fin_parse)
        except:
            type = ' '


    # def fin_parse(self,response):
    #     try:
    #         traffic = response.xpath("//div[@class='type']/div[@class='top']/div[3]/text()").extract()
    #
    #         item = response.meta["item"]
    #         item["traffic"] = traffic
    #
    #         yield item
    #
    #     except:
    #         pass

item:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class XxtourismItem(scrapy.Item):
    # define the fields for your item here like:
    # 景点名称
    name = scrapy.Field()
    # 景点地址
    address = scrapy.Field()
    # 景点简介
    introduce = scrapy.Field()
    # 景点类型
    type = scrapy.Field()
    # 开放时间
    time = scrapy.Field()
    # 门票概况
    ticket = scrapy.Field()
    # 交通概况
    traffic = scrapy.Field()

这是执行日志:

PS D:\Python\XXtourism\XXtourism> scrapy crawl tourism -o tourism.csv
2023-12-20 18:16:56 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: XXtourism)
2023-12-20 18:16:56 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23
.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.7, Platform Windows-10-10.0.19045-SP0
2023-12-20 18:16:56 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'XXtourism',
 'COOKIES_ENABLED': False,
 'DOWNLOAD_DELAY': 3,
 'NEWSPIDER_MODULE': 'XXtourism.spiders',
 'SPIDER_MODULES': ['XXtourism.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
2023-12-20 18:16:56 [py.warnings] WARNING: D:\Ana\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change
 in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-12-20 18:16:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-12-20 18:16:56 [scrapy.extensions.telnet] INFO: Telnet Password: c388126d14d4b80a
2023-12-20 18:16:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-12-20 18:16:57 [scrapy.core.engine] INFO: Spider opened
2023-12-20 18:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:16:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-20 18:16:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/> (referer: None)
2023-12-20 18:17:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> from <GET http://Tianjin.cncn.com/jingdian/tianjindaxue/>
2023-12-20 18:17:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> from <GET http://Tianjin.cncn.com/jingdian/shuishanggongyuan/>
2023-12-20 18:17:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> from <GET http://Tianjin.cncn.com/jingdian/nankaidaxue/>
2023-12-20 18:17:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> from <GET http://Tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/>
2023-12-20 18:17:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/shijizhong/> from <GET http://Tianjin.cncn.com/jingdian/shijizhong/>
2023-12-20 18:17:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/jingyuan/> from <GET http://Tianjin.cncn.com/jingdian/jingyuan/>
2023-12-20 18:17:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> from <GET http://Tianjin.cncn.com/jingdian/dagukoupaotai/>
2023-12-20 18:17:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> from <GET http://Tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/>
2023-12-20 18:17:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> from <GET http://Tianjin.cncn.com/jingdian/huoyuanjiaguju/>
2023-12-20 18:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangs
hijie/>
2023-12-20 18:17:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> from <GET http://Tianjin.cncn.com/jingdian/xikaijiaotang/>
2023-12-20 18:17:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> from <GET http://Tianjin.cncn.com/jingdian/tianjinziranbowuguan/>
2023-12-20 18:17:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> from <GET http://Tianjin.cncn.com/jingdian/dongwuyuan/>
2023-12-20 18:17:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> from <GET http://Tianjin.cncn.com/jingdian/tianjinhuanlegu/>
2023-12-20 18:17:57 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:17:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/cifangzi/> from <GET http://Tianjin.cncn.com/jingdian/cifangzi/>
2023-12-20 18:18:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> from <GET http://Tianjin.cncn.com/jingdian/haiheyishifengqingqu/>
2023-12-20 18:18:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjindaxue/> (referer: None)
2023-12-20 18:18:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shuishanggongyuan/> (referer: None)
2023-12-20 18:18:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/nankaidaxue/> (referer: None)
2023-12-20 18:18:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tanggubinhaishijiguangchang/> (referer: None)
2023-12-20 18:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/shijizhong/> (referer: None)
2023-12-20 18:18:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/jingyuan/> (referer: None)
2023-12-20 18:18:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dagukoupaotai/> (referer: None)
2023-12-20 18:18:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/binhaihangmuzhutigongyuan/> (referer: None)
2023-12-20 18:18:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/huoyuanjiaguju/> (referer: None)
2023-12-20 18:18:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhaichangjidihaiyangshijie/> (referer: None)
2023-12-20 18:18:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/xikaijiaotang/> (referer: None)
2023-12-20 18:18:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinziranbowuguan/> (referer: None)
2023-12-20 18:18:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/dongwuyuan/> (referer: None)
2023-12-20 18:18:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinhuanlegu/> (referer: None)
2023-12-20 18:18:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/cifangzi/> (referer: None)
2023-12-20 18:18:57 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 15 pages/min), scraped 0 items (at 0 items/min)
2023-12-20 18:19:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/haiheyishifengqingqu/> (referer: None)
2023-12-20 18:19:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> from <GET http://Tianjin.cncn.com/jingdian/tianjinguwenhuajie/>
2023-12-20 18:19:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/wudadao/> from <GET http://Tianjin.cncn.com/jingdian/wudadao/>
2023-12-20 18:19:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> from <GET http://Tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/>
2023-12-20 18:19:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinguwenhuajie/> (referer: None)
2023-12-20 18:19:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/wudadao/> (referer: None)
2023-12-20 18:19:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tianjin.cncn.com/jingdian/tianjinzhiyanmotianlun/> (referer: None)
2023-12-20 18:19:20 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-20 18:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 12810,
 'downloader/request_count': 39,
 'downloader/request_method_count/GET': 39,
 'downloader/response_bytes': 151805,
 'downloader/response_count': 39,
 'downloader/response_status_count/200': 20,
 'downloader/response_status_count/301': 19,
 'elapsed_time_seconds': 142.80337,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 12, 20, 10, 19, 20, 220207),
 'httpcompression/response_bytes': 458357,
 'httpcompression/response_count': 20,
 'log_count/DEBUG': 40,
 'log_count/INFO': 12,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 20,
 'scheduler/dequeued': 39,
 'scheduler/dequeued/memory': 39,
 'scheduler/enqueued': 39,
 'scheduler/enqueued/memory': 39,
 'start_time': datetime.datetime(2023, 12, 20, 10, 16, 57, 416837)}
2023-12-20 18:19:20 [scrapy.core.engine] INFO: Spider closed (finished)

跟着老师讲的一步一步来的,自己多爬取了几个信息(打开对应的详细网页进行爬取)
始终获取不到任何信息,301重定向错误也试了很多方法,但都没有解决 救救我吧 大佬们

阅读 2.5k
1 个回答

真是气人!
image.png

你这边代码有问题,然后就去exception那块了,那边没有yield item就导致中断了。其实你没必要做那个判断。

修改后就可以了

image.png


scrapy跑代码有些麻烦,用下面的吧。

import requests as r
from lxml.etree import HTML


def main():
    resp = r.get('https://tianjin.cncn.com/jingdian/')
    content = resp.content.decode('gb2312')
    #
    html = HTML(content)
    nodes = html.xpath('//div[@class="city_spots_list"]/ul/li')
    for n in nodes:
        title = n.xpath('./a/div[@class="title"]//b//text()')
        print(title)
    x = 3
    pass


if __name__ == '__main__':
    main()

['天津之眼摩天轮']
['五大道']
['天津古文化街']
['海河意式风情区']
['瓷房子']
['天津欢乐谷']
['动物园']
['天津自然博物馆']
['西开教堂']
['海昌极地海洋世界']
['霍元甲故居']
['天津航母主题公园']
['大沽口炮台']
['静园']
['世纪钟']
['塘沽滨海世纪广场']
['南开大学']
['水上公园']
['天津大学']
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进