scrapy运行后又 DEBUG 400，没能从网页上爬取信息；DEBUG 200 可以正常获取信息。

Question

scrapy运行后又 DEBUG 400，没能从网页上爬取信息；DEBUG 200 可以正常获取信息。

发布于
2017-07-16

用Scrapy爬取“汽车之家”下“沃尔沃”品牌的所有汽车信息，包括汽车的型号、车系、用户评分、车型，都是在具体的车型页面（http://www.autohome.com.cn/sp...）就能找到的简单信息，程序从总的“沃尔沃”网页进去，经过两层网页到达具体的车型页面。

但是运行后发现，只能爬取到部分信息，其中DEBUG: Scraped from <200 ... 就是成功爬取的；DEBUG: Scraped from <400 ...就是失败的，请问这个200和400分别是什么意思？或者您有相关的链接请发我一下。DEBUG页面如下：

2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <200 http://www.autohome.com.cn/spec/25553/>
{'details': '2017款 2.0T Polestar', 'name': '沃尔沃S60', 'size': '中型车'}
2017-07-16 12:57:57 [scrapy] DEBUG: Crawled (400) <GET http://www.autohome.com.cn/spec/19583/#pvareaid=2042128> (referer: http://car.autohome.com.cn/price/series-404.html)
2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <400 http://www.autohome.com.cn/spec/27225/>
{}
2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <200 http://www.autohome.com.cn/spec/27221/>
{'details': '2017款 2.0T T5 智逸版 5座',
 'name': '沃尔沃XC90',
 'score': '4.57分',
 'size': '中大型SUV'}
2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <200 http://www.autohome.com.cn/spec/27227/>
{'details': '2017款 2.0T T4 智尚版',
 'name': '沃尔沃V60',
 'score': '4.46分',
 'size': '中型车'}
2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <400 http://www.autohome.com.cn/spec/19581/>
{}
2017-07-16 12:57:57 [scrapy] DEBUG: Scraped from <400 http://www.autohome.com.cn/spec/19583/>
{}

对于爬取失败的网页，我用scrapy shell进去，又可以正常获取信息，所以很不解，请问如何获得所有想要爬取的信息？

spider代码如下：

import scrapy
from autohome.items import AutohomeItem
import re


class AutohomeSpider(scrapy.spiders.Spider):
    
    name = 'autohome'
    allowed_domains = ['https://car.autohome.com.cn']
    start_urls = [
        'http://car.autohome.com.cn/price/brand-70.html'
    ]

    def parse(self, response):
        for href in response.xpath('//div[@class="main-title"]/a/@href').extract():
            url_path = 'http://car.autohome.com.cn' + href
            yield scrapy.Request(url_path, callback=self.car_list, dont_filter=True)

    def car_list(self, response):
        for href in response.xpath('//div[@class="interval01-list-cars-infor"]/p/a/@href').extract():
            # 这一步会返回两种href，其中一种是“惠民补贴”，需要筛除
            if href[-6] != '6':
                yield scrapy.Request(href, callback=self.car_infor_collect, dont_filter=True)

    def car_infor_collect(self, response):
        item = AutohomeItem()

        data_0 = response.xpath('//div[@class="breadnav fn-left"]/a/text()').extract()
        if data_0:
            item['size'] = data_0[1]
            item['name'] = data_0[2]
            item['details'] = data_0[3]

        data_1 = response.xpath('//a[@class="fn-fontsize14 font-bold"]/text()').extract()
        if data_1:
            item['score'] = data_1[0]

        yield item

scrapy

python

阅读 6.9k

1 个回答

得票最新

屎壳螂

5811412

发布于
2017-07-24

可能网站做了一些访问限制，比如说一个ip一个小时内只允许访问10个连接，你可以写个简单的脚本测试一下

撰写回答