小弟刚学scrapy,知之甚少,最近尝试着爬了苏宁,结果惨不忍睹,各种错误不断啊,所幸各位前辈踩的坑不少,诸多问题百度都解决了,但是有一个pipline文件KeyError的错误折腾了几晚都没解决,只能求助各位大佬了,下面贴代码
settings:
ITEM_PIPELINES = {
'weibo.pipelines.SuningPipeline': 200,
}
items:
class SuningItem(scrapy.Item):
name = scrapy.Field() #存储商品名称
price = scrapy.Field() #存储商品价格
evaluate = scrapy.Field() #存储商品好评数
favorable_rate = scrapy.Field() #存储商品好评率/数
brand = scrapy.Field() #存储商品品牌
style = scrapy.Field() #存储商品款式
product_url = scrapy.Field() #存储商品链接
storename = scrapy.Field() #存储店铺名称
这个item中还有另外一个不同名的类,不影响吧~
spider:
import scrapy
from weibo.items import SuningItem
from scrapy.http import Request
import re
class SuninSpider(scrapy.Spider):
name = 'sunin'
allowed_domains = ['suning.com']
start_urls = ['https://search.suning.com/%E5%A5%B3%E9%9E%8B/']
url = 'https://search.suning.com/%E5%A5%B3%E9%9E%8B/'
request对象
def start_requests(self):
for i in range(0, 99):
url = self.url + '&cp=' + str(i)
yield Request(url, callback = self.parse)
def parse(self, response):
suningitem = SuningItem()
suningitem['product_url'] =response.xpath('//p[@class="sell-point"]/a/@href').extract()
for i in range(0, len(suningitem['product_url'])):
suningitem['product_url'][i] = "http:" + str(suningitem['product_url'][i])
product_url = suningitem['product_url']
for i in product_url:
if 'http:' in i:
yield Request(i, meta={'key':suningitem}, callback= self.product_parse)
#将所有抓到的商品url生成request对象返回,并且用meta传递item
else:
i = 'http:' + i
yield Request(i, meta={'key':suningitem}, callback= self.product_parse)
yield suningitem
def product_parse(self, response):
suningitem = response.meta['key']
product_url = response.url
#print("---------" + str(product_url) + "------")
suningitem['brand'] = response.xpath('//td[@class="val"]/a/text()').extract()
suningitem['style'] = response.xpath('//tr/td/div/span[text()="款式"]/../../../td[@class="val"]/text()').extract()
suningitem['name'] = response.xpath('//h1/text()').extract()
suningitem['storename'] = response.xpath('//div[@class="si-intro-list"]/dl[1]/dd/a/text()').extract()
proID_pat = re.compile('/(\d+)\.html')
storeID_pat = re.compile('/(\d.+)/')
proID = proID_pat.findall(product_url)[0]
storeID = storeID_pat.findall(product_url)[0]
favorate_url = 'https://review.suning.com/ajax/review_satisfy/style-000000000' + str(proID[0]) + '-' + str(storeID[0]) + '-----satisfy.htm?'
yield Request(favorate_url, meta = {'key': suningitem}, callback = self.favorate_parse)
price_url = 'https://pas.suning.com/nspcsale_1_000000000' + str(proID) + '_000000000' + str(proID) + '_' + str(storeID) + '_60_319_3190101_361003_1000103_9103_10766_Z001___R9001225.html'
yield Request(price_url, meta={'key': suningitem}, callback=self.price_parse)
yield suningitem
def favorate_parse(self, response):
suningitem = response.meta['key']
#suningitem['favorable'] = response.xpath('').extract()
#suningitem['evaluate'] = response.xpath('').extract()
fa_pat = re.compile('"totalCount":(\d+),')
eva_pat = re.compile('"fiveStarCount":(\d+),')
suningitem['evaluate'] = eva_pat.findall(response)[0]
suningitem['favorable'] = fa_pat.findall(response)[0]
yield suningitem
def price_parse(self, response):
suningitem = response.meta['key']
price_pat = re.compile('"netPrice":"(\d*?\.\d*?)"')
suningitem['price'] = price_pat.findall(response.body.decode('utf-8', 'ignore'))[0]
yield suningitem
piplines:
class SuningPipeline(object):
def __init__(self):
self.lfile = open(r'd:\desktop\苏宁data.csv', 'w', encoding='utf-8')
def process_item(self, item, spider):
''' name = item['name']
price = item['price']
evaluate = item['evaluate']
favorable = item['favorable']
brand = item['brand']
style = item['style']
product_url = item['product_url']
storename = item['storename']
for i in range(0, len(name)-1):
line = name[i] + ',' + price[i] + ',' + evaluate[i] + ',' + favorable[i] + ',' + brand[i] + ',' + style[i] + ','\
+ product_url[i] + ',' + storename[i] + '\r\n'
self.lfile.write(line) '''
#上面注释掉的块是刚开始尝试写入本地csv文件的方法,之前用过这种方法成功了,不知道这次为什么会报KeyError,下面的方法是不知道什么地方学来的了,也报KeyError……
self.lfile.write(
str(item['name']) + ',' +
str(item['price']) + ',' +
str(item['evaluate']) + ',' +
str(item['favorable']) + ',' +
str(item['style']) + ',' +
str(item['product_url']) + ',' +
str(item['storename']) + '\r\n'
)
return item
def close_spider(self, spider):
self.lfile.close()
大家有时间帮忙看看给小弟指条明路啊,如果还存在其它问题或者更好的方法也望大家帮忙指出啊,小弟感激不尽……拜谢先……
你把product_parse,price_parse,favorate_parse函数下的最后的那个yield suningitem 去掉试试