先简单说下我的需求,我要从以下两个不同的url,获取不同的数据,最后统一调用item
https://www.esr.com/sc/map_ch...
https://www.esr.com/sc/media_...
spider代码
# -*- coding: utf-8 -*-
import scrapy
from news.items import EsrItem
class EsrSpider(scrapy.Spider):
name = 'esr'
allowed_domains = ['esr.com']
def start_requests(self):
yield scrapy.Request('https://www.esr.com/sc/map_china.php',self.parse1)
yield scrapy.Request('https://www.esr.com/sc/media_news.php',self.parse3)
def parse1(self, response):
for web in response.xpath('//div[@class="earth_hide_ul"]/ul/li'):
url_tmp=web.xpath('.//a/@href').extract()[0]
urlquest="https://www.esr.com/sc/"+url_tmp
yield scrapy.Request(url=urlquest,callback=self.parse2)
def parse2(self, response):
item=EsrItem()
item['assetstitle']=response.xpath('//div[@class="flex justify_between_center"]/h3/text()').extract()[0]
item ['assetaddress'] = response.xpath("//ul[@class='map_item_ul'][1]/li/b/text()").extract()[0]
tmp =response.xpath("//ul[@class='map_item_ul'][2]")
item['assettedian']=tmp.xpath("string(.)").extract()[0].strip()
item['assetjiagou']=response.xpath("//ul[@class='map_store_ul']/li[1]/div/span/text()").extract()[0]
item['assettudimianji']=response.xpath("//ul[@class='map_store_ul']/li[2]/div/span/text()").extract()[0].strip()
item['assetjianzhumianji'] = response.xpath("//ul[@class='map_store_ul']/li[3]/div/span/text()").extract()[
0].strip()
item['assetjungongtime'] = response.xpath("//ul[@class='map_store_ul']/li[4]/div/span/text()").extract()[
0].strip()
assetpeople = response.xpath("//ul[@class='map_store_ul']/li[5]/div/span/a/text()").extract()[
0].strip()
assetpeople_mail=response.xpath("//ul[@class='map_store_ul']/li[5]/div/span/a/@href").extract()[
0][6:]
item['assetpeople']=assetpeople+assetpeople_mail
yield scrapy.Request(”这里如何写?“callback=self.parse3,meta={'item':item})
def parse3(self, response):
pass
如上面我重写了start_request,yield返回2个请求分别调用parse1,parse2,
我在parse2里面其实已经把item里面所有asset开头的字段数据取出来了,但是还有个new开头的字段需要在另一个url里面取,我想把parse2的assset的数据传到parse3里面继续处理,最后统一yield item,但是这个parse2的回调函数url字段该怎么写呢?我其实不需要传url,只想把数据最后统一传到item,或者说我直接在parse里面直接yield item吗?但是数据还不全啊。
我的item
class EsrItem(scrapy.Item):
assetstitle=scrapy.Field()
assetaddress=scrapy.Field()
assettedian=scrapy.Field()
assetjiagou=scrapy.Field()
assettudimianji=scrapy.Field()
assetjianzhumianji=scrapy.Field()
assetjungongtime=scrapy.Field()
assetpeople=scrapy.Field()
newstitle=scrapy.Field()
newtiems=scrapy.Field()
newslink=scrapy.Field()
目前暂无好的解决办法,临时的解决办法只能新建两个item,单独的返回对应的item,然后pipaline里面isinstance(item, EsrItem):判断item最后进行数据处理了。