Python Scrapy爬虫问题

Question

Python Scrapy爬虫问题

Me1ody丶

4411824

发布于
2017-11-06

在使用scrapy爬取百度招聘信息时，将希望爬取的公司名保存在一个list中，希望依次爬取每个公司的10页招聘信息，代码如下：

import scrapy
import json
from scrapy.http import Request
from baiduRecruit.items import BaidurecruitItem
import re

class RecruitcrawlSpider(scrapy.Spider):

name = 'recruitcrawl'
allowed_domains = ['zhaopin.baidu.com/']
page=0
company_list=['卓易','远东']
start_urls=['http://zhaopin.baidu.com/api/quanzhiasync?query={}&sort_key=5&sort_type=1&city_sug=无锡&detailmode=close&rn=20&pn=0'.format(name)for name in company_list]

def parse(self, response):
    result_json = json.loads(response.text)
    infos = result_json['data']['main']['data']['disp_data']
    name=result_json['data']['main']['data']['hilight']
    if infos:
        for info in infos:
            company_name =name
            try:
                publish_time = info['lastmod']
            except:
                publish_time = ''

            rinfo = BaidurecruitItem()
            rinfo['company_name'] = company_name
            rinfo['publish_time'] = publish_time 
            yield rinfo

    if self.page<=9:
        self.page+=1
        current_url=response.url
        next_url=re.sub('pn=\d+','pn=%d'%(self.page*20),current_url)
        yield Request(url=next_url,callback=self.parse,dont_filter=True)

但结果少了非常多条（应该有250条实际只有130条左右），请大家帮帮忙
谢谢。

python

阅读 1.6k

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

相似问题

找不到问题？创建新问题

Python Scrapy爬虫问题

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？