为什么我使用pyspider框架进行爬虫,但是results里没有结果?

我用 pyspider 想爬取 51job 上的招聘信息,在控制台代码页 run 验证的时候输出是正确的,但是回到控制台 run 之后 results 里面就没有结果,这样的情况一直出现,麻烦各位帮我看一下。

代码:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://jobs.51job.com/', callback=self.index_page, validate_cert=False, age=0)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('.e5 .lkst a').items():
            self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False, age=0)

    @config(priority=2)
    def detail_page(self, response):
        for each in response.doc('.e .info .title a').items():
            self.crawl(each.attr.href, callback=self.detail_page_next, validate_cert=False, age=0,retries=3)
        for each in response.doc('.bk a').items():
            print "deep"
        self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False, age=0)
                
    
    @config(priority=1)
    def detail_page_next(self, response):
        return {
            "公司":response.doc('.cname').text(),
            "公司规模":response.doc('.ltype').text(),
            "职位":response.doc('h1').text(),
            "薪资":response.doc('.cn strong').text(),
            "描述":response.doc('.job_msg').text(),
            "地点":response.doc('.lname').text(),
        }

代码页验证正确:
图片描述

控制台:
图片描述

results:
图片描述

阅读 4.4k
1 个回答

试试下面的脚本,设置detail_page为priority=2会让结果更早出现

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-01-22 12:13:12
# Project: 51job


from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://jobs.51job.com/', callback=self.main_index, validate_cert=False, age=0)

    @config(age=10 * 24 * 60 * 60)
    def main_index(self, response):
        for each in response.doc('.e5 .lkst a').items():
            self.crawl(each.attr.href, callback=self.index_page, validate_cert=False, age=0)

    @config(priority=1)
    def index_page(self, response):
        for each in response.doc('.e .info .title a').items():
            self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False, age=0,retries=3)
        for each in response.doc('.bk a').items():
            print "deep"
        self.crawl(each.attr.href, callback=self.index_page, validate_cert=False, age=0)
                
    
    @config(priority=2)
    def detail_page(self, response):
        return {
            "公司":response.doc('.cname').text(),
            "公司规模":response.doc('.ltype').text(),
            "职位":response.doc('h1').text(),
            "薪资":response.doc('.cn strong').text(),
            "描述":response.doc('.job_msg').text(),
            "地点":response.doc('.lname').text(),
        }
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题