scrapy 获取下一页失败

爬取豆瓣读书,只能获得第一页的信息。

spider类代码如下:

# -*- coding: utf-8 -*-
import scrapy
from doubanbook.items import DoubanbookItem

class KgbookSpider(scrapy.Spider):
    #这个spider类的唯一标识
    name = 'KGbook'
    #爬取的网页在这个域名下,非此域名被忽略
    allowed_domains = ['book.douban.com/tag/考古']
    #spider启动时下载的网址
    start_urls = ['https://book.douban.com/tag/考古']

    #解析网页,获得item
    def parse(self, response):
        #用来存储item
        #items = []
        #将包含每本书的<li>标签,作为单个元素依次装入列表booklist中
        booklist = response.xpath('//*[@id="subject_list"]/ul/li')
        #依次遍历列表中每个元素,即依次遍历每个包含一本书信息的<li>标签
        for r in booklist:
            #创建DoubanbookItem()对象
            item = DoubanbookItem()
            #书名
            item['title'] = r.xpath('./div[2]/h2/a/@title').extract_first().replace(' ', '')
            #作者,译者,出版等信息
            item['author'] = r.xpath('string(./div[2]/div[1])').extract_first().replace('\n', '').replace(' ', '')
            #评分
            item['score'] = r.xpath('string(./div[2]/div[2]/span[2])').extract_first()
            item['peoples'] = r.xpath('string(./div[2]/div[2]/span[3])').extract_first().replace('\n', '').replace(' ', '')
            item['summary'] = r.xpath('string(./div[2]/p)').extract_first().strip()
            yield item

        next_page = response.xpath('//*[@id="subject_list"]/div[2]/span[5]/a/@href').extract_first()#.replace('\n', '').replace(' ', '')
        #print(str(next_page))
        #next_page = ''.join(next_page)
        if next_page is not None:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url, callback=self.parse, dont_filter=False)

        return item

这是下一页的标签
图片描述

有知道错误在哪里的前辈请指出,感谢!

阅读 3.4k
2 个回答

已解决!步骤如下:

1.注释掉allowed_domains = ['book.douban.com/tag/考古']
或者改为allowed_domains = ['book.douban.com']

2.去掉dont_filter=False

3.不同页的翻页标签的路径会有变动,需要用绝对路径直接定位到这个翻页的标签或者它的父标签。

解决后代码如下:

        next_page = response.xpath('//span[@class="next"]/a[contains(text(), "后页")]/@href').extract()
        if next_page:
            next_page = next_page[0]
            next_page_url = response.urljoin(next_page)
            yield Request(next_page_url, callback=self.parse)
        else:
            print(str(next_page))
新手上路,请多包涵
import scrapy
from First.items import FirstItem


class SpiderMan(scrapy.Spider):
    name = "rose"
    start_urls = [
        "https://wenku.baidu.com/"
    ]

    def parse(self , response):
        for item in response.xpath("//div/dl/dd/a"):
            title = item.xpath("text()").extract()
            targetUrl = item.xpath("@href").extract_first()
            oneItem = FirstItem()
            oneItem["title"] = title
            oneItem["targetUrl"] = targetUrl
            yield scrapy.Request(url = targetUrl  , meta = {"title":title} , callback=self.parse_url)

    def parse_url(self , response):
        title = response.meta["title"]
        print(title)
        for sel2 in response.xpath('//a[@class="Author logSend"]'):
            docName = sel2.xpath("text()").extract()

            oneItem = FirstItem()
            oneItem["docName"] = docName
            print(oneItem["docName"])

你可以试试我这个方法来请求下一页,我的代码是之前写的有一点BUG ,但是可以看一下思想~

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题