scrapy 无法爬取下一层的页面

Question

scrapy 无法爬取下一层的页面

发布于
2017-01-14

新手上路，请多包涵

我想写了一个爬虫，基于scrapy框架，我的目的是爬取一个类似于论坛的东西：
第一层页面是一个分页的列表，其url为：
http://www.example.com/group/p1/
http://www.example.com/group/p2/
.
.
.
http://www.example.com/group/pn/
通过第一层页面的列表链接至第二层页面，其url为：
http://www.example.com/group/view/1
htpp://www.example.com/group/view/2
.
.
.
http://www.example.com/group/view/999...


我的爬虫代码如下：

`

import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin
class MyCrawler(CrawlSpider):
    name = "MyCrawler"
    start_urls=[
        "http://www.example.com/group/"
    ]
    rules = [
        Rule(LxmlLinkExtractor(allow=(),
                               restrict_xpaths=(["//div[@class='multi-
                               page']/a[@class='aNxt']"])),
                               callback='parse_list_page',
                               follow=True)
    ]
    def parse(self, response):

        list_page=response.xpath("//div[@class='li-itemmod']/div/h3/a/@href").extract()
        for item in list_page:
            yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)

    def parse_detail_page(self,response):
        community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()

        self.log(community_name,2)

`

我现在的问题是：parse_detail_page好像总是得不到执行，是什么原因，如果是我的思路有问题，向这样的需求应该如何写，我想这是个普遍的需求？
谢谢各位。

python爬虫 scrapy

阅读 3.8k

1 个回答

得票最新

lifew

2

发布于
2017-09-30

新手上路，请多包涵

crawlspider中应该不使用parse函数的。你把parse函数改个名，再重新调用试试

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

浏览器能请求到数据怎么换了api工具或是爬虫都没数据了呢？
4 回答2.8k 阅读

相似问题

找不到问题？创建新问题

scrapy 无法爬取下一层的页面

你尚未登录，登录后可以

浏览器能请求到数据怎么换了api工具或是爬虫都没数据了呢？