scrapy学习笔记(二)：连续抓取与数据保存

抓取论坛、贴吧这种多分页的信息时，没接触scrapy之前，是前确定有多少页，使用for循环抓取。这方法略显笨重，使用scrapy则可以直接组合下一页的链接，然后传给request持续进行抓取，一直到没有下一页链接为止。

还是以官方教程的网站为例子，先分析下元素：

下一页元素

可以看到下一页的标签：

<a href="/page/2/">Next <span aria-hidden="true">→</span></a>

其中的href属性值/page/2与www.quotes.toscrape.com组合起来就是下一页的网址，同理第二页next的href属性值组合起来就是第三页，因此只要我们判断出是否有下一页的关键字，就可以进行持续抓取。

上代码：

import scrapy

class myspider(scrapy.Spider):

# 设置爬虫名称
name = "get_quotes"

# 设置起始网址
start_urls = ['http://quotes.toscrape.com']

def parse(self, response):

    #使用 css 选择要素进行抓取，如果喜欢用BeautifulSoup之类的也可以
    #先定位一整块的quote，在这个网页块下进行作者、名言,标签的抓取
    for quote in response.css('.quote'):
        yield {
            'author' : quote.css('small.author::text').extract_first(),
            'tags' : quote.css('div.tags a.tag::text').extract(),
            'content' : quote.css('span.text::text').extract_first()
        }

    # 使用xpath获取next按钮的href属性值
    next_href = response.xpath('//li[@class="next"]/a/@href').extract_first()
    # 判断next_page的值是否存在
    if next_href is not None:

        # 如果下一页属性值存在，则通过urljoin函数组合下一页的url:
        # www.quotes.toscrape.com/page/2
        next_page = response.urljoin(next_href)

        #回调parse处理下一页的url
        yield scrapy.Request(next_page,callback=self.parse)

下面是处理结果：
抓取结果

可以看到一直抓取了10页，此网站也只有10页
网页页数

整个网站的名人名言就全部抓取到了，是不是很方便

现在只是把抓取得到的只是打印到屏幕上，并没有存储起来，接下来我们使用Mongodb进行存储，mongodb的优点可自行google，这里就不说了。从官网下载，参考官方安装教程进行配置安装。

要使用Mongodb需要pymongo，直接pip install pymongo
先演示下直接存储，当做Mongodb存储例子，实际不推荐这么使用：

import scrapy

# 导入pymongo
import pymongo

class myspider(scrapy.Spider):

# 设置爬虫名称
name = "get_quotes"

# 设置起始网址
start_urls = ['http://quotes.toscrape.com']

# 配置client，默认地址localhost，端口27017
client = pymongo.MongoClient('localhost',27017)
# 创建一个数据库，名称store_quote
db_name = client['store_quotes']
# 创建一个表
quotes_list = db_name['quotes']

def parse(self, response):

    #使用 css 选择要素进行抓取，如果喜欢用BeautifulSoup之类的也可以
    #先定位一整块的quote，在这个网页块下进行作者、名言,标签的抓取
    for quote in response.css('.quote'):
        # 将页面抓取的数据存入mongodb,使用insert
        yield self.quotes_list.insert({
            'author' : quote.css('small.author::text').extract_first(),
            'tags' : quote.css('div.tags a.tag::text').extract(),
            'content' : quote.css('span.text::text').extract_first()
        })

    # 使用xpath获取next按钮的href属性值
    next_href = response.xpath('//li[@class="next"]/a/@href').extract_first()
    # 判断next_page的值是否存在
    if next_href is not None:

        # 如果下一页属性值存在，则通过urljoin函数组合下一页的url:
        # www.quotes.toscrape.com/page/2
        next_page = response.urljoin(next_href)

        #回调parse处理下一页的url
        yield scrapy.Request(next_page,callback=self.parse)

如果使用的是pycharm编辑器，有一个mongodb插件，可以方便的查看数据库，Mongo plugin，在plugin里面添加