BeautifulSoup：网页解析利器上手简介

关于爬虫的案例和方法，我们已讲过许多。不过在以往的文章中，大多是关注在 如何把网页上的内容抓取下来 。今天我们来分享下，当你已经把内容爬下来之后， 如何提取出其中你需要的具体信息 。

网页被抓取下来，通常就是 str 字符串类型的对象 ，要从里面寻找信息，最直接的想法就是直接通过字符串的 find 方法 和 切片操作 ：



s = '<p>价格：15.7 元</p>'
start = s.find('价格：')
end = s.find(' 元')
print(s[start+3:end])  
# 15.7

这能应付一些极简单的情况，但只要稍稍复杂一点，这么写就会累死人。更通用的做法是使用 正则表达式 ：



import re
s = '<p>价格：15.7 元</p>'
r = re.search('[\d.]+', s)
print(r.group())
# 15.7

正则表达式是处理文本解析的万金油，什么情况都可以应对。但可惜掌握它需要一定的学习成本， 原本我们有一个网页提取的问题，用了正则表达式，现在我们有了两个问题。

HTML 文档本身是 结构化的文本 ，有一定的规则，通过它的结构可以简化信息提取。于是，就有了 lxml、pyquery、BeautifulSoup 等网页信息提取库。一般我们会用这些库来提取网页信息。其中， lxml 有很高的解析效率，支持 xPath 语法 （一种可以在 HTML 中查找信息的规则语法）； pyquery 得名于 jQuery（知名的前端 js 库），可以用类似 jQuery 的语法解析网页 。但我们今天要说的，是剩下的这个：

BeautifulSoup

BeautifulSoup（下文简称 bs）翻译成中文就是“美丽的汤”，这个奇特的名字来源于《 爱丽丝梦游仙境 》（这也是为何在其官网会配上奇怪的插图，以及用《爱丽丝》的片段作为测试文本）。

bs 最大的特点我觉得是 简单易用 ，不像正则和 xPath 需要刻意去记住很多特定语法，尽管那样会效率更高更直接。 对大多数 python 使用者来说，好用会比高效更重要 。这也是我自己使用并推荐 bs 的主要原因。

接下来介绍点 bs 的基本方法，让你看完就能用起来。考虑到“只收藏不看党”的阅读体验，先给出一个“ 嫌长不看版 ”的总结：

随 anaconda 附带，也可以通过 pip 安装
指定 不同解析器在性能、容错性上会有差异 ，导致结果也可能不一样
基本使用流程： 通过文本初始化 bs 对象 -> 通过 find/find_all 或其他方法检测信息 -> 输出或保存
可以迭代式的查找，比如先定位出一段内容，再其上继续检索
开发时应注意不同方法的返回类型，出错时多看报错、多加输出信息
官方文档 很友好，也有中文，推荐阅读

安装

推荐使用 pip 进行安装（关于 pip 见前文《Crossin：如何安装 Python 的第三方模块》）：

pip install beautifulsoup4

要注意，包名是 beautifulsoup4 ，如果不加上 4，会是老版本也就是 bs3，它是为了兼容性而存在，目前已不推荐。我们这里说 bs，都是指 bs4。

bs4 也可以直接通过安装 anaconda 获得（介绍见前文《Crossin：Python数据科学环境：Anaconda 了解一下》）。

bs 在使用时需要指定一个“ 解析器 ”：

html.parse - python 自带，但容错性不够高，对于一些写得不太规范的网页会丢失部分内容
lxml - 解析速度快，需额外安装
xml - 同属 lxml 库，支持 XML 文档
html5lib - 最好的容错性，但速度稍慢

这里的 lxml 和 html5lib 都需要额外安装，不过如果你用的是 anaconda，都是一并安装好的。

快速上手

我们就用官网上的文档作例子：



html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 bs 的初始化操作，是用文本创建一个 BeautifulSoup 对象，建议手动指定解析器：



from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

获取其中的某个结构化元素及其属性：



soup.title  # title 元素
# <title>The Dormouse's story</title>

soup.p  # 第一个 p 元素
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']  # p 元素的 class 属性
# ['title']

soup.p.b  # p 元素下的 b 元素
# <b>The Dormouse's story</b>

soup.p.parent.name  # p 元素的父节点的标签
# body

并不是所有信息都可以简单地通过结构化获取，通常使用 find 和 find_all 方法进行查找：



soup.find_all('a')  # 所有 a 元素
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id='link3')  # id 为 link3 的元素
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a

find 和 find_all 可以有多个搜索条件叠加，比如 find('a', id='link3', class_='sister')
find 返回的是一个 bs4.element.Tag 对象 ，这个对象可以进一步进行搜索。如果有多个满足的结果，find 只返回第一个 ；如果没有，返回 None。
find_all 返回的是一个 由 bs4.element.Tag 对象组成的 list ，不管找到几个或是没找到，都是 list。

输出：



x = soup.find(class_='story')
x.get_text()  # 仅可见文本内容
# 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'
x.prettify()  # 元素完整内容
# '<p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n  Elsie\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n  Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n  Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n</p>\n'

如果你有前端开发经验，对 CSS 选择器很熟悉，bs 也为你提供了相应的方法：



soup.select('html head title')
# [<title>The Dormouse's story</title>]
soup.select('p > #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

以上就是 BeautifulSoup 的一个极简上手介绍，对于 bs 能做什么，想必你已有了一个初步认识。如果你要在开发中使用，建议再看下它的 官方文档 。文档写得很清楚，也有中文版，你只要看了最初的一小部分，就可以在代码中派上用场了。更多的细节可以在使用时进一步搜索具体方法和参数设置。

中文版文档 地址：

Beautiful Soup 4.2.0 文档www.crummy.com

对于爬虫的其他方面，推荐阅读我们之前的相关文章：

════

其他文章及回答：

学编程：如何自学Python | 新手引导 | 一图学Python

开发案例：智能防挡弹幕 | 红包提醒 | 流浪地球

欢迎搜索及关注： Crossin的编程教室