新手上路，请多包涵

我正在尝试使用 Beautiful Soup 从 Zillow 抓取房价数据。

我通过属性 ID 获取网页，例如。 http://www.zillow.com/homes/for_sale/18429834_zpid/

当我尝试 find_all() 函数时，我没有得到任何结果：

 results = soup.find_all('div', attrs={"class":"home-summary-row"})

但是，如果我使用 HTML 并将其削减到我想要的部分，例如：

 <html>
    <body>
        <div class=" status-icon-row for-sale-row home-summary-row">
        </div>
        <div class=" home-summary-row">
            <span class=""> $1,342,144 </span>
        </div>
    </body>
</html>

我得到 2 个结果，都 <div> 类 home-summary-row 。所以，我的问题是，为什么在搜索整页时没有得到任何结果？

工作示例：

 from bs4 import BeautifulSoup
import requests

zpid = "18429834"
url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
response = requests.get(url)
html = response.content
#html = '<html><body><div class=" status-icon-row for-sale-row home-summary-row"></div><div class=" home-summary-row"><span class=""> $1,342,144 </span></div></body></html>'
soup = BeautifulSoup(html, "html5lib")

results = soup.find_all('div', attrs={"class":"home-summary-row"})
print(results)

原文由 SFBA26 发布，翻译遵循 CC BY-SA 4.0 许可协议

python html web-scraping beautifulsoup

阅读 653

2 个回答

得票最新

社区维基

发布于
2022-12-19

✓ 已被采纳

根据 W3.org Validator 的说法，HTML 存在许多问题，例如杂散的结束标记和跨多行拆分的标记。例如：

 <a
href="http://www.zillow.com/danville-ca-94526/sold/"  title="Recent home sales" class=""  data-za-action="Recent Home Sales"  >

这种标记会使 BeautifulSoup 更难解析 HTML。

您可能想尝试运行一些程序来清理 HTML，例如删除每行末尾的换行符和尾随空格。 BeautifulSoup 还可以为您清理 HTML 树：

 from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

原文由 Soviut 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2022-12-19

您的 HTML 格式不正确，在这种情况下，选择正确的解析器至关重要。在 BeautifulSoup 中，目前有 3 个可用的 HTML 解析器，它们以 不同的方式工作和处理损坏的 HTML ：

html.parser （内置，无需额外模块）
lxml （最快，需要安装 lxml ）
html5lib （最宽松，需要安装 html5lib ）

解析器文档页面之间的差异更详细地描述了这些差异。在您的情况下，为了证明差异：

 >>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> zpid = "18429834"
>>> url = "http://www.zillow.com/homes/" + zpid + "_zpid/"
>>> response = requests.get(url)
>>> html = response.content
>>>
>>> len(BeautifulSoup(html, "html5lib").find_all('div', attrs={"class":"home-summary-row"}))
0
>>> len(BeautifulSoup(html, "html.parser").find_all('div', attrs={"class":"home-summary-row"}))
3
>>> len(BeautifulSoup(html, "lxml").find_all('div', attrs={"class":"home-summary-row"}))
3

如您所见，在您的情况下， html.parser 和 lxml 都完成了工作，但 html5lib 却没有。

原文由 alecxe 发布，翻译遵循 CC BY-SA 3.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

使用 Beautiful Soup 查找特定类

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

问一个鼠标滚动事件，这种是怎么实现的？

form对象根据表单dom元素的name属性获取元素对象是基于什么标准的？兼容性如何？

Stack Overflow 翻译