[爬虫]lxml 获取当前节点的html，并正确显示中文

universe_king

阅读 1 分钟

0

获取当前节点：etree.tostring

正确显示中文
方法一：使用html库的unescape函数
html.unescape

from lxml import etree
import html

with open('list.html', 'r', encoding='utf-8') as f:
    text = f.read()

tree = etree.HTML(text)


r = html.unescape(etree.tostring(tree.xpath(
    '//*[@id="scroll_marquee"]')[0]).decode('utf-8'))
print(r)
print(type(r))

参考链接：爬取网页时调用tostring()中文乱码("数字;")解决方案

方法二：使用lxml库的etree.tostring方法

from lxml import etree
import requests

response = requests.get('https://www.baidu.com/).text
tree = etree.HTML(response)
strs = tree.xpath( "//body")
strs = strs[0]
strs = str(etree.tostring(info, encoding="utf-8"), encoding='utf-8')
print (strs)

参考链接：lxml提取html标签内容, tostring()不能显示中文解决方案

阅读 3.3k更新于 2020-10-27

universe_king

3.5k 声望716 粉丝

« 上一篇

xpath的一些骚操作

下一篇 »

引用和评论

推荐阅读

Qwen3 调用 FastMCP —— 查询天气案例，了解 MCP 和大模型的结合方法

universe_king阅读 98

Anaconda安装教程以及Anaconda和pip配置国内镜像

遗失的美好灬阅读 5.6k

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Swift社区赞 1阅读 683

pip安装报错：No such file or directory 'conda-forge' 没有那个文件或目录

代码的路赞 1阅读 1.9k

科学计算编程涉及到的技术栈简介

冒泡的马树阅读 3.2k评论 1

Python 描述符

Exception阅读 3.2k

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

universe_king阅读 2.9k

0 条评论

评论支持部分 Markdown 语法：**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用。你还可以使用 @ 来通知其他用户。