原网页编码是utf-8可是抓取之后还是乱码

import requests,re

req_list = requests.get('http://finance.eastmoney.com/news/cgnjj_3.html').text
list_url = re.search('<p class="title">.*?<a href="(.*?)".*?target="_blank">',req_list,re.S)
content_url = list_url.group(1)
content_source = requests.get(content_url).text
#yixia,huoquneirong
title = re.search('<h1>(.*?)</h1>',content_source).group(1)
time = re.search('<div class="time">(.*?)</div>',content_source).group(1)
source = re.search('<div class="source">(.*?)</div>',content_source,re.S).group(1)
content = re.search('<div id="ContentBody" class="Body">(.*?)<p class="res-edit">',content_source,re.S).group(1)
print(title)
print(time)
print(source)
print(content)

获取的内容都是乱码啊,我看了原网页,编码确实是utf-8

阅读 3.5k
3 个回答
req = requests.get('http://finance.eastmoney.com/news/cgnjj_3.html')
req.encoding = 'UTF-8'
req_list = req.text

类似这样明确指明编码

response = requests.get('http://finance.eastmoney.com/news/cgnjj_3.html')
response.encoding    # 查看响应的编码,我这里返回了'ISO-8859-1'
response.encoding = 'utf-8'
response.text    # OK
req_list = req_list.encode("latin1").decode("utf-8")
print(req_list)
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题