为什么python爬虫获取到的是一串utf8编码而不是正常的字符串？

Question

为什么python爬虫获取到的是一串utf8编码而不是正常的字符串？

发布于
2018-08-01

更新于
2018-08-01

try:
    req = urllib.request.Request(url, headers=hds[page_num%len(hds)])
    source_code = urllib.request.urlopen(req).read()
    plain_text=str(source_code)   
except:
    print ("Error.")
    continue
    
soup = BeautifulSoup(plain_text, from_encoding='utf-8')
list_soup = soup.find('div', {'class': 'mod book-list'})

try_times+=1;
if list_soup==None and try_times<200:
    continue
elif list_soup==None or len(list_soup)<=1:
    break # Break when no informatoin got after 200 times requesting

for book_info in list_soup.findAll('dd'):
    title = book_info.find('a', {'class':'title'}).string.strip()
    desc = book_info.find('div', {'class':'desc'}).string.strip()
    desc_list = desc.split('/')
    book_url = book_info.find('a', {'class':'title'}).get('href')
    
    #输出爬取到的书籍的标题
    print(title)

上面的代码是爬取豆瓣网的爬虫，为什么我爬取的书籍的标题(title)，是一串utf-8编码，而不是正常的字符串？如下图：

由于获取到的标题(title)其实仍然是str类型，所以无法使用decode函数解码，请问还有什么办法可以解决吗？

网页爬虫编码

python

阅读 3k

1 个回答

试一试

text.encode('latin-1').decode('unicode_escape')

查看全部 1 个回答

推荐问题

相似问题

找不到问题？创建新问题

为什么python爬虫获取到的是一串utf8编码而不是正常的字符串？

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

为什么python爬虫获取到的是一串utf8编码而不是正常的字符串？

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？