爬取京东页面的文本为乱码

我使用beautiful soup解析京东的界面，把里面的文本全都提取出来，但是打印的时候发现全是乱码。jd的界面使用utf-8编码的，我在解码成gbk时却遇到错误。

下面是代码，请指教。

#encoding=gbk
from bs4 import BeautifulSoup
from bs4 import NavigableString
from bs4 import Comment
from bs4 import Doctype
import urllib2

def walker(soup, indent):
    text=""
    if soup.name is not None:
        for child in soup.children:
            if isinstance(child, NavigableString):
                if len(child) != 1: #如何判断是否为空
                    text = indent + unicode(child).encode('utf-8').strip() #.decode('utf-8').encode('gbk')
            text += walker(child, indent+"\t")
    return text

if __name__ == "__main__":
    soup = BeautifulSoup( urllib2.urlopen("http://item.jd.com/1592573020.html").read()) 
    doctypes=soup.findAll(text=lambda text: isinstance(text, Doctype))
    [doctype.extract() for doctype in doctypes]
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]

    for script in soup("script"):
        script.extract()
    for noscript in soup("noscript"):
        noscript.extract()
    for style in soup("style"):
        style.extract()
    text=walker(soup, "")
    print "text", text.decode('utf-8').encode('gbk') #这里会出错

阅读 5.3k

爬取京东页面的文本为乱码

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

爬取京东页面的文本为乱码

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？