# -*- coding:utf-8 -*- ''' Created on 2015年10月8日 ''' def main(): s = u"你好" d = {'id':001, 'text':s} s1 = "你好" d1 = {'id':002, 'text':s1} print d print s print "------------" print d1 print s1 if __name__ == "__main__": main() 输出为: {'text': u'\u4f60\u597d', 'id': 1} 你好 ------------ {'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2} 你好为何直接打印的都是正常的汉字,但是,字典中的却是\uxxxx 或者 \x.. 之类的呢? 请高手解惑. PS : 在使用 sqlite3 存储中文时, 及使用scrapy抓取中文数据时, 都遇到上面字典中的情况. 很头疼.

python的编码问题,一个小例子让人很困惑

3 个回答

得票最新

selfboot

8k164058

发布于
2015-10-08

更新于
2015-10-08

✓ 已被采纳

Ok，为了清楚解释这个问题，我假设你知道什么是编码，如果不是很清楚，可以移步这里：人机交互之字符编码。下面解释你的这段代码。

# -*- coding:utf-8 -*-

告诉Python解释器你的这个脚本编码方式为"UTF-8"，然后Python解释器直接用“UTF-8”来解码这个脚本文件（当然你得确保文件编码格式确实为UTF-8）。

String vs Unicode String

s1 = u"你好"
s2 = "你好"

s1是一个"str"类型，而s2是一个“unicode”类型，如下：

>>> s1 = "你好"
>>> type(s1)
<type 'str'>
>>> s2 = u"你好"
>>> type(s2)
<type 'unicode'>

这两个类型都是Python的 Sequence Types。

str类型的字符串，内部保存的是a plain sequence of bytes，即任意字符串经过编码后的样子:
```
>>> str_1 = "你好"
>>> str_1
'\xe4\xbd\xa0\xe5\xa5\xbd'
```
这里我的控制台默认是UTF-8编码，所以str_1传入Python解释器的是你好用UTF-8编码后的字节串e4bda0e5a5bd。在你的脚本中，你好也会被用UTF-8编码后传递给str_1。
Unicode 类型的字符串，内部保存的是a sequence of code points，每个码值(code points)均在0 to 0x10ffff之间，在Unicode字符集唯一对应了一个字符。也就是说对于Unicode字符串，解释器看到的是Unicode串中所有字符对应的码值序列。
```
>>> unicode_str2 = u"你好"
>>> unicode_str2
u'\u4f60\u597d'
>>> u"你"
u'\u4f60'
>>> u"好"
u'\u597d'
```
这里你在Unicode字符集对应4f60，好对应597d。

深入了解 print

在了解Python的print机制前，首先要了解对象的两个内建函数 __repr__ 和 __str__

object.__repr__(self): Called by the repr() built-in function and by string conversions (reverse quotes) to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value. If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines __repr__() but not __str__(), then __repr__() is also used when an “informal” string representation of instances of that class is required.

object.__str__(self): Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object. This differs from __repr__() in that it does not have to be a valid Python expression: a more convenient or concise representation may be used instead. The return value must be a string object.

当我们在Python中 print object时，实际上会按照下图去执行：

图片描述

对于str类型和unicode类型，内置了__str__函数，返回便于我们阅读的字符串；而对于dict或者list类型，没有__str__函数，因此会调用用来精确描述对象的__repr__。

>>> str.__str__
<slot wrapper '__str__' of 'str' objects>
>>> unicode.__str__
<slot wrapper '__str__' of 'unicode' objects>
>>> dict.__str__
<slot wrapper '__str__' of 'object' objects>
>>> dict.__repr__
<slot wrapper '__repr__' of 'dict' objects>

dict.__str__ 返回的是'object'的__str__，说明dict没有内置__str__。而dict内置了__repr__，因此print dic相当于repr(dict)。

>>> d1 = {'id':002, 'text':"你好"}
>>> print d1
    {'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2}
>>> print repr(d1)
{'text': '\xe4\xbd\xa0\xe5\xa5\xbd', 'id': 2}

使用scrapy抓取中文数据时：对于你获取到的数据，首先要知道它的编码格式，然后对其进行相应的编码即可。
在使用 sqlite3 存储中文时：对于你需要保存的数据，只需要将其按照sqlite3数据库的编码要求进行相应的解码即可。

更多内容

关于repr()，文档解释如下：

repr() is meant to generate representations which can be read by the interpreter (or will force a SyntaxError if there is no equivalent syntax).

u"\uxxxx" 和 "\x"表示什么？不感兴趣可以略过。

Escape Sequence	Meaning
\uxxxx	Character with 16-bit hex value xxxx (Unicode only)
\xhh	Character with hex value hh

下面是一些例子

>>> chr(0x41)
'A'
>>> "\x41"
'A'
>>> "\x01" # a non printable character
'\x01'
>>> "\x41abc"
'Aabc'
>>> print u"\u5b66"  # 查Unicode表知道汉字`学`的Unicode码值为U+5b66。
学
>>> print u"\u5b66abc"
学abc

vimac

11.7k21528

发布于
2015-10-08

更新于
2015-10-08

返回给你的是原始编码而已，你大可以淡定。
python中，print一个非str类型的对象会隐式调用对象的__str__这个方法（实际上就是做转换成字符串的操作）
而dict（也包括list，tuple等很多python内建对象）的__str__方法中，会对字符串做这种编码处理（从而使输出都是ascii编码的字符）

如果你print d['text']或者print d1['text']就可以看到你期望的结果了

给你个例子

class A:
     def __str__(self):
         return "hello"

a = A()
print a

以上程序的的结果是 hello

编辑补充：

特意去查了Python的C代码，就dict这个场景而言，确实是repr而不是str，所以我上面的答案是错误的

详见： https://github.com/python/cpython/blob/2.7/Objects/dictobject.c#L1023

dream

85564453

发布于
2015-10-08

更新于
2015-10-08

#调用repr
>>> a=u"你好"
>>> b="你好"
>>> print repr(a)
u'\u4f60\u597d'
>>> print repr(b)
'\xc4\xe3\xba\xc3'

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

python的编码问题,一个小例子让人很困惑

String vs Unicode String

深入了解 print

更多内容

你尚未登录，登录后可以

浏览器能请求到数据怎么换了api工具或是爬虫都没数据了呢？

如何在 Vim 中正确输入竖线 │ 符号？

VIM编辑器支持内码输入字符的方法是什么？

win11上跑 scrapy爬虫脚本，一直报错 'gbk' codec can't decode byte 0xae in position 324:？