python 中的maketrans在utf-8文件中该怎么使用

Question

python 中的maketrans在utf-8文件中该怎么使用

发布于
2017-04-26

我写了一个处理文本的文件就是把文本中所有的符号都替换掉，替换成空格。用的python中maketrans和translate。其中在使用对于ASCII编码的文件时是正常的，但对于utf-8文件时，就报错，提示maketrans中的参数不等长，但是明明是一样长的啊：

File "/Users/lgq/Desktop/p3.py", line 10, in text_to_words
"abcdefghijklmnopqrstuvwxyz                                                   ") 
ValueError: the first two maketrans arguments must have equal length

我查了一下说是maketrans在utf-8下不能用，那我在utf-8下该怎么替换掉字符呢，求各位大神指点。

def text_to_words(the_text):
    """ 
        Return a list of words with all punctuation removed,
        and all in lowercase.
    """
    my_substitutions = the_text.maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                            ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds


def get_words_in_book(filename):
    """ Read a book from filename, and return a list of its words."""
    f = open(filename, "r", encoding = "utf-8")
    content = f.read()
    f.close()
    wds = text_to_words(content)
    return wds


book_words = get_words_in_book("alice.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
        format(len(book_words), book_words[:100]))

python3.x

python

阅读 3.7k

1 个回答

得票最新

caimaoy

85341931

发布于
2017-04-26

✓ 已被采纳

首先这两个字符串长度不相等， \" 是一个字符， \\ 也是一个字符
你可以用 len() 查看。
然后关于字符串什么的问题，最好说明 python 的版本

maketrans 参数长度不相等

 my_substitutions = the_text.maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                            ")

测试代码：

from string import translate, maketrans

def text_to_words(the_text):
    """ 
        Return a list of words with all punctuation removed,
        and all in lowercase.
    """
    my_substitutions = maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                          ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds

text_to_words('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~\'\\测试')

output

['abcdefghijklmnopqrstuvwxyz', '\xe6\xb5\x8b\xe8\xaf\x95']

这是 python2 的运行结果

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

相似问题

找不到问题？创建新问题

python 中的maketrans在utf-8文件中该怎么使用

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？