类型错误：预期的字符串或类似字节的对象——使用 Python/NLTK word_tokenize

我有一个包含 ~40 列的数据集，并且正在使用 .apply(word_tokenize) 其中的 5 列，如下所示： df['token_column'] = df.column.apply(word_tokenize) 。

我只收到其中一列的 TypeError，我们将其称为 problem_column

 TypeError: expected string or bytes-like object

这是完整的错误（去除了 df 和列名，以及 pii），我是 Python 的新手，并且仍在尝试找出错误消息的哪些部分是相关的：

 TypeError                                 Traceback (most recent call last)
<ipython-input-51-22429aec3622> in <module>()
----> 1 df['token_column'] = df.problem_column.apply(word_tokenize)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas_libs\lib.c:66440)()

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize__init__.py in word_tokenize(text, language, preserve_line)
    128     :type preserver_line: bool
    129     """
--> 130     sentences = [text] if preserve_line else sent_tokenize(text, language)
    131     return [token for sent in sentences
    132             for token in _treebank_word_tokenizer.tokenize(sent)]

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize__init__.py in sent_tokenize(text, language)
     95     """
     96     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 97     return tokenizer.tokenize(text)
     98
     99 # Standard word tokenizer.

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
   1233         Given a text, returns a list of the sentences in that text.
   1234         """
-> 1235         return list(self.sentences_from_text(text, realign_boundaries))
   1236
   1237     def debug_decisions(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
   1281         follows the period.
   1282         """
-> 1283         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1284
   1285     def _slices_from_text(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
   1272         if realign_boundaries:
   1273             slices = self._realign_boundaries(text, slices)
-> 1274         return [(sl.start, sl.stop) for sl in slices]
   1275
   1276     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
   1272         if realign_boundaries:
   1273             slices = self._realign_boundaries(text, slices)
-> 1274         return [(sl.start, sl.stop) for sl in slices]
   1275
   1276     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
   1312         """
   1313         realign = 0
-> 1314         for sl1, sl2 in _pair_iter(slices):
   1315             sl1 = slice(sl1.start + realign, sl1.stop)
   1316             if not sl2:

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
    310     """
    311     it = iter(it)
--> 312     prev = next(it)
    313     for el in it:
    314         yield (prev, el)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
   1285     def _slices_from_text(self, text):
   1286         last_break = 0
-> 1287         for match in self._lang_vars.period_context_re().finditer(text):
   1288             context = match.group() + match.group('after_tok')
   1289             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

5 列都是字符/字符串（在 SQL Server、SAS 中验证并使用 .select_dtypes(include=[object])) 。

为了更好地衡量，我使用了 .to_string() 来确保 problem_column 除了字符串之外真的没有任何东西，但我继续收到错误。如果我分别处理这些列 good_column1-good_column4 继续工作并且 problem_column 仍然会产生错误。

我四处搜索，除了从集合中删除任何数字（我不能这样做，因为它们很有意义），我还没有找到任何额外的修复。

原文由 LMGagne 发布，翻译遵循 CC BY-SA 4.0 许可协议

阅读 1.3k

def custom_tokenize(text): if not text: print('The text to be tokenized is a None type. Defaulting to blank string.') text = '' return word_tokenize(text) df['tokenized_column'] = df.column.apply(custom_tokenize)

类型错误：预期的字符串或类似字节的对象——使用 Python/NLTK word_tokenize

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译