NLTK中文分词报错

from nltk.tokenize import StanfordSegmenter
segmenter = StanfordSegmenter(
  path_to_sihan_corpora_dict="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data",   path_to_model="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data/pku.gz",   path_to_dict="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data/dict-chris6.ser.gz")
res = segmenter.segment(u"北海已成为中国对外开放中升起的一颗明星")
print(res)

C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\python.exe D:/programming/leetcode/test.py
D:/programming/leetcode/test.py:3: DeprecationWarning: 
The StanfordTokenizer will be deprecated in version 3.2.5.
Please use nltk.parse.corenlp.CoreNLPTokenizer instead.'
  path_to_sihan_corpora_dict="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data",   path_to_model="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data/pku.gz",   path_to_dict="E:/NLP/NLP_code/Installation/base/stanford-segmenter-2017-06-09/data/dict-chris6.ser.gz")
Traceback (most recent call last):
  File "D:/programming/leetcode/test.py", line 4, in <module>
    res = segmenter.segment(u"北海已成为中国对外开放中升起的一颗明星")
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\tokenize\stanford_segmenter.py", line 182, in segment
    return self.segment_sents([tokens])
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\tokenize\stanford_segmenter.py", line 210, in segment_sents
    stdout = self._execute(cmd)
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\tokenize\stanford_segmenter.py", line 229, in _execute
    stdout, _stderr = java(cmd, classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\internals.py", line 129, in java
    p = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 971, in _execute_child
    args = list2cmdline(args)
  File "C:\Users\lybroman\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 461, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'NoneType' is not iterable
阅读 8k
3 个回答

缺少了一些参数,java_class是必须的
segmenter = StanfordSegmenter(

java_class='edu.stanford.nlp.ie.crf.CRFClassifier',
path_to_jar='/home/kenwood/stanford/segmenter/stanford-segmenter.jar',
path_to_slf4j='/home/kenwood/stanford/segmenter/slf4j-api.jar',
path_to_sihan_corpora_dict='/home/kenwood/stanford/segmenter/data',
path_to_model='/home/kenwood/stanford/segmenter/data/pku.gz',
path_to_dict='/home/kenwood/stanford/segmenter/data/dict-chris6.ser.gz'

)

新手上路,请多包涵

我加上java_class='edu.stanford.nlp.ie.crf.CRFClassifier',还是报错。

Windows下的stanfordnlp参数设置,提前设置好环境变量。源码第153行表示需要传java_class。参见https://github.com/nltk/nltk/...

segmenter = StanfordSegmenter(

path_to_sihan_corpora_dict="E:\stanford_nlp\stanford-segmenter-2018-10-16\data",
path_to_model="E:\stanford_nlp\stanford-segmenter-2018-10-16\data\pku.gz",  
path_to_dict="E:\stanford_nlp\stanford-segmenter-2018-10-16\data\dict-chris6.ser.gz", 
java_class = 'edu.stanford.nlp.ie.crf.CRFClassifier')
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题