新手上路，请多包涵

我从这里得到了我的改变的问题。我有以下代码：

 from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

如何打印 1) 包含和 2) 排除停用词的文本中出现频率最高的 10 个词？

原文由 user2064809 发布，翻译遵循 CC BY-SA 4.0 许可协议

python nltk word-frequency find-occurrences

阅读 552

2 个回答

得票最新

社区维基

发布于
2023-01-08

✓ 已被采纳

nltk中有一个FreqDist函数

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)

提取 10 个最常见的：

 mostCommon= allWordDist.most_common(10).keys()

原文由 igorushi 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-08

不确定函数中的 is stopwords ，我想它需要是 in 但你可以使用 Counterdict most_common(10) 最常见的：

 from collections import Counter
from string import punctuation

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

如果您传入一个 nltk 文件对象，只需对其进行迭代：

 def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

nltk 方法包括标点符号，因此可能不是您想要的。

原文由 Padraic Cunningham 发布，翻译遵循 CC BY-SA 3.0 许可协议

查看全部 2 个回答

推荐问题

打印包含和排除停用词的文本中 10 个最常出现的词

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译