新手上路，请多包涵

我想计算文本文件中所有单词的频率。

 >>> countInFile('test.txt')

应该返回 {'aaa':1, 'bbb': 2, 'ccc':1} 如果目标文本文件是这样的：

 # test.txt
aaa bbb ccc
bbb

我在一些帖子之后用纯 python 实现了它。但是，我发现由于文件很大（> 1GB），纯 python 方法是不够的。

我认为借用sklearn的力量是一个候选。

如果您让 CountVectorizer 计算每一行的频率，我想您将通过对每一列求和来获得单词频率。但是，这听起来有点间接。

使用 python 计算文件中单词的最有效和最直接的方法是什么？

更新

我的（非常慢）代码在这里：

 from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

原文由 Light Yagmi 发布，翻译遵循 CC BY-SA 4.0 许可协议

python 自然语言处理 scikit-learn word-count frequency-distribution

阅读 511

2 个回答

得票最新

社区维基

发布于
2023-01-08

✓ 已被采纳

最简洁的方法是使用 Python 提供的工具。

 from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

就是这样。 map(str.split, f) 正在制作一个生成器，该生成器返回 list 每行的单词。包装在 chain.from_iterable 将其转换为一次生成一个单词的单个生成器。 Counter 接受一个可迭代的输入并计算其中的所有唯一值。最后，你 return 一个 dict 类对象（一个 Counter ），在创建过程中，你只存储一个一次一行数据和总计数，而不是一次整个文件。

理论上，在 Python 2.7 和 3.1 上，您自己循环链式结果并使用 dict 或 collections.defaultdict(int) 来计数（因为 Counter 在 Python 中实现，这在某些情况下可能会使其变慢），但让 Counter 完成工作更简单且更自我记录（我的意思是，整个目标是计数，所以使用 Counter ）。除此之外，在 CPython（参考解释器）3.2 及更高版本 Counter 上有一个 C 级加速器，用于计算可迭代输入，它的运行速度比你用纯 Python 编写的任何东西都要快。

更新： 您似乎想要去除标点符号和不区分大小写，所以这是我之前代码的一个变体：

 from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it’s creating and destroying many small Counter and set objects, rather than .update -ing a single Counter 每行一次（虽然比我在更新的代码块中给出的稍慢，但至少在比例因子上算法相似）。

原文由 ShadowRanger 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

发布于
2023-01-08

一种有效且准确的内存方法是利用

CountVectorizer in scikit （用于ngram提取）
NLTK word_tokenize
numpy 矩阵求和以收集计数
collections.Counter 用于收集计数和词汇

一个例子：

 import urllib.request
from collections import Counter

import numpy as np

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[出去]：

 [(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

本质上，您也可以这样做：

 from collections import Counter
import numpy as np
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

让我们 timeit ：

 import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[出去]：

 5.257147789001465

请注意 CountVectorizer 也可以使用文件而不是字符串， 这里不需要将整个文件读入内存。在代码中：

 import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

原文由 alvas 发布，翻译遵循 CC BY-SA 3.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

有效计算python中的词频

更新

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译