我有一个名字列表，例如： names = ['A', 'B', 'C', 'D'] 和一份文件清单，在每份文件中都提到了其中一些名字。 document =[['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']] 我想得到一个输出作为共同出现的矩阵，比如： A B C D A 0 2 1 1 B 2 0 2 1 C 1 2 0 1 D 1 1 1 0 在 R 中有针对此问题的解决方案（创建共现矩阵），但我无法在 Python 中完成。我正在考虑在 Pandas 中做这件事，但还没有进展！原文由 mk_sch 发布，翻译遵循 CC BY-SA 4.0 许可协议

来自嵌套单词列表的共现矩阵

2 个回答

发布于
2023-01-11

✓ 已被采纳

显然，这可以根据您的目的进行扩展，但它执行的是一般操作：

 import math

for a in 'ABCD':
    for b in 'ABCD':
        count = 0

        for x in document:
            if a != b:
                if a in x and b in x:
                    count += 1

            else:
                n = x.count(a)
                if n >= 2:
                    count += math.factorial(n)/math.factorial(n - 2)/2

        print '{} x {} = {}'.format(a, b, count)

原文由 Malik Brahimi 发布，翻译遵循 CC BY-SA 3.0 许可协议

社区维基

1

发布于
2023-01-11

Another option is to use the constructor csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)]) from scipy.sparse.csr_matrix where data , row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k] 。

技巧是通过遍历文档并创建元组列表（doc_id、word_id）来生成 row_ind 和 col_ind 。 data 只是一个相同长度的矢量。

将 docs-words 矩阵乘以其转置将得到共现矩阵。

此外，这在运行时间和内存使用方面都是高效的，因此它还应该处理大型语料库。

 import numpy as np
import itertools
from scipy.sparse import csr_matrix

def create_co_occurences_matrix(allowed_words, documents):
    print(f"allowed_words:\n{allowed_words}")
    print(f"documents:\n{documents}")
    word_to_id = dict(zip(allowed_words, range(len(allowed_words))))
    documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
    row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
    data = np.ones(len(row_ind), dtype='uint32')  # use unsigned int for better memory utilization
    max_word_id = max(itertools.chain(*documents_as_ids)) + 1
    docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id))  # efficient arithmetic operations with CSR * CSR
    words_cooc_matrix = docs_words_matrix.T * docs_words_matrix  # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
    words_cooc_matrix.setdiag(0)
    print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
    return words_cooc_matrix, word_to_id

运行示例：

 allowed_words = ['A', 'B', 'C', 'D']
documents = [['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
words_cooc_matrix, word_to_id = create_co_occurences_matrix(allowed_words, documents)

输出：

 allowed_words:
['A', 'B', 'C', 'D']

documents:
[['A', 'B'], ['C', 'B', 'K'], ['A', 'B', 'C', 'D', 'Z']]

words_cooc_matrix:
[[0 2 1 1]
 [2 0 2 1]
 [1 2 0 1]
 [1 1 1 0]]

原文由 Mockingbird 发布，翻译遵循 CC BY-SA 3.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

来自嵌套单词列表的共现矩阵

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

来自嵌套单词列表的共现矩阵

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？