doc2vec
The Doc2vec method is an unsupervised algorithm that can learn a fixed-length feature representation from variable-length text (for example, sentences, paragraphs, or documents). Doc2vec can also be called Paragraph Vector or Sentence Embeddings. It can obtain vector expressions of sentences, paragraphs and documents. It is an extension of Word2Vec. It has some advantages, such as not having a fixed sentence length and accepting sentences of different lengths as training samples.
To put it simply, the model is first trained with a large amount of text, and then any piece of text can be converted into a vector with the model. Only with vectors can the calculation of similarity be carried out.
There is a ready-made doc2vec in gensim, just use it directly
Use of gensim
import os
import gensim
import smart_open
import logging
import sqlite3
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def read_db3(fname):
conn = sqlite3.connect(fname)
cur = conn.cursor()
cur.execute('select lngid,description from modify_title_info_zt where description !=""')
outtext = cur.fetchone()
while outtext:
tokens = gensim.utils.simple_preprocess(outtext[1])
yield gensim.models.doc2vec.TaggedDocument(tokens, outtext[0])
outtext = cur.fetchone()
train_corpus = list(read_db3('zt_aipjournal_20210615_1.db3'))
The above code is very simple, lowercase the description field in db3, remove symbols, and segment words to get the corpus for training
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=10,workers=4)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
model.save('aip.model')
vector_size The dimension of the vector, the default is 100
min_count removes words whose word frequency is less than the set value
epochs The number of iterations is 10 by default
Number of processes trained by workers
After training, save the model for later use.
The above corpus has 180,000 documents. The training time of the 100-dimensional model took 500 seconds, and the training of the 300-dimensional model took 750 seconds.
def read_corpus(fname, tokens_only=False):
with smart_open.open(fname, encoding="utf8") as f:
for i, line in enumerate(f):
tokens = gensim.utils.simple_preprocess(line)
if tokens_only:
yield tokens
else:
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument(tokens, [i])
test_corpus = list(read_corpus('test.txt', tokens_only=True))
new_model = gensim.models.doc2vec.Doc2Vec.load('aip.model')
vectorlist = []
for i in range (len(test_corpus)):
vectorlist.append(new_model.infer_vector(test_corpus[i]))
import numpy as np
from gensim import matutils
line = 'Three hundred thirty-one Chinese school children on Taiwan were given an aqueous oil trachoma vaccine and 322 an aqueous oil placebo'
vector = new_model.infer_vector(gensim.utils.simple_preprocess(line))
for i in range(0,7):
similarity = np.dot(matutils.unitvec(vector), matutils.unitvec(vectorlist[i]))
print(similarity)
After training, the saved model can be directly loaded for use
Similarity calculation uses cosine similarity
Use the function matutils.unitvec() to scale the vector length to 1, so you can get the similarity by directly calculating the dot product of the vector
There are abstracts of 7 articles in test.txt, and the final similarity calculation is the result of calculation of a paragraph of text and 7 abstracts that I intercepted in the last article.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。