如何在gensim中使用TaggedDocument?

新手上路,请多包涵

我有两个目录,我想从中读取它们的文本文件并标记它们,但我不知道如何通过 TaggedDocument 来做到这一点。我认为它可以作为 TaggedDocument([Strings],[Labels]) 工作,但这显然不起作用。

这是我的代码:

 from gensim import models
from gensim.models.doc2vec import TaggedDocument
import utilities as util
import os
from sklearn import svm
from nltk.tokenize import sent_tokenize
CogPath = "./FixedCog/"
NotCogPath = "./FixedNotCog/"
SamplePath ="./Sample/"
docs = []
tags = []
CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]
NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]
SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]
for doc in CogList:
     str = open(CogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     print docs
     tags.append(doc)
     print "###########"
     print tags
     print "!!!!!!!!!!!"
for doc in NotCogList:
     str = open(NotCogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)
for doc in SampleList:
     str = open(SamplePath + doc, 'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)

T = TaggedDocument(docs,tags)

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

这是我得到的错误:

 Traceback (most recent call last):
  File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>
    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab
    if isinstance(document.words, string_types):
AttributeError: 'list' object has no attribute 'words'

原文由 Farhood 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 782
1 个回答

Doc2Vec 模型的输入应该是 TaggedDocument([‘list’,‘of’,‘word’], [TAG_001]) 的列表。一个好的做法是使用句子的索引作为标签。例如,要用两个句子(即文档、段落)训练一个 Doc2Vec 模型:

 s1 = 'the quick fox brown fox jumps over the lazy dog'
s1_tag = '001'
s2 = 'i want to burn a zero-day'
s2_tag = '002'

docs = []
docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])
docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])

model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)
model.build_vocab(docs)

print 'Start training process...'
model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

#save model
model.save(model_path)

原文由 biendltb 发布,翻译遵循 CC BY-SA 4.0 许可协议

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题