Word2Vec 训练英文文本如何按逗号分词

import logging
import os
import sys
import multiprocessing
 
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    # model.init_sims(replace=True)
    model.save(outp1)
    model.wv.save_word2vec_format(outp2, binary=False)

训练集中是分好的,训练完词汇表中是独立的单词
图片描述

阅读 4.3k
2 个回答

定位到from gensim.models.word2vec.LineSentence
将line = utils.to_unicode(line).split(' ')改为line = utils.to_unicode(line).split(',')

LineSentence类的要求是:

Simple format: one sentence = one line; words already preprocessed and separated by     whitespace

你需要自己简单预处理一下。

现在比较流行的是doc2vec,有兴趣可以看下:https://segmentfault.com/a/11...

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题
宣传栏