jieba分词结果不理想怎么办？

Question

jieba分词结果不理想怎么办？

发布于
2024-01-25 河南

更新于
2024-01-25

请问jieba分词结果不理想怎么办？
我想要创建关于景区评论的词云图，现在用jieba分词，然后将分词后的结果进行LDA建模提取主题，但提取出的主题中的热点词，明显能看出分词有问题。

相关代码：

# 加载中文停用词

stop_words = set(stopwords.words('chinese'))
broadcastVar = spark.sparkContext.broadcast(stop_words)

# 中文文本分词
def tokenize(text):
    return list(jieba.cut(text))

# 删除中文停用词
def delete_stopwords(tokens,stop_words):
    # 分词
    words = tokens  

    # 去除停用词
    filtered_words = [word for word in words if word not in stop_words]
    # 重建文本
    filtered_text = ' '.join(filtered_words)

    return filtered_text
# 删除标点符号和固定字
def remove_punctuation(input_string):
    import string
    # 制作一个映射表，其中所有的标点符号和需要删除的字都被映射为None
    all_punctuation = string.punctuation + "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.\t \n很好是去还不人太都中"
    translator = str.maketrans('', '', all_punctuation)
    # 使用映射表来移除所有的标点符号和字
    no_punct = input_string.translate(translator)
    return no_punct

def Thematic_focus(text):
    from gensim import corpora, models
    num_words = 0
    if len(text)>200:
        num_words = 10
    elif 200>=len(text)>100:
        num_words = 8
    elif 100>=len(text)>50:
        num_words = 5
    else:
        num_words = 3

    tokens = tokenize(text)
    # 删除停用词
    stop_words = broadcastVar.value
    text = delete_stopwords(tokens,stop_words)
    # 祛除标点符号
    text = remove_punctuation(text)
    # 重新分词
    tokens = tokenize(text)
    print(type(tokens),type([tokens]))
    # return str(tokens)
    # # 创建字典和文档-词频矩阵
    dictionary = corpora.Dictionary([tokens])

    corpus = [dictionary.doc2bow(tokens)]

    # 运行LDA模型
    lda_model = models.LdaModel(corpus, num_topics=1, id2word=dictionary, passes=50)

    # 提取主题
    topics = lda_model.show_topics(num_words=num_words)
    # 输出主题

    for topic in topics:
        return str(topic)

我想要让分词变得更合理，或者说有更好的提取景区评论中关键词的方法。

python jieba分词文本处理数据分析 lda

阅读 946

1 个回答

得票最新

算云烟

145

发布于
2024-05-09 广东

✓ 已被采纳

1.逆向搜狗旅游词库，构建属于自己的词库，根据词库分词。
2.根据GitHub中的开源的停用词词库，构建属于自己的停用词词库，去完成祛除停用词的操作。

撰写回答