ES 自定义分词匹配及同义词处理（qbit）

前言

本文对 Elasticsearch 7.17 有效，分词器使用 ik_max_word
设计思路

- 匹配大多数分词 token 项（mininum_should_match）
- 如果分词 token 的第一个等于原词，那么匹配原词或者其他大多数分词 token
- 如果原词有同义词，就仅使用原词同义词；否则，使用分词 token 的同义词
- “同义”这个动作只有“一轮”，要么原词同义，要么分词 token 同义

GetLeafTermDSL

叶子节点的 term 匹配
流程图
示例代码

def GetLeafTermDSL(field: str, word: str, basicUniqueTokenList: list, minShouldMatch: str, bSyn: bool = False):
    r''' 
    叶子节点 term 查询 DSL 
    field: 字段名
    word: 检索词
    basicUniqueTokenList: 检索词去重的 token 列表
    minShouldMatch: 最少匹配个数, 如 "5<-1 8<-2"
    bSyn: 是否启用同义词
    '''
    shouldList = list()
    for token in basicUniqueTokenList:
        innerShouldList = list()
        innerShouldList.append(
            {
                "term": {
                    f"{field}.text": {
                        "value": token,
                        # "boost": 1.0
                    }
                }
            }
        )

        tokenSynList = tool.GetSynonymList(token, 1)    # 获取 token 同义词
        if (not bSyn) or (not tokenSynList):            # 不启用同义词 OR 没有token同义词
            shouldList.append(
                {
                    "bool": {
                        "should": innerShouldList
                    }
                }
            )
        else:       # 启用同义词 AND 有token同义词
            for tokenSyn in tokenSynList:
                innerShouldList.append(
                    {
                        "term": {
                            "all_field.text": {
                                "value": tokenSyn
                            }
                        }
                    }
                )
            shouldList.append(
                    {
                        "bool": {
                            "should": innerShouldList
                        }
                    }
                )

    # 数据治理 -> [数据治理, 数据, 治理]
    # 分词 token 数大于1，第一个 token 等于原词
    if (len(basicUniqueTokenList) > 1) and (word.strip() == basicUniqueTokenList[0].strip()):
        # 移除 shouldList 里面的原词
        return {
            "bool": {
                "should": [
                    {
                        "term": {
                            f"{field}.text": {
                                "value": word,
                                # "boost": 1.0
                            }
                        }
                    },
                    {
                        "bool": {
                            "should": shouldList[1:],           # 第一个为原本 token，不取
                            "minimum_should_match": minShouldMatch
                        }
                    }
                ]
            }
        }
    else:
        return {
            "bool": {
                "should": shouldList,
                "minimum_should_match": minShouldMatch
            }
        }

自定义分词搜索

整体外围调用 GetLeafTermDSL
流程图
示例代码

def tokenQuery(bSyn: bool):
    if not bSyn:        # 不启用同义词
        logger.debug("不启用同义词...")
        return GetLeafTermDSL(field, value, basicUniqueTokenList, minShouldMatch, False)
    # 以下为启用了同义词
    logger.debug("以下为启用了同义词...")
    rawSynList = tool.GetSynonymList(value, 1)      # 获取原词同义词
    if rawSynList:      # 存在原词同义词
        logger.debug("存在原词同义词...")
        shouldList = list()
        shouldList.append(GetLeafTermDSL(
            field, value, basicUniqueTokenList, minShouldMatch))
        for item in rawSynList:
            tokenList = await GetBasicToken(item, True)
            shouldList.append(GetLeafTermDSL(
                field, item, tokenList, minShouldMatch))
        return {
            "bool": {
                "should": shouldList
            }
        }
    else:
        # 以下为启用了同义词，不存在原词同义词
        logger.debug("以下为启用了同义词，不存在原词同义词...")
        # 数据治理 -> [数据治理, 数据, 治理]
        # 分词 token 数大于1，第一个 token 等于原词
        return GetLeafTermDSL(field, value, basicUniqueTokenList, minShouldMatch, True)

ES 自定义分词匹配及同义词处理（qbit）

前言

GetLeafTermDSL

自定义分词搜索

相关阅读

qbit

引用和评论

uvicorn 配置日志格式（qbit）

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总