ES 自定义分词匹配及同义词处理（qbit）

本文对 Elasticsearch 7.17 有效，分词器使用 ik_max_word设计思路- 匹配大多数分词 token 项（mininum_should_match）

如果分词 token 的第一个等于原词，那么匹配原词或者其他大多数分词 token
如果原词有同义词，就仅使用原词同义词；否则，使用分词 token 的同义词

“同义”这个动作只有“一轮”，要么原词同义，要么分词 token 同义GetLeafTermDSL叶子节点的 term 匹配流程图

示例代码def GetLeafTermDSL(field: str, word: str, basicUniqueTokenList: list, minShouldMatch: str, bSyn: bool = False):
r'''
叶子节点 term 查询 DSL
field: 字段名
word: 检索词
basicUniqueTokenList: 检索词去重的 token 列表
minShouldMatch: 最少匹配个数, 如 "5<-1 8<-2"
bSyn: 是否启用同义词
'''
shouldList = list()
for token in basicUniqueTokenList:

  innerShouldList = list()
  innerShouldList.append(
      {
          "term": {
              f"{field}.text": {
                  "value": token,
                  # "boost": 1.0
              }
          }
      }
  )

  tokenSynList = tool.GetSynonymList(token, 1)    # 获取 token 同义词
  if (not bSyn) or (not tokenSynList):            # 不启用同义词 OR 没有token同义词
      shouldList.append(
          {
              "bool": {
                  "should": innerShouldList
              }
          }
      )
  else:       # 启用同义词 AND 有token同义词
      for tokenSyn in tokenSynList:
          innerShouldList.append(
              {
                  "term": {
                      "all_field.text": {
                          "value": tokenSyn
                      }
                  }
              }
          )
      shouldList.append(
              {
                  "bool": {
                      "should": innerShouldList
                  }
              }
          )

# 数据治理 -> [数据治理, 数据, 治理]
# 分词 token 数大于1，第一个 token 等于原词
if (len(basicUniqueTokenList) > 1) and (word.strip() == basicUniqueTokenList[0].strip()):

  # 移除 shouldList 里面的原词
  return {
      "bool": {
          "should": [
              {
                  "term": {
                      f"{field}.text": {
                          "value": word,
                          # "boost": 1.0
                      }
                  }
              },
              {
                  "bool": {
                      "should": shouldList[1:],           # 第一个为原本 token，不取
                      "minimum_should_match": minShouldMatch
                  }
              }
          ]
      }
  }

else:

  return {
      "bool": {
          "should": shouldList,
          "minimum_should_match": minShouldMatch
      }
  }自定义分词搜索整体外围调用 GetLeafTermDSL流程图

示例代码def tokenQuery(bSyn: bool):
if not bSyn: # 不启用同义词

  logger.debug("不启用同义词...")
  return GetLeafTermDSL(field, value, basicUniqueTokenList, minShouldMatch, False)

# 以下为启用了同义词
logger.debug("以下为启用了同义词...")
rawSynList = tool.GetSynonymList(value, 1) # 获取原词同义词
if rawSynList: # 存在原词同义词

  logger.debug("存在原词同义词...")
  shouldList = list()
  shouldList.append(GetLeafTermDSL(
      field, value, basicUniqueTokenList, minShouldMatch))
  for item in rawSynList:
      tokenList = await GetBasicToken(item, True)
      shouldList.append(GetLeafTermDSL(
          field, item, tokenList, minShouldMatch))
  return {
      "bool": {
          "should": shouldList
      }
  }

else:

  # 以下为启用了同义词，不存在原词同义词
  logger.debug("以下为启用了同义词，不存在原词同义词...")
  # 数据治理 -> [数据治理, 数据, 治理]
  # 分词 token 数大于1，第一个 token 等于原词
  return GetLeafTermDSL(field, value, basicUniqueTokenList, minShouldMatch, True)相关阅读

ES 自定义分词匹配及同义词处理（qbit）

已注销

引用和评论

SpringBoot + JWT + Redis 开源知识社区系统

《传媒公司如何管理艺人：新手必读的行业宝典》

《标准化流程的魔力：如何让团队协作更顺畅？》

跨部门沟通、如何打破信息壁垒

跨部门沟通、如何打破信息壁垒

在企业级数据集成领域，数据一致性

概念解析