前言

同义词困惑

查询语句

GET /_analyze
{
  "tokenizer": "ik_max_word",
  "filter" : [
    {
      "expand": true,
      "type": "synonym_graph",
      "synonyms": ["联合工作,Team working"]
    }
  ],
  "explain" : true,
  "attributes" : ["keyword"],
  "text" : "联合工作"
}

输出结果

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "ik_max_word",
      "tokens" : [
        {
          "token" : "联合",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "工作",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 1
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "__anonymous__synonym_graph",
        "tokens" : [
          {
            "token" : "team",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "SYNONYM",
            "position" : 0
          },
          {
            "token" : "联合",
            "start_offset" : 0,
            "end_offset" : 2,
            "type" : "CN_WORD",
            "position" : 0,
            "positionLength" : 2
          },
          {
            "token" : "working",
            "start_offset" : 0,
            "end_offset" : 4,
            "type" : "SYNONYM",
            "position" : 1,
            "positionLength" : 2
          },
          {
            "token" : "工作",
            "start_offset" : 2,
            "end_offset" : 4,
            "type" : "CN_WORD",
            "position" : 2
          }
        ]
      }
    ]
  }
}

问题

  1. 为什么 联合positionLength 会由 1 变成 2 ? (已知 positionLength 为默认值 1 时不显示)
  2. 为什么 工作position 会由 1 变成 2
  3. TokenFilterTokenizer 之后,联合工作 在分词后变为 联合工作,应该没有同义词了呀?

由输入/输出虚拟过程

  • 根据输入和输出,qbit 虚拟了如下过程
    image.png

同义词 与 minimum_should_match

创建索引

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "word_syn": {
          "type": "synonym_graph",
          "synonyms": [
            "联合工作,Team working"
          ],
          "updateable": "true"
        }
      },
      "analyzer": {
        "ik_max_word_syn": {
          "filter": [
            "word_syn"
          ],
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    }
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

写入测试数据

POST my_index/_bulk
{ "index" : { "_id" : "1" } }
{ "title" : "联合工作" }
{ "index" : { "_id" : "2" } }
{ "title" : "team working" }
{ "index" : { "_id" : "3" } }
{ "title" : "联合工作,team working" }

查询 1

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "minimum_should_match": 2,
        "analyzer": "ik_max_word",
        "query": "联合工作"
      }
    }
  }
}

ES 解析为

(title:联合 title:工作)~2

查到 2 条数据

联合工作
联合工作,team working

查询 2

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "minimum_should_match": 2,
              "analyzer": "ik_max_word",
              "query": "联合工作"
            }
          }
        },
        {
          "match": {
            "title": {
              "minimum_should_match": 2,
              "analyzer": "ik_max_word",
              "query": "team working"
            }
          }
        }
      ]
    }
  }
}

ES 解析为

((title:联合 title:工作)~2) ((title:team title:working)~2)

查到 3 条数据

联合工作,team working
联合工作
team working

查询 3

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "minimum_should_match": 2,
        "analyzer": "ik_max_word_syn",
        "query": "联合工作"
      }
    }
  }
}

ES 解析为

((title:"team working" title:"联合 工作"))~2

查到 0 条数据

qbit 思考

  • 本期望 查询3 等价于 查询2
  • minimum_should_match同义词 结合时作用的层级只在最外层,不是实际期望的结果
本文出自 qbit snap

qbit
268 声望279 粉丝