前言
- ES 版本
7.17
- 同义词分词器
IK
- ES 官方文档:Token graphs,里面没有
多词vs.多词
的示例 - 分词器组成(出自:https://www.elastic.co/blog/found-text-analysis-part-1)
- synonym_graph 为 TokenFilter: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/...
同义词困惑
查询语句
GET /_analyze
{
"tokenizer": "ik_max_word",
"filter" : [
{
"expand": true,
"type": "synonym_graph",
"synonyms": ["联合工作,Team working"]
}
],
"explain" : true,
"attributes" : ["keyword"],
"text" : "联合工作"
}
输出结果
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "ik_max_word",
"tokens" : [
{
"token" : "联合",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "工作",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}
]
},
"tokenfilters" : [
{
"name" : "__anonymous__synonym_graph",
"tokens" : [
{
"token" : "team",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "联合",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0,
"positionLength" : 2
},
{
"token" : "working",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "工作",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
}
]
}
]
}
}
问题
- 为什么
联合
的positionLength
会由1
变成2
? (已知 positionLength 为默认值 1 时不显示) - 为什么
工作
的position
会由1
变成2
? TokenFilter
在Tokenizer
之后,联合工作
在分词后变为联合
、工作
,应该没有同义词了呀?
由输入/输出虚拟过程
- 根据输入和输出,
qbit
虚拟了如下过程
同义词 与 minimum_should_match
创建索引
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"word_syn": {
"type": "synonym_graph",
"synonyms": [
"联合工作,Team working"
],
"updateable": "true"
}
},
"analyzer": {
"ik_max_word_syn": {
"filter": [
"word_syn"
],
"type": "custom",
"tokenizer": "ik_max_word"
}
}
}
},
"mappings": {
"dynamic": false,
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
写入测试数据
POST my_index/_bulk
{ "index" : { "_id" : "1" } }
{ "title" : "联合工作" }
{ "index" : { "_id" : "2" } }
{ "title" : "team working" }
{ "index" : { "_id" : "3" } }
{ "title" : "联合工作,team working" }
查询 1
GET my_index/_search
{
"query": {
"match": {
"title": {
"minimum_should_match": 2,
"analyzer": "ik_max_word",
"query": "联合工作"
}
}
}
}
ES 解析为
(title:联合 title:工作)~2
查到 2 条数据
联合工作
联合工作,team working
查询 2
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"minimum_should_match": 2,
"analyzer": "ik_max_word",
"query": "联合工作"
}
}
},
{
"match": {
"title": {
"minimum_should_match": 2,
"analyzer": "ik_max_word",
"query": "team working"
}
}
}
]
}
}
}
ES 解析为
((title:联合 title:工作)~2) ((title:team title:working)~2)
查到 3 条数据
联合工作,team working
联合工作
team working
查询 3
GET my_index/_search
{
"query": {
"match": {
"title": {
"minimum_should_match": 2,
"analyzer": "ik_max_word_syn",
"query": "联合工作"
}
}
}
}
ES 解析为
((title:"team working" title:"联合 工作"))~2
查到 0 条数据
qbit 思考
- 本期望
查询3
等价于查询2
- 但
minimum_should_match
与同义词
结合时作用的层级只在最外层,不是实际期望的结果
本文出自 qbit snap
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。