[toc]
1. Composition of Elasticsearch analizer
1. Composed of three parts
1.1 Character Filter (character filter)
For original text filtering, such as text whose original text is html, the html tag needs to be removed: html_strip
1.2 Tokenizer
Divide the input (text processed by Character Filter) according to certain rules (such as spaces)
1.3 Token Filter (word segmentation filter)
Perform secondary processing on the quasi-term segmented by Tokenizer, such as uppercase -> lowercase, stop word filtering (run to in, the, etc.)
Second, Analyzer test word segmentation
2.1 Specify the analyzer test participle
2.1.1 Standard analyzer
Tokenizer: Standard Tokenize
Based on unicode text segmentation, suitable for most languages
Token Filter: Lower Case Token Filter/Stop Token Filter (disabled by default)
- LowerCase Token Filter: After filtering, lowercase --> so the search match after standard default word segmentation is lowercase
- Stop Token Filter (disabled by default) --> stop words: will be discarded in the index after tokenization
GET _analyze
{
"analyzer": "standard",
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
2.1.2 The standard results are visible
- all lowercase
- numbers are still there
- No stop word (closed by default)
{
"tokens" : [
{
"token" : "for",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "in",
"start_offset" : 44,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
2.2 Other analyzers
- standard
- stop stopword culling
- simple
- whitespace is only separated by whitespace, not removed
- keyword full text, no word segmentation
2.3 Specify Tokenizer and Token Filter to test word segmentation
2.3.1 Use the same Tokenizer and Filter as standard
The previous section said: The Tokenizer used by the standard analyzer is standard Tokenizer
and the filter used is lowercase
, we try to replace the analyzer by using tokenizer and filter:
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
The result is the same as above:
{
"tokens" : [
{
"token" : "for",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "in",
"start_offset" : 44,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
2.3.2 Add a stop filter and try again
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase","stop"],
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
Observation found: in
is gone, so there should be in
the filter component in the stop~
There are two in the filter (using two TokenFilter--ES fields can make multiple values are arrays) If you remove the lowercase
in the filter, it will not be uppercase to lowercase. Well, I won't post the results here~
{
"tokens" : [
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
3. The Analyzer component that comes with Elasticsearch
3.1 The character filter that comes with ES
3.1.1 What is character filter?
Before the tokenizer, process the text, such as adding, deleting or replacing characters; multiple character filters can be set.
It affects tokenizer's
position
andoffset
.
3.1.2 Some built-in character filters
- html strip: remove html tags
- mapping: string replacement
- pattern replace: regular match replacement
3.2 ES's own tokenizer
3.2.1 What is a tokenizer?
Divide the original text (original text processed by character filter) according to certain rules. (term or token)
3.2.2 Built-in tokenizer
- whitespace: whitespace word segmentation
- standard
- uax_url_email: url/email
- pattern
- keyword: no participle
- pattern hierarchy: pathname splitting
3.2.3 You can use java plugin to implement custom tokenizer
3.3 ES's own token filter
3.3.1 What is a tokenizer?
Process the words output by the tokenizer (processing term)
3.3.2 Built-in token filter
- lowercase: lowercase
- stop: remove stop words (in/the etc.)
- synonym: add synonyms
4. Demo case
4.1 html_strip/maping+keyword
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "html_strip"
},
{
"type": "mapping",
"mappings": [
"- => _", ":) => _happy_", ":( => _sad_"
]
}
],
"text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}
The tokenizer: keyword is used, that is, it is completely preserved when cutting the word, and it is not cut;
Two char_filters are used: html_strip (remove html tags), mapping (replace the original content with the specified content)
The above result: the html tag is removed, and the minus sign is replaced with an underscore
{
"tokens" : [
{
"token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
"start_offset" : 3,
"end_offset" : 52,
"type" : "word",
"position" : 0
}
]
}
4.2 char_filter uses regular replacement
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "http://(.*)",
"replacement": "$1"
}
],
"text": "http://www.elastic.co"
}
Regular replacement: type
/ pattern
/ replacement
result:
{
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
4.3 tokenizer uses directory segmentation
GET _analyze
{
"tokenizer": "path_hierarchy",
"text": "/user/niewj/a/b/c"
}
Word segmentation result:
{
"tokens" : [
{
"token" : "/user",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a/b",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a/b/c",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
}
]
}
4.4 whitespace and stop of tokenfilter
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"], // ["lowercase", "stop"]
"text": "The girls in China are playing this game !"
}
Result: in and this are removed (stopword), but the term is capitalized and retained, because the tokenizer uses whitespace instead of standard
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "girls",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "China",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "playing",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : "game",
"start_offset" : 36,
"end_offset" : 40,
"type" : "word",
"position" : 7
},
{
"token" : "!",
"start_offset" : 41,
"end_offset" : 42,
"type" : "word",
"position" : 8
}
]
}
4.5 Custom analyzer
4.5.1 settings custom analyzer
PUT my_new_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{ // 1.自定义analyzer的名称
"type": "custom",
"char_filter": ["my_emoticons"],
"tokenizer": "my_punctuation",
"filter": ["lowercase", "my_english_stop"]
}
},
"tokenizer": {
"my_punctuation": { // 3.自定义tokenizer的名称
"type": "pattern", "pattern":"[ .,!?]"
}
},
"char_filter": {
"my_emoticons": { // 2.自定义char_filter的名称
"type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
}
},
"filter": {
"my_english_stop": { // 4.自定义token filter的名称
"type": "stop", "stopwords": "_english_"
}
}
}
}
}
4.5.2 Test the custom analyzer:
POST my_new_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I'm a :) person in the earth, :( And You? "
}
output
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "_hapy_",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "person",
"start_offset" : 9,
"end_offset" : 15,
"type" : "word",
"position" : 3
},
{
"token" : "earth",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 6
},
{
"token" : "_sad_",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "you",
"start_offset" : 37,
"end_offset" : 40,
"type" : "word",
"position" : 9
}
]
}
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。