Practice 003-analyzer of elasticsearch

[toc]

1. Composition of Elasticsearch analizer

1. Composed of three parts

1.1 Character Filter (character filter)

For original text filtering, such as text whose original text is html, the html tag needs to be removed: html_strip

1.2 Tokenizer

Divide the input (text processed by Character Filter) according to certain rules (such as spaces)

1.3 Token Filter (word segmentation filter)

Perform secondary processing on the quasi-term segmented by Tokenizer, such as uppercase -> lowercase, stop word filtering (run to in, the, etc.)

Second, Analyzer test word segmentation

2.1 Specify the analyzer test participle

2.1.1 Standard analyzer

Tokenizer: Standard Tokenize
Based on unicode text segmentation, suitable for most languages
Token Filter: Lower Case Token Filter/Stop Token Filter (disabled by default)
- LowerCase Token Filter: After filtering, lowercase --> so the search match after standard default word segmentation is lowercase
- Stop Token Filter (disabled by default) --> stop words: will be discarded in the index after tokenization

 GET _analyze
{
  "analyzer": "standard",
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

2.1.2 The standard results are visible

all lowercase
numbers are still there
No stop word (closed by default)

 {
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.2 Other analyzers

standard
stop stopword culling
simple
whitespace is only separated by whitespace, not removed
keyword full text, no word segmentation

2.3 Specify Tokenizer and Token Filter to test word segmentation

2.3.1 Use the same Tokenizer and Filter as standard

The previous section said: The Tokenizer used by the standard analyzer is standard Tokenizer and the filter used is lowercase , we try to replace the analyzer by using tokenizer and filter:

 GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

The result is the same as above:

 {
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.3.2 Add a stop filter and try again

 GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase","stop"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

Observation found: in is gone, so there should be in the filter component in the stop~

There are two in the filter (using two TokenFilter--ES fields can make multiple values are arrays) If you remove the lowercase in the filter, it will not be uppercase to lowercase. Well, I won't post the results here~

 {
  "tokens" : [
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

3. The Analyzer component that comes with Elasticsearch

3.1 The character filter that comes with ES

3.1.1 What is character filter?

Before the tokenizer, process the text, such as adding, deleting or replacing characters; multiple character filters can be set.
It affects tokenizer's position and offset .

3.1.2 Some built-in character filters

html strip: remove html tags
mapping: string replacement
pattern replace: regular match replacement

3.2 ES's own tokenizer

3.2.1 What is a tokenizer?

Divide the original text (original text processed by character filter) according to certain rules. (term or token)

3.2.2 Built-in tokenizer

whitespace: whitespace word segmentation
standard
uax_url_email: url/email
pattern
keyword: no participle
pattern hierarchy: pathname splitting

3.2.3 You can use java plugin to implement custom tokenizer

3.3 ES's own token filter

3.3.1 What is a tokenizer?

Process the words output by the tokenizer (processing term)

3.3.2 Built-in token filter

lowercase: lowercase
stop: remove stop words (in/the etc.)
synonym: add synonyms

4. Demo case

4.1 html_strip/maping+keyword

 GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "html_strip"
    },
    {
      "type": "mapping",
      "mappings": [
        "- => _", ":) => _happy_", ":( => _sad_"
      ]
    }
  ],
  "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}

The tokenizer: keyword is used, that is, it is completely preserved when cutting the word, and it is not cut;

Two char_filters are used: html_strip (remove html tags), mapping (replace the original content with the specified content)

The above result: the html tag is removed, and the minus sign is replaced with an underscore

 {
  "tokens" : [
    {
      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
      "start_offset" : 3,
      "end_offset" : 52,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.2 char_filter uses regular replacement

 GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

Regular replacement: type / pattern / replacement

result:

 {
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

4.3 tokenizer uses directory segmentation

 GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/niewj/a/b/c"
}

Word segmentation result:

 {
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b/c",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.4 whitespace and stop of tokenfilter

 GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], // ["lowercase", "stop"]
  "text": "The girls in China are playing this game !"
}

Result: in and this are removed (stopword), but the term is capitalized and retained, because the tokenizer uses whitespace instead of standard

 {
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "girls",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    }
  ]
}

4.5 Custom analyzer

4.5.1 settings custom analyzer

 PUT my_new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{ // 1.自定义analyzer的名称
          "type": "custom",
          "char_filter": ["my_emoticons"], 
          "tokenizer": "my_punctuation", 
          "filter": ["lowercase", "my_english_stop"]
        }
      },
      "tokenizer": {
        "my_punctuation": { // 3.自定义tokenizer的名称
          "type": "pattern", "pattern":"[ .,!?]"
        }
      },
      "char_filter": {
        "my_emoticons": { // 2.自定义char_filter的名称
          "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
        }
      },
      "filter": {
        "my_english_stop": { // 4.自定义token filter的名称
          "type": "stop", "stopwords": "_english_"
        }
      }
    }
  }
}

4.5.2 Test the custom analyzer:

 POST my_new_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm a :) person in the earth, :( And You? "
}

output

 {
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "_hapy_",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "person",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "earth",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_sad_",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "you",
      "start_offset" : 37,
      "end_offset" : 40,
      "type" : "word",
      "position" : 9
    }
  ]
}