Illustrate the principle of ElasticSearch search! I&#39;ll understand after reading it

First from top to bottom, then from the bottom up to introduce ElasticSearch , trying to answer the following questions:

Why does my search *foo-bar* fail to match foo-bar ?
Why does adding more files compress the index (Index)?
Why does ElasticSearch take up a lot of memory?

version

elasticsearch version: elasticsearch-2.2.0

Cluster on the cloud

Box in the cluster

Each white square box in the cloud represents a node-Node.

Between nodes

Directly at one or more nodes, multiple small green squares are combined to form an ElasticSearch index.

Small square in the index

Under an index, the small green squares distributed among multiple nodes are called shards.

Shard＝Lucene Index

An ElasticSearch Shard is essentially a Lucene Index. This ElasticSearch detailed usage tutorial was praised by the boss when it was shared internally

Lucene is a Full Text search library (there are many other forms of search library), ElasticSearch is built on top of Lucene. Most of the content in the next story is actually how ElasticSearch works based on Lucene. Ultra-detailed 116-page Elasticsearch actual documentation! HD downloadable

Graphical Lucene

Mini index-segment

There are many small segments in Lucene. We can think of them as mini-indexes inside Lucene.

Inside Segment

Has many data structures

Inverted Index
Stored Fields
Document Values
Cache

The most important Inverted Index

Inverted Index mainly includes two parts:

An ordered data dictionary (including the word Term and its frequency of occurrence).
Postings corresponding to the word Term (that is, the file where the word exists).

When we search, we first decompose the searched content, and then find the corresponding Term in the dictionary, so as to find the content of the file related to the search. What is the working principle of Elasticsearch query data?

Query "the fury"

AutoCompletion-Prefix

If you want to find a letter starting with the letter "c", you can simply find words such as "choice" and "coming" (Term) in the Inverted Index table through Binary Search. ElasticSearch 100 million-level data retrieval case actual combat!

Expensive lookup

If you want to find all words that contain the letters "our", the system will scan the entire Inverted Index, which is very expensive.

In this case, if we want to optimize, then the problem we face is how to generate a suitable term.

Problem transformation

For problems such as the above, we may have several feasible solutions:

* suffix -> xiffus *

If we want to use the suffix as a search condition, we can do reverse processing for Term.

(60.6384, 6.5017) -> u4u8gyykk

For GEO location information, it can be converted to GEO Hash.

123 -> {1-hundreds, 12-tens, 123}

For simple numbers, multiple forms of Term can be generated for it.

Resolve spelling errors

A Python library generates a tree-shaped state machine containing misspelling information for words to solve the problem of spelling errors.

Stored Field field lookup

When we want to find a file that contains a specific title content, Inverted Index can not solve this problem well, so Lucene provides another data structure Stored Fields to solve this problem. Essentially, Stored Fields is a simple key-value pair. By default, ElasticSearch will store the JSON source of the entire file.

Document Values for sorting and aggregation

Even so, we found that the above structure still can not solve such as: sorting, aggregation, facet, because we may have to read a lot of unnecessary information.

Therefore, another data structure solves this problem: Document Values. This structure is essentially a columnar storage, which highly optimizes the storage structure with the same type of data.

In order to improve efficiency, ElasticSearch can read all a Document Value under the index into memory for operation, which greatly improves the access speed, but also consumes a lot of memory space.

In short, these data structures Inverted Index, Stored Fields, Document Values and their caches are all inside the segment.

When the search occurs

When searching, Lucene will search all segments and then return the search results of each segment, and finally merge them and present them to the customer.

Several features of Lucene make this process very important:

Segments are immutable
Delete? When the deletion occurs, all Lucene does is to delete the mark position, but the file will still be in its original place and will not change.
Update? So for update, essentially what it does is: delete first, then re-index (Re-index)
Compression everywhere
Lucene is very good at compressing data. Basically all compression methods in textbooks can be found in Lucene.
Cache all all
Lucene will also cache all information, which greatly improves its query efficiency.

Cached story

When ElasticSearch indexes a file, it will establish a corresponding cache for the file, and refresh the data periodically (every second), and then these files can be searched.

As time increases, we will have many segments,

So ElasticSearch will merge these segments, and in this process, the segments will eventually be deleted

This is why adding files may make the space occupied by the index smaller, it will cause a merge, which may result in more compression.

Give a chestnut

Two segments will merge

These two segments will eventually be deleted and then merged into a new segment

At this time, the new segment is in the cold state in the cache, but most of the segments still remain unchanged and are in the warm state.

The above scenes often happen inside the Lucene Index.

Search in Shard

The process of ElasticSearch searching from Shard is similar to the process of searching from Lucene Segment.

Unlike searching in Lucene Segment, Shards may be distributed on different Nodes, so when searching and returning results, all information will be transmitted through the network.

have to be aware of is:

1 search to find 2 shards = 2 searches for shards separately

Processing of log files

When we want to search for logs generated on a specific date, by partitioning and indexing log files based on timestamps, the search efficiency will be greatly improved.

It is also very convenient when we want to delete old data, just delete the old index.

In the above case, each index has two shards

How to scale

The shard will not be split further, but the shard may be transferred to a different node

Therefore, if the cluster node pressure increases to a certain level, we may consider adding new nodes, which will require us to re-index all data, which is something we don’t want to see, so we need to plan It is time to consider clearly how to balance the relationship between enough nodes and insufficient nodes.

Node allocation and shard optimization

For more important data index nodes, allocate machines with better performance
Ensure that each shard has a replica information replica

Routing

Each node has a routing table, so when a request arrives at any node, ElasticSearch has the ability to forward the request to the shard of the desired node for further processing.