7
头图

First from top to bottom, then from the bottom up to introduce ElasticSearch , trying to answer the following questions:

  • Why does my search *foo-bar* fail to match foo-bar ?
  • Why does adding more files compress the index (Index)?
  • Why does ElasticSearch take up a lot of memory?

version

elasticsearch version: elasticsearch-2.2.0

Cluster on the cloud

图片

Box in the cluster

Each white square box in the cloud represents a node-Node.

图片

Between nodes

Directly at one or more nodes, multiple small green squares are combined to form an ElasticSearch index.

图片

Small square in the index

Under an index, the small green squares distributed among multiple nodes are called shards.

图片

Shard=Lucene Index

An ElasticSearch Shard is essentially a Lucene Index. This ElasticSearch detailed usage tutorial was praised by the boss when it was shared internally

图片

Lucene is a Full Text search library (there are many other forms of search library), ElasticSearch is built on top of Lucene. Most of the content in the next story is actually how ElasticSearch works based on Lucene. Ultra-detailed 116-page Elasticsearch actual documentation! HD downloadable

Graphical Lucene

Mini index-segment

There are many small segments in Lucene. We can think of them as mini-indexes inside Lucene.

图片

Inside Segment

Has many data structures

  • Inverted Index
  • Stored Fields
  • Document Values
  • Cache

图片

The most important Inverted Index

图片

Inverted Index mainly includes two parts:

  • An ordered data dictionary (including the word Term and its frequency of occurrence).
  • Postings corresponding to the word Term (that is, the file where the word exists).

When we search, we first decompose the searched content, and then find the corresponding Term in the dictionary, so as to find the content of the file related to the search. What is the working principle of Elasticsearch query data?

图片

Query "the fury"

图片

AutoCompletion-Prefix

If you want to find a letter starting with the letter "c", you can simply find words such as "choice" and "coming" (Term) in the Inverted Index table through Binary Search. ElasticSearch 100 million-level data retrieval case actual combat!

图片

Expensive lookup

If you want to find all words that contain the letters "our", the system will scan the entire Inverted Index, which is very expensive.

图片

In this case, if we want to optimize, then the problem we face is how to generate a suitable term.

Problem transformation

图片

For problems such as the above, we may have several feasible solutions:

* suffix -> xiffus *

If we want to use the suffix as a search condition, we can do reverse processing for Term.

(60.6384, 6.5017) -> u4u8gyykk

For GEO location information, it can be converted to GEO Hash.

123 -> {1-hundreds, 12-tens, 123}

For simple numbers, multiple forms of Term can be generated for it.

Resolve spelling errors

A Python library generates a tree-shaped state machine containing misspelling information for words to solve the problem of spelling errors.

图片

Stored Field field lookup

When we want to find a file that contains a specific title content, Inverted Index can not solve this problem well, so Lucene provides another data structure Stored Fields to solve this problem. Essentially, Stored Fields is a simple key-value pair. By default, ElasticSearch will store the JSON source of the entire file.

图片

Document Values for sorting and aggregation

Even so, we found that the above structure still can not solve such as: sorting, aggregation, facet, because we may have to read a lot of unnecessary information.

Therefore, another data structure solves this problem: Document Values. This structure is essentially a columnar storage, which highly optimizes the storage structure with the same type of data.

图片

In order to improve efficiency, ElasticSearch can read all a Document Value under the index into memory for operation, which greatly improves the access speed, but also consumes a lot of memory space.

In short, these data structures Inverted Index, Stored Fields, Document Values and their caches are all inside the segment.

When the search occurs

When searching, Lucene will search all segments and then return the search results of each segment, and finally merge them and present them to the customer.

Several features of Lucene make this process very important:

  • Segments are immutable
  • Delete? When the deletion occurs, all Lucene does is to delete the mark position, but the file will still be in its original place and will not change.
  • Update? So for update, essentially what it does is: delete first, then re-index (Re-index)
  • Compression everywhere
  • Lucene is very good at compressing data. Basically all compression methods in textbooks can be found in Lucene.
  • Cache all all
  • Lucene will also cache all information, which greatly improves its query efficiency.
Cached story

When ElasticSearch indexes a file, it will establish a corresponding cache for the file, and refresh the data periodically (every second), and then these files can be searched.

图片

As time increases, we will have many segments,

图片

So ElasticSearch will merge these segments, and in this process, the segments will eventually be deleted

图片

This is why adding files may make the space occupied by the index smaller, it will cause a merge, which may result in more compression.

Give a chestnut

Two segments will merge

图片

These two segments will eventually be deleted and then merged into a new segment

图片

At this time, the new segment is in the cold state in the cache, but most of the segments still remain unchanged and are in the warm state.

The above scenes often happen inside the Lucene Index.

图片

Search in Shard

The process of ElasticSearch searching from Shard is similar to the process of searching from Lucene Segment.

图片

Unlike searching in Lucene Segment, Shards may be distributed on different Nodes, so when searching and returning results, all information will be transmitted through the network.

have to be aware of is:

1 search to find 2 shards = 2 searches for shards separately

图片

Processing of log files

When we want to search for logs generated on a specific date, by partitioning and indexing log files based on timestamps, the search efficiency will be greatly improved.

It is also very convenient when we want to delete old data, just delete the old index.

图片

In the above case, each index has two shards

How to scale

图片

The shard will not be split further, but the shard may be transferred to a different node

图片

Therefore, if the cluster node pressure increases to a certain level, we may consider adding new nodes, which will require us to re-index all data, which is something we don’t want to see, so we need to plan It is time to consider clearly how to balance the relationship between enough nodes and insufficient nodes.

Node allocation and shard optimization
  • For more important data index nodes, allocate machines with better performance
  • Ensure that each shard has a replica information replica

图片

Routing

Each node has a routing table, so when a request arrives at any node, ElasticSearch has the ability to forward the request to the shard of the desired node for further processing.

图片

A real request

图片

Query

图片

Query has a filtered type and a multi\_match query

Aggregation

图片

Aggregate according to the author to get the information of the top10 author of the top10 hits

Request distribution

This request may be distributed to any node in the cluster

图片

God node

图片

At this time, this node becomes the coordinator of the current request (Coordinator), and it decides:

  • According to the index information, determine which core node the request will be routed to
  • And which copy is available
  • and many more
routing

图片

Before the real search

ElasticSearch will convert Query to Lucene Query

图片

Then perform calculations in all segments

图片

There will also be a cache for the Filter condition itself

图片

But the queries will not be cached, so if the same Query is executed repeatedly, the application needs to cache

图片

and so,

  • filters can be used at any time
  • Query is only used when score is needed
return

After the search is over, the results will be returned layer by layer along the downward path.

图片

图片

图片

图片

图片

Source: https://www.cnblogs.com/richaaaard/p/5226334.html


民工哥
26.4k 声望56.7k 粉丝

10多年IT职场老司机的经验分享,坚持自学一路从技术小白成长为互联网企业信息技术部门的负责人。2019/2020/2021年度 思否Top Writer