First from top to bottom, then from the bottom up to introduce ElasticSearch , trying to answer the following questions:
- Why does my search
*foo-bar*
fail to matchfoo-bar
? - Why does adding more files compress the index (Index)?
- Why does ElasticSearch take up a lot of memory?
version
elasticsearch version: elasticsearch-2.2.0
Cluster on the cloud
Box in the cluster
Each white square box in the cloud represents a node-Node.
Between nodes
Directly at one or more nodes, multiple small green squares are combined to form an ElasticSearch index.
Small square in the index
Under an index, the small green squares distributed among multiple nodes are called shards.
Shard=Lucene Index
An ElasticSearch Shard is essentially a Lucene Index. This ElasticSearch detailed usage tutorial was praised by the boss when it was shared internally
Lucene is a Full Text search library (there are many other forms of search library), ElasticSearch is built on top of Lucene. Most of the content in the next story is actually how ElasticSearch works based on Lucene. Ultra-detailed 116-page Elasticsearch actual documentation! HD downloadable
Graphical Lucene
Mini index-segment
There are many small segments in Lucene. We can think of them as mini-indexes inside Lucene.
Inside Segment
Has many data structures
- Inverted Index
- Stored Fields
- Document Values
- Cache
The most important Inverted Index
Inverted Index mainly includes two parts:
- An ordered data dictionary (including the word Term and its frequency of occurrence).
- Postings corresponding to the word Term (that is, the file where the word exists).
When we search, we first decompose the searched content, and then find the corresponding Term in the dictionary, so as to find the content of the file related to the search. What is the working principle of Elasticsearch query data?
Query "the fury"
AutoCompletion-Prefix
If you want to find a letter starting with the letter "c", you can simply find words such as "choice" and "coming" (Term) in the Inverted Index table through Binary Search. ElasticSearch 100 million-level data retrieval case actual combat!
Expensive lookup
If you want to find all words that contain the letters "our", the system will scan the entire Inverted Index, which is very expensive.
In this case, if we want to optimize, then the problem we face is how to generate a suitable term.
Problem transformation
For problems such as the above, we may have several feasible solutions:
* suffix -> xiffus *
If we want to use the suffix as a search condition, we can do reverse processing for Term.
(60.6384, 6.5017) -> u4u8gyykk
For GEO location information, it can be converted to GEO Hash.
123 -> {1-hundreds, 12-tens, 123}
For simple numbers, multiple forms of Term can be generated for it.
Resolve spelling errors
A Python library generates a tree-shaped state machine containing misspelling information for words to solve the problem of spelling errors.
Stored Field field lookup
When we want to find a file that contains a specific title content, Inverted Index can not solve this problem well, so Lucene provides another data structure Stored Fields to solve this problem. Essentially, Stored Fields is a simple key-value pair. By default, ElasticSearch will store the JSON source of the entire file.
Document Values for sorting and aggregation
Even so, we found that the above structure still can not solve such as: sorting, aggregation, facet, because we may have to read a lot of unnecessary information.
Therefore, another data structure solves this problem: Document Values. This structure is essentially a columnar storage, which highly optimizes the storage structure with the same type of data.
In order to improve efficiency, ElasticSearch can read all a Document Value under the index into memory for operation, which greatly improves the access speed, but also consumes a lot of memory space.
In short, these data structures Inverted Index, Stored Fields, Document Values and their caches are all inside the segment.
When the search occurs
When searching, Lucene will search all segments and then return the search results of each segment, and finally merge them and present them to the customer.
Several features of Lucene make this process very important:
- Segments are immutable
- Delete? When the deletion occurs, all Lucene does is to delete the mark position, but the file will still be in its original place and will not change.
- Update? So for update, essentially what it does is: delete first, then re-index (Re-index)
- Compression everywhere
- Lucene is very good at compressing data. Basically all compression methods in textbooks can be found in Lucene.
- Cache all all
- Lucene will also cache all information, which greatly improves its query efficiency.
Cached story
When ElasticSearch indexes a file, it will establish a corresponding cache for the file, and refresh the data periodically (every second), and then these files can be searched.
As time increases, we will have many segments,
So ElasticSearch will merge these segments, and in this process, the segments will eventually be deleted
This is why adding files may make the space occupied by the index smaller, it will cause a merge, which may result in more compression.
Give a chestnut
Two segments will merge
These two segments will eventually be deleted and then merged into a new segment
At this time, the new segment is in the cold state in the cache, but most of the segments still remain unchanged and are in the warm state.
The above scenes often happen inside the Lucene Index.
Search in Shard
The process of ElasticSearch searching from Shard is similar to the process of searching from Lucene Segment.
Unlike searching in Lucene Segment, Shards may be distributed on different Nodes, so when searching and returning results, all information will be transmitted through the network.
have to be aware of is:
1 search to find 2 shards = 2 searches for shards separately
Processing of log files
When we want to search for logs generated on a specific date, by partitioning and indexing log files based on timestamps, the search efficiency will be greatly improved.
It is also very convenient when we want to delete old data, just delete the old index.
In the above case, each index has two shards
How to scale
The shard will not be split further, but the shard may be transferred to a different node
Therefore, if the cluster node pressure increases to a certain level, we may consider adding new nodes, which will require us to re-index all data, which is something we don’t want to see, so we need to plan It is time to consider clearly how to balance the relationship between enough nodes and insufficient nodes.
Node allocation and shard optimization
- For more important data index nodes, allocate machines with better performance
- Ensure that each shard has a replica information replica
Routing
Each node has a routing table, so when a request arrives at any node, ElasticSearch has the ability to forward the request to the shard of the desired node for further processing.
A real request
Query
Query has a filtered type and a multi\_match query
Aggregation
Aggregate according to the author to get the information of the top10 author of the top10 hits
Request distribution
This request may be distributed to any node in the cluster
God node
At this time, this node becomes the coordinator of the current request (Coordinator), and it decides:
- According to the index information, determine which core node the request will be routed to
- And which copy is available
- and many more
routing
Before the real search
ElasticSearch will convert Query to Lucene Query
Then perform calculations in all segments
There will also be a cache for the Filter condition itself
But the queries will not be cached, so if the same Query is executed repeatedly, the application needs to cache
and so,
- filters can be used at any time
- Query is only used when score is needed
return
After the search is over, the results will be returned layer by layer along the downward path.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。