Technical selection of massive data system architecture
https://db-engines.com/en/ranking/search+engine
Full-text search engine (NLP, crawler, such as Baidu)
Vertical search engine (e-commerce, OA, site search, video site)
Search engine has requirements:
- Fast query speed (efficient compression algorithm, fast encoding and decoding)
- Accurate results (BM25, TF-IDF)
- Rich search results (recall rate)
Get started with Elasticsearch
Environmental installation
- Operating System|JDK|Own Compatibility
- jdk(8|11|14) compatibility
- elastic.co/cn/downloads compatibility
Elasticsearch directory structure
- bin: executable script file, including start es service, plug-in management, function command
- config: configuration file directory, es configuration, role configuration, jvm configuration, etc.
- lib: the java library that es depends on
- data: The default data storage directory. Including all data of nodes, shards, indexes, and documents, production environment requirements must be modified .
- logs: The default log file storage path, production environment must be modified .
- modules: Contains all es modules, such as Cluster, Discovery, Indicators, etc.
- plugins: directory of installed plugins
- Only available after jdk/jdk.app 7.0, with its own java environment
start up
- Start bin/elasticsearch, open localhost:9200
Multi-node mode
- Multi-item single node
Single project and multiple nodes
elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach
Cluster health
Health status
- green All primary and replica are active, and the cluster is healthy
- yellow, at least one replica is unavailable, but all primary are active, and data recognition is guaranteed to be complete
- red: At least one primary is unavailable, and the cluster is unavailable
Health check
- _cat/health
- _cluster/health
kibana
- Verify that the service started successfully localhost:5601
- Configure es service address
elasticsearch.host:["http://localhost:9201"]
(kibana.yml) Close kibana from the command line:
- close the window
- ps -ef|grep 5601 or ps -ef|grep kibana or lsof -i:5601
- kill -9 pid
About the cause and solution of "kibana server is not ready yet"
- Incompatible versions of kibana and Elasticsearch (keep the same version)
- The service address of Elasticsearch is different from the elasticsearch.hosts configured in Kibana (configured in elasticsearch.yml)
- Cross-domain access is prohibited in Elasticsearch (the configuration in elasticsearch.yml allows cross-domain access)
- The server has a firewall turned on (turn off the firewall or modify the server security policy)
- The remaining space of the disk where Elasticsearch is located is less than 90% (clear the disk space, configure monitoring and alarm)
See the essence through the phenomenon: take you to see through the essence of "index"
index
- Help quick search
- Take the data structure as the carrier
- Landing as a file
The structure of the database
Why B+Trees (Mysql) is not suitable for big data retrieval
- mysql, one hundred thousand level, no index: 0.295s
- mysql, million level, no index: 3.365s
- mysql, million level, full-text index: 1.033s
- mysql, tens of millions, full-text index: 10.038s
- es, ten million, .8s
mysql index structure B-Trees visualization
Full interpretation of the inverted index
inverted index data structure
inverted index core algorithm
Compression algorithm for inverted tables
- FOR:Frame Of Reference
- RBM:RoaringBitMap
- FOR:Frame Of Reference
Retrieval principle of term index
- FST:Finit state Transducers
http://examples.mikemccandless.com/fst.py
The realization principle of FST in lucene
- FST:Finit state Transducers
Positive Index Inverted Index Discrimination :
First of all, we must understand the concept of two data structures. doc values are the mapping from documents to terms, and inverted is the mapping from terms to document IDs. In principle, let me first talk about why the inverted index is not suitable for aggregation. You cannot determine the total number of doc through the inverted index, and because the analysis will be performed by default, even if the aggregation, the result may be inaccurate, so you have to create the not_analyzed field. , Which increases the disk usage. For the simplest example, adding this is a product table, each product has a number of labels, we executed the following query
query:{
match:{
tags:"性价比"
},
aggs:{
tag_terms:{
terms:{
field:"tags.keyword"
}
}
}
}
The meaning of this aggregate query is to query all the tags of the product with the label "cost-effective"
When executing agg
We use the inverted index, so the voice is like this: Scan the term in the inverted index to see if the corresponding doc tag in the inverted table corresponding to this term contains "cost-effectiveness", if it does, record it, Since we are not sure whether the next term meets the conditions, we have to judge one by one, which results in a table scan.
If you use a positive index, and the positive index refers to which terms are included in the doc, that is, the current docid=>the mapping of all terms contained in the current field, what we want to find is all the doc that meets the conditions , Then we can directly use the key (docid) to get the values (all terms) instead of scanning the table.
Therefore, the essence of the high efficiency of using the forward index in the aggregate query is the difference between the two data structures, and it has nothing to do with the combined inverted index. The combined inverted index is only pre-filtered. The above is the reason why the positive index is friendly to aggregate query in principle.
The two data structures are different in data compression:
Doc values is a serialized columnar storage structure, where values also contain word frequency data. And this structure is very conducive to data compression (FOR and RBM compression algorithms) because the way Lucence reads files at the bottom layer is local mmap, in principle it is read from the disk to the OS cache for decoding, using positive row The data structure of the index, because the columnar data can be compressed efficiently like the posting list, this method greatly increases the speed of reading from the disk, because the size is small, and then the data is stored in the OS Cache In this case, the reading speed is very high, and doc values are more suitable for aggregation reasons.
Give an easy-to-understand example
There are 20 students enrolled in the tutoring class, each student can register for multiple classes, and each class has a head teacher
The front row index is the head teacher, that is, which students are included in each class
The inverted index is which classes each student enrolled in
Now we need to know which students are included in the music tutoring class and the art tutoring class. Just ask the head teacher twice. If we ask the students, we must ask each student again and ask him if you have applied for the music and art tutoring class. If you don’t ask every student, you never know if the student you haven’t asked is in the music class or art class
In this example, the head teacher is equivalent to the front row index, each doc is a class, each doc contains several terms, and each term is like a student.
The head teacher knows which students are in each class, that is, which terms each doc contains. Students only know which classes they belong to, which is equivalent to which classes (doc) contain this term.
Summary and analysis of Elasticsearch interview questions
https://wenyuanblog.com/blogs/elasticsearch-interview-questions.html
https://blog.csdn.net/yy339452689/article/details/105865771/
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。