BE.JAVA-Elasticsearch Notes-Technical Secret

Technical selection of massive data system architecture

https://db-engines.com/en/ranking/search+engine

Full-text search engine (NLP, crawler, such as Baidu)
Vertical search engine (e-commerce, OA, site search, video site)

Search engine has requirements:

Fast query speed (efficient compression algorithm, fast encoding and decoding)
Accurate results (BM25, TF-IDF)
Rich search results (recall rate)

Get started with Elasticsearch

Environmental installation

Elasticsearch directory structure

bin: executable script file, including start es service, plug-in management, function command
config: configuration file directory, es configuration, role configuration, jvm configuration, etc.
lib: the java library that es depends on
data: The default data storage directory. Including all data of nodes, shards, indexes, and documents, production environment requirements must be modified .
logs: The default log file storage path, production environment must be modified .
modules: Contains all es modules, such as Cluster, Discovery, Indicators, etc.
plugins: directory of installed plugins
Only available after jdk/jdk.app 7.0, with its own java environment

start up

Start bin/elasticsearch, open localhost:9200

Multi-node mode

Multi-item single node

Single project and multiple nodes

elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach
elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach

Cluster health

Health status
- green All primary and replica are active, and the cluster is healthy
- yellow, at least one replica is unavailable, but all primary are active, and data recognition is guaranteed to be complete
- red: At least one primary is unavailable, and the cluster is unavailable
Health check
- _cat/health
- _cluster/health

kibana

Verify that the service started successfully localhost:5601
Configure es service address elasticsearch.host:["http://localhost:9201"] (kibana.yml)
Close kibana from the command line:
- close the window
- ps -ef|grep 5601 or ps -ef|grep kibana or lsof -i:5601
- kill -9 pid
About the cause and solution of "kibana server is not ready yet"
- Incompatible versions of kibana and Elasticsearch (keep the same version)
- The service address of Elasticsearch is different from the elasticsearch.hosts configured in Kibana (configured in elasticsearch.yml)
- Cross-domain access is prohibited in Elasticsearch (the configuration in elasticsearch.yml allows cross-domain access)
- The server has a firewall turned on (turn off the firewall or modify the server security policy)
- The remaining space of the disk where Elasticsearch is located is less than 90% (clear the disk space, configure monitoring and alarm)

See the essence through the phenomenon: take you to see through the essence of "index"

index

Help quick search
Take the data structure as the carrier
Landing as a file

The structure of the database

Why B+Trees (Mysql) is not suitable for big data retrieval

mysql, one hundred thousand level, no index: 0.295s
mysql, million level, no index: 3.365s
mysql, million level, full-text index: 1.033s
mysql, tens of millions, full-text index: 10.038s
es, ten million, .8s

mysql index structure B-Trees visualization

Full interpretation of the inverted index

inverted index data structure

inverted index core algorithm

Compression algorithm for inverted tables
- FOR:Frame Of Reference
- RBM:RoaringBitMap
Retrieval principle of term index
- FST:Finit state Transducers
  http://examples.mikemccandless.com/fst.py
  
  The realization principle of FST in lucene

Positive Index Inverted Index Discrimination :
First of all, we must understand the concept of two data structures. doc values are the mapping from documents to terms, and inverted is the mapping from terms to document IDs. In principle, let me first talk about why the inverted index is not suitable for aggregation. You cannot determine the total number of doc through the inverted index, and because the analysis will be performed by default, even if the aggregation, the result may be inaccurate, so you have to create the not_analyzed field. , Which increases the disk usage. For the simplest example, adding this is a product table, each product has a number of labels, we executed the following query

query:{
  match:{
    tags:"性价比"
  }，
  aggs:{
    tag_terms:{
      terms:{
        field:"tags.keyword"
      }
    }
  }
}

The meaning of this aggregate query is to query all the tags of the product with the label "cost-effective"

When executing agg
We use the inverted index, so the voice is like this: Scan the term in the inverted index to see if the corresponding doc tag in the inverted table corresponding to this term contains "cost-effectiveness", if it does, record it, Since we are not sure whether the next term meets the conditions, we have to judge one by one, which results in a table scan.

If you use a positive index, and the positive index refers to which terms are included in the doc, that is, the current docid=>the mapping of all terms contained in the current field, what we want to find is all the doc that meets the conditions , Then we can directly use the key (docid) to get the values (all terms) instead of scanning the table.

Therefore, the essence of the high efficiency of using the forward index in the aggregate query is the difference between the two data structures, and it has nothing to do with the combined inverted index. The combined inverted index is only pre-filtered. The above is the reason why the positive index is friendly to aggregate query in principle.

The two data structures are different in data compression:
Doc values is a serialized columnar storage structure, where values also contain word frequency data. And this structure is very conducive to data compression (FOR and RBM compression algorithms) because the way Lucence reads files at the bottom layer is local mmap, in principle it is read from the disk to the OS cache for decoding, using positive row The data structure of the index, because the columnar data can be compressed efficiently like the posting list, this method greatly increases the speed of reading from the disk, because the size is small, and then the data is stored in the OS Cache In this case, the reading speed is very high, and doc values are more suitable for aggregation reasons.

Give an easy-to-understand example
There are 20 students enrolled in the tutoring class, each student can register for multiple classes, and each class has a head teacher

The front row index is the head teacher, that is, which students are included in each class

The inverted index is which classes each student enrolled in

Now we need to know which students are included in the music tutoring class and the art tutoring class. Just ask the head teacher twice. If we ask the students, we must ask each student again and ask him if you have applied for the music and art tutoring class. If you don’t ask every student, you never know if the student you haven’t asked is in the music class or art class

In this example, the head teacher is equivalent to the front row index, each doc is a class, each doc contains several terms, and each term is like a student.
The head teacher knows which students are in each class, that is, which terms each doc contains. Students only know which classes they belong to, which is equivalent to which classes (doc) contain this term.

Summary and analysis of Elasticsearch interview questions

https://wenyuanblog.com/blogs/elasticsearch-interview-questions.html
https://blog.csdn.net/yy339452689/article/details/105865771/

BE.JAVA-Elasticsearch Notes-Technical Secret

Technical selection of massive data system architecture

Get started with Elasticsearch

See the essence through the phenomenon: take you to see through the essence of "index"

Why B+Trees (Mysql) is not suitable for big data retrieval

Full interpretation of the inverted index

Summary and analysis of Elasticsearch interview questions

seasonley

引用和评论

FE.CLI-使用playwright对pywebview应用做自动化测试

试试 Elasticsearch 的 unsigned_long（qbit）

换掉ES！SpringBoot + Meilisearch实现商品搜索，太方便了！

超越Elasticsearch！号称下一代搜索引擎，性能炸裂！

优秀！一款基于 SpringBoot + Vue 开发的网盘系统！

day01-基本查询

ElasticSearch 可观测性最佳实践