Technical selection of massive data system architecture

https://db-engines.com/en/ranking/search+engine
image.png

Full-text search engine (NLP, crawler, such as Baidu)
Vertical search engine (e-commerce, OA, site search, video site)

Search engine has requirements:

  • Fast query speed (efficient compression algorithm, fast encoding and decoding)
  • Accurate results (BM25, TF-IDF)
  • Rich search results (recall rate)

Get started with Elasticsearch

Environmental installation

Elasticsearch directory structure

  • bin: executable script file, including start es service, plug-in management, function command
  • config: configuration file directory, es configuration, role configuration, jvm configuration, etc.
  • lib: the java library that es depends on
  • data: The default data storage directory. Including all data of nodes, shards, indexes, and documents, production environment requirements must be modified .
  • logs: The default log file storage path, production environment must be modified .
  • modules: Contains all es modules, such as Cluster, Discovery, Indicators, etc.
  • plugins: directory of installed plugins
  • Only available after jdk/jdk.app 7.0, with its own java environment

start up

  • Start bin/elasticsearch, open localhost:9200
  • Multi-node mode

    • Multi-item single node
    • Single project and multiple nodes

      elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach
      elasticserach -E path.data=data1 -Ee path.logs=log1 -E node.name=node1 -E cluster.name=msb_teach

Cluster health

  • Health status

    • green All primary and replica are active, and the cluster is healthy
    • yellow, at least one replica is unavailable, but all primary are active, and data recognition is guaranteed to be complete
    • red: At least one primary is unavailable, and the cluster is unavailable
  • Health check

    • _cat/health
    • _cluster/health

kibana

  • Verify that the service started successfully localhost:5601
  • Configure es service address elasticsearch.host:["http://localhost:9201"] (kibana.yml)
  • Close kibana from the command line:

    • close the window
    • ps -ef|grep 5601 or ps -ef|grep kibana or lsof -i:5601
    • kill -9 pid
  • About the cause and solution of "kibana server is not ready yet"

    • Incompatible versions of kibana and Elasticsearch (keep the same version)
    • The service address of Elasticsearch is different from the elasticsearch.hosts configured in Kibana (configured in elasticsearch.yml)
    • Cross-domain access is prohibited in Elasticsearch (the configuration in elasticsearch.yml allows cross-domain access)
    • The server has a firewall turned on (turn off the firewall or modify the server security policy)
    • The remaining space of the disk where Elasticsearch is located is less than 90% (clear the disk space, configure monitoring and alarm)

See the essence through the phenomenon: take you to see through the essence of "index"

index

  • Help quick search
  • Take the data structure as the carrier
  • Landing as a file

The structure of the database
image.png

Why B+Trees (Mysql) is not suitable for big data retrieval

  • mysql, one hundred thousand level, no index: 0.295s
  • mysql, million level, no index: 3.365s
  • mysql, million level, full-text index: 1.033s
  • mysql, tens of millions, full-text index: 10.038s
  • es, ten million, .8s

mysql index structure B-Trees visualization
image.png
image.png

Full interpretation of the inverted index

inverted index data structure
image.png
inverted index core algorithm

  • Compression algorithm for inverted tables

    • FOR:Frame Of Reference
      image.png
      image.png
      image.png
      image.png
      image.png
    • RBM:RoaringBitMap
      image.png
  • Retrieval principle of term index

    image.png

Positive Index Inverted Index Discrimination :
First of all, we must understand the concept of two data structures. doc values are the mapping from documents to terms, and inverted is the mapping from terms to document IDs. In principle, let me first talk about why the inverted index is not suitable for aggregation. You cannot determine the total number of doc through the inverted index, and because the analysis will be performed by default, even if the aggregation, the result may be inaccurate, so you have to create the not_analyzed field. , Which increases the disk usage. For the simplest example, adding this is a product table, each product has a number of labels, we executed the following query

query:{
  match:{
    tags:"性价比"
  },
  aggs:{
    tag_terms:{
      terms:{
        field:"tags.keyword"
      }
    }
  }
}

The meaning of this aggregate query is to query all the tags of the product with the label "cost-effective"

When executing agg
We use the inverted index, so the voice is like this: Scan the term in the inverted index to see if the corresponding doc tag in the inverted table corresponding to this term contains "cost-effectiveness", if it does, record it, Since we are not sure whether the next term meets the conditions, we have to judge one by one, which results in a table scan.

If you use a positive index, and the positive index refers to which terms are included in the doc, that is, the current docid=>the mapping of all terms contained in the current field, what we want to find is all the doc that meets the conditions , Then we can directly use the key (docid) to get the values (all terms) instead of scanning the table.

Therefore, the essence of the high efficiency of using the forward index in the aggregate query is the difference between the two data structures, and it has nothing to do with the combined inverted index. The combined inverted index is only pre-filtered. The above is the reason why the positive index is friendly to aggregate query in principle.

The two data structures are different in data compression:
Doc values is a serialized columnar storage structure, where values also contain word frequency data. And this structure is very conducive to data compression (FOR and RBM compression algorithms) because the way Lucence reads files at the bottom layer is local mmap, in principle it is read from the disk to the OS cache for decoding, using positive row The data structure of the index, because the columnar data can be compressed efficiently like the posting list, this method greatly increases the speed of reading from the disk, because the size is small, and then the data is stored in the OS Cache In this case, the reading speed is very high, and doc values are more suitable for aggregation reasons.

Give an easy-to-understand example
There are 20 students enrolled in the tutoring class, each student can register for multiple classes, and each class has a head teacher

The front row index is the head teacher, that is, which students are included in each class

The inverted index is which classes each student enrolled in

Now we need to know which students are included in the music tutoring class and the art tutoring class. Just ask the head teacher twice. If we ask the students, we must ask each student again and ask him if you have applied for the music and art tutoring class. If you don’t ask every student, you never know if the student you haven’t asked is in the music class or art class

In this example, the head teacher is equivalent to the front row index, each doc is a class, each doc contains several terms, and each term is like a student.
The head teacher knows which students are in each class, that is, which terms each doc contains. Students only know which classes they belong to, which is equivalent to which classes (doc) contain this term.

Summary and analysis of Elasticsearch interview questions

https://wenyuanblog.com/blogs/elasticsearch-interview-questions.html
https://blog.csdn.net/yy339452689/article/details/105865771/


seasonley
615 声望693 粉丝

一切皆数据