Elasticsearch database optimization combat: let your ES fly

Abstract: ES has become an all-round data product, and it is becoming more and more popular in many fields. This article aims to analyze the use of ES from the database field.

This article is shared from Huawei Cloud Community " Elasticsearch Database Acceleration Practice ", the original author: css_blog.

1. Program description

What are the main functions of Elasticsearch? Different scenarios have different positioning. In log scenarios, we can use ELK ecology to build a log analysis system. In the search field, ES is currently the most popular search engine. In the field of big data, ES can provide a data warehouse with massive logs against Hbase; in the field of database, ES can be used as an analytical database for query and analysis. ES has become an all-round data product, and it is becoming more and more popular in many fields. This article aims to analyze the use of ES from the database field.

ES is not a relational database. The data update uses optimistic locking and is controlled by version numbers. It does not support transaction processing. This is also the difference between ES and traditional databases (Mysql); however, ES supports accurate query acceleration, multiple conditions and arbitrary combinations of queries, and multiple Aggregate query, the query speed is very fast, it can replace the scenario requirements of database query with complex conditions, and even can replace the database as a secondary index.

In the database acceleration scenario, the usual practice is that the product order data generated by the customer will be written to the Mysql-type relational database, and the database writing guarantees transactional performance. However, as the data of the product order becomes more and more, at the same time, the customer query conditions are changing. It is impossible to build indexes for all fields, and the query capability of the database is far from satisfying the query requirements. We consider using ES to synchronize database data in full, and do multi-condition aggregation queries in ES. The results of the query can be searched in Mysql. When querying product order details, the Mysql data and ES data may not be required to be consistent in real time, and can be passed through canal. Consumption of Mysql binlog log information, synchronize to ES, achieve one-time write, and ensure data consistency. The following databases are explained using Mysql as an example.

Second, the index principle analysis

Why ES's query capabilities far exceed Mysql relational databases is mainly determined by the difference in their implementation principles and the data structure of the underlying storage. The following compares the implementation principles of the two products.

Elasticsearch will process all the input text, build an index and put it in memory, thereby improving search efficiency. At this point, ES is better than MySQL's B+ tree structure. MySQL needs to put the index on the disk. Each time it is read, the index needs to be read from the disk and then the corresponding data node is found, but ES can be found directly in the memory The approximate location of the target document to maximize efficiency. And MySQL’s disadvantages are more obvious when performing combined queries. It does not support complex combined queries such as aggregation operations. Even if you want to combine queries, you must build indexes in advance, but ES can complete this complex operation. By default, each field is All are indexed, and can be combined with each other when querying.

(1) Database index B+ tree

The indexes in the database are all organized by trees. Commonly used are B tree, B-tree, and B+tree. The following describes the organization structure of B+tree.

First of all, let's imagine why we need to build an index. Suppose we have a table book that stores the book information, name, author, publication time, etc. we keep. We have 10,000 records. If we need to find a "database" Book, then our SQL is:

select name，author form book where name = ‘database’;

We need to scan the entire table for full comparison. If we build an index on the name, the title of the book has been sorted in order, and we only need to find the corresponding position in the query to get the result quickly.

The essence of indexing is to filter out the final desired results by continuously narrowing down the range of data you want to obtain, while turning random events into sequential events. That is to say, with this indexing mechanism, we can always use The same search method to lock data.

The database uses B+tree to build indexes:

B+tree data is only stored in leaf nodes. In this way, on the basis of the B-tree, the number of keywords stored in each node is more, and the tree has fewer levels, so the query data is faster. All the key pointers have leaf nodes, so the number of searches each time is the same, so the query speed more stable.

(2) Elasticsearch index principle

ES builds an index using inverted index storage.

Create an index for all the input data, and associate everything with the document. When we look up the data, we directly look up the dictionary (Term), find the document ID corresponding to the Term, and then find the data. This is similar to the way Mysql uses B+tree to build an index, but if the term of the dictionary is large, the search for the term will be slow. ES further suggests a dictionary index (FST) to improve the search ability of the dictionary.

The Term Index is stored in the memory in the form of a tree, and the FST+compression common prefix method is used to greatly save memory. The block where the Term Dictionary is located is queried through the Term Index and then the term is found on the disk to reduce the number of IOs.

After Term Dictionary is sorted, the time complexity of retrieval is reduced from the original N to logN through the dichotomy.

Three, query comparative analysis

The following common scenarios for database search compare ES and database:

full text search

ES supports full-text search and can segment data. Each word is indexed through FSP, but Mysql relational database does not support it. Imagine if the search is not the entire field but several keywords in the field, the Mysql search must be all Table scan.

precise search

If Mysql has indexed the field, the performance difference between ES search and Mysql search is not big, and Mysql may be faster, but ES is a distributed system that can support PB-level data search, and the advantage of searching for large tables is more obvious.

Multi-condition query

We know that Mysql needs to index fields to speed up the search process, and ES is fully indexed by default. For multi-condition queries, Mysql is triggered to build a joint index. Otherwise, if multiple fields are searched, Mysql first selects a field to search, and the result is used second The final result is obtained by filtering through a field.

ES uses the intersection of multiple field result sets, and uses bitmap or skiplist to speed up the search speed, which has obvious advantages over Mysql.

aggregate search

Mysql aggregate search requires full table scan sorting if index is not established, and range query is performed on B+tree if index is established.

In order to speed up the aggregation search, ES uses Doc value to solve the aggregation search problem. DocValue is columnar storage.

The storage results are as follows:

Docvalue data is sorted by document ID, DocValue turns random reading into sequential reading,
In es, because of the existence of fragmentation, data is split into multiple copies and placed on different machines. But for the user experience, it seems that there is only one library. For aggregate queries, the processing is completed in two stages:

Shard's local Lucene Index computes the local aggregation results in parallel.
Receive the partial aggregation results of all Shards, and aggregate the final aggregation results.

This two-stage aggregation architecture allows each shard not to return the original data, but only to return the aggregation result with a much smaller amount of data. This greatly reduces the consumption of network bandwidth.

multi-copy acceleration

We know that ES has the concept of shard and replica. On the one hand, replicas can ensure the reliability of data, and on the other hand, multiple replicas can speed up search speed and improve search concurrency.

Fourth, the database to Elasticsearch synchronization program

Combining the actual usage mode of the user and the size of the data volume, there are many different ways to choose from Mysql data to ES.

Canal=>Elasticsearch

Use Canal to directly consume Mysql binlog logs and write them to ES. In this way, if Mysql writes a lot, you will face Canal write blocking problems.

Canal =>Kafka=>Elasticsearch

Canal data is written to Kafka, and another app is used to consume Kafka data and synchronized to ES

V. Question summary

1. Index shard problem

When Mysql data is synchronized to ES, we face the problem of index establishment. Before data is written to ES, we need to plan the number of data shards and replicas in advance. Replicas can be dynamically modified, but the number of shards cannot be modified after creation.

As the amount of Mysql data increases, if there are too few shards, it will cause the problem of too much data per shard.

If an index is 600G, there are only 3 shards, and each shard is 200G, which will greatly deplete the query capability and is not conducive to data migration.

We can create indexes on a monthly basis and associate all indexes with index aliases.

test_data-202101
test_data-202102

2. Query acceleration issues

In the scene of using ES to accelerate the database, we hope that the ES query capability is as fast as possible. When the ES query does not meet the requirements, we need to tune the query.

Commonly used methods are:

Click to follow, and learn about Huawei Cloud's fresh technology for the first time~

Elasticsearch database optimization combat: let your ES fly

1. Program description

Second, the index principle analysis

(1) Database index B+ tree

(2) Elasticsearch index principle

Three, query comparative analysis

Fourth, the database to Elasticsearch synchronization program

V. Question summary

1. Index shard problem

2. Query acceleration issues

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

【Hadoop】HBase系统解析及适用场景

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场