How to quickly create an index in GaussDB (for MySQL)? Senior Architect of Huawei Cloud Database Reveals the Secret for You

Abstract: In the cloud service environment, how to solve the performance problem of customers creating indexes based on large amounts of data has become a challenge for cloud service vendors. Huawei Cloud GaussDB (for MySQL) introduces parallel index creation technology to solve performance bottlenecks such as batch index creation and temporary index addition, helping users build indexes faster. To learn more about the secrets of creating indexes quickly, please don't miss this article.

Share this article from Huawei cloud community " GaussDB (for MySQL) How to quickly create an index? Huawei Cloud Database Senior Architect ", Author: Su Bin, Huawei Cloud Database Senior Architect.

Su Bin, a senior architect of Huawei Cloud Database, has 16 years of experience in database kernel research and development. Previously, as the main research and development staff of the official MySQL InnoDB team, he participated in and led the development and release of many important features. He is currently responsible for and participating in the design and development of the core functions of RDS for MySQL and GaussDB (for MySQL), the main products of Huawei Cloud RDS.

Guide

In the cloud service environment, how to solve the performance problem of customers creating indexes based on large amounts of data has become a challenge for cloud service vendors. Huawei Cloud GaussDB (for MySQL) introduces parallel index creation technology to solve performance bottlenecks such as batch index creation and temporary index addition, helping users build indexes faster. To learn more about the secrets of creating indexes quickly, please don't miss this article.

About MySQL Index

We all know that the database uses indexing technology to speed up data query. MySQL database also supports several index structures to improve query performance (see MySQL document: https://dev.mysql.com/doc/refman/8.0/en/create-index.html), of which B is the most widely used +tree index, because B+tree index has a good balance between query and modification performance, and its storage and maintenance cost is also relatively good.

The MySQL table itself is represented by a clustered index (must be a B+tree index), plus several secondary indexes, including B+tree indexes, to form a separate MySQL table. It can be said that a MySQL table consists of a group The index is composed together. We all know that indexing is a double-edged sword. Sufficient indexes can better improve the performance of adaptable queries, but these indexes need to be maintained to synchronize them with data. Therefore, in the data modification operation stage, more indexes are also needed. Will bring higher overhead. The trade-off of index creation is usually dynamic. Users may not be able to know which indexes need to be created at the beginning of the table definition. Indexes need to be adjusted with the development and changes of the business, which also brings some dynamic index creation. problem.

MySQL index creation logic

Let's first look at the logic of MySQL index creation. First of all, the creation of MySQL indexes can be implemented using two different DDL (Data Definition Language: Data Definition Language) algorithms. The first is the COPY algorithm, which is very inefficient. It is to copy data between two tables to complete the modification of the table structure. In particular, it requires table locks, which is basically not used now. The second is the INPLACE algorithm, which does not require locking, so many DDL operations do not block DML (Data Manipulation Language: Data Manipulation Language) operations, such as creating indexes. The specific implementation of the algorithm is completed at the storage engine level, which can be more optimized. In fact, the DDL statement also has an INSTANT algorithm, but it cannot support the creation of an index operation, so I will not introduce it here.

For the INPLACE algorithm, before version 5.7, the index record was continuously inserted into the created empty index. Due to the disorder of the inserted data, this method leads to obvious performance problems and potential waste of space. After version 5.7, MySQL has optimized the indexing step and improved it to a bottom-up batch insertion and compact assembly creation method for sorted index records. If there are multiple indexes to be created, each index will be executed separately The same algorithm. The new algorithm will go through the main steps of reading data, sorting data, and creating indexes.

Generally speaking, DDL operations such as index creation are more time-consuming than ordinary DML operations, and this type of DDL time-consuming will cause users to continue to dynamically add indexes to speed up queries, and they need to wait a long time, which greatly affects Business; and the user's MySQL instance has enabled Binlog replication, and time-consuming DDL operations are likely to cause long-term lagging behind the database.

MySQL index creation flow chart

The problem of index creation in cloud scenarios

As more and more users host their data on cloud services and the amount of user data continues to grow, the aforementioned problems caused by dynamically adding indexes greatly affect user experience. At the same time, customers’ single-table data has gradually reached several terabytes or even dozens of terabytes. There are more and more complaints from customers about the performance problems caused by too slow index creation, especially if the index creation cycle is too long, it may be difficult for us to find a suitable segment. The index is dynamically created during the low peak period of the business to avoid business fluctuations. Therefore, how to solve the performance problem of customers creating indexes based on large amounts of data in the cloud service environment has become a challenge for cloud service vendors.

In the cloudification scenario, there is another main scenario that is very important to the customer's experience. We know that if customers want to migrate their business to the cloud, large-scale data migration is required (Huawei Cloud provides a data replication service DRS tool to support various data migration scenarios). The more efficient data migration methods are:

Logical export source data
Create a table on the target side (note that the table does not contain secondary indexes)
Insert the data exported from the source to the target
Create a secondary index on the target table

If it involves dynamic data synchronization, the related steps will be more complicated. Since it has nothing to do with the topic, it will not be expanded here. In the above steps, the important thing to note is that in steps 2 and 4, the secondary index is not created when the table is created on the target side. This optimization has a great impact on performance, especially in scenarios where a table has many secondary indexes. We know that if the insertion of the Btree index is ordered, the insertion performance and the space utilization of the result are the best, because the split of the Btree index will occur at the end of the insertion area, and at the same time, due to the optimization of the splitting algorithm, the pages generated by the split The filling rate will be relatively high; on the contrary, if it is random insertion, especially concurrent random insertion, it is easy to cause the Btree index to split at different nodes, and the page fill rate after the split is in a half-full state, resulting in Btree The final one swelled.

With this background, it is easy for us to understand the above problem. When inserting table data, we shielded the secondary index. After all the data is ready, we create the secondary index by batch indexing. This is for the secondary index. Level index creation efficiency is the highest. If you don't do this, every time you insert a record, you have to insert the corresponding secondary index, then the secondary index is an out-of-order random insertion, and the concurrent performance will be much worse.

Although it is an effective solution to create secondary indexes in batches after data synchronization is ready, if the amount of data is large, creating secondary indexes in this way is still very time-consuming, causing customers to wait a long time after the data migration is completed. For business, this waiting period may be at the level of hours or even days. Although you can consider table-level concurrent index creation, this method also has obvious shortcomings: the application scenarios are limited and multiple tables are required; and the concurrency between tables is actually not the most effective form of concurrency, and the mutual influence is relatively large.

How to quickly create an index in GaussDB (for MySQL)?

In summary, there are two performance bottlenecks at the point of index creation: one is the batch index creation after users migrate data; the second is that users temporarily need to add a secondary index. No matter what point, we need to build a good index faster to improve the user experience.

HUAWEI CLOUD GaussDB (for MySQL) introduces the technology of creating indexes in parallel, which improves the problem that the community version of MySQL only uses a single thread to create indexes, so as to improve the efficiency of index creation and solve the aforementioned two pain points together. The aforementioned community edition index creation logic is single-threaded. First, there is the problem of insufficient resource utilization. Second, the index creation process is a process of alternating CPU and IO overhead. When doing an operation, even if it is not resource competition The operation only has to wait. Multi-threaded index creation can make full use of CPU and IO resources. At the same time, when some threads are doing CPU calculations, other threads can perform IO operations concurrently.

The parallel creation of indexes used by GaussDB (for MySQL) is a full-link parallel technology. As mentioned earlier, creating an index involves several stages. Our parallel creation algorithm performs parallel processing for each stage here. From reading data, sorting, to creating an index, all operations are performed in parallel, and each step is specified by The N threads are processed concurrently. Its logic is shown in the figure below:

GaussDB (for MySQL) especially makes a variety of optimizations for data merging and sorting, so that our conventional merging and sorting can be fully parallelized and make full use of CPU, memory and IO resources. In the merging step after creating indexes in parallel, a set of simplified algorithms are also used to correctly handle various index structure scenarios.

Supported indexes and scenes

GaussDB (for MySQL) has the function of creating indexes in parallel. Currently, the supported indexes are Btree secondary indexes. For virtual index secondary indexes, comprehensive support will be provided in the near future, and MySQL's spatial index and fulltext index are not covered by the parallel creation of indexes.

It should be noted that the creation of primary key indexes currently does not support parallelism, so if a SQL statement that creates an index in parallel includes the creation of a primary key index, or the aforementioned spatial index and fulltext index, the client will receive one A warning indicates that the operation does not support parallel creation of indexes, and the statement will be executed in a single-threaded index creation method.

From the perspective of SQL statements, as mentioned above, different algorithms can be used to create indexes. Since the COPY algorithm (ALGORITHM=COPY) does not use batch insertion, it will not benefit from the parallel creation of index optimization. As for the INPLACE algorithm, if the index is created in a non-rebuild way, you can benefit from this optimization; once you need to use the rebuild method to create an index, because it involves the establishment of a primary key index, you cannot use the algorithm of parallel index creation.

Example

Let's take a few examples to understand how to use the parallel creation index algorithm to speed up the creation speed, and how our conditional constraints take effect.

1. We use the sysbench table, there are 100 million pieces of data in the table

2. Build an index on the k field of the table, using the community default single thread, which takes 146.82s

3. By setting innodb_rds_parallel_index_creation_threads = 4 to enable 4 threads to build the index, you can see that it takes 38.72s to build the index, and the speed is increased by 3.79 times.

4. Suppose we want to modify the primary key index. Although multithreading is specified, we will receive a warning. In fact, we can only build the index through a single thread.

Precautions

First of all, explain the parameter innodb_rds_parallel_index_creation_threads. It controls the total number of threads that can be used by all parallel DDLs in the system, and the value range is [1-128]. A value of 1 for this parameter means that the index is created using the original single thread, and a value of N means that the next DDL is created using N threads. If one DDL uses 100 threads to execute, then the other one must also use parallel DDL and can only use the remaining 28 threads; and if 128 threads are occupied by parallel DDL statements, the new DDL only Able to follow the logic of the original single-threaded creation.

Although the parallel creation of indexes speeds up the creation of indexes, it still requires careful evaluation in specific usage scenarios. We know that after the parallel algorithm is applied, the DDL will use the hardware resources as fully as possible, which also means that other operations will not get too many resources. Therefore, specific analysis is required for different scenarios, and it determines how we create indexes.

For the migration scenario, since there is no business access at this time, the user wants to complete the creation of all indexes as soon as possible, so you can set the number of multiple threads as much as possible. For example, if we are a 16-core specification instance, then we can specify the number of parallel threads For 16, speed up the completion of the operation.

If the index is to be created during the user's business operation phase, we still don't want DDL operations, which will have too much impact on the running business such as DML operations. Therefore, at this time, you can specify a relatively small number of threads to create an index, such as 2-4 (or depending on CPU specifications and load, and concurrent execution of multiple DDL operations is not encouraged). This not only speeds up the index creation process relatively, but also guarantees the normal progress of DML.

In summary, GaussDB (for MySQL) supports parallel creation of indexes. By shortening the time used to create indexes, it has solved the two types of problems that customers care about and improved the customer experience. But the technology is endless. In the field of index creation, there are other problems that we need to optimize and solve, such as how to reduce the impact of index creation steps on IO and so on. We will optimize these points in the follow-up and bring more surprises to customers.

Currently, Huawei Cloud GaussDB (for MySQL) parallel creation index optimization function has been . Welcome to the official website of Huawei Cloud to experience: 161498a034110a https://www.huaweicloud.com/product/gaussdb_mysql.html

Attachment: Huawei Cloud GaussDB (for MySQL) kernel expert series articles

Huawei overseas female scientists will reveal the secret for you: How powerful is the vertical integration of GaussDB (for MySQL) cloud stack?

HUAWEI CLOUD database kernel experts reveal for you: How fast is GaussDB (for MySQL) parallel query?

Click to follow, and learn about the fresh technology of Huawei Cloud for the first time~

How to quickly create an index in GaussDB (for MySQL)? Senior Architect of Huawei Cloud Database Reveals the Secret for You

Guide

About MySQL Index

MySQL index creation logic

The problem of index creation in cloud scenarios

How to quickly create an index in GaussDB (for MySQL)?

Supported indexes and scenes

Example

Precautions

Attachment: Huawei Cloud GaussDB (for MySQL) kernel expert series articles

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

LRU算法，你别跑，我就要吃透你

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

被 Manus 带火的 MCP 是什么｜一文看懂

MySQL × 向量数据库：大模型时代的黄金组合实战指南

百万级群聊的设计实践