Li Chunzhu (lichunzhu), TiDB R&D engineer
SIG group: Migrate SIG Community, mainly covering TiDB data processing tools, including TiDB data backup/import and export, TiDB data change capture, migration of other database data to TiDB, etc.

foreword

Dumpling is a tool written in Go language for exporting data from a database. The database currently supports the MySQL protocol and is optimized for the characteristics of TiDB. Go Dumpling! Make the exported data more stable article introduces the advanced use of Dumpling. next part of this article will introduce the optimization logic of Dumpling's internal table concurrency, so as to help you understand more deeply how Dumpling works.

Why do you need in-table concurrency

The export logic inside Dumpling can be interpreted with a production-consumer model. The producer thread will traverse the set of database tables to be exported, and then send the generated export SQL to the consumer thread, and the consumer thread will format the SQL execution result and write it to the file. It is not difficult to see that concurrent exports can be performed between different consumers without interfering with each other.

It can be easily deduced from the above that the data tables to be exported are not related to each other and can be exported concurrently by different consumers. However, in most business scenarios, the amount of data between tables varies greatly, and it is easy to cause threads to idle in a large table. Therefore, it is necessary to divide the large table into smaller "export units" (hereinafter referred to as chunks) so that the consumer threads can export in parallel, thereby improving the export speed. chunk should be divided as evenly as possible to ensure that a similar uneven chunk division and export a large table small table concurrency problems, will make the export time is doubled, and greatly enhance the database server memory usage .

In-table concurrency when exporting MySQL

So how do you divide a large table into smaller and more uniform chunks? It is conceivable that compared with other types, integer numbers can be divided into multiple limit ranges more evenly, which is the most ideal division method. At the same time, in order to ensure that the divided integer range can hit the index and avoid wasting computing resources by repeatedly scanning the entire table, the divided range used should be the first column of the index. From this, the in-table concurrency division method for MySQL can be obtained:

First, select the index column whose first column is an integer and record it as field, Cardinality , so as to ensure that the integer data in this column are as different as possible. After selecting the integer column, Dumpling uses the explain statement to roughly estimate the number of data rows that the table will export under the limited conditions and record it as count. According to the parameter rows that specifies the size of the divided rows at the beginning, it can be obtained that Dumpling needs to divide the data into count/rows chunks. Then, by selecting min(field), max(field), the maximum and minimum fields in the data under the limited conditions are recorded as max_field and min_field. Assuming that the data is generally evenly distributed within this range, the division step size can be calculated as d=(max_field-min_field)*rows/count. The concurrent chunks in each table are constrained by where conditions, and the ranges are [min_field, min_field+d), [min_field+d, min_field+2d) …

From the above implementation, it can be seen that after specifying rows, the chunk divided into rows is not necessarily rows. At the same time, increasing rows will directly increase the step size range of each chunk, that is, increase the data volume of each chunk. Therefore, if you find that the database memory consumption is too large when Dumpling is exported, you can adjust the rows appropriately to reduce the data volume of each chunk. In the actual export scenario, the rows setting should be moderate: if it is too large, it will consume too much memory, and it is easy to make the concurrency effect not good; if it is too small, it is easy to cause Dumpling to frequently request a small amount of data from the database, which will slow down the export speed. In the current practice scenario, configuring --rows=200000 can generally take into account the concurrency effect and export speed.

In-table concurrency when exporting TiDB v3.0/v4.0

As can be seen from the above, when the user table does not have a uniformly distributed integer index, or the result of the explain statement to obtain the number of data rows is inaccurate, the concurrency effect in the table will be greatly reduced. So, how will TiDB and Dumpling handle this? In the 161dcf2bb47b2b How to Calculate in TiDB Database, it is mentioned that TiDB will assign a row ID to each row of data in the table, which is represented by RowID. The RowID is unique in the table and can be directly obtained from the database by selecting _tidb_rowid. Therefore, the simple idea is to directly use _tidb_rowid as the integer primary key above, and use the same method to divide chunks.

However, Best Practices for TiDB High Concurrent Write Scenarios , in order to avoid TiDB write hotspots, TiDB tables often use the AUTO_RANDOM column or add the SHARD_ROW_ID_BITS parameter when creating a table. These parameters will make the distribution of the _tidb_rowid column extremely uneven, which will lead to inaccurate division of chunks when concurrently dividing chunks in the Dumpling export table, which will affect the export speed and even cause OOM.

In TiDB database storage , after the TiDB data can be mapped into KV key-value pairs, it is stored on TiKV in the form of range region, each region saves data in the range of [StartKey, EndKey) and TiKV will try to keep it The data stored in each Region does not exceed a certain size. These features are very beneficial for Dumpling to divide evenly chunk data. Therefore, Dumpling obtains the StartKeys of all Regions of the export target table through the TIKV_REGION_STATUS table under the INFORMATION_SCHEMA library of TiDB, decodes the required row_id, and then uses the obtained rowid as the WHERE condition to divide the chunk.

It can be seen from the above implementation that the division scale of Dumpling's in-table concurrency is the region size, and the specific value of rows has no effect on the division result. However, whether the rows value is set or not will still determine whether Dumpling exports the TiDB database in the way of in-table concurrency.

In-table concurrency when exporting TiDB v5.0

TiDB v5.0.0 began to support the clustered index to avoid the table return operation when TiDB used rowid before and improve the write query speed. Tables with clustered indexes turned on will no longer have the _tidb_rowid column. At the same time, in certain scenarios such as split region, the StartKey of the region is not necessarily a legal value. However, the idea of dividing by region above is still an effective method, but a better way to obtain the data of region boundary division is needed.

To solve this problem, TiDB supports the SELECT fields FROM table TABLESAMPLE REGIONS() syntax in v5.0.0 and above. After executing the SQL, TiKV will scan out each region involved in the table and obtain the first legal kv pair, and then return the obtained data to Dumpling. For example, when using this SQL to SELECT each column of the clustered index, the SQL will return the column values of the first row of the clustered index in each REGION of the table for evenly dividing the chunks.

Dumpling follow-up development plan

The following are some plans and assumptions for the subsequent development of Dumpling. At present, Dumpling has been migrated to the tidb repo. You are welcome to discuss and participate in the development Dumpling Repo

Generally speaking, as long as the database that needs to be supported has a corresponding database driver or client, such as the golang driver godror Oracle database, the export statement and the calling Go codebase can be slightly modified to implement the database export support. Community members are also welcome to participate and help Dumpling support exporting more types of databases.

Dumpling currently does not support exporting TiDB Sequences. Supporting this function will make the export function more complete.

Dumpling needs to support checksum[4][5] to ensure the correctness of exported data.

Dumpling is supported. When exporting TiDB in snapshot mode, the export continues from the breakpoint after partial export.

Contact: channel #sig-migrate in the tidbcommunity slack workspace, you can join this channel through this invitation link .

PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...