Dumpling exports in-table concurrency optimization丨TiDB tool sharing

Li Chunzhu (lichunzhu), TiDB R&D engineer
SIG group: Migrate SIG Community, mainly covering TiDB data processing tools, including TiDB data backup/import and export, TiDB data change capture, migration of other database data to TiDB, etc.

foreword

Dumpling is a tool written in Go language for exporting data from a database. The database currently supports the MySQL protocol and is optimized for the characteristics of TiDB. Go Dumpling! Make the exported data more stable article introduces the advanced use of Dumpling. next part of this article will introduce the optimization logic of Dumpling's internal table concurrency, so as to help you understand more deeply how Dumpling works.

Why do you need in-table concurrency

The export logic inside Dumpling can be interpreted with a production-consumer model. The producer thread will traverse the set of database tables to be exported, and then send the generated export SQL to the consumer thread, and the consumer thread will format the SQL execution result and write it to the file. It is not difficult to see that concurrent exports can be performed between different consumers without interfering with each other.

It can be easily deduced from the above that the data tables to be exported are not related to each other and can be exported concurrently by different consumers. However, in most business scenarios, the amount of data between tables varies greatly, and it is easy to cause threads to idle in a large table. Therefore, it is necessary to divide the large table into smaller "export units" (hereinafter referred to as chunks) so that the consumer threads can export in parallel, thereby improving the export speed. chunk should be divided as evenly as possible to ensure that a similar uneven chunk division and export a large table small table concurrency problems, will make the export time is doubled, and greatly enhance the database server memory usage .

In-table concurrency when exporting MySQL

So how do you divide a large table into smaller and more uniform chunks? It is conceivable that compared with other types, integer numbers can be divided into multiple limit ranges more evenly, which is the most ideal division method. At the same time, in order to ensure that the divided integer range can hit the index and avoid wasting computing resources by repeatedly scanning the entire table, the divided range used should be the first column of the index. From this, the in-table concurrency division method for MySQL can be obtained:

First, select the index column whose first column is an integer and record it as field, Cardinality , so as to ensure that the integer data in this column are as different as possible. After selecting the integer column, Dumpling uses the explain statement to roughly estimate the number of data rows that the table will export under the limited conditions and record it as count. According to the parameter rows that specifies the size of the divided rows at the beginning, it can be obtained that Dumpling needs to divide the data into count/rows chunks. Then, by selecting min(field), max(field), the maximum and minimum fields in the data under the limited conditions are recorded as max_field and min_field. Assuming that the data is generally evenly distributed within this range, the division step size can be calculated as d=(max_field-min_field)*rows/count. The concurrent chunks in each table are constrained by where conditions, and the ranges are [min_field, min_field+d), [min_field+d, min_field+2d) …

From the above implementation, it can be seen that after specifying rows, the chunk divided into rows is not necessarily rows. At the same time, increasing rows will directly increase the step size range of each chunk, that is, increase the data volume of each chunk. Therefore, if you find that the database memory consumption is too large when Dumpling is exported, you can adjust the rows appropriately to reduce the data volume of each chunk. In the actual export scenario, the rows setting should be moderate: if it is too large, it will consume too much memory, and it is easy to make the concurrency effect not good; if it is too small, it is easy to cause Dumpling to frequently request a small amount of data from the database, which will slow down the export speed. In the current practice scenario, configuring --rows=200000 can generally take into account the concurrency effect and export speed.

In-table concurrency when exporting TiDB v3.0/v4.0

As can be seen from the above, when the user table does not have a uniformly distributed integer index, or the result of the explain statement to obtain the number of data rows is inaccurate, the concurrency effect in the table will be greatly reduced. So, how will TiDB and Dumpling handle this? In the 161dcf2bb47b2b How to Calculate in TiDB Database, it is mentioned that TiDB will assign a row ID to each row of data in the table, which is represented by RowID. The RowID is unique in the table and can be directly obtained from the database by selecting _tidb_rowid. Therefore, the simple idea is to directly use _tidb_rowid as the integer primary key above, and use the same method to divide chunks.

However, Best Practices for TiDB High Concurrent Write Scenarios , in order to avoid TiDB write hotspots, TiDB tables often use the AUTO_RANDOM column or add the SHARD_ROW_ID_BITS parameter when creating a table. These parameters will make the distribution of the _tidb_rowid column extremely uneven, which will lead to inaccurate division of chunks when concurrently dividing chunks in the Dumpling export table, which will affect the export speed and even cause OOM.

In TiDB database storage , after the TiDB data can be mapped into KV key-value pairs, it is stored on TiKV in the form of range region, each region saves data in the range of [StartKey, EndKey) and TiKV will try to keep it The data stored in each Region does not exceed a certain size. These features are very beneficial for Dumpling to divide evenly chunk data. Therefore, Dumpling obtains the StartKeys of all Regions of the export target table through the TIKV_REGION_STATUS table under the INFORMATION_SCHEMA library of TiDB, decodes the required row_id, and then uses the obtained rowid as the WHERE condition to divide the chunk.

It can be seen from the above implementation that the division scale of Dumpling's in-table concurrency is the region size, and the specific value of rows has no effect on the division result. However, whether the rows value is set or not will still determine whether Dumpling exports the TiDB database in the way of in-table concurrency.

In-table concurrency when exporting TiDB v5.0

TiDB v5.0.0 began to support the clustered index to avoid the table return operation when TiDB used rowid before and improve the write query speed. Tables with clustered indexes turned on will no longer have the _tidb_rowid column. At the same time, in certain scenarios such as split region, the StartKey of the region is not necessarily a legal value. However, the idea of dividing by region above is still an effective method, but a better way to obtain the data of region boundary division is needed.

To solve this problem, TiDB supports the SELECT fields FROM table TABLESAMPLE REGIONS() syntax in v5.0.0 and above. After executing the SQL, TiKV will scan out each region involved in the table and obtain the first legal kv pair, and then return the obtained data to Dumpling. For example, when using this SQL to SELECT each column of the clustered index, the SQL will return the column values of the first row of the clustered index in each REGION of the table for evenly dividing the chunks.

Dumpling follow-up development plan

The following are some plans and assumptions for the subsequent development of Dumpling. At present, Dumpling has been migrated to the tidb repo. You are welcome to discuss and participate in the development Dumpling Repo

supports exporting more types of source databases (issue#11)

Generally speaking, as long as the database that needs to be supported has a corresponding database driver or client, such as the golang driver godror Oracle database, the export statement and the calling Go codebase can be slightly modified to implement the database export support. Community members are also welcome to participate and help Dumpling support exporting more types of databases.

support export sequence (issue#61)

Dumpling currently does not support exporting TiDB Sequences. Supporting this function will make the export function more complete.

supports checksum verification

Dumpling needs to support checksum[4][5] to ensure the correctness of exported data.

support checkpoint (issue#10)

Dumpling is supported. When exporting TiDB in snapshot mode, the export continues from the breakpoint after partial export.

Contact: channel #sig-migrate in the tidbcommunity slack workspace, you can join this channel through this invitation link .

Dumpling exports in-table concurrency optimization丨TiDB tool sharing

foreword

Why do you need in-table concurrency

In-table concurrency when exporting MySQL

In-table concurrency when exporting TiDB v3.0/v4.0

In-table concurrency when exporting TiDB v5.0

Dumpling follow-up development plan

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式