Implementation of TiDB data consistency verification: Sync-diff-inspector optimization solution

Introduction

In the data synchronization scenario, upstream and downstream data consistency check is a very important aspect, the lack of data validation, business decisions could have a very negative impact. . Sync-diff-inspector is a consistency check tool developed by the Data Platform team. It can check the consistency of upstream and downstream data in a variety of data synchronization scenarios, such as multiple data sources to a single purpose (mysql sub-database split TiDB in the table), a single source to a single destination (TiDB TiDB table to table) or the like, the data verification process, efficiency and accuracy is essential. First, let's look at the architecture diagram of Sync-diff-inspector, and have a general understanding of the role and implementation principle of Sync-diff-inspector.

Sync-diff-inspector 2.0 architecture diagram

Why Sync-diff-inspector 2.0?

In version 1.0, we encountered some problems reported by customers, including:

A memory overflow occurred on the TiDB side when the consistency check was performed on a large table.
The issue of Float type data verification is not supported.
The result output is not user-friendly, and the verification result needs to be streamlined.
GC occurred during the inspection process, causing the verification to fail.

The reason for the above problem related to the original implementation of :

Using single thread to divide the Chunk, all the divided Chunks in the table need to wait for all the Chunks in the table to be divided before starting the comparison, which will cause the TiKV usage rate to decrease during this period
The checkpoint function writes the status of each Chunk that has been verified into the database, so the IO written to the database becomes the bottleneck of the verification process.
When the checksums in the chunk range are different, the row-wise comparison is directly performed, which consumes a lot of IO resources.
The lack of adaptive GC function causes the Snapshot being verified to be GC, making the verification fail.
...

Sync-diff-inspector 2.0 new features

Chunk division

For comparing whether the data of the two tables are the same, you can judge by calculating the checksum of the two tables separately, but to determine which row is different, you need to compare it row by row. In order to reduce the number of rows that need to be compared row by row when the checksum is inconsistent, Sync-diff-inspector adopts a compromise solution: divide the table into several chunks in the order of the index, and compare each chunk with upstream and downstream data. right.

The division of chunks follows the previous method. TiDB statistics will divide the table into several buckets with the index as the scope, and then merge or split these buckets according to the size of the chunk. The segmentation process selects random rows as the range.

The original version of Sync-diff-inspector uses a single thread to divide chunks. The divided chunks need to wait for the table to divide all the chunks before they start to compare. Here, we use the method of asynchronously dividing chunks to improve resource utilization during this period. There are two situations reduces resource utilization

During the chunk division process, because the predetermined size of the chunk is smaller than the size of a bucket, the bucket needs to be divided into several chunks. This is a relatively slow process. Therefore, the consumer side, that is, the comparison thread of the chunk, will wait. Resource utilization will decrease. Two processing methods are used here: multiple buckets are used for asynchronous division to improve resource utilization; some tables have no bucket information, so the entire table can only be divided as a bucket, and multi-table division is used to improve the overall asynchronous division. Number of barrels.
The division of chunks will also take up a certain amount of resources. Too fast chunk division will slow down the speed of chunk comparison to a certain extent. Therefore, the consumer side uses channels to limit the speed of multi-table chunking.

In summary, Sync-diff-inspector optimized by the division of the chunk three parts composition. As shown in the figure below, there are 3 chunk_iters specified here, and each chunk_iter is divided into a table. Here, the progress of chunks_iter division is adjusted through the global channel. Note that this is only limited by the table. When each chunk_iter starts to be divided, all chunks will be divided asynchronously. When the buffer of the global channel is full, chunk_iter will block. When all the chunks of chunk_iter enter the buffer of the global channel, the chunk_iter will start to divide the next table.

Checkpoint and repair SQL

Sync-diff-inspector supports the function of continuing to check at the breakpoint. The Diff process will record breakpoint information every ten seconds. When the verification program exits abnormally at a certain moment, running Sync-diff-inspector again will continue the verification from the most recently saved breakpoint. If the configuration file of Sync-diff-inspector changes during the next run, Sync-diff-inspector will discard the breakpoint information and perform verification again.

The completeness and correctness of this function depend on the global order and continuity defined in the Chunk division process. Compared with the original version, the checkpoint implemented by Sync-diff-inspector 2.0 does not need to record the status of each chunk, but only needs to record the status of continuous and recently completed chunks, which greatly reduces the amount of data that needs to be recorded. The global ordering feature of a chunk consists of a structure that contains the table that the chunk belongs to, and the bucket to the bucket that belongs to the table (if the chunk is merged by two or more buckets) If it succeeds, record the beginning and end of the bucket), how many chunks the bucket is divided into, and this chunk is the number of the divided chunks. At the same time, this feature can also determine whether two chunks are continuous. Each time the breakpoint clock is triggered, the last chunk of the consecutive chunks that have been compared is selected as the checkpoint, and the chunk information is written to the local file.

When different lines are verified, Sync-diff-inspector will generate repair SQL and save it in a local file. Because the inspected chunks are out of order and parallel, a file is created for each chunk (if there are different rows in the chunk) to save the repair SQL, and the file name is the globally ordered structure of the chunk. There must be a sequence of repairing SQL and checkpoint records:

If you write the repair SQL record first, then the program exits abnormally at this time. The chunk written in the repair SQL but not recorded by the checkpoint will be generated next time. Under normal circumstances, the repair SQL file will be overwritten again. But because the bucket segmentation is randomly divided, although the number of chunks after segmentation is fixed, the different rows checked last time are the third of the chunks after segmentation, and this time it may be in the range of the fourth chunk. Inside. In this way, there will be repeated repair SQL.
If the checkpoint is written first, then the program exits abnormally at this time, and the next execution will start from the back range of the chunk recorded by the checkpoint. If the chunk has repair SQL but has not been recorded, then the repair SQL information will be lost.

Here, the repair SQL record is written first, and all repair SQL files after the chunk recorded by the checkpoint will be executed the next time (the file is named after the global ordered structure of the chunk, so it is easy to judge the two chunks) the order) are moved to the trash folder, this avoid duplication of repair SQL appear .

Two-point checksum adaptive chunkSize

Checksum and split the large table into chunks. The performance loss of checksum is that each checksum will have some additional consumption (including the time of a session establishment and transmission). If the chunk is divided into small chunks, then these additional costs will be spent on a checksum. The proportion of time will become larger. Usually it is necessary to set the predetermined size of the chunk chunkSize larger, but the chunkSize is set too large. When the checksum results of the upstream and downstream databases on the chunk are different, if the large chunk is directly compared by row, the overhead will also become very high. Big.

In the data synchronization process, generally only occur small amount of data inconsistency , based on this assumption, when the verification process, inconsistencies are found downstream of the checksum on a chunk of the original chunk by dichotomy can be divided into a size close The two sub-chunks of, the checksum comparison is performed on the sub-chunks to further narrow the possible range of inconsistent rows. The advantage of this optimization is that the time and memory resources consumed by checksum comparison are much less than the consumption of row-by-row data comparison. Through checksum comparison, the possible range of inconsistent rows is continuously reduced, which can reduce the data rows that need to be compared row by row. Speed up the comparison and reduce memory consumption . And since each checksum calculation is equivalent to traversing a sub-chunk after the bisection, in theory, multiple additional consumption is not considered, and the overhead of the bisection check is equivalent to only doing two more checksums on the original chunk.

Since doing a checksum is equivalent to traversing all rows in the range, you can calculate the number of rows in this range by the way. This is because the principle of checksum is to perform crc32 operations on one row of data, and then calculate the XOR sum of the results of each row. This kind of checksum cannot check three rows of repeated errors, and the index column is not a unique attribute. There is such a mistake. At the same time, the number of rows of each chunk is calculated, and the limit syntax can be used to locate the index of the middle row of the chunk, which is the premise for the use of the dichotomy method.

But the chunkSize cannot be set too large. When two sub-chunks have different rows after a dichotomy, the dichotomy will be stopped and the row comparison will be performed. A chunk that is too large is more likely to contain multiple different rows at the same time, and the effect of the binary check will be reduced. Here, the default chunkSize of each table is set to 50,000 rows, and each table is divided into up to 10,000 chunks.

Index processing

The tables of upstream and downstream databases may have different schemas. For example, downstream tables only have some upstream indexes. Inappropriate index selection will cause the time-consuming increase of one party's database. When verifying the table structure, only the upstream and downstream indexes are retained (if such indexes do not exist, all indexes are retained). On the other hand, some indexes contain columns that are not unique attributes, and there may be a large number of rows with the same index value, so the chunks will be divided unevenly. When Sync-diff-inspector selects an index, will give priority to the primary key or unique index, followed by the index with the lowest repetition rate.

where processing

Suppose there is a table create table t (a int, b int, c int, primary key (a, b, c));

And the range of a divided chunk is ((1, 2, 3), (1, 2, 4))

The original Sync-diff-inspector will generate a where statement:

((a > 1) OR (a = 1 AND b > 2) OR (a = 1 AND b = 2 AND c > 3))
((a < 1) OR (a = 1 AND b < 2) OR (a = 1 AND b = 2 AND c <= 4))

Can be optimized as (a = 1) AND (b = 2) AND ((c> 3) AND (c <= 4))

Adaptive GC

In the original version of Sync-diff-inspector, a large number of tables may be GC causing the verification to fail during the verification process. The Sync-diff-inspector tool supports the function of adaptive GC. A background goroutine is started during the initialization phase of the Diff process, and the GC safepoint TTL parameters are constantly updated during the inspection process, so that the corresponding snapshot will not be GC, ensuring a smooth verification process conduct.

Handling Float columns

According to the characteristics of the float type, the effective precision is only 6 digits, so use round(%s, 5-floor(log10(abs( column )))) for float type columns in the checksum SQL to take 6 significant digits as the checksum string Part, when the column takes the special value of 0, the result is NULL, but ISNULL(NULL) is also part of the checksum string, which is not true at this time, so that 0 can be distinguished from NULL.

User interaction optimization

Sync-diff-inspector displays the following information:

Write the log to the log file.
The progress bar is displayed in the foreground and the table is being compared.
Record the verification results of each table, including the overall comparison time, the amount of comparison data, the average speed, the comparison results of each table, and the configuration information of each table.
Generated repair SQL information.
Checkpoint information recorded at a certain time interval.

The effect is as follows:

For details, please refer to overview

Performance improvement

Based on the above optimization methods, we conducted a performance test. In Sysbench, we constructed 668.4GB of data, with a total of 190 tables, each with 10 million rows of data. The test results are as follows:

As can be seen from the test results, Sync-diff-inspector 2.0 compared to the original, check speed has improved significantly, while significantly reducing the memory footprint TiDB end .

Future outlook

Open architecture

In Sync-diff-inspector, we define the Source abstraction. Currently, it only supports data consistency verification from TiDB end to TiDB end, MySQL end to MySQL end, and MySQL end to TiDB end. However, in the future, by implementing Source corresponding methods, can be adapted to a variety of other databases for data consistency verification , such as Oracle, Aurora, etc.

Support more types

Due to the special types of some columns, sync-diff-inspector currently does not support them (such as json, bit, binary, blob). They need to be treated specially in the checksum SQL statement. For example, for json type columns, the value of each key that appears in json needs to be extracted through json_extract.

A more radical two-point checksum

The new version of sync-diff-inspector uses the binary checksum method to reduce the amount of row-by-row comparison data, but when it finds that the two chunks after the dichotomy have inconsistent data, it stops continuing the dichotomy and performs row-by-row comparison. This method is pessimistic, thinking that there may be multiple inconsistencies in the chunk at the moment. However, according to the actual situation, the application scenario of sync-diff-inspector generally only has a small amount of inconsistencies. A more radical approach is to continue the dichotomy, and the final result is a group with the smallest number of rows (default 3000 rows) and there are inconsistencies. Data chunk arrays, and then compare these arrays row by row.

Implementation of TiDB data consistency verification: Sync-diff-inspector optimization solution

Introduction

Why Sync-diff-inspector 2.0?

Sync-diff-inspector 2.0 new features

Chunk division

Checkpoint and repair SQL

Two-point checksum adaptive chunkSize

Index processing

where processing

Adaptive GC

Handling Float columns

User interaction optimization

Performance improvement

Future outlook

Open architecture

Support more types

A more radical two-point checksum

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式