foreword
Relay log is similar to binary log, which refers to a set of files containing database change events, plus related index and mata files. For details, refer to the official documentation. After enabling relay log for an upstream in DM, it has the following advantages compared to not enabling it:

When relay log is not turned on, each subtask will connect to the upstream database to pull binlog data, which will put a lot of pressure on the upstream database. After turning on the relay log, you only need to create a connection to pull the binlog data to the local, and each subtask can read the local data. relay log data.
The upstream database generally has an expiration time for the binlog, or it will actively purge the binlog to clear the space. When the relay log is not enabled, if the DM synchronization progress is relatively behind, once the binlog is cleaned up, the synchronization will fail, and only full migration can be performed; after the relay log is enabled, the binlog data will be pulled in real time and written to the local, and the The current synchronization progress is irrelevant, which can effectively avoid the previous problems.
But in DM version <= v2.0.7, there are the following problems after enabling relay log:

The data synchronization delay has increased significantly compared with not opening the relay log. The following table is the test result of a single-task benchmark. It can be seen that the average latency has increased significantly. In the table below, the percentile data for latency begins with a ".".
321.jpg

CPU consumption increases when relay is enabled. (Due to the increase of latency, in some simple scenarios (such as only 1 task), the resource usage rate will decrease compared to not turning on the relay log. However, when the number of tasks increases, the CPU consumption of turning on the relay will increase).
Due to the above problems, in the new version, we have made some performance optimizations for the relay log of DM.

Current relay implementation
Before starting to introduce the specific optimization, first briefly introduce the implementation of relay in the current DM, the detailed implementation, please refer to the DM source code reading series of articles (6) implementation of relay log, this article will not describe too much here.

The current relay module can be divided into two parts, relay writer and relay reader, whose structure is as follows:

322.jpg

Relay writer
The relay writer does the following three things in sequence:

Use binlog reader to read binlog events from upstream MySQL/MariaDB;

Convert the read binlog event using binlog transformer;

Store the converted binlog events in the local relay directory in the form of relay log files using binlog writer.

Relay reader
After enabling relay, Syncer will obtain binlog events through relay reader. The relay reader mainly does the following work:

Read the binlog file in the realy directory and send it to the syncer;
When the end of the file is read, regularly (currently every 100ms) check whether the current binlog file size and meta file content have changed, and if they change, continue reading (binlog file changes) or switch to a new file (meta file changes) ).
As can be seen from the above introduction, relay reader and relay writer are independent of each other, and interact with each other through binlog, meta and index files in relay directory.

Test environment description
Before starting to introduce the optimization content, first introduce the environment used during optimization

The upstream is MySQL, the version is 5.7.35-log;

The downstream is a single-instance TiDB, the version is 5.7.25-TiDB-v5.2.1;

DM uses 1 master and 1 worker

The Latency benchmark version is the master branch on 2021-10-14 (commit hash is d2dc22d)
The CPU benchmark version is the relay refactoring branch on 2021-11-01 (commit hash is 9f7ce1d6)
In the small-scale test (<= 4 tasks), MySQL/TiDB/DM runs on the same host, which is convenient for testing. The host environment is 8 cores and 16G;

MySQL/TiDB/DM runs on one host respectively under larger-scale tests (> 4 tasks), and the hosts are all 8 cores and 16G

The migration delay is measured by the downstream self-update time column. For details, refer to the third method in the various methods to tell you how to calculate the delay time of DM synchronization data to TiDB.

Latency optimization
As you can see from the "Current relay implementation" section above, there are two points that may affect latency:

The current implementation of the timing check of the relay reader itself will have a certain impact on the latency. In the worst case, a binlog event needs to be delayed for at least 100ms before it can be synchronized to the downstream;
The relay writer will write to the disk, and then the relay reader will read from the disk. Will this read and write have a greater impact on latency?
The investigation found that there is a page cache mechanism under the linux system (there is a similar mechanism under the Mac OS system), reading the recently written files does not pass through the disk, but reads the cache in the OS memory, so the theoretical impact is limited.
After research and testing of the above problems, we have concluded two solutions:

Cache the binlog that the relay writer prepared to write in the recent period of time in the memory. If the relay reader requests this binlog, the relay reader reads it directly from the memory;
The relay reader still uses the method of reading files, and the relay writer notifies the relay reader when writing a new event.
Option 1 needs to switch between reading memory and reading files according to the downstream writing speed, which is relatively complicated to implement. Moreover, due to the existence of the page cache at the OS layer, the application itself has a limited impact on latency by adding a layer of cache.

For scheme 2, we did some preliminary tests. When increasing the frequency of relay reader check, enabling relay can basically achieve the latency when relay is not turned on. We investigated MySQL's relay log and found that it is also by reading files. Therefore, We chose option 2.

The implementation is relatively simple. Add a Listener to the relay writer, notify the Listener when there is a new binlog event (send a message to the channel), and then in the relay reader, change the timing check to listen for messages in the channel.

The following figure shows the latency results under the 4 table 4 task test. It can be seen that the latency after the relay is turned on is very close to that when the relay is turned on:

323.jpg
CPU optimization
After optimization of Latency, we also tested the CPU occupancy after enabling relay log, and found that the CPU occupancy rate is also high after enabling it. All the test results in this scenario), it can be seen that the CPU consumption has increased significantly after the relay is turned on, and the spikes have become larger:

324.jpg
Using the pprof that comes with golang to make a CPU profile, it can be seen from the figure below that the syncer/relay reader/relay writer and other parts account for a large proportion. After comparing the code logic, it is found that:

The Relay reader uses the ParseFile interface of go-mysql, which reopens the file every time it is called, and reads the first FORMAT_DESCRIPTION event, which is the position of the first blue mark in the figure below;
When optimizing latency, since the relay reader and writer are independent of each other, in order to simplify the implementation, only the channel is used to notify whether there is a new binlog write, and the newly written binlog may have already been read in the last read. , which led to a lot of invalid meta and index file checks.
325.jpg
In response to the above problems, we have made the following optimizations:

Use go-mysql's ParseReader to eliminate repeated open and read consumption;
Refactor the relay module to integrate the relay writer and reader to facilitate communication between the two. After receiving the notification through the channel, the relay reader checks whether the file being written by the current relay writer is the same as the file being read, that is, whether the file is in the active writing state, and obtains the position where the current file is written. Through this information , which can avoid invalid meta and index file checks.
It can be seen from the figure below that the CPU is greatly reduced after optimization, but the spikes are still larger:

326.jpg

Since the rate of write events generated by sysbench for our test is relatively stable, there is no special execution code in DM, and Golang is a compiled language with GC, so we guess that the spikes mainly come from GC, but this point is from The CPU profile is not obvious, see the figure below:

327.jpg

Use GODEBUG=gctrace=1 to open the GC log, see the figure below, you can find:

After enabling relay, the resident memory is nearly doubled (239 MB -> 449 MB), and the total heap space is also nearly doubled.
After investigation, the problem is a memory leak caused by the tidb embedded in the DM, which has not been resolved for the time being.
After enabling relay, the CPU occupied by GC greatly increases, especially the background GC time and idle GC time.
328.jpg
The following figure is the flame graph of the alloc_space part of the heap profile after the above optimization:

Note: The heap profile of pprof is an analysis of the program running so far, not a certain period of time, so you can also see some resident memory applications from the figure below.

329.jpg

Through the heap profile and comparing the code, we found the following points that can be optimized:

When Go-mysql parses binlog events from files, each event will re-apply for a bytes.Buffer, and it will continue to expand during the reading process. After optimization, a buffer pool is used to reduce the overhead caused by continuous expansion;
The heatbeat event in the Local streamer uses time.After, and the code using this interface will be more concise, but the channel created by this interface will only be released when the timer is triggered. In addition, the read event of the Local streamer is a high-frequency call, and each call is called. Creating a timer channel is also expensive. After optimization, it is changed to reuse timer;
Relay uses a timeout context when reading events from the upstream, and each read will create an additional channel, and in the current scenario, the timeout context is not necessary. The timeout context is removed after optimization;
The relay reader and relay writer write the debug log without detecting the log level, and some Field objects are created each time. Although they are not large, they will also bring some overhead due to the high frequency of these operations. After optimization, the log level judgment is added to the debug logs that are called frequently.
Description: The zap logger used by DM to write logs has better performance. It is generally no problem to directly call log.Debug under infrequent calls.
The optimized results are as follows. It can be seen that the CPU has been reduced a lot, and the spikes have been reduced a lot:

3210.jpg

The following figure shows the test results under the 20 table 20 task scenario:

3211.jpg

Legacy & Future Work
After the above optimization, the gap in latency is very small when relay is turned on compared to not turning on relay, and the growth of CPU is also at a relatively low level, but there are still some points that can be optimized, and it is expected to be in subsequent versions. Add gradually, as follows:

io.CopyN is used by go-mysql to read files. This function will apply for a small object internally. It still has some impact on GC under high-frequency usage, but it is not big. It has not been changed this time;
Some optimizations for no relay and relay that take effect at the same time have not been done this time, such as the timeout context created when the streamer reads the event;
At present, there is still a lot of overhead for multiple readers to read the same file. Possible solutions for further optimization:
Reduce the number of reads, such as one reader finishes reading, other readers read memory, etc., or increase the memory cache as previously envisaged;
Combine the same downstream tasks to reduce the number of tasks.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...