BR chose to scan at the Transaction KV level to achieve backup. In this way, the core of the backup is the MVCC Scan distributed on multiple TiKV nodes: simple, rude, but effective. It inherits many advantages of TiKV from birth: distributed, conducive to horizontal expansion, and flexible (backup of any range, future Any version of GC data) and other advantages.

Compared with the previous use of mydumper for SQL layer backups, BR can back up and restore more efficiently: it eliminates the SQL layer overhead, supports backup indexes at the same time, and all backups are sorted SST files, which greatly Speeds up recovery.

The strength of BR has been demonstrated in the previous article ( https://pingcap.com/zh/blog/cluster-data-security-backup ). This article will describe the specific implementation of the BR backup side in detail: simple In other words, BR is the "operator push-down" of backup: the task is sent to TiKV through the gRPC interface, and then TiKV is allowed to dump the data to the external storage by itself.

The basic process of BR

interface

In order to distinguish from the general MVCC Scan request, TiKV provides an Backup , this interface is different from the general read request-it does not return data to the client, but directly stores the read data in the designated memory ( External Stroage):

service Backup {
    // 收到 backup 的 TiKV,将会将 Request 指定范围中,所有自身为 leader
    // 的 region 备份,并流式地返回给客户端(每个 region 对应流中的一个 item)。
    rpc backup(BackupRequest) returns (stream BackupResponse) {}
}

// NOTE:隐藏了一些不重要的 field 。
message BackupRequest {
    // 备份的范围,[start_key, end_key)。
    bytes start_key = 2;
    bytes end_key = 3;
    // 备份的 MVCC 版本。
    uint64 start_version = 4;
    uint64 end_version = 5;
    
    // 限流接口,为了确保和恢复时候的一致性,限流限制保存备份文件的阶段的速度。
    uint64 rate_limit = 7;
    
    // 备份的目标存储。
    StorageBackend storage_backend = 9;
    // 备份的压缩 -- 这是一个用 CPU 资源换取 I/O 带宽的优化。
    CompressionType compression_type = 12;
    int32 compression_level = 13;
    // 备份支持加密。
    CipherInfo cipher_info = 14;
}

message BackupResponse {
    Error error = 1;
    // 备份的请求将会返回多次,每次都会返回一个已经完成的子范围。
    // 利用这些信息可以统计备份进度。
    bytes start_key = 2;
    bytes end_key = 3;
    // 返回该范围备份文件的列表,用于恢复的时候追踪。
    repeated File files = 4;
}

Client

The BR client will use the TiDB interface to calculate the range ("ranges") that needs to be backed up according to the libraries and tables specified by the user to be backed up. The calculation basis is:

  1. Generate range based on all data keys of each table. (All t{table_id}_r prefix 061c27fa35e2b6)
  2. Generate range based on all index keys of each index. (All t{table_id}_i{index_id} prefix 061c27fa35e2e5)
  3. If the table has partitions (which means, it may have multiple table IDs), for each partition, the range is generated according to the above rules.

In order to obtain the maximum degree of parallelism, the BR client will send backup requests on these ranges to all TiKVs in parallel.

Of course, the backup cannot be smooth sailing. We will inevitably encounter problems when backing up: such as network errors, or triggering TiKV's current limiting measures (Server is Busy), or Key is Locked. At this time, we must reduce these ranges and resend the request (otherwise , We are going to repeat what we have done before...).

After a failure, the process of selecting an appropriate range to resend the request is called "fine-grained backup" in BR. Specifically:

  1. In the previous "coarse-grained backup", every time the BR client receives a BackupResponse it will store the [start_key, end_key) as a range in an interval tree (you can think of it as a simple BTreeSet<(Vec<u8>, Vec<u8>)> ).
  2. "Coarse-grained backup" will ignore any retryable errors, but the corresponding range will not be stored in this interval tree, so a "hole" will be left in the tree. The pseudo code for these two steps is as follows.
func Backup(tree RangeTree) {
    // ... 
    for _, resp := range responses {
        if resp.Success {
            tree.Insert(resp.StartKey, resp.EndKey)  
        }
    }
}

// An example: 
// When backing up the ange [1, 5).
// [1, 2), [3, 4) and [4, 5) successed, and [2, 3) failed:
// The Tree would be like: { [1, 2), [3, 4), [4, 5) }, 
// and the range [2, 3) became a "hole" in it.
// 
// Given the range tree is sorted, it would be easy to 
// find all holes in O(n) time, where n is the number of ranges.
  1. After the "coarse-grained backup" is over, we traverse the interval tree, find all the "holes" in it, and perform "fine-grained backups" in parallel:
  • Find all regions that contain the hole.
  • Initiate a Backup RPC in the corresponding range of the region to their leader.
  • After success, put the corresponding range into the interval tree.
  1. After a round of "fine-grained backup" is over, if there are holes in the interval tree, go back to (3), and report an error and exit after a certain number of failed retries.

After the above "backup" process is completed, BR will use the Coprocessor interface to request TiKV to execute the checksum of the table specified by the user.

This checksum will be used as a reference when restoring, and it will also be compared with the file-by-file checksum generated by TiKV during backup. This comparison process is called "fast checksum".

During the "backup" process, BR will collect the backup table structure, backup timestamp, and generated backup file through the interface of TiDB, and store it in a "backupmeta". This is an important reference when recovering.

TiKV

In order to achieve resource isolation and reduce resource preemption, Backup-related tasks are run in a separate thread pool. The threads in this thread pool are called "bkwkr" (an extremely abstract abbreviation for "backup worker").

After receiving the gRPC backup request, this BackupRequest will be converted into a Task .

Then, TiKV will use Task in start_key and end_key generated called " Progress " structure: It will Task in a large range into a plurality of sub-ranges, by:

  1. Region within the scan range.
  2. For the Region where the current TiKV role is Leader , the range of the Region is issued as a subtask of Backup.

Progress is a "pull model" interface: forward . Subsequently, each Backup Worker created by TiKV will call this interface in parallel to obtain a set of Regions to be backed up, and then perform the following three steps:

  1. RaftKV , the Backup Worker will perform a Raft read process through the 061c27fa35e599 interface, and finally obtain a snapshot of the corresponding region in the Backup TS. ("Get Snapshot")
  2. For this Snapshot, Backup Worker will scan the consistent version of backup_ts Here we will scan out Percolator's transactions. In order to facilitate recovery, we will prepare two temporary buffers, "default" and "write", which correspond to the Default CF and Write CF in the TiKV Percolator implementation, respectively. ("Scan")
  3. Then, we will first flash the Raw Key of the two CFs in the scanned transaction into the corresponding buffer. After the entire region backup is completed (or some regions are too large, the backup file will be split on the way), and then the These two files are stored in external storage, and their corresponding ranges and sizes are recorded. Finally, a BackupResponse returned to BR. ("Save")

In order to ensure the uniqueness of the file name, the backup file name will include the store ID of the current TiKV, the region ID of the backup, the hash of the start key, and the CF name.

The backup file uses RocksDB's Block Based SST format: its advantage is that it natively supports file-level checksum and compression, and has the potential to be ingested quickly during recovery.

External storage is a general storage abstraction that exists to adapt to multiple backup targets: some are similar to VFS in Linux, but it is simplified a lot: only simple operations of saving and downloading the entire file are supported. It currently adapts to mainstream cloud disks, and supports serialization and deserialization in the form of URLs. For example, using s3://some-bucket/some-folder , specifies that the backup plate onto S3 cloud some-bucket under some-folder directory.

Challenges and optimization of BR

Through the above basic process, the basic link of BR can be run through: similar to operator push-down, BR pushes down the backup task to TiKV, so that the resources of TiKV can be reasonably used to achieve the effect of distributed backup.

In this process, we encountered many challenges. In this section, we will talk about these challenges.

BackupMeta and OOM

As mentioned in the previous article, BackupMeta stores all the meta-information of the backup: including table structure, indexes of all backup files, and so on. Imagine you have a large enough cluster: for example, one hundred thousand tables may have dozens of terabytes of data in total, and each table may have several indexes.

In this way, millions of files may eventually be generated: at this time, BackupMeta may reach as large as several GB; on the other hand, due to the characteristics of the protocol buffer, we may have to read out the entire file to serialize it into Go language Objects, the peak memory usage is doubled. In some extreme environments, there will be the possibility of OOM.

In order to alleviate this problem, we designed a hierarchical BackupMeta format. Simply put, it splits BackupMeta into two parts: an index file and a data file, similar to the structure of a B+ tree:

Specifically, we will add these Fields to BackupMeta, pointing to the root node of the corresponding "B+ tree":

message BackupMeta {
    // Some fields omitted...
    // An index to files contains data files.
    MetaFile file_index = 13;
    // An index to files contains Schemas.
    MetaFile schema_index = 14;
    // An index to files contains RawRanges.
    MetaFile raw_range_index = 15;
    // An index to files contains DDLs.
    MetaFile ddl_indexes = 16;
}

MetaFile is the node of this "B+ tree":

// MetaFile describes a multi-level index of data used in backup.
message MetaFile {
    // A set of files that contains a MetaFile.
    // It is used as a multi-level index.
    repeated File meta_files = 1;
    
    // A set of files that contains user data.
    repeated File data_files = 2;
    // A set of files that contains Schemas.
    repeated Schema schemas = 3;
    // A set of files that contains RawRanges.
    repeated RawRange raw_ranges = 4;
    // A set of files that contains DDLs.
    repeated bytes ddls = 5;
}

It may have two forms: one is the "leaf node" that carries the corresponding data (the last four meta_files are filled with the corresponding data), or it can point itself to the next node File is one to the external storage, the other File reference, including basic information such as file name.

In the current implementation, in order to avoid the complexity of realizing the split and merge operations similar to the B-tree, we only use the first-level index, and store the table structure and file metadata in each 128M small file, so Enough to avoid the OOM problem caused by BackupMeta.

GC, GC never changes

In the entire process of backup scan, because of the long time span, it will inevitably be affected by GC.
Not only BR, other ecological tools will also encounter GC problems: for example, TiCDC requires incremental scanning. If the initial version has been GC removed, then consistent data cannot be synchronized.

In the past, our solution was to let the user manually adjust the GC Lifetime, but this often resulted in the "first time kill" effect: the user was happy to back up, and then did other things, a few hours later found that the backup failed due to GC Up...

This will greatly affect the mood of users: in order to make users happier using various ecological tools, PD provides a function called "Service GC Safepoint". Each service can set a "Safepoint" through the interface on the PD, and TiDB will ensure that all historical versions will not be GC after the time specified by Safepoint. In order to prevent the cluster from failing to GC after the accidental exit of the BR, this Safepoint will also have a TTL: if there is no refresh after the specified time, the PD will remove the Service Safe Point.

For BR, you only need to set this Safepoint to Backup TS. For reference, this Safepoint will be named "br-<Random UUIDv4>" and has a TTL of five minutes.

Backup compression

When backing up at full speed, the backup traffic may be quite large: you can look at the relevant part of the "Show Muscle" article at the beginning for details.

If you are willing to use enough cores for backup, you may soon reach the bottleneck of the network card (for example, if it is not compressed, only about 4 cores of a gigabit network card will be full.), in order to avoid the network card becoming a bottleneck , We introduced compression when backing up.

We reuse the compression function provided in RocksDB Block Based Table Format: zstd compression is used by default. Compression will increase the CPU usage, but can reduce the load of the network card. When the network card becomes a bottleneck, it can significantly increase the speed of backup.

Current limiting and isolation

In order to reduce the impact on other tasks, as mentioned above, all backup requests will be executed in a separate thread pool.

But even so, if the backup consumes too much CPU, it will inevitably affect other loads in the cluster: the main reason is that BR will take up a lot of CPU and affect the scheduling of other tasks; on the other hand, BR will read a lot Disk, affect the speed of writing task to flash disk.

In order to reduce the use of resources, BR provides a current limiting mechanism. When the user starts BR with the --ratelimit parameter, the third step "Save" on the TiKV side will be restricted, and at the same time the flow of the previous steps will be restricted.

One point to note here: the size of the backup data is often much smaller than the actual space occupied by the cluster. The reason is that the backup will only back up the data of a single copy and single MVCC version. The ratelimit current limit is applied to the Save phase, so it limits the speed of writing backup data.

On the "server" side, you can also limit the current by adjusting the size of the thread pool. This parameter is called backup.num-threads. Considering that we allow user-side current limiting, its default value is very high: 75% of all CPUs . If you need to perform more thorough current limiting on the server side, you can modify this parameter. For reference, an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz CPU can generate 10M zstd-compressed SST files per second.

Summarize

With Service Safe Point, we have solved the "difficult to use" problem caused by manually adjusting GC.

Through the newly designed BackupMeta, we have solved the OOM problem of massive meter scenarios.
Through backup compression, current limiting and other measures, we make BR have a smaller impact on the cluster and faster (even though the two may not have both).

In general, BR is the "third way" between "physical backup" and "logical backup". : Compared with tools such as mydumper or dumpling, it eliminates the additional cost of the SQL layer; Looking for a consistent snapshot of the physical layer in a distributed system, it is easy to implement and more dexterous. For the current stage, it is a disaster-tolerant backup solution suitable for TiDB.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...