Abstract: This article will introduce the implementation mechanism of various methods of standby machine reconstruction, combined with application scenario analysis, and suggestions for the use of new parameters, in order to obtain the best application effect.

This article is shared from the HUAWEI CLOUD community " first code and then read, an article introduces the realization mechanism of various methods of standby machine reconstruction ", the original author: Victor_NK.

1 Requirements introduction

The instance of GaussDB (DWS) will inevitably fail during operation, causing instance errors or failing to start. At this time, the standby machine needs to be rebuilt. The main purpose of the standby machine reconstruction function is to repair the single point of failure of the instance. In addition, it can also be used for the initialization of the cluster installation, the metadata synchronization of the cluster expansion, and the warm standby replacement after a node failure. This article will introduce the implementation mechanism of various methods of standby machine reconstruction, combined with application scenario analysis, and suggestions for the use of new parameters, in order to obtain the best application effect.

2 Design plan

2.1 Function classification

According to the different implementation methods, the standby machine reconstruction can be divided into: full reconstruction and incremental reconstruction.

To rebuild the standby machine, you need to run the gs_ctl build tool on the host to be repaired. In the process of rebuilding the standby machine, it is necessary to establish a connection with the main DN for interactive data access. In the command line parameters of the gs_ctl tool, the -b=mode parameter can be used to specify the mode of rebuilding the DN standby machine. Currently, the supported values of mode include the following four:

1. full: is rebuilt in full, and the data directory of the DN host is resynchronized by obtaining the full mirroring difference between the master and the backup.
2. fullcleanup: is rebuilt in full, and the data directory of the DN host is resynchronized through full mirroring. The difference from the full mode is that the data directory of the DN backup machine needs to be cleaned up before synchronization, and the configuration file is retained. The host sends all its data directories to the standby machine except for the configuration file.
3. incremental: incremental reconstruction, by parsing the WAL log to obtain the data of the difference between the primary and standby DNs to incrementally repair the standby DN.
4. auto (do not specify -b): first incremental reconstruction, and full reconstruction after the incremental reconstruction fails.

In the actual production environment, which method to use depends on the requirements and application scenarios.

2.2 Application scenarios

Standby machine reconstruction is divided into different functional scenarios: DN Build DN, CN build CN, CN Build DN. The characteristics of the application scenarios of each Build are as follows:
image.png
Table 1 Application scenarios of Build

To choose which repair method is the best, you need to understand the working principle of the mode and set the relevant parameters reasonably according to the application scenario.

3 Implementation process

3.1 fullcleanup mode: based on push method

Fullcleanup mode is a push mode. In this mode, the host controls the data flow (what I give to you). Without caring about the damage and scope of the backup machine, the host needs to configure its own data directory configuration file. All external data is transferred to the backup machine, and the backup machine is started after rebuilding.

The main working process is shown in Figure 1:
image.png

Figure 1 Working process of fullcleanup build

The characteristics of fullcleanup mode build are obvious, and the standby machine will be completely rebuilt. But the shortcomings are also obvious: the host needs all the data and XLOG log files on the Copy instance, which occupies a high network transmission bandwidth and has a certain impact on the running business. The backup machine does not perform atomic management of the data before the repair, and once the process fails, it cannot be restored to the original backup machine. If the whole build process fails due to occasional network failures, all previous efforts will be discarded, and the build data will need to be built from scratch.

Therefore, this method is the most conservative last option, and it is a choice when all other reconstruction methods are invalid.

3.2 full mode: pull method to obtain differences based on file verification

Full mode is a pull mode. In this mode, the backup machine controls the data flow (what I need to get what I need), and only needs to fill in the difference data between the master and the backup. But the premise is that the standby machine needs to know the difference between itself and the host. Full mode cuts in directly from file comparison. The main and standby machines can simultaneously multi-thread (parallel, concurrency, and improve performance) to traverse the data directory files on their respective locals, and learn whether the files are available, size, file verification calculations and other information ; Through this information, the calculation result File Map List is continuously combined and filtered to obtain the smallest set of different files and reduce the number of copies of data/files. The backup machine can not only back up the differential files of the backup machine as a backup set (to meet reliability), or only pull the differential files of the host to update (improve performance); the backup machine can use multiple threads (concurrency, improve performance) from The host pulls the file.

The main working process is shown in Figure 2:
image.png

Figure 2 Full build working process

The feature of full mode build is to make full use of the existing files of the standby machine, reduce the number of data synchronization, and can easily perform backup and recovery, and parallel control. But it takes a certain amount of time for local IO and calculations. Compared with fullcleanup mode, it is usually faster (several times the improvement can be obtained) under the condition of no resource bottleneck, and it is safer in terms of reliability. It is the first choice for full-scale builds.

3.3 incremental mode: pull method based on Xlog analysis to obtain differences

Incremental mode is another way of pulling. It is suitable for inconsistent scenarios caused by logs, such as active and standby dual-active. Incremental reconstruction is based on the main DN file and the WAL log of the backup DN, and repairs the backup DN file and file blocks according to the principle of more refunds and less supplements. Compared with full reconstruction, the granularity, data volume, and WAL log volume of the copy are less, and the cost is lower.

The main working process is shown in Figure 3:
image.png

Figure 3 Work process of incremental build

Incremental mode build is only applicable to inconsistent scenarios caused by logs, such as active and standby dual-active scenarios. The backup machine data file damage, data directory loss and other faults cannot be repaired through incremental reconstruction. At this time, the backup machine can be repaired again through the full reconstruction method.

4 Thinking summary

With the performance improvement and reliability enhancement of the backup machine reconstruction function, some new parameters have been added to the build, which is worth understanding and paying attention to when using it.

4.1 -T THREAD-NUMBER

Suitable for full and incremental modes of pull mode. Its function is to specify the number of connections to the host on the standby machine side, which is used for multi-threaded concurrent calculation and file pull.

Normally, the default value is 4 to get better performance. When resources permit, it is recommended to use a higher number of threads. However, it should be noted that increasing the number of threads can improve the performance of Build and shorten its reconstruction time, but it also increases the consumption of network connections, CPU, and network IO. When using it, you need to consider the resource status and set it appropriately.

4.2 -u

It is suitable for full and incremental modes that support atomization function. Its function is that atomic restoration and backup will not be performed during the build process, which is suitable for scenarios where there is insufficient space or no backup is required.

Incremental build is atomic by default. During the process, atomic recovery and backup are strictly performed. Errors in the process can be recovered. After the errors are eliminated, reentry can be called again. Due to the large amount of data files involved in Full Build, it is not atomic by default, but it will try to perform atomic recovery, ignoring the success or failure of the recovery result, and only perform backups when the backup machine pulls the files on the host within 20% of the total. It is best to clearly indicate the need or need for atomization when using it.

4.3 -B --backupdir=DIR

It is suitable for full and incremental modes that support atomization function. Its function is to restore and back up the backup set from the specified path during the build process. It is suitable for scenarios with strong and high reliability requirements. The backup can be kept atomized before the build is successful to avoid the loss of the original backup.

It should be noted that the remaining disk space of the backup set path specified by the user should be greater than the size of the DN instance data directory. And the backup path remains blank or the data set of this node, irrelevant data will cause the backup set to become invalid, and the regenerated backup set will risk the loss of the original data copy. Except for the pg_rewind_bak path, the backup set path specified by the user should be isolated from the working path of the standby instance.

For more information about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account to share with you the latest and most complete PB-level data warehouse black technology~

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量