Data warehouse standby machine DN reconstruction: Quickly repair your data warehouse DN single point of failure

Abstract: large-scale distributed systems cannot be avoided. When a single point of failure occurs in DN, what are the recovery methods and how to recover? This section focuses on how to repair the single point of failure of DN by operating gs_ctl build.

This article is shared from the HUAWEI CLOUD community " HUAWEI CLOUD data warehouse standby machine DN reconstruction, fast repair DN single point of failure! ", the original author: welblupen.

1. Technical background

The DN high-availability architecture of GaussDB (DWS) is a master, standby, and slave architecture. That is, in a distributed environment, complete cluster data is distributed across multiple DN groups using fragmentation technology, and each group of DNs is responsible for one data fragmentation, including: a primary DN, a backup DN, and a slave backup DN. The master and the backup each have a complete set of data. Generally, the slave does not store data. It only temporarily stores the data when the backup machine fails. After the backup machine fails, in order to maintain the consistency of the cluster data, the backup machine needs to be connected to the host. Make a copy of data and xlog logs.

2. Scenarios where the backup DN needs to be rebuilt

2.1. After the host has a single point of failure, the standby machine will failover to become the primary, the original primary will be downgraded, and the cluster will be downgraded; after the original primary failure is restored, the WAL log CRC check of the primary and standby machines may fail. The CM system will detect this status. Automatically rebuild the standby machine through the way of rebuilding the standby machine.

2.2. After a single point of failure of the standby machine occurs, the state of the standby machine becomes unknown, and the cluster is degraded. After the standby machine fails to recover, the standby machine needs to be rebuilt to synchronize data with the host.

3. Operation classification of backup DN reconstruction

3.1. Incremental reconstruction: gs_ctl build -b incremental -Z datanode

use:

Incremental build can repair common host or instance failures caused by backup log bifurcation problems, and can also repair some data file loss problems. If a host exception occurs during the rebuilding process, you can manually roll back and recover from the loss.

process:

Obtain the difference file: Obtain the difference file between the primary and backup DNs by parsing the Xlog log
Backup and recovery: Strictly perform atomic recovery and backup of the primary and backup differential files. Errors in the process can be recovered. After the errors are eliminated, reentry can be called again
File transfer: The designated (1-16) threads are created by the backup machine to pull the difference file from the host
Complete incremental reconstruction and wait for the xlog log to be placed on the disk

analysis:

Incremental reconstruction is to calculate the difference between the primary and backup DN files based on the Xlog log, and send the files to the backup DN, and quickly perform incremental reconstruction without any damage to the backup data at a low cost.

3.2. Full reconstruction: gs_ctl build -b full -Z datanode

use:

The full reconstruction of the standby machine can repair most data and log damage or loss scenarios, but the repair time is longer than the incremental build

process:

Obtain the difference file: Use the CRC-32C series algorithm based on hardware tuning to obtain the CRC check value of the corresponding file on the main DN, and also perform the corresponding operation locally, compare the two to obtain the difference file list
Backup and recovery: By default, there is no atomization, but atomic recovery will be attempted, ignoring the success or failure of the recovery result
File transfer: The designated (1-16) threads are created by the backup machine to pull the difference file from the host
Complete incremental reconstruction and wait for the xlog log to be placed on the disk

analysis:

Full reconstruction is based on the main DN file, and the backup DN file is verified with it. If a file block of the backup DN file is inconsistent, the host sends this file block to the backup DN. Compared with full cleanup and reconstruction, the amount of data copied and the amount of WAL logs are less, and the cost is moderate.

3.3. Full cleanup and reconstruction: gs_ctl build -b fullcleanup -Z datanode

use:

The difference from the full mode is that the data directory of the DN host needs to be cleaned up before synchronization. Able to repair most data and log damage or loss scenarios, but the repair time is longer than other modes

process:

Clean up the data files of the standby machine: clear the data directory of the standby machine and keep the configuration files
The host transmits the full amount of mirroring to the standby computer: the host uses a single thread to send all its data directories to the standby computer except for the configuration file
Complete the full reconstruction and wait for the xlog log to be placed on the disk

analysis:

Full cleaning and reconstruction is the backup machine empties the data directory, saves the configuration files, and sends a full reconstruction request to the host. The host sends all its own data directories to the backup machine except for the configuration files. After rebuilding, the backup machine is started, which is costly.

4. Summary

The main purpose of the backup machine DN reconstruction function is to repair a single point of failure. The backup machine reconstruction method is divided into full reconstruction, full cleaning reconstruction and incremental reconstruction according to the realization, all of which interact with the main DN. When a single point of failure occurs in the DN, the operator should select an appropriate reconstruction method to reconstruct the data of the standby machine according to the actual damage degree and resource consumption.

For more information about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account to share with you the latest and most complete PB-level data warehouse black technology~

Click to follow, and get to know the fresh technology of

Data warehouse standby machine DN reconstruction: Quickly repair your data warehouse DN single point of failure

1. Technical background

2. Scenarios where the backup DN needs to be rebuilt

2.2. After a single point of failure of the standby machine occurs, the state of the standby machine becomes unknown, and the cluster is degraded. After the standby machine fails to recover, the standby machine needs to be rebuilt to synchronize data with the host.

3. Operation classification of backup DN reconstruction

3.1. Incremental reconstruction: gs_ctl build -b incremental -Z datanode

use:

process:

analysis:

3.2. Full reconstruction: gs_ctl build -b full -Z datanode

use:

process:

analysis:

3.3. Full cleanup and reconstruction: gs_ctl build -b fullcleanup -Z datanode

use:

process:

analysis:

4. Summary

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

DNS解析错误要怎么处理

2025免费云服务器盘点

为什么会出现DNS污染？出现DNS污染怎么办？

湖仓一体化（Lakehouse）指什么？有哪些应用场景？

Apache Doris 3.0.4 版本正式发布