From principle to practice, take you hand in hand to easily get the dual-cluster disaster recovery of data warehouse

Abstract: This article introduces the problem analysis method of dual-cluster disaster tolerance by introducing the architecture, log structure, and analysis steps of the dual-cluster.

This article is shared from the HUAWEI cloud community " from principle to practice, hand-in-hand take you to easily get the data warehouse dual-cluster disaster recovery ", the original author: Puyol.

Dual cluster principle

The disaster recovery solution of GaussDB (DWS) is a dual-cluster synchronization architecture, that is, two independent clusters periodically synchronize data to achieve the purpose of tolerance. The current method of data synchronization is to periodically perform incremental backups and restore synchronization through the roach (GaussDB (DWS) backup and restore tool). The dual-cluster framework is a complex distributed system. When a problem occurs, how to quickly and accurately locate the problem and restore the service is a very urgent problem. This problem will be more prominent on the cloud. This article introduces the problem analysis method of dual-cluster disaster recovery by introducing the architecture, log structure, and analysis steps of the dual-cluster.

First introduce the principle of the dual-cluster deployment scheme, and first introduce the background knowledge from the two aspects of deployment architecture and important parameters, so as to better understand the method of problem analysis.

Introduction to Architecture

1. Logical architecture example

The figure below is a schematic diagram of a homogeneous dual-cluster deployment. Both the active and standby clusters are 3c3d. The main node of the main cluster deploys the dual-cluster framework script to perform regular backup operations, and the main node of the standby cluster regularly restores the backup set. Basic data needs to be backed up in full and then incrementally backed up.

2. Deployment Architecture

The figure below is the deployment architecture following the figure above, involving three binary files of the dual cluster synchronization script (SyncDataToStby.py) and the backup program (GaussRoach.py, gs_roach)

Calling relationship on the backup side: SyncDataToStby.py -> GaussRoach.py -> gs_roach

The calling relationship on the recovery side: SyncDataToStby.py -> GaussRoach.py -> gs_roach

Understanding the call relationship is directly related to our analysis of the problem.

SyncDataToStby.py is the start of the call of the entire dual-cluster, which controls the normal operation of the dual-cluster. Normally, it is a long-lived memory process. If it exits abnormally, there will be a crontab in the background to re-pull the dual-cluster script: crontab- > SyncDataToStby.py -> GaussRoach.py -> gs_roach

Introduction to main parameters

identify the problem

As we all know, the various logs of the system are powerful tools for us to understand the operating mechanism and understand the problem site. Similarly, the problem analysis of the dual cluster also depends on the analysis of the log. First, let's understand the log corresponding to the dual cluster:

log directory structure

From the logic diagram and deployment diagram in the previous section, the log file corresponding to each binary is shown in the following figure, and the corresponding log is found for the binary information.

As shown in the figure above, the log of the dual cluster is also stored in the directory $GAUSSLOG, and has its own independent directory, “roach”. This directory is also the corresponding log path for backup/restore. We introduce from the top to bottom of the call relationship

frame directory

Store the log generated by SyncDataToStby.py, which involves the functions of dual cluster scheduling, backup set cleaning, status display, configuration file and command line parameter analysis.

controller directory

Store the log generated by GaussRoach.py, which involves some operations of backup and recovery preparation, backup and recovery parameter analysis, backup cluster processing, error handling, etc.

agent directory

Store the log generated by the gs_roach tool, which involves operations such as gs_roach connecting to gaussdb/gtm/cm to initiate backup/restore, generate a backup set/restore a backup set, and so on.

gs_roach tool function: complete the function of packing the data files of cn/dn/gtm/cm into backup files in order on the backup side, and generate the backup set meta-information file; the recovery side decompresses the backup set file to the corresponding cn according to the meta-information file /dn/gtm/cm in the data directory.

Positioning steps

Determine whether the problem is on the backup side or the recovery side, look up the Sync log on the master node of the dual cluster, and determine the module that is faulty
Determine the level of error. Since the dual-cluster execution process is a way of calling and timing relationships between upper and lower layers, refer to the specific sequence:

crontab -> SyncDataToStby.py -> GaussRoach.py -> gs_roach

Each module has a more detailed log description process, and specific problems are analyzed in detail. Generally speaking, there are the following aspects

1) Configuration error, user, environment variable file

2) Backup cluster path permission issue

3) Backup failed due to non-Normal cluster status

4) Restoration fails due to node failure and backup set damage

Subsequent articles will describe the problem location steps in detail by module and error type

summary

The dual-cluster disaster recovery function of GaussDB (DWS) is an independent and complex distributed system, involving the use of three-tier tools, so it will cause some confusion in problem location. The positioning method needs to understand the architecture and operation mechanism first, and then analyze the log corresponding to the node according to the timing relationship. Later, we will introduce some typical problems and repair methods from the perspective of each module.

If you want to know more about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account, and share with you the latest and most complete PB-level digital warehouse black technology. You can also get a lot of learning materials in the background~

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

From principle to practice, take you hand in hand to easily get the dual-cluster disaster recovery of data warehouse

Dual cluster principle

Introduction to Architecture

1. Logical architecture example

2. Deployment Architecture

Introduction to main parameters

identify the problem

log directory structure

summary

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

好用的开源埋点方案-ClkLog埋点用户分析系统

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场