The regret medicine for database misoperation is here: AnalyticDB PostgreSQL teaches you to implement distributed and consistent backup and recovery

Introduction to This article will introduce the principle and usage of AnalyticDB PostgreSQL version backup and recovery.

1. Background

AnalyticDB PostgreSQL version cloud native data warehouse product built by the Alibaba Cloud database team based on the PostgreSQL kernel (referred to as PG). In business scenarios such as real-time interactive data analysis, HTAP, ETL, and BI report generation, ADB PG has unique technical advantages.

As an enterprise-level data warehouse product, the importance of data security is self-evident. The backup and recovery function is the basic means to ensure data security, and it is also an important guarantee for ADB PG to restore the database in response to emergencies. Backup and recovery, as the name suggests, is to back up the database so that it can be restored when necessary to prevent it from happening. Currently, the backup and recovery function of ADB PG has been applied in the following user scenarios:

When data is destroyed or the instance is unavailable due to system failure or human error, the instance is restored based on the backup data.
Users need to quickly clone an identical instance based on an existing instance.
Under the premise that the number of nodes remains the same, the user needs to change the specifications of the source instance.

This article will introduce the principle and usage of ADB PG backup and recovery.

2. Introduction

ADB PG is a distributed database using MPP horizontal expansion architecture. The ADB PG instance is composed of one or more coordination nodes (Master) and multiple computing nodes (Compute Node). The coordination node is responsible for receiving user requests, formulating a distributed execution plan and sending it to the computing node, collecting the execution results and returning to the customer End; the computing node is responsible for parallel computing analysis and data storage. Data can be distributed randomly, hashed, and replicated among computing nodes. The following figure shows the architecture diagram of ADB PG:

The physical backup and recovery function of ADB PG, based on cluster-based basic backup and log backup, can back up the data of each node while the distributed database continues to provide services, and ensure data consistency. When needed, the distributed database can be restored to the moment of backup.

Basic backup refers to a complete copy of all data in the database. The basic backup will compress the full data snapshot of the cluster and store it in other offline storage media. The cluster will not block the user's read and write during the basic backup. Therefore, the logs generated during the backup will also be backed up to ensure the integrity of the basic backup.

Log backup (also called incremental backup) refers to backing up log files generated by the cluster to other offline storage media. The log file records the user's DML and DDL operations on the database. Through a complete basic backup and continuous log backup, the new cluster can be restored to a certain historical event point, ensuring data security during this period of time.

ADB PG can guarantee backup and recovery with a minimum RPO of 10 minutes.

Three, principle

Before fully introducing the backup and recovery principle of ADB PG, first briefly introduce the PITR (Point in Time Recovery) backup and recovery mechanism of stand-alone PG. The backup and recovery mechanism of ADB PG is based on the PITR principle of stand-alone PG, and a distributed data consistency guarantee mechanism is added.

(1) PITR mechanism of stand-alone PG

WAL log:

The PostgreSQL database records all changes (including operations such as DDL and DML) to the data in the WAL (Write Ahead Log) log file. The WAL log file can be regarded as an infinitely growing append-only file. PG will divide the log data into multiple files for storage according to a fixed size. Each transaction modification data operation will be appended to the WAL file, and assigned a unique LSN sequence number (Log Sequence Number), when the transaction is committed, it will ensure that the WAL log has been persisted.

The purpose of these log files is to allow the database to "replay" the WAL log to recover the data that has not been persisted when the database crashes, but the corresponding transaction has been committed when the database needs to be restored.

Recovery point:

With the WAL log can be "replayed" operation, then there is another question: when do you need to replay? This requires a restore point (restore point) to solve.

The recovery point is equivalent to a mark written in the WAL log, which marks the location of a log. When the PG replays the log, it determines whether the "replay" operation needs to be stopped by checking whether it has reached this mark point.

The following SQL can create a marker point named t1 in the WAL log file:

postgres=# select pg_create_restore_point('t1');
LOG:  restore point "t1" created at 0/2205780
STATEMENT:  select pg_create_restore_point('t1');
 pg_create_restore_point
-------------------------
 0/2205780
(1 row)

When the database sequentially replays the WAL log, it will check that the current log contains this recovery point name, and if it does, stop the replay. In addition, PG also supports recovery to a specified point in time, transaction number, LSN sequence number and other operations.

Basic backup and incremental backup:

The basic backup is a complete copy of the database data. You can use the pg\_basebackup tool to perform a basic backup of a stand-alone PG. The backup data can be saved locally or in other offline storage media (OSS).

$ pg_basebackup -D pg_data_dir/ -p 6000
NOTICE:  pg_stop_backup complete, all required WAL segments have been a

Incremental backup refers to backing up the generated WAL log files. In PG, you can specify how to back up WAL log data through the database parameter archive\_command. When the PG generates a WAL log file, it will try to backup and archive the log file by executing the archive\_command command. For example, the following command will send the log file to the specified OSS.

archive_command="ossutil cp %p oss://bucket/path/%f"

Full backup and incremental backup of stand-alone PG

It should be noted that the basic backup period does not block the read and write of the database, so the WAL log corresponding to the data update during the backup period also needs to be backed up to ensure data consistency during recovery.

PITR recovery:

When you need to restore the database, first download the basic backup data, then use the basic backup to start the cluster, then download the log file backup, and "replay" to the specified recovery point to restore the database. In a stand-alone PG, the target of the specified recovery point can be the transaction number, timestamp, WAL sequence number (LSN), and a recovery point name.

(2) ADB PG's distributed and consistent backup and recovery mechanism

As a distributed database, ADB PG uses two-phase transaction commit to manage distributed transactions. If the PITR mechanism of the stand-alone PG is copied, it will cause data inconsistency. For example, the following scenario: Distributed transactions are distributed in the order of A, B, and C, but due to various reasons (such as network delay, node load, explicit commit, etc.), the order of transaction submission in distributed mode may vary on each node Not the same, as shown in the figure below:

Master submits in order of A, B, C
Compute Node 1 is submitted in the order of A, C, and B
Compute Node 2 is submitted in order of B, C, A

If during this process, a recovery point is created, if you specify to recover to this recovery point when recovering, it is obvious that the status of each node in the cluster is inconsistent after recovery.

Two-phase transaction commit lock and consistent recovery point:

In order to solve the above problems, we introduced a two-phase transaction commit lock. Distributed transaction commit will obtain the lock in SHARED mode, and the creation of a recovery point requires the lock to be obtained in EXCLUSIVE mode. Therefore, if there are distributed transactions in the cluster waiting for submission on each node, the action of creating a recovery point in the cluster must wait for the distributed transactions on all nodes to be submitted before proceeding.

This fundamentally solves the above problem, which solves the problem of data inconsistency caused by the creation of a recovery point while the distributed transaction is still being submitted. After introducing the two-phase commit lock mechanism, we can ensure that the status of each node corresponding to the created recovery point is consistent, so we call the recovery point created in ADB PG a consistent recovery point.

Distributed backup and recovery process:

With transaction commit locks and consistent recovery points, we can safely back up each node of ADB PG and create consistent recovery points without worrying about the inconsistency of node status.

The backup of ADB PG is also divided into basic backup and log backup (also called incremental backup). The basic backup is a complete copy of each node of the cluster. ADB PG will concurrently back up the computing node and the coordinating node, and stream the backup data to offline storage (such as OSS). During the basic backup, the read and write services of the cluster will not be blocked. Therefore, if the user has written and updated data during the basic backup, the WAL log corresponding to the data change also needs to be backed up. As shown in the figure below, ADB PG will perform a data copy on each node in parallel and upload the data to OSS in a streaming manner.

ADB PG basic backup process

The log backup of ADB PG is a backup of the WAL logs generated by the computing nodes and coordinating nodes in the cluster. Each node will dump the WAL log generated by itself to offline storage (such as OSS). At the same time, the cluster will periodically create a consistent recovery point and back up the WAL log containing the consistent recovery point.

When you need to restore a new cluster, you need to use the base backup and log backup at the same time, and first create a restore instance with the same number of nodes as the original instance. Each node pulls the specified basic backup to the local in parallel. After that, each node pulls the WAL log backup file it needs, and replays it locally until the replay reaches the specified consistent recovery point and stops. In the end, we can get a new cluster and ensure that the data and state are consistent with the data and state of the source instance at the consistent recovery point. The recovery process is shown in the figure below:

Four, use

(1) Console backup related information

View the basic backup set

Users can view the basic backup data of the database on the "Backup and Restore" page of the instance console. Currently, the basic backup data is stored on OSS, and the default retention period is 7 days.

Each row in the table represents a basic backup data, and records the start time, end time, backup status (success/failure), backup data size, and consistency point in time of the backup. The consistency point in time indicates that this basic backup data can restore the cluster to this historical point in time and keep the database in a consistent state.

View consistent recovery point

A consistent recovery point refers to a historical point in time to which the cluster can be recovered. The user can view all the recovery points of the current instance on the "recovery points" page of the backup recovery page.

Each row in the table represents a consistent recovery point and records the time stamp of the recovery point, indicating that the recovery point can restore the cluster to this historical point in time.

View the log file list

The log file records all changes to the database. When the cluster is restored, the corresponding log file will be used to restore the cluster to a consistent state. The log files of the current user cluster restoration are all saved on the OSS. Users can view the log file list in "Log Backup" on the backup and recovery page.

View backup strategy

The backup strategy refers to the period and time period for the instance to perform backups, the frequency of creating consistent recovery points, and the number of days to retain data backups, and so on.

The user can view and modify the backup strategy in the "Backup Settings" of the backup and recovery.

Modify the backup strategy

Click the "Modify Backup Configuration" button to modify the backup strategy.

(2) Instance recovery steps

First look at the data on the source instance

Enter the recovery page

The user can click Restore in the instance list, data backup list or recovery point list of the console to enter the instance recovery page;

The recovery page is as follows:

The sales page of the recovery instance is roughly the same as the page of the purchase instance, but with the following restrictions:

1. The current recovery instance is the number of masters and one must be selected

2. The number of selected instance segments (computer nodes) must be consistent with the source instance

3. The selected instance storage space must be greater than or equal to the source instance

Choose a recovery time point

In the drop-down box of "Clone Source Backup Set" on the recovery page, select the historical point in time to which the instance needs to be recovered, that is, specify a consistent recovery point.

Click to buy

After the user clicks to buy, the process is the same as the process of purchasing a new instance. After the instance is created, the newly restored instance can be seen in the console.

New instance restored

Looking at the data on the restored new instance, you can see that the data is exactly the same as the source instance.

Five, summary

Backup and recovery are of great value to ADB PG to ensure data security. The current backup and recovery function has applied multiple user scenarios, and guaranteed an RPO of at least 10 minutes. In the future, the ADB PG backup and recovery function will continue to optimize backup and recovery performance, support differential backup, support more storage media, improve user experience, and provide users with more functions, performance and cost optimization.

The regret medicine for database misoperation is here: AnalyticDB PostgreSQL teaches you to implement distributed and consistent backup and recovery

1. Background

2. Introduction

Three, principle

(1) PITR mechanism of stand-alone PG

(2) ADB PG's distributed and consistent backup and recovery mechanism

Four, use

(1) Console backup related information

(2) Instance recovery steps

Five, summary

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

仅3步！打造更精细化的制造业供应链管理体系（内附完整版PDF）

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

高端制造业财务数字化怎么做？思迈特提出了新思路

分析型数据库入门指南：如何选择适合你的实时分析工具？

医院利用大数据技术开展“模拟审计”自查自纠，大幅减少违规问题

StarRocks + Paimon 在阿里集团 Lakehouse 的探索与实践