Introduction to ODPS active/standby cluster bidirectional data replication causes the network explosion problem of the active/standby center

1. Fault description
At the customer site, there was a problem that the main and backup center network was blown up due to the full copy of the mutual data between the ODPS main and backup computer rooms, which seriously affected the daily operation of ODPS tasks. In the environment of the ODPS main and standby computer rooms, the user's tasks are all running in the main computer room, and the generated data will fall in the main computer room by default, and the data in the main computer room will be asynchronously replicated to the standby computer room through ODPS replicationService. So why is there a situation where data is reversely synchronized to the host room? It is necessary to investigate the problem and analyze the root cause.
2. Malfunction
During the investigation, observe the network load status of the machine before and after closing the data. When the main and backup data replication is turned on, the machine's network card has a large in and out traffic.
Figure 1
Continue to investigate and find that both the host room copy job and the standby computer room copy job are running, and many of them are data tables that have not been updated. This unnecessary full data synchronization causes a lot of network overhead.

figure 2

image 3

Figure 4
3. Analysis of the cause of the failure
Before solving the problem, we need to figure out the overall ODPS dual-computer room disaster recovery technical solution in the same city and the working principle of asynchronous data replication across computer rooms among them.
3.1 The overall technical solution for disaster recovery of ODPS dual-computer rooms in the same city
In the application of ODPS products, for each scenario of failure or cluster disaster, the failure recovery or service switching solutions are different. This customer belongs to the ODPS dual-computer room disaster recovery solution in the same city. Let's first look at the overall technical solution for the ODPS dual-computer room disaster recovery in the same city.
- Features:
- Complete MaxCompute services (control cluster, computing cluster, tunnel, front-end) are deployed in the main and standby computer rooms.
- Both the main and standby computer rooms can be used normally, but under normal conditions, the standby computer rooms are silent and do not process specific business requests.
- Start data transmission between the main and standby computer rooms, and set a predetermined strategy to synchronize data to the standby computer room regularly or on demand.
- Core logic:
- Metadata is kept synchronized and replicated;
- Business data is kept asynchronously replicated;
- RTO: Minute level can be achieved;
- RPO: It depends on the amount of data and the synchronization frequency, usually hourly.
- Network requirements:
- The synchronization delay of metadata should be controlled within 5ms;
- The synchronization delay of business data should be controlled within 20ms.

Figure 5
- The specific description of the module is as follows:
- VIP1: Point to the virtual IP of a group of tunnel data service nodes, bind the domain name of the ODPS tunnel, and all ODPS data upload and download go through VIP1.
- VIP2: Point to the virtual IP of a group of ODPS DDL/DML command service nodes, bind the domain name of the ODPS service, and all DDL/DML commands are submitted to ODPS for processing through VIP2.
- Tunnel front-end cluster: A set of nodes that deploy the ODPS Tunnel service process for data upload and download, call the user center and Meta service to authenticate user requests, read and write data in the computing storage cluster.
- Command front-end cluster: A group of front-end nodes that deploy the ODPS DDL/DML command processing service, and forward the DDL/DML operation commands to the ODPS control service for processing.
- Control Service: Process DDL/DML commands sent from the front end. For DML, SQL syntax analysis, query optimization and query plan generation will be performed, and the physical execution plan will be completed through distributed jobs submitted to the computing cluster.
- User Center: UMM service, responsible for managing users of the entire Alibaba Cloud and big data platform.
- Meta service: ODPS uses OTS as its own Meta storage service, responsible for managing the metadata of ODPS objects, including project information, table data (Table) schema and the path of data storage on the Feitian cluster, and the data on different Feitian clusters Version information, metadata information of the user UDF, etc.
- Computing cluster: Feitian cluster for storage and computing, storing all data and UDF programs, all DML commands will be parsed into a Feitian distributed DAG (directed acyclic graph) job and submitted to the computing cluster for execution. The core modules of the Feitian cluster include Pangu and Fuxi. Pangu is a distributed file system, and Fuxi is responsible for resource and job scheduling management.
In this solution, the main and standby computer rooms each have a set of ODPS services, and they share the same user center (UMM) and Meta service. Both UMM and OTS have their own dual computer room disaster recovery capabilities. See their disaster recovery solutions for details.
3.2 Working principle of asynchronous data replication across computer rooms
Let's look at the working principle of asynchronous data replication across computer rooms:
- Under normal working conditions, the ODPS in the main computer room provides services, and the ODPS in the standby computer room has no service request. The upper-layer data services only use ODPS through two service domain names:
- ODPS service domain name: points to the virtual IP of the command front-end cluster, namely VIP2 in the system architecture diagram.
- Tunnel service domain name: the virtual IP pointing to the tunnel front-end cluster, namely VIP1 in the system architecture diagram.
- ODPS uses the asynchronous data replication mechanism. The Replication Service in the main computer room continuously synchronizes the ODPS data in the main computer room to the computing cluster in the standby computer room. ODPS introduces the data version mechanism:
- The same piece of data (table or partition) may have different versions on multiple computing clusters. The ODPS Meta service will maintain the version information of each piece of data, which is as follows:
<span class="lake-fontsize-12">{"LatestVersion":V1,"Status":{"ClusterA":"V1","ClusterB":"V0"}}</span> |
When a data version is updated, a distributed cross-cluster data replication task is triggered to replicate the data to other computing clusters. The data replication flow control in the computer room can be performed by restricting the replication task.
1. For a large-scale offline data processing system such as ODPS, data processing often has unexpected situations. The amount of data produced in a certain period of time may be very large, which is limited by the bandwidth of the computer room. It takes a certain amount of time for the newly calculated data to be copied to the standby computer room. ODPS provides real-time viewing of currently unsynchronized data tables/partitions, real-time disaster recovery data synchronization rate and other information. The real-time data synchronization rate mainly depends on the influence of two factors:
machine room bandwidth size;
* The ODPS of the host room calculates the busyness of the cluster.
Because data replication is also performed by Feitian distributed tasks, the computing resources (mainly CPU and memory) of the ODPS computing cluster in the host room are needed. ODPS can give an estimate of the time for data synchronization to complete based on these two factors.
Taking into account the competition for resources such as bandwidth and storage between clusters, users need to choose whether to create a disaster recovery project by themselves. When creating a project through ascm/dtcenter, you can choose to create a single-cluster project, or you can choose to create a multi-cluster project. The single-cluster project does not support disaster recovery.
Disaster recovery project configuration:
1. After creating a multi-cluster project through ascm/dtcenter, the default active and standby clusters will not synchronize data. You need to enable data synchronization through the bcc page configuration (under maxcompute module -> operation and maintenance menu -> business operation and maintenance -> project management -> List of items).
Image 6
1. Resource replication configuration of a certain project of the customer:
Figure 7
Configuration item description, the following configuration can only take effect after resource replication is turned on.
SyncObject: Configure the synchronized cluster and the synchronization data type. Refer to the figure above and modify the ODPS cluster name to the on-site active/standby cluster name to enable full ODPS data replication.
* ScanMetaInterval: Data synchronization interval, in seconds.
* EnableEvent: Whether the data is synchronized in real time, when the configuration is true, when the main cluster data changes, it will be synchronized to the standby cluster immediately, and the ScanMetaInterval configuration will be invalid.
Note: The configuration data synchronization is real-time synchronization or the data synchronization interval configuration time is shorter, which will occupy the network bandwidth to a large extent. It is recommended that when the amount of data to be synchronized is large, turn off the real-time synchronization and increase the synchronization interval.
1. Disaster tolerance replication risk and practice description:
As mentioned in the previous section, if EnableEvent is set to true, then the data in the project will trigger synchronization immediately after being changed, and the amount of data synchronized each time is the total amount of data in the table or the partition. For example: there is 1T of data in a non-partitioned table. If you insert one piece of data into this table, you need to copy all the 1T of data to the equipment room. If the data is written 10 times within 1 minute, 1T of data will be copied 10 times to the standby computer room in one minute, resulting in 9T of data to be recovered. It is strongly recommended to turn off real-time replication on the hybrid cloud to reduce the bandwidth and storage pressure of the machine room. Change the strategy of timed replication, such as setting ScanMetaInterval, scan and copy once every 6 hours. EnableEvent = false ScanMetaInterval=21600 #6 hours=21600 seconds
* In order to prevent ODPS from occupying the cluster bandwidth during peak hours, you can configure the global configuration in adminconsle->cross-cluster replication. Here, the traffic bandwidth for replication between ODPS clusters is limited, and the unit is Gb, as shown in the following figure:
Figure 8
From the screenshot of the replication job log from the specific backup room to the main center, you can see that the cluster name is uppercase, and the sourceLocation set by the client in the resource replication synchronization is lowercase, and there is a problem of misconfiguration on the client side.
Picture 9
I communicated with the R&D students of ODPS and confirmed that the root cause of this problem is that the replicationService of ODPS will use the SyncObject cluster to verify the version number of the project corresponding to the standby cluster through ots when it initiates asynchronous replication of project data. The difference in case here is that the item in the equipment room does not exist, so synchronization is initiated, but when the data is stored on the ground, there is data verification and it will not actually store these. This brings unnecessary network overhead, but it does not affect data quality.
After modifying the resource configuration through bcc, changing the cluster in SyncObject to uppercase, and restarting the replicationService of ODPS, the problem was completely solved.
Picture 10
# 4. Question conclusion
The ODPS active and standby clusters have full bandwidth for data replication and synchronization. The root cause is that the cluster name in ODPS project resource replication is lowercase, and the name of the ODPS project cluster is uppercase. When data is synchronized, the project will be considered as non-existent, resulting in two-way Synchronization, the test has also verified this. This problem has been solved by bcc batch correcting the cluster information in the project name to uppercase, and restarting the replicationService.
Special attention is required: Customers need to pay attention to keeping the case of the cluster name and the cluster name matching the cluster name in the project resource replication configuration in bcc.
Through this troubleshooting, we can get a good understanding of the current ODPS multi-computer room data synchronization solution and the multi-computer room technical disaster tolerance architecture.
We are the Alibaba Cloud Intelligent Global Technical Service-SRE team. We are committed to becoming a technology-based, service-oriented, and high-availability engineer team of business systems; providing professional and systematic SRE services to help customers make better use of the cloud 、Build a more stable and reliable business system based on the cloud to improve business stability. We hope to share more technologies that help enterprise customers go to the cloud, make good use of the cloud, and make their business operations on the cloud more stable and reliable. You can scan the QR code below to join the Alibaba Cloud SRE Technical Institute Dingding circle, and more The multi-cloud master communicates about those things about the cloud platform.
> Copyright notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。