The original author Ran Xiaolong, first published on the public account "Tencent Cloud Middleware", and the publication has been authorized by the original account. If you need to reprint, please go to contact. This article mainly introduces Apache Pulsar's cross-regional replication solutions in different scenarios.
The following article is from Tencent Cloud Middleware, author Ran Xiaolong
About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/
Apache Pulsar is a multi-tenant, high-performance message transmission solution between services. It supports features such as multi-tenancy, low latency, read-write separation, cross-regional replication, rapid capacity expansion, and flexible fault tolerance. It natively supports cross-continental cross-regional replication solutions, combined with its own tenant and namespace level abstraction, can flexibly support different types of cross-regional replication solutions in different scenarios.
With the support of Geo-Replication design, first, we can easily distribute services to multiple computer rooms; second, we can deal with computer room-level failures, that is, when one computer room is unavailable, services can be transferred to Other computer rooms continue to provide services to the outside world.
Apache Pulsar has a built-in multi-cluster cross-regional replication function. GEO-Repliaaction refers to clusters scattered in different physical regions through a certain configuration method so that data can be replicated between clusters.
According to the dimension of whether the message is asynchronous read and write, cross-regional replication can be divided into the following two solutions:
- Synchronization mode: If the disaster tolerance level of the data is very high, you can use the synchronous cross-city deployment mode. Data copies will exist between different cities. The problem is that the fluctuation of the network between cities will have a greater impact on performance, because Need to wait for multiple cities to write successfully before returning to the client successfully.
- Asynchronous mode: If the disaster tolerance level of the data is not so high, you can use the asynchronous cross-city deployment mode. For example, there are two independent data centers, Shanghai and Toronto, and the message written in Shanghai will be written to Toronto asynchronously. The advantages are not Affect the performance of the main process, not enough storage overhead.
Below we discuss the cross-regional replication scheme of Pulsar in asynchronous mode.
Pulsar currently supports the following three asynchronous cross-regional replication schemes:
- Fully connected
- One-way copy
- Failover mode
From the perspective of whether it has configurationStoreServers (global zookeeper), it can be divided into the following two asynchronous cross-regional replication solutions:
- There are configured StoreServers
- Fully connected
- No StoreServers configured
- One-way copy
- Failover mode
A core concept in the entire cross-regional replication is whether the data between each cluster can be interoperable, and the interaction between them mainly depends on the following configuration information:
- cluster （cluster name）
- zookeeper （local cluster zk servers）
- configuration-store （global zk servers）
When initializing the Pulsar cluster, the user can specify the corresponding information above, an example is as follows:
bin/pulsar initialize-cluster-metadata \ --cluster pulsar-cluster-1 \ --zookeeper zk1.us-west.example.com:2181 \ --configuration-store zk1.us-west.example.com:2181 \ --web-service-url http://pulsar.us-west.example.com:8080 \ --web-service-url-tls https://pulsar.us-west.example.com:8443 \ --broker-service-url pulsar://pulsar.us-west.example.com:6650 \ --broker-service-url-tls pulsar+ssl://pulsar.us-west.example.com:6651
The full-mesh format allows data to be shared among multiple clusters, as shown in the figure below:
- configurationStoreServers : Stores the configuration information of each cluster, which means that the clusters can perceive each other's address information. In addition, tenant and namespace information will be stored. The main purpose is to simplify the operation process. When the information of one of the clusters is updated, the other clusters can obtain this information change through the global zookeeper.
- tenant : which clusters are allowed to operate on the currently created tenant (–allowed-clusters)
- namespace : The currently created namespace allows data replication between which clusters (–clusters)
For data replication between multiple clusters, we can all simplify to data replication between two clusters. Based on this concept, the principle of Geo-Replication is shown in the following figure:
There are currently two clusters deployed in Beijing and Shanghai. When a user uses a producer in the Beijing cluster to send data, it will first be sent to the local cluster in the Beijing computer room (topic1). At the same time, a replication cursor will be created. For a cursor that specializes in copying data, through this cursor information, you can determine which stage the current data is copied to. At the same time, it will create a replication producer. It will read the data from topic1 in the Beijing computer room, and then write the data to topic1 in the Shanghai computer room. After the broker in the Shanghai computer room receives the producer's request, it will write to the same topic locally. China comes (topic1). At this time, if the user in the Shanghai computer room starts the consumer to consume data, they will receive the data information produced by the Beijing computer room producer. vice versa.
The following issues need to be explained here:
- In a fully connected scenario, the data in the Beijing computer room will be copied to the cluster in the Shanghai computer room, and the data in the Shanghai computer room will also be copied to the Beijing computer room. Then, after the data from the Beijing computer room is copied to the Shanghai computer room, the Shanghai computer room will be reversed. Copy this piece of data back to Beijing, forming an endless loop of data? Because when the producer sends a message, it knows which cluster it is currently in belongs to. When the produced message is replicated by the replication producer, it will mark the message with a label: replication_from, which represents where the message comes from. , Can solve the problem of reverse replication.
- In the case of Geo-Replication, the exact-once semantics of the message can also be guaranteed (at-least-once + broker-side deduplication (producer-name + sequence ID))
- The replication delay depends on the network delay between the two computer rooms. If the delay is relatively large, the network situation between the two computer rooms needs to be considered.
Once the global zookeeper is configured, the data replication is bidirectional, and the data between all clusters mounted under the global zookeeper are interoperable.
As we mentioned above, when the global zookeeper is configured, there is no way to do one-way data replication, but in many scenarios, we do not need all the data between the clusters to be fully connected. This scenario Next, we can consider using the function of one-way replication. It should be emphasized that one-way replication does not require users to configure or specify configurationStoreServers separately. When configuring, you only need to configure the value of configurationStoreServers to the zookeeper address (zookeeperServers) of the local cluster. Can.
So how to do cross-cluster replication scenarios without configuring global zookeeper?
As we mentioned above, the role of global zookeeper is mainly to store the address information of multiple clusters and the corresponding namespace information, and there is no additional metadata information. So in the one-way replication scenario, you need to tell the clusters in other computer rooms that you need to read the namespace information between different clusters.
Failover mode is a special case of one-way replication.
In Failover mode, the cluster in the remote computer room is only used for data backup, and there will be no producer and consumer. Only when the currently active cluster is down, the corresponding producer and consumer will be switched to the corresponding ones. Continue to consume in the standby cluster. Because of the existence of the replication sub, the subscription status will also be copied to the backup room.
- blog post recommendation | Multi-picture detailed Apache Pulsar message storage model
- blog post recommendation | This article takes you to understand Pulsar's message retention and expiration strategy
Click the link to get the Apache Pulsar hard core dry goods information!