This article is translated from "Distributed Locks with Apache Pulsar".
Original link: https://theboreddev.com/distributed-locks-with-apache-pulsar/
Translator
Wei Wei kivenwei, 7 years of R&D work experience, mainly experienced SOA and microservice architecture. He is familiar with the development of open source software in the field of cloud computing, CNCF, and cloud native ecology. He is currently working in CGN Intelligent Technology Co., Ltd.

Sometimes the most challenging job a software engineer often faces is ensuring that only one component of our distributed application is performing the corresponding computation at a time.

For example, there are three running nodes in the application, and we need to run a scheduled task every day, how to ensure that only one of the nodes can trigger the running task? If we email a client in this task and run the task on three nodes, our client might receive this email 3 times. Backfired, so how to fix this problem?

One might say: "easy, just run one node!".

Actually, it's not that easy. In most cases, we must ensure that the service has an adequate level of availability, and running only one node means that if something goes wrong, the service will not be available.

What we really need is to choose some "master node" to be responsible for this task. Another factor to consider is that if our primary node fails, this task must be immediately delegated to one of the backup nodes to perform the task to avoid disruption.

Let's see what we want to achieve. As shown below:

Operationally, we need to find an easy way to "elect" the master node responsible for executing the task, and the other nodes will patiently wait for its turn to execute. These standby nodes will therefore be in a so-called "sleep" state; the standby node will only wake up if the primary node fails or becomes unresponsive.

how to solve this problem

In some scenarios, it is preferable to use some fairly complex implementation to ensure that only one of the nodes performs the task.

"CAS" atomic operations, now supported by some database engines, may be a reasonable solution to this problem quickly. We can solve the problem by exploiting a feature in the database without reimplementing it ourselves. But what if the database doesn't support such atomic operations?

The problem gets more complicated because each node will try to compete to acquire the lock, but two nodes can acquire the lock at the same time with the value "free" and both successfully set the value without noticing that the other node also "acquires" Lock". This means that not only one node, but both nodes will perform this task, such as sending an email to a customer,

However, even if CAS operations are supported in the database, some mechanism needs to be provided to ensure that a standby node takes over if our primary node fails: a heartbeat-like process that constantly checks the status and takes appropriate action if a node fails. However, this approach is time-consuming and labor-intensive, and ideally we would like to utilize a well-established and thoroughly tested product.

If you've already deployed Pulsar, that's why we use products like Apache Pulsar to solve this problem. A similar solution can be implemented with Kafka, but this post is mainly focused on Apache Pulsar.

Implementing Distributed Locks with Apache Pulsar

So how to take advantage of Apache Pulsar? Pulsar provides a subscription mechanism called failover (disaster recovery), which mainly implements a leader election mechanism.

How to make full use of this election mechanism to ensure that the scheduled task is executed only once?

Without going into the specifics of the implementation as it depends a lot on the usage scenario, it is possible to come up with an easy way to do this. details as follows:

Automatic start scheduler based on heartbeat events

One way to do this is to start a consumer and start listening for heartbeat events, then immediately start sending heartbeat events. Consumers will use failover subscriptions to subscribe to topics. Therefore, only one node is able to start the scheduler. If the primary node fails, the other standby nodes will take over and start tasks immediately. The following figure is the idea of realization:

In this example, we have a topic to manage distributed locks, and each consumer sends heartbeats to this topic periodically, and uses failover (disaster recovery) to subscribe to the topic. Only one of the nodes will become the master node and handle heartbeat events. If the master has not started the mail scheduler, it starts as soon as it receives the first heartbeat; for the remaining heartbeats, they are ignored as long as the scheduler is running.

What happens if the master node fails? Specifically look at the following figure:

In the event of a primary node failure, Pulsar's failover subscription will detect that the node has failed and the standby node will take over. In the diagram, the standby node on the left will receive a heartbeat, triggering the start of task execution. Once the previous primary node is restored, it will "sleep" again as a standby node.

in conclusion

We generally do not recommend adopting any new technology just to implement distributed locks, often finding and taking advantage of current products can save time and reduce operational complexity.

Of course, users can also build their own distributed locks on other systems, but this takes time and is prone to errors. But in the implementation of Pulsar, this feature was available and reliable after thorough testing by other engineers, saving users valuable time and reducing operational problems.

Follow public account "Apache Pulsar" to get dry goods and news

Join the Apache Pulsar Chinese exchange group👇🏻


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统