Algorithm Leader丨How does a distributed system elect a leader?

Hello everyone, I am the laboratory researcher of this issue - Li Shuai. Leader election is one of the trickiest things in distributed systems. At the same time, understanding how the leader is elected and the responsibilities of the leader is the key to understanding distributed systems.

In a distributed system, a service usually consists of multiple nodes or instances to form a service cluster to provide scalable and highly available services.

These nodes can work at the same time to improve service processing and computing capabilities. However, if these nodes operate on shared resources at the same time, they must coordinate their operations to prevent each node from overwriting the changes made by other nodes, resulting in data confusion. question.

Therefore, we need to elect a leader in all nodes to manage and coordinate all nodes of the cluster, which is also the most common Master-Slave architecture.

To select the master node leader among multiple nodes in a distributed environment, the following strategies are usually used:

According to the process ID or instance ID, select the maximum value or the minimum value as the primary node.
Implement a common leader election algorithm, such as Bully, etc.
Through distributed mutual exclusion locks, it is guaranteed that only one node can acquire the lock for a period of time and become the master node.

In this article, I will introduce several common election algorithms, including ZAB, Bully, and Token Ring Election.

Bully's algorithm

Garcia-Monila invented the Bully algorithm in a paper in 1982, which is a very common election algorithm in distributed systems. The node with the largest ID is the master node.

If there is a cluster, each node can be connected to each other, and each node knows the information (Id and node address) of other nodes, as follows:

When the cluster is initialized, each node first determines whether it has the largest ID among the surviving nodes. If so, it sends a Victory message to other nodes to announce itself as the master node. According to the rules, at this time, the P6 node in the cluster becomes the master master. node.

Now there are some failures in the cluster that are causing the nodes to go offline. If the offline is a slave node, the cluster is still one master and multiple slaves, which has little impact, but if the offline is the P6 master node, it becomes a scene of no leader.

Now we need to re-elect a masternode!

Our nodes can be connected to each other, and heartbeat checks are performed regularly between nodes. At this time, the P3 node detects that the P6 master node fails, and then the P3 node initiates a new election.

First, P3 will send an Election message to all nodes with a larger ID than its own.

Since P6 has been offline, there is no response to the request, while P4 and P5 can receive the Election request and respond to the Alive message. After the P3 node receives the message, it stops the election, because now there are nodes with a larger ID than their own surviving, and they take over the election.

Next, the P4 node sends election messages to the P5 and P6 nodes.

The P5 node responds to the Alive message and takes over the election.

Likewise, the P5 node sends an election message to the P6 node.

At this time, the P6 node did not respond, and now P5 is the node with the largest ID among the surviving nodes, so P5 should become the new master node, and send a Victory message to other nodes, announcing that it has become the leader!

After a period of time, the fault recovers, and the P6 node goes back online. Because it is the node with the largest ID, it directly sends a Victory message to other nodes to announce that it has become the master node, and P5 receives an election request from a node with a larger ID than its own. , downgrade to become a slave node.

Token Ring algorithm

The Token Ring Election algorithm is closely related to the network topology of cluster nodes. Its characteristic is that all nodes form a ring, and each node knows the downstream nodes and can communicate with them, as follows:

When the cluster is initialized, one of the nodes will first send an election message to the next node, which contains the ID of the current node. After the next node receives the message, it will append its own ID to the message, and then continue to pass it down. Finally a closed loop is formed.

This election is initiated from the P3 node.

After the P3 node receives the message from P4, it finds that the message contains its own node ID, and it can be determined that the election message has gone through the entire ring. 2,1,4", select the largest Id as the master node, that is, elect P6 as the leader.

Next, the P3 node sends a message to the downstream node, announcing that P6 is the master node, until the message travels the entire ring and returns to P3. At this point, the election is completed.

Now there are some failures in the cluster, causing the master node P6 to go offline. The upstream P3 node is first discovered (through the heartbeat check), and then the P3 node re-launches the election. When the downstream P3 node cannot be connected, it will try to connect to the downstream node. The downstream node P5 sends an election message with its own node ID, and the message is gradually transmitted downstream.

Until the election message returns to the P3 node, select the largest ID from the "3, 5, 2, 1, 4" node list, that is, now P5 becomes the master node.

Next, the P3 node sends a message to the downstream node announcing that P5 is the master node.

Until the message walks the entire ring and returns to P3, at this point, the election is completed.

ZAB - Atomic Broadcast Protocol for ZooKeeper

As we all know, Apache ZooKeeper is a distributed framework for cloud computing. Its core is an atomic broadcast protocol (ZooKeeper Atomic Broadcast) based on Paxos, but in fact it is neither Basic-Paxos nor Multi-Paxos.

There are currently two leader election algorithms in ZooKeeper: LeaderElection and FastLeaderElection (default), and the FastLeaderElection election algorithm is the standard Fast-Paxos algorithm implementation.

Below I will introduce the implementation of leader election in the ZAB protocol.

First, we have three nodes, S1, S2, S3, each node has a local data and ballot box, the data includes myid, zxid and epoch.

myid When initializing each node, you need to configure its own node ID, which is a non-repeating number.
The number of epoch election rounds, the default is 0, and the accumulation operation is performed during the election. The size of epoch can indicate the order of election.
zxid ZooKeeper's global transaction ID, a 64-bit non-repeating number, the first 32 bits are the epoch, and the last 32 bits are the count counter. How does zxid make it globally unique? In fact, after the leader is selected in the cluster, a write operation will firstly increment the zxid on the leader node, and then synchronize to the follower node. It is much simpler to ensure that a number is incremented and not repeated on a node. The size of the zxid can indicate the occurrence of an event order of precedence.

Now start voting, the voting content is the local data of the node mentioned above, [myid, zxid, epoch], each node first votes for itself, and puts it in its own ballot box, and then broadcasts this vote to other nodes. .

After one round of vote exchange, each node's ballot box now has the votes of all nodes.

According to the voting node information in the ballot box, the competition is carried out. The rules are as follows:

First, the zxid will be compared, and the largest zxid will win (the larger the zxid, the newer the data). If the zxid is the same, then the myid (that is, the serverId of the node) will be compared. Broadcast its updated vote to other nodes.

Node S3: According to the competition rules, the winning vote is S3 itself, so there is no need to update the local vote and broadcast again.

Nodes S1 and S2: According to the competition rules, re-vote for S3, overwrite the previous vote for themselves, and broadcast the vote again.

Note that if multiple votes for the same node are received in the same election round, the last vote is used to overwrite the previous vote.

At this time, node S3 receives the re-votes from nodes S1 and S2, and votes for itself, which conforms to the "majority principle". Node S3 becomes the leader, while S1 and S2 become followers. At the same time, the leader sends heartbeats to the followers regularly for checking.

Summarize

This article mainly introduces several classic leadership election algorithms in distributed systems, ZAB, Bully, Token Ring Election. Some of the election rules are "elder is the greatest", while others are "democratic voting", and the minority obeys the majority. You can compare them. Advantages and disadvantages, choosing an appropriate election algorithm in practical applications.

Why is the Paxos algorithm not introduced? Because Paxos is a consensus algorithm, and in Basic-Paxos, a consensus can be reached without a leader node, which can be called "equality of all beings", and the concept of leader mentioned in Multi-Paxos is only to improve efficiency. Of course Paxos is very important, it can be said that it is the foundation of distributed systems.

Microsoft MVP, looking forward to your joining

Algorithm Leader丨How does a distributed system elect a leader?

Bully's algorithm

Token Ring algorithm

ZAB - Atomic Broadcast Protocol for ZooKeeper

Summarize

微软技术栈

引用和评论

极客说｜揭秘大语言模型与 GPT 的变革力量，探索未来 AI 技术的无限潜能

融合AMD与NVIDIA GPU集群的MLOps：异构计算环境中的分布式训练架构实践

本地？线上？分布式系统前后端架构、部署、联调指南，突破技术

【微服务架构】从链路追踪到日志关联：打造分布式系统问题定位利器

CAP 理论：分布式系统的三选二原则与 Java 实战

Paxos 协议三阶段解密：原理剖析与 Java 实现