8
头图

Preface

Zookeeper believes that everyone is familiar with it, and many distributed middleware use zk to provide the characteristics of distributed consistency and coordination. Dubbo officially recommends using zk as the registration center, which is also an important component of hadoop and Hbase. Zk has also appeared in other well-known open source middleware.

There are many children's shoes who have known zk for a long time, know its basic idea, and know how to use it. But when I asked about the election and data synchronization mechanism between cluster zk during the interview, I fell into a blind spot.

In fact, the election and synchronization of many distributed middleware are similar to zk. In this article, I will focus on the election and synchronization mechanism between zk clusters.

Deployment of ZK cluster

First, let's look at the second half of the operating mechanism:

The cluster requires at least three servers, and the official document strongly recommends using an odd number of servers, because zookeeper judges whether the entire service cluster is available by judging the survival of most nodes, such as 3 nodes, half of which is 1.5, take upwards That's 2. If two of them are down, it means that the entire cluster is down. If you use an even number of 4, if you hang up two, it means that most of them are not alive, so they will also hang up. So if you use 4 servers, it is not cost-effective in terms of using resources.

Configuration syntax:

server.<节点ID>=<IP>:<数据同步端口>:<选举端口>
  • Node ID: a number between 1 and 125, written in the {dataDir}/myid file of the corresponding service node.
  • IP address: The remote IP address of the node can be the same. It is recommended to use a different machine in the production environment, otherwise the purpose of fault tolerance cannot be achieved.
  • Data synchronization port: Data replication port during master-slave synchronization.
  • Election port: the election port of the master and slave nodes.

Assuming there are 3 zookeeper nodes, we need to write config configuration for them, and we also need to write 3 copies, which are placed on different servers. The configuration is as follows:

initLimit=10
syncLimit=5
dataDir=/data/zk_data
clientPort=2181
# 集群配置
server.1=192.168.1.1:2888:3888
server.2=192.168.1.2:2888:3888
server.3=192.168.1.3:2888:3888

The dataDir parameter specifies a directory to store zk data. There is a file myid in it, and the myid file on the 3 machines stores 1, 2, and 3 respectively. Correspond to the respective node ID.

After configuring the 3 configuration files, start them separately, so that we have a 3-node cluster zookeeper set up.

./bin/zkServer.sh start conf/zoo.cfg

Roles in the ZK cluster

There are three public roles in the zookeeper cluster, which are leader , follower , and observer .

Roledescribe
leaderThe master node, also known as the leader. It is used to write data and is elected through elections. If it goes down, a new master node will be elected.
followerChild nodes, also known as followers. Used to read data. At the same time, he is also a candidate node of the master node and has voting rights.
observerSecondary child nodes, also known as observers. It is used to read data. The difference from follower is that it has no voting rights and cannot be selected as the master node. And the observer will not be included in the calculation of the available state of the cluster.

About the observer configuration:

Just add the observer suffix to the cluster configuration, an example is as follows:

server.3=127.0.0.1:2883:3883:observer

Among them, there is only one leader, and the rest are followers and observers, but we generally do not configure observers in production, because observers do not have the right to vote. It can be understood that the observer is a temporary worker, not a formal employee, and cannot be promoted. In addition, it has the same function as follower.

When does the observer need to be used, because zk generally requires more read requests than writes. When the pressure of the entire cluster is too high, we can add a few temporary workers observers to obtain performance improvements. When you don't need it, you can remove the observer at any time.

When zk connects, we generally configure all nodes of zk, separated by commas. In fact, it is possible to connect any one or more of the clusters. It's just that if there is only one node connected, when this node goes down, we will be disconnected. Therefore, it is recommended to configure multiple nodes for connection.

How to view the roles in the ZK cluster

We can use the following command to view the roles in the zk cluster

./bin/zkServer.sh status conf/zoo.cfg

I built a pseudo-cluster with 3 nodes on my own machine (shared one machine), and the configuration files were named zoo1.cfg, zoo2.cfg, and zoo3.cfg. The result of using the above command to view is:

image.png

As you can see, node 2 is the leader and the others are followers. But if you start in the order of zoo1.cfg, zoo2.cfg, zoo3.cfg, no matter how many times you start it, node 2 will always be the leader, and at this time, if you turn off node 2 and check the role, you will find that node 3 has become leader.

image.png

The above phenomena are all related to the election mechanism of zookeeper

The election mechanism of ZK cluster

Let's take the zk of 3 nodes as a simple election illustration

image.png

zk will conduct multiple rounds of voting until the number of votes of a certain node is greater than or equal to more than half. In 3 nodes, there will be a total of 2 rounds of voting:

  • In the first round, each node votes for itself when it starts, so that zk1, zk2, and zk3 each have one vote.
  • In the second round, each node voted for greater than its own myid, so that when zk2 starts, it gets another vote. Plus the vote that I voted for myself. There are 2 votes in total. 2 votes are greater than half of the current total number of nodes, so voting ends. zk2 was elected leader.

Some children's shoes will ask, what about zk3, because zk2 has been elected and voting has ended. So zk2 will not vote for zk3 either.

Of course, this is a relatively simple version of the election. In fact, the real election is more than zxid, which will be discussed later.

When will the zk election be triggered? One is that it will be triggered when it is started, and the other is that it will be triggered when the leader is down. In the above example, if node 2 goes down, according to the rules, the leader should be zk3.

Data synchronization mechanism of ZK cluster

Zookeeper's data synchronization is to ensure the data consistency of each node. It is roughly divided into two processes, one is the normal client data submission process, and the other is the data recovery process after a node in the cluster goes down.

Normal client data submission process

The client write data submission process is roughly as follows: the leader receives the client's write request, and then synchronizes to each child node:

image.png

However, there are children's shoes that cause confusion. The client generally connects to all nodes, and the client does not know which is the leader.

Indeed, the client will establish links with all nodes, and initiate write requests by traversing the nodes one by one, such as node 1 the first time, and node 2 the second time. And so on.

If the role of the node to which the client happens to be connected is the leader, follow the above process. Then if the linked node is not a leader but a follower, then there is the following process:

image.png

If the node that the client chooses to link is a follower, this follower will forward the request to the current leader, and then the leader will broadcast the request to all followers through the blue line, and each node will follow the green line to tell the leader after synchronizing the data. The data has been synchronized (but not yet submitted). When the leader receives more than half of the node ACK confirmation messages, then the leader thinks that the data can be submitted, and it will be broadcast to all follower nodes, and all nodes can submit the data. The entire synchronization work is over.

Then let's talk about the data synchronization process after the node is down.

When the leader in the zookeeper cluster goes down, a new election will be triggered. During the election, the entire cluster cannot provide external services. Until a new leader is selected, the service can be re-provided.

Let's go back to the example of 3 nodes, zk1, zk2, zk3, where z2 is the leader, z1, z3 are the followers, assuming that zk2 goes down and triggers a re-election. According to the election rules, z3 is elected as the leader. At this time, the entire cluster only completes z1 and z3. If the entire cluster creates another node data at this time, then z2 restarts. At this time, the data of z2 must be older than z1 and z3, then how to synchronize the data at this time.

Zookeeper is confirmed by the ZXID transaction ID. ZXID is a number with a length of 64 bits. The lower 32 bits are incremented according to the number. Any data change will cause the lower 32 bits to simply increase by 1. The upper 32 bits are the leader cycle number. Whenever a new leader is elected, the new leader takes out the ZXID from the local transaction log, and then parses the upper 32 bits of the cycle number, adds 1, and then adds the lower 32 bits Set all to 0. This ensures that each time a new leader is elected, the uniqueness of the ZXID is guaranteed and the increment is guaranteed.

The command to view the ZXID of a data node is:

先进入zk client命令行
./bin/zkCli.sh -server 127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183 
stat加上数据节点名称
stat /test

The results of the implementation are:

image.png

As you can see, there are 3 ZXIDs, each of which represents:

cZxid: The transaction ID when the node was created

mZxid: The transaction ID when the node was last updated

pZxid: The latest creation/update/deleted transaction ID of the child node of this node

The command to view the latest ZXID of a node is:

echo stat|nc 127.0.0.1 2181
这个命令需提前在cfg文件后追加:4lw.commands.whitelist=*,然后重启

image.png

The ZXID here is the ID of the last transaction of the current node.

If the data of the entire cluster is consistent, then the ZXID of all nodes should be the same. Therefore, zookeeper uses this orderly ZXID to ensure the consistency of the data between the nodes. With the previous problems, if the leader is down and then restarted, it will compare the latest ZXID with the current leader. If The ZXID of the node is smaller than the ZXID in the latest Leader, then the data will be synchronized.

Look at the elections in ZK again

Let's look at the election mechanism in ZK with the concept of ZXID.

Suppose there is still a three-node cluster, zk2 is the leader, if zk2 hangs at this time. zk3 was elected as the leader and zk1 as the follower. At this time, if one of the data in the cluster is updated. Then turn off both zk1 and zk3. Then restart zk1, zk2, and zk3 one by one. After starting at this time, can zk2 still be elected as the leader?

In fact, this question, in other words, is: when zk nodes are started one by one, the data of zk1 and zk3 are the latest, but the data of zk2 is not the latest. According to the previous election rules, can zk2 be elected as the leader smoothly?

The answer is no, and zk1 was elected last.

Why is this.

Because the latest ZXID of zk2 is no longer the latest, the zk election process will give priority to nodes with a larger ZXID. At this time, the largest ZXIDs are zk1 and zk3, and the election will only be generated in these two nodes, according to the election rules mentioned earlier. In the first round of voting, as long as zk1 gets 1 vote, it can reach half and be elected as the Leader.

finally

In fact, the election and synchronization of zk are not complicated. If you can try to build a 3-node pseudo-cluster locally, try to run the above case. You should be able to understand the whole process. As a veteran consistency coordination middleware, zk is also a frequently asked question point in many interviews. If you can understand the core content of this article, this will not be your blind spot when you encounter this type of problem again. Finally, friends who like the content of this article hope to like, follow, and forward.

image.png


铂赛东
1.2k 声望10.4k 粉丝

开源作者&内容创作者,专注于架构,开源,微服务,分布式等领域的技术研究和原创分享