In-depth understanding of the core principles of Zookeeper

The previous article Zookeeper basic principles & detailed application scenarios gave a detailed introduction to the basic principles and application scenarios of Zookeeper, although it introduced its underlying storage principles and how to use Zookeeper to implement distributed locks. But I think this is just a little bit of . So this article will give you a detailed about the 160e3d550e9fc1 core underlying principle . Those who are not familiar with Zookeeper can look back.

ZNode

This should be regarded as the basis of Zookeeper, the smallest unit of data storage. In Zookeeper, the storage structure similar to the file system is abstracted into a tree by Zookeeper. Each node (Node) in the tree is called ZNode . A data structure is maintained in ZNode to record the version number and ACL (Access Control List) changes in the ZNode.

With these data version number and its updated Timestamp , Zookeeper can verify the validity of the cache requested by the client and coordinate the update.

Moreover, when the Zookeeper client executes update or delete operation, it must bring the version number of the corresponding data to be modified. If Zookeeper detects that the corresponding version number does not exist, it will not perform this update. If it is legal, after the data is updated in the ZNode, the corresponding version number will be will be updated together.

This set of version number logic is actually used by many frameworks. For example, in RocketMQ, when Broker registers with NameServer, it will also bring such a version number, called DateVersion .

Stat Structure look at the data structure of the data related to the maintenance version number. It is called 060e3d550ea1b6, and its fields are:

Field	Paraphrase
czxid	Create the zxid of the node
mzxid	The zxid of the node was last modified
pzxid	The zxid of the child node of the node was last modified
ctime	Milliseconds between the start of the current epoch and the creation of the node
mtime	The interval in milliseconds from the beginning of the current epoch to the last time the node was edited
version	The number of changes to the current node (that is, the version number)
cversion	The number of changes to the child nodes of the current node
aversion	The number of ACL changes of the current node
ephemeralOwner	SessionID of the current temporary node owner (empty if it is not a temporary node)
dataLength	The length of the data of the current node
numChildren	The number of children of the current node

For example, through the stat command, we can view the specific value of the Stat Structure in a ZNode.

The epoch and zxid here are related to the Zookeeper cluster, which will be introduced in detail later.

ACL

ACL (Access Control List) is used to control the related permissions of ZNode, and its permission control is similar to that in Linux. Linux permission type divided into three types, namely read , write , performed , respectively corresponding to the letters r, w, x. The permission granularity is also divided into three types, namely owner permission , group permission , other group permission , for example:

drwxr-xr-x  3 USERNAME  GROUP  1.0K  3 15 18:19 dir_name

What size ? Granularity is the classification of the objects that the permissions act on. To put the above three granularities in another way, of the user (Owner), the user's group (Group), and other groups (Other) , which should be regarded as a kind of The standard of access control is a typical three-stage.

Although Zookeeper is also three-stage, there is a difference in the granularity of the two. The three-stage Zookeeper is 160e3d550ee560 Scheme, ID, and Permissions , which mean the permission mechanism, the users allowed to access, and the specific permissions respectively.

Scheme represents a permission mode, which has the following 5 types:

world In this Scheme, ID can only be anyone , which means everyone can access
auth represents users who have passed authentication
digest uses username + password for verification.
ip only allows certain specific IP to access ZNode
X509 by the client's certificate

At the same time, there are five types of permissions:

CREATE Create node
READ Get a node or list its child nodes
WRITE can set node data
DELETE can delete child nodes
ADMIN can set permissions

As in Linux, this permission has an abbreviation, for example:

getAcl method, the user can view the permissions of the corresponding ZNode, as shown in the figure, the result we can output is three-stage. They are:

scheme uses the world
id value is anyone , which means that all users have permission
Permissions Specific permissions cdrwa, respectively C reate, D ELETE, R & lt the EAD, W is RITE and A DMIN Abbreviation

Session mechanism

Knowing the Version mechanism of Zookeeper, we can continue to explore the Session mechanism of Zookeeper.

We know that there are 4 types of nodes in Zookeeper, which are persistent nodes, persistent sequential nodes, temporary nodes, and temporary sequential nodes.

As we talked about in the previous article, if the client creates a temporary node and disconnects afterwards, all temporary nodes will be deleted . In fact, the disconnected is not very accurate. It should be said that Session expires when the client establishes a connection, all temporary nodes created by it will be deleted.

So how does Zookeeper know which temporary nodes are created by the current client?

The answer is ephemeralOwner (Owner of the temporary node) field in Stat Structure

As mentioned above, if the current temporary sequence node , then ephemeralOwner stores the SessionID of the Owner who created the node. With the SessionID, it can naturally match the corresponding client. When the Session fails, the client can be the client. created on the end are deleted .

When the corresponding service creates a connection, it must provide a string with all servers and ports, separated by , for example.

127.0.0.1:3000:2181,127.0.0.1:2888,127.0.0.1:3888

After the Zookeeper client receives this string, it will randomly select a service and port to establish a connection. If the connection is disconnected later, the client will select the next server from the string and continue to try to connect until the connection is successful.

In addition to this basic IP+ port, Zookeeper's 3.2.0 later versions also support the path in the connection string, for example.

127.0.0.1:3000:2181,127.0.0.1:2888,127.0.0.1:3888/app/a

In this way, /app/a will be regarded as the root directory of the current service, and all node paths created under it will be prefixed with /app/a . For example, I created a node /node_name , then its complete path will be /app/a/node_name . This feature is especially suitable for a multi-tenant environment. For each tenant, they think that they are the top-level root directory / .

When the Zookeeper client and server have established a connection, the client will get a 64-bit SessionID and password. What is this password used for? We know that Zookeeper can deploy multiple instances. If the client disconnects and establishes a connection with another Zookeeper server, then this password will be brought along when the connection is established. This password is a security measure of Zookeeper, and all Zookeeper nodes can verify it. In this way, even if it is connected to other Zookeeper nodes, the Session is also valid.

Session expired has two situations, namely:

The specified expiration time has passed
The client does not send a heartbeat within the specified time

In the first case, expiration time will establish a connection to the server when the client Zookeeper, this range is supported only on the expiration time of 2 times tickTime and 20 times tickTime between.

Ticktime is a configuration item of the Zookeeper server, which is used to specify the interval at which the client sends a heartbeat to the server. The default value is tickTime=2000 , and the unit is milliseconds

And this logic is maintained by the Session expired Zookeeper server, once Session expired, the server will removed immediately all temporary nodes created by the Client, and inform all client-related changes that are listening these nodes.

For the second case, the heartbeat in Zookeeper is realized by PING request . Every so often, the client will send a PING request to the server. This is the essence of heartbeat. The heartbeat makes the server perceive that the client is alive, and the same makes the client perceive that the connection with the server is still valid. The interval is tickTime , and the default is 2 seconds.

Watch mechanism

After understanding ZNode and Session, we can finally continue to the next key function Watch. In the above content, the word (Watch) is mentioned more than once. First use one sentence to summarize its role

Register a listener for a node, once the node changes (such as update or delete), the listener will receive a Watch Event

Like there are multiple types in ZNode, there are also multiple types of Watch, namely one-time Watch and permanent Watch.

one-time Watch is triggered, the Watch will be removed
Permanent Watch is still retained after being triggered, and can continue to monitor the changes on the ZNode. This is a new feature of Zookeeper 3.6.0

One-time Watch can be set in the parameters when getData() , getChildren() and exists() , while permanent Watch needs to be implemented by addWatch()

And there is a problem with the one-time Watch , because there is a time interval between the event triggered by the Watch reaching the client and then setting up a new Watch on the client. However, if the change occurs during this time interval, the client cannot perceive it.

Zookeeper cluster architecture

ZAB agreement

After laying out the previous ones, you can learn more about Zookeeper from the perspective of the overall architecture. In order to ensure the high availability of architecture based on the master-slave read-write separation 160e3d550ef03e.

We know that in a similar Redis master-slave architecture, the Gossip protocol is used to communicate between nodes, so what is the communication protocol in Zookeeper?

The answer is ZAB (Zookeeper Atomic Broadcast) protocol.

The ZAB protocol is a atomic broadcast protocol that supports 160e3d550ef0c6 crash recovery . It is used to transfer messages between Zookeepers and keep all nodes in sync. ZAB also has the characteristics of high performance, high availability, easy to use, easy to maintain, and supports automatic failure recovery.

The ZAB protocol divides the nodes in the Zookeeper cluster into three roles, namely Leader , Follower and Observer , as shown in the following figure:

In general, this architecture is similar to the Redis master-slave or MySQL master-slave architecture (if you are interested, you can also read the articles written before, and have talked about it)

Redis master-slave
MySQL master and slave

The difference is that there are two roles in the usual master-slave architecture, namely Leader and Follower (or Master, Slave), but there is one more Observer in Zookeeper.

So the question is, what is the difference between Observer and Follower?

Essentially, the functions of the two are the same, and both provide Zookeeper with the ability to scale horizontally so that it can handle more concurrency. But the difference is that during the leader election process, Observer does not participate in voting for .

Sequential consistency

As mentioned above, in the Zookeeper cluster is read-write separation . Only the Leader node can handle the write request. If the Follower node receives the write request, it will forward the request to the Leader node for processing, and the Follower node itself will not process the write request. of.

After the Leader node receives the message, it will process them one by one in the strict order of the request. This is a major feature of Zookeeper, it will ensure the sequence consistency of the .

For example, if message A arrives before message B, then in all Zookeeper nodes, message A will arrive before message B, and Zookeeper will ensure the global order message.

zxid

How does Zookeeper guarantee the order of messages? The answer is through zxid .

You can simply zxid as the unique ID of the message in Zookeeper. Nodes will Proposal (transaction proposal) . The proposal will bring zxid and specific data ( Message). ). And zxid consists of two parts:

epoch can be understood as a dynasty, or a version of Leader iteration, each Leader’s epoch is different
counter counter, a message will automatically increase

This is also the underlying implementation of the unique zxid generation algorithm. Because the epoch used by each Leader is unique, and different messages are in the same epoch, the value of the counter is different, so that all proposals are in the Zookeeper cluster All have a unique zxid.

Recovery mode

A normally running Zookeeper cluster will be in broadcast mode . Conversely, if more than half of the nodes are down, they will enter the recovery mode .

What is the recovery mode?

In the Zookeeper cluster, there are two modes, namely:

Recovery mode
Broadcast mode

When the Zookeeper cluster fails, it will enter the recovery mode , also called Leader Activation. As the name suggests, elect the Leader at this stage. The nodes will generate zxid and Proposal, and then vote with each other. Voting must be based on principles, and there are two main ones:

The zxid of the elected leader must be the largest among all followers
And more than half of the followers have returned ACK, indicating that they recognize the elected leader

If an abnormality occurs during the election process, Zookeeper will directly conduct a new round of elections. If all goes well, the Leader will be successfully elected, but at this time the cluster cannot provide services normally, because the key data synchronization has not been performed between the new Leader and Follower.

After that, the leader will wait for the remaining followers to connect, and then send the missing data to all followers through Proposal.

As for how to know which data is missing, Proposal itself needs to record the log. You can make a Diff by using the value in the lower 32-bit Counter of zxid in Proposal

Of course, there is an optimization here. If there is too much missing data, then the efficiency of sending Proposal one by one is too low. So if Leader found the missing data would be too much current data make a snapshot , direct package sent to the Follower.

The Epoch of the newly elected Leader will add +1 to the original value and reset the Counter to 0.

Do you think it's over here? In fact, it still can’t provide services normally.

data synchronization completed, the Leader will send a NEW_LEADER Proposal to the Follower. If and only after the Proposal is returned to Ack by more than half of the Followers, the Leader will commit the NEW_LEADER Proposal and the cluster can work normally.

At this point, the recovery mode ends, and the cluster enters the broadcast mode .

Broadcast mode

In the broadcast mode, after the Leader receives the message, it will send Proposal (transaction proposal) to all other Followers, and the Follower will return an ACK to the Leader after receiving the Proposal. When the leader receives quorums ACKs, the current proposal will be submitted and applied to the node's memory. How many are quorums?

Zookeeper officially recommends that at least one of every two Zookeeper nodes needs to return an ACK. Assuming there are N Zookeeper nodes, the calculation formula should be n/2 + 1 .

This may not very intuitive, with vernacular it is, more than half of Follower returned ACK, the Proposal will be able to submit and application to memory ZNode.

Zookeeper uses 2PC to ensure data consistency between nodes (as shown in the figure above), but since the Leader needs to interact with all followers, the communication overhead will become larger and the performance of Zookeeper will decrease. So in order to enhance Zookeeper's performance , only to return ACK from all the Follower node becomes a more than half of Follower return ACK can be.

Well, the above is the entire content of this blog, welcome to search on follow [160e3d55101671 SH's full stack notes ], reply [ queue ] to obtain MQ learning materials, including basic concept analysis and RocketMQ detailed source code analysis, continue to update in.
If you think this article is helpful to you, please a thumbs up , close a note , share , leave a .

In-depth understanding of the core principles of Zookeeper

ZNode

ACL

Session mechanism

Watch mechanism

Zookeeper cluster architecture

ZAB agreement

Sequential consistency

zxid

Recovery mode

Broadcast mode

SH的全栈笔记

引用和评论

缓存与数据库的双写一致性

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

In-depth understanding of the core principles of Zookeeper

ZNode

ACL

Session mechanism

Watch mechanism

Zookeeper cluster architecture

ZAB agreement

Sequential consistency

zxid

Recovery mode

Broadcast mode

SH的全栈笔记

引用和评论

缓存与数据库的双写一致性

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

C++ 中 VS 项目引入公共配置文件

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储 ｜ 得物技术

疯狂推荐！从零开始 Dify 部署全攻略！

LSM-TREE从入门到入魔：从零开始实现一个高性能键值存储｜得物技术