The evolution of ZooKeeper&#39;s service form in Alibaba

Author: Grass Valley

Apache ZooKeeper has experienced the evolution process of open source for self-use, deep optimization, feedback to the community, and development of enterprise version service cloud customers in Alibaba. In order to clarify the context of this article, we define the key terms mentioned in the evolution process as follows.

Apache ZooKeeper: Provides distributed coordination services such as distributed locks, distributed queues, etc., and can also be used for the ability to register configuration centers.
TaoKeepeer: Based on ZooKeeper, it has been deeply transformed and served Taobao in 2008.
MSE: Alibaba Cloud's one-stop microservice platform for the mainstream open source microservice ecosystem in the industry.
ZooKeeper Enterprise Service: A sub-product of MSE that provides open-source enhanced cloud services, divided into basic and professional versions.

The evolution of ZooKeeper's service form in Alibaba

As early as 2008, Alibaba designed Taokeeper, a distributed coordination software based on the open source implementation of ZooKeeper and Taobao's e-commerce business, which coincided with Taobao's service-oriented transformation. At that time, various types of distributed middleware were also born. , such as HSF/ConfigServer/VIPServer, etc.

10 years later, in 2019, Alibaba implemented a site-wide cloud campaign, and all products needed to be upgraded to the public cloud architecture. MSE was born at that time, and it was compatible with the mainstream ZooKeeper version after it went online.

title=

The whole process has gone through the following 3 stages:

The first stage: Version 1.0 in 2008, which mainly supported the applications of the group with distributed coordination requirements. At that time, all businesses were mixed with more than 1,000 applications, and eventually 150+ shared clusters were manually operated and maintained. As time goes by, businesses are splitting microservices, and the capacity of shared clusters grows explosively. The problems brought about by this are: business co-location, large explosion radius, and great risks to stability; daily operations Dimensions, such as machine replacement, can affect the whole body. If there is a configuration problem, it will affect all businesses.

The second stage: In order to solve the problem of stage one, we will evolve ZooKeeper to version 2.0. At that time, Zhengzhi containerization was just emerging. After carefully studying the containerization transformation plan, we carried out a large number of transformations under the condition that performance and operation and maintenance can meet the requirements at the same time. Business split, cluster migration, and minimum stability. The unit operates and maintains a cluster, so that we can finally sleep peacefully. After the split, relying on the large-scale operation and maintenance capabilities of K8s, these problems have been well solved, thus realizing the exclusive mode cluster. , resource isolation, SLA has been improved, can reach 99.9%.

The third stage: Migrating to the cloud to provide public cloud services, which has evolved to 3.0. This version focuses on creating open source enhancements, such as building based on Dragonwell, JVM parameter tuning, integrating Prometheus, deploying multiple AZs, forcing an average breakup, supporting dynamic configuration, smooth expansion and contraction and other transformations. , observability, high availability and security have been improved, and the SLA can reach 99.95%.

title=

Best Practices for ZooKeeper in Technical Scenarios

Next, I will introduce the best practice scenarios of ZooKeeper, which are classified into three categories, namely:

In the field of microservices, the representative integrated product is Dubbo/SpringCloud
In the field of big data, the representative integrated products are Flink/Hbase/Hadoop/Kafka
Self-developed distributed systems, including distributed systems within your own company, require distributed coordination, such as distributed locks

Microservice Field - Registry

In the microservice scenario, ZooKeeper is mainly used as a registry, using the registration/subscription mode of ZooKeeper, you can see the data structure of Dubbo in ZooKeeper:

title=

When the Provider starts, a temporary node is created under the ZooKeeper fixed path providers, and the service information of the machine is stored in this node, such as the application name, IP and port, etc. When the Consumer starts, it monitors the Providers under the corresponding service. For all child nodes, ZooKeeper will actively notify the Consumer of all child node information, and the Consumer will get the address list information of all Providers at this time. The Provider is registered to the temporary node on ZooKeeper, and its life cycle is established between the Provider and ZooKeeper. The long link is the same, unless the Provider takes the initiative to go offline. When the Provider goes down or takes the initiative to go offline, the temporary node will be deleted, then the consumers who subscribe to this service will listen to the event through Watch and update the address list. Take it off the tune.

title=

There are 2 points to note here:

The service data registered to ZooKeeper should not be too much. When there are many Providers or Consumers and frequent online and offline, it is very easy to cause ZooKeeper FullGC.
When the Provider goes offline abnormally, the life cycle of the temporary node depends on the time of SessionTimeOut, which can be set according to the business to avoid too long or too short affecting business calls.

Big Data Field - HA High Availability

In the field of big data, Flink/Hadoop/Hbase/Kafka and other systems use ZooKeeper as a distributed coordination component by default. In this, ZooKeeper uses its own characteristics to help them solve many distributed problems, among which the most The main thing is to use ZooKeeper to do the HA (Highly Available) scheme to improve the availability of the cluster. Generally, there are two or more nodes, which are divided into active nodes (Active) and standby nodes (standby).

There are 2 Servers in the illustration below, which form the HA mode. When the Server starts, it writes a temporary node under an agreed path to ZooKeeper. Since ZooKeeper only allows one successful write, whoever successfully writes first will act as Active. And it will be notified to other nodes in the cluster by ZooKeeper, and the state of other nodes will be changed to standby state.

When the Active node is down, ZooKeeper will notify the node status, and other standby nodes will immediately write data to the node. If the writing is successful, they will take over as Active.

The whole process is roughly like this. One point to pay attention to here is that in the case of abnormal network conditions, the switching of the active and standby nodes is not so real-time, and a split-brain may occur, that is, there are two master nodes. In this case, customers can When the terminal is switching, you can try to wait for a while, and then switch after the state is stable.

title=

Distributed coordination scenarios of self-developed systems

When developing a distributed system, you will inevitably encounter many distributed coordination problems. ZooKeeper is like a universal toolbox.

For different scenarios, based on the characteristics of ZooKeeper, they can be combined into a solution; when writing distributed systems, these functions are often used:

Election of Master

Our system needs to elect a Master to perform tasks; for example, ScheduleX uses ZooKeeper to do this. There are many Worker nodes in Schedulex. Some tasks are non-idempotent and can only be executed by one process. There are two main ways to choose a master from among many workers:

The way to preempt the master node: agree on a fixed path, whoever writes the temporary node data to it successfully, even if it is the master, when the master goes down, the temporary node will expire and release, ZooKeeper will notify other nodes, and other nodes will continue to Inside write data preemption.
Minimum node method: It is implemented by using the temporary ordered nodes of ZooKeeper, as shown in the figure: When selecting a master, each server writes a temporary ordered node under the directory. It is agreed that the node with the smallest serial number will be the master. That's it.

title=

Distributed lock

In a distributed environment, programs are distributed on independent nodes. Distributed locks are a way to control synchronous access to shared resources between distributed systems. The following describes how Zookeeper implements distributed locks. There are two main types of distributed locks. type:

1. Exclusive locks (Exclusive Locks): called exclusive locks, after acquiring this lock, other processes are not allowed to read and write

The principle of implementation is also very simple. Using the feature of ZooKeeper that only one node can be created under a specific path, it is agreed that whoever creates it successfully will grab the lock. At the same time, other nodes must monitor this change. If the temporary node is deleted, it can be notified. Create (Create), this is the same approach as preempting the Master node in the Master election:

title=

2. Shared Locks: Also known as read locks, multiple processes can acquire this lock at the same time and perform read operations, but if they want to write operations, there must be no read operations, and they are the first to acquire the write operation. operation type lock

The implementation method is as shown in the figure; when reading, create a temporary sequence node of R. If there is no W node in the node smaller than him, then the write is successful and can be read. If it is to be written, it is judged that among all R nodes, Whether you are the smallest.

title=

Distributed queue

The most common FIFO (First Input First Output) queue model for distributed queues is the first-in-first-out queue model. The request operations that enter the queue first are completed first, and then the subsequent requests will be processed:

title=

Zookeeper implements a FIFO queue, similar to a shared lock implementation, similar to a full-write shared lock model:

1. Get all child nodes under the /Queue node, get all elements in the queue

2. Determine the order of your own node number in all child nodes

3. If your serial number is not the smallest, you need to wait and register Watcher monitoring with the last node with a smaller serial number than your own.

4. After receiving the Watcher notification, repeat the first step

Configuration Center

Using ZooKeeper as the configuration center also uses the registration/subscription mode of ZooKeeper. There is a point to note here. ZooKeeper is not suitable for storing too large data, generally no more than 1M, otherwise performance problems may occur.

title=

ZooKeeper Enterprise Service by MSE

The relationship between MSE and ZooKeeper

Micro Service Engine (MSE for short) is a one-stop microservice platform for the mainstream open source microservice ecosystem in the industry. All services in the microservice ecosystem can be integrated on this platform, and the engines it provides are independently hosted The purpose is to provide you with high-performance, high-availability, high-integration, and secure services. Currently, MSE provides the following modules:

Register Configuration Center - (ZooKeeper/Nacos/Eureka)
Cloud Native Gateway - (Envoy)
Distributed Transactions - (Seata)
Microservice governance (Dubbo/Spring Cloud/Sentinel/OpenSergo)

Like Nacos, ZooKeeper provides the function of registration and configuration center, but ZooKeeper also provides the ability of distributed coordination, which is applied in the field of big data.

The ZooKeeper enterprise service provided by MSE is divided into basic version and professional version. The former is suitable for development and testing environments and simple production environments, while the latter has made many improvements in performance, observability, and high availability. Next, we will introduce The professional version has advantages over self-built.

title=

More stable and highly available than self-built ZooKeeper

title=

Product Architecture Diagram of MSE

ZooKeeper is deployed in multiple AZs: we all know that only more than half of the nodes in ZooKeeper can elect the master. When a 5-node ZooKeeper cluster is deployed in 3 availability zones, it should have a distribution of 2/2/1. In this case , if any one of the availability zones fails, ZooKeeper is still available as a whole. The delay between Alibaba Cloud AZs is currently less than 3ms, which is very short and controllable.
High-availability load balancing: User nodes access ZooKeeper endpoints, which is a SingleTunnel SLB provided by MSE. This SLB is a master-standby high-availability system. It will automatically load balance user requests and distribute the request pressure to the back-end nodes. When the end node fails, it will be automatically removed to ensure that the request goes to the normal node.
Self-healing of node failure: Relying on the Liveness capability of K8s, when a node fails, it will automatically restore the faulty node to ensure the sustainability of services in a timely manner.
Data security: The professional version of ZooKeeper provides the backup capability of snapshots, which can quickly rebuild and restore the data in the cluster in case of unexpected situations in the cluster to ensure data security.

The above is a guarantee of high availability in the architecture design.

In the research and development process, we have a complete set of stability guarantee system: from the research and development stage to the final change, there is a corresponding standard system, such as changing the three-axle, when changing, it must meet the observable/rollback/ You can go online only in grayscale, otherwise it will be called back; during operation, we have a series of inspection components, configuration consistency checks, and continuous improvement in the future. If there are problems in the inspection, they need to be solved immediately.

MSE has also normalized fault drills. For common fault scenarios, such as network interruption, CoreDNS downtime, ECS downtime, etc., they are all running regularly; for online early warning, we have also achieved active detection and timely detection. , Arrange on-duty personnel to deal with it 24 hours a day, we have a set of 1/5/10 emergency procedures, requiring that problems are found within 1 minute, 5 minutes are resolved, and 10 minutes are restored.

title=

All of the above are ultimately to ensure the stability and high availability of MSE. The largest cluster on the MSE line supports 40w+ long links. It has been running stably for 3 years without failure, and the SLA has reached 99.95%.

Free operation and maintenance, providing rich console functions

If you build ZooKeeper yourself, what needs to be done:

Build infrastructure: Prepare some basic facilities, such as ECS/SLB, etc., and then do network planning.
Installing ZooKeeper: During the installation process, you need to configure a lot of parameters, and you must be familiar with these parameters, otherwise you will be blinded if there is a problem. Different parameters also have a certain impact on the performance of the cluster when running, so you need to have enough Professional knowledge to be competent.
Expansion and shrinkage: Planning the allocation of MyId, the new expansion machine needs to be self-increasing, otherwise the new machine will not be able to join the old cluster to synchronize data, because only the big MyId will actively connect the small to synchronize the cluster data; the new node joins the cluster, also If there is a strict startup sequence, the number of newly added machines must be less than half of the original cluster, otherwise the master will be selected on the new node, resulting in data loss; when a large number of nodes are added, this rule needs to be repeated many times. One less step can easily lead to the failure of cluster master selection, data loss, and online production failures.
Server configuration changes: After the configuration items in zoo.cfg are updated, each machine in the cluster needs to be manually restarted to trigger them to take effect.
Data management: Open source ZooKeeper does not have a graphical management tool. To view data, you have to query it through zkClient or write code. The operation is very complicated and cumbersome. These are the problems caused by self-construction.
Online fault handling: For example, the ZooKeeper GC is running, or the network is disconnected. At this time, professional operation and maintenance personnel who are familiar with ZK/JVM/operating systems are required to handle it.

The ZooKeeper enterprise service provided by MSE solves the above problems through productization:

title=

When you need a ZooKeeper cluster, you can purchase it with one click, and use it out of the box in 3 minutes. When there is a capacity problem, you can smoothly expand or shrink it with one click. It also provides functions such as resetting data and setting parameters to white screen. , and also provides the commonly used core default indicator market, which is matched with the alarm. Use ZooKeeper enterprise services to save worry and effort, and improve enterprise IT ROI.

Observability enhancements

The third advantage of ZooKeeper Professional, the observability enhancement:

Rich monitoring market: This professional version is deeply integrated with Prometheus, and it is free for everyone to use, providing more than 20 monitoring indicators commonly used by Zookeeper and 4 core resource monitoring indicators
Support core alarm rules: It can basically meet your daily operation and maintenance needs. Of course, if you still need it, you can contact us at any time and arrange it for you
Open and enrich Metrics standard indicators: This professional version has opened more than 70 built-in Metrics indicators in ZooKeeper through API. For you, you can use these data to draw and monitor the market yourself, which is very convenient.

title=

performance boost

The write performance optimization is improved by 20%, and the data reliability reaches 99.9999999% (that is, nine nines).

The write performance of ZooKeeper has a lot to do with the disk performance. Only after the data is successfully written to the transaction log of the disk can the write succeed. In order to improve the write performance, we use Alibaba Cloud ESSD high-performance cloud disk. The maximum IOPS can reach 5W, the maximum throughput is 350M/S, the data reliability is 99.9999999% (that is, 9 9s), and the entire writing TPS performance can be improved by about 20%.

Built on Dragonwell, read performance is 1x faster

We integrated Alibaba's high-performance JDK, enabled the coroutine optimization capability inside, and optimized the lock strength of ZooKeeper's read and write task queue. In the scenario of high concurrent processing, the read performance can be improved by about 1 times compared to open source. performance. \

title=

GC time is reduced by 80%, greatly reducing the situation of Full GC

ZooKeeper is a latency-sensitive application. The time and number of GCs will affect the processing throughput of ZooKeeper. Therefore, we have adjusted the JVM parameters for this situation. The heap settings are dynamically set according to different configurations, and at the same time in advance Recycled resource fragments to avoid FullGC. The overall optimization reduced the GC time by 80%, while avoiding FullGC as much as possible.

title=

Based on MSE, build Dubbo+Zookeeper microservice

Before the operation, you need to buy a ZooKeeper, you can choose to pay according to the amount, and it does not need to be released. If you use it for a long time, you can choose to pay annually and monthly:

title=

When choosing a network access method, the following situations are required:

1. If you only use the VPC network, you can choose the private network, select the switch and professional network, and do not move the others (note here: do not choose the public network bandwidth)

title=

2. If you only want to access the public network, select the public network network, and then select the corresponding bandwidth;

title=

3. If you need a public network and also need VPC network access, then you choose the private network, and at the public network bandwidth, select the public network bandwidth you need, so that 2 access points will be created;

title=

After the purchase, the ZooKeeper cluster will be created successfully in about 5 minutes. Remember the access method. You will need to configure this address in the Dubbo configuration file later:

title=

When the environment is ready, the Provider/Consumer configuration is ready. For detailed operation steps, you can watch the live video: https://yqh.aliyun.com/live/detail/28603

write at the end

The ZooKeeper enterprise service provided by MSE aims to provide users with a more reliable, lower cost, higher efficiency, and fully compatible open source distributed coordination service. Provide post-paid and annual and monthly payment modes, support Hangzhou, Shanghai, Beijing, Shenzhen and other 23 regions at home and abroad, and meet 95% of regional users. If you have other new service needs, you can contact us.

Buy MSE Professional Edition ZooKeeper now and enjoy 10% off, both new and old.

You can also search the group number 34754806 on DingTalk to join the user group to communicate and answer questions.

title=

Click here to go to the MSE official website to snap up!

The evolution of ZooKeeper's service form in Alibaba

The evolution of ZooKeeper's service form in Alibaba

Best Practices for ZooKeeper in Technical Scenarios

Microservice Field - Registry

Big Data Field - HA High Availability

Distributed coordination scenarios of self-developed systems

ZooKeeper Enterprise Service by MSE

The relationship between MSE and ZooKeeper

More stable and highly available than self-built ZooKeeper

Free operation and maintenance, providing rich console functions

Observability enhancements

performance boost

Built on Dragonwell, read performance is 1x faster

GC time is reduced by 80%, greatly reducing the situation of Full GC

Based on MSE, build Dubbo+Zookeeper microservice

write at the end

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

【万字长文】大模型开源开发全景与趋势解读

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

🔥吐血整理 Bolt.diy 部署与应用攻略

The evolution of ZooKeeper&#39;s service form in Alibaba

The evolution of ZooKeeper's service form in Alibaba

Best Practices for ZooKeeper in Technical Scenarios

Microservice Field - Registry

Big Data Field - HA High Availability

Distributed coordination scenarios of self-developed systems

ZooKeeper Enterprise Service by MSE

The relationship between MSE and ZooKeeper

More stable and highly available than self-built ZooKeeper

Free operation and maintenance, providing rich console functions

Observability enhancements

performance boost

Built on Dragonwell, read performance is 1x faster

GC time is reduced by 80%, greatly reducing the situation of Full GC

Based on MSE, build Dubbo+Zookeeper microservice

write at the end

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

【万字长文】大模型开源开发全景与趋势解读

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

🔥吐血整理 Bolt.diy 部署与应用攻略

The evolution of ZooKeeper's service form in Alibaba