Reduce costs and increase efficiency! The transformation of the registration center in the Ant Group

Text | Lin Yuzhi (Flower Name: Yuan Three)

Senior experts of Ant Group focus on microservices/service discovery related fields

Proofreading｜Li Xudong

This article is 8624 words read 18 minutes

｜Introduction｜

Service discovery is one of the most important dependencies for building a distributed system. The registry and Antvip are responsible for this responsibility in Ant Group. The registry provides service discovery capabilities in the computer room, and Antvip provides service discovery capabilities across computer rooms.

The focus of this article is the registration center and the multi-cluster deployment form (IDC dimension). There is no data synchronization between the cluster and the cluster.

PART. 1 Background

Looking back at the evolution of the registry in the Ant Group, it probably started in 2007/2008 and has evolved for more than 13 years. Today, both the business form and its own capabilities have undergone tremendous changes.

A brief review of the historical development of the registry:

V1: Introduce Taobao's configserver

V2: Horizontal expansion

Starting from this version, Ant and Ali began to evolve independently. The main difference is the choice of data storage direction. Ant chose horizontal expansion and data storage in shards. Ali chose vertical expansion to increase the memory specifications of the data nodes.

This choice affects the storage architecture of SOFARegistry and Nacos several years later.

V3/V4: LDC support and disaster recovery

V3 supports LDC unitization.

V4 adds a decision-making mechanism and runtime list, which solves the problem of manual intervention when a single machine is down, which improves high availability and reduces operation and maintenance costs to a certain extent.

V5：SOFARegistry

The first four versions are confreg. The V5 project SOFARegistry was launched in 2017. The goals are:

1. Code maintainability: confreg code history is burdensome

A small number of modules use guice for dependency management, but most of the modules are static interactions. It is not easy to separate the core module and the extension module, which is not conducive to product open source.
The interaction model between the client and the server is complex, the understanding cost is extremely high, and it is not friendly to multiple languages.

2. Operation and maintenance pain points: Raft is introduced to solve the maintenance problem of serverlist. The operation and maintenance of the entire cluster includes Raft, which is simplified through operator.

3. Robustness: A multi-node backup mechanism is added to the consistent hash ring (default 3 copies), and 2 copies are insensitive to business downtime.

4. Cross-cluster service discovery: In-site cross-cluster service discovery requires additional support from antvip. It is hoped that the capabilities of the two sets of facilities can be unified. At the same time, commercial scenarios also require cross-computer room data synchronization.

These goals have been partially achieved, and partially achieved are not good enough. For example, some of the pain points of operation and maintenance still remain. Cross-cluster service discovery has a great stability challenge in the face of large-scale data from the master station.

V6：SOFARegistry 6.0

In November 2020, SOFARegistry summarized and absorbed the experience of internal/commercial polishing, and at the same time, in order to cope with future challenges, it launched the 6.0 version of the large-scale refactoring plan.

It took 10 months to complete the development and upgrade of the new version, and at the same time roll out the application-level service discovery.

PART. 2 Challenge

The current problem

The challenge of cluster size

Data growth: With the development of business, the number of business instances continues to grow, and the number of pub/sub also grows accordingly. Taking one of the clusters as an example, the data in 2019 is the benchmark data, and the pub in 2020 is close to ten million.

The figure below shows the data comparison of the cluster during Double Eleven over the years and the optimization effect of switching applications. Compared with Double Eleven in 2019, pub at the interface level of Double Eleven will increase by 200% and sub will increase by 80% in 2021.

Increasing fault explosion radius: The more instances connected to the cluster, the more services and instances affected by the fault. Ensuring business stability is the most basic and highest priority requirement.
Test the horizontal expansion ability: After the cluster reaches a certain scale, whether it still has the ability to continue horizontal expansion, it needs the cluster to have good horizontal expansion ability. Expanding from 10 to 100 is not the same as expanding from 100 to 500.
HA capability: As the number of cluster instances increases, the overall hardware failure rate of the facing nodes increases accordingly. Can clusters of various machine failures recover quickly? Students who have experience in operation and maintenance know that the difficulties faced by a small cluster and a large cluster are growing exponentially.
Push performance: Most service-discovered products choose the final consistency of the data, but how long is this final in the scale of different clusters? In fact, none of the related products has given clear data.

But in fact, we believe that this indicator is the core indicator of service discovery products. This duration has an impact on the call: the newly added address has no traffic; the deleted address is not removed in time, etc. The PaaS of Ant Group has SLO constraints on the push delay of the registry: if the delay of the change push list exceeds the agreed value, the address list of the business end is wrong. We have also experienced failures caused by untimely push notifications in our history.

The increase in the scale of business instances also brings pressure on the performance of push: the number of instances under pub on the publishing side increases; the number of business instances on the subscription side increases; a simple estimate, pub/sub increases by 2 times, and the amount of data pushed is 2*2 , An increase of 4 times, is a product relationship. At the same time, the performance of the push also determines the maximum number of operation and maintenance business instances that can be supported at the same time. For example, in an emergency scenario, the business restarts on a large scale. If this is a bottleneck, it will affect the recovery time of the failure.

The cluster size can be considered the most challenging. The core architecture determines its upper limit, and the cost of transformation after confirmation is very high. And often when the bottleneck is found, it is already on the ground. We have to choose an architecture that can raise the technical ceiling of the product.

Operational challenges

One of the main goals of the SOFARegistryX project was to have better operation and maintenance capabilities than confreg: the introduction of meta roles, election and storage of meta-information through Raft, to provide the control plane capabilities of the cluster. But the facts have proved that we still underestimated the importance of operation and maintenance, as Mr. Lu Xun said: [The job of a programmer is only two things, one is operation and maintenance, the other is operation and maintenance].

The goal set three years ago has been seriously lagging behind today.

Increase in the number of clusters: The internal business of the Ant Group is deployed at different sites (simply understood as each site is a relatively independent business that requires different levels of isolation), and multiple clusters need to be deployed at one site: Disaster recovery requires a separate computer room Deployment; development requires multiple environments. The number of deployed sites has grown beyond our imagination. It has now reached hundreds of clusters and is still growing rapidly. The growth rate refers to the growth rate of the Federal Reserve's money supply in recent years. I used to think that some operation and maintenance work can be overwhelmed. After the number of clusters increases, there are too many times of overwhelming, which squeezes the energy of development/operation and maintenance students, and has no resources to plan poetry and distant places.

Business interruption: The operation and maintenance of the business is 24/7, and the monthly version of capacity adaptation/self-healing/MOSN plows the entire site application and so on. The figure below shows the number of machine batches per minute. It can be seen that even on weekends and late at night, the tasks of operation and maintenance are continuous.

The students of Ant Group should be familiar with and hate the operation and maintenance announcement of the registration center. Because of the sensitivity of the business, the registry has always been down for release and operation and maintenance before. At this time, it is necessary to lock the release/restart action of the entire site. In order to minimize the impact on the business, the relevant students in the registration center can only sacrifice a black hair and do related operations during the late night and low peak period. Even so, there is still no way to achieve zero interruption to the business.

The Challenge of Naming in the Cloud Native Era

In the era of cloud native technology, some trends can be observed:

The promotion of microservices/FaaS has led to an increase in light-duty applications: the number of instances has increased, and it needs to be able to support a larger business scale
The life cycle of application instances is shorter: FaaS is used on-demand, autoscale capacity adaptation and other means lead to more frequent ebb and flow of instances, and the performance of the registry is mainly reflected in the response speed of instance changes
Multi-language support: In the past, the main development system of Ant Group was Java, and the non-Java language docking infrastructure was a second-class citizen. With the demand for AI and innovative businesses, there are more and more scenarios for non-Java systems. If there is an SDK for each language, the maintenance cost will be a nightmare. Of course, sidecar (MOSN) is a solution, but can it support low intrusive access methods, or even sdk-free capabilities?
Service routing: In most scenarios in the past, endpoints can be considered equal, and the registration center only provides a communication address list to meet the needs. In the precise routing scenario of Mesh, pilot not only provides eds (address list) but also rds (routing). The registry needs to enrich its own capabilities.
K8s: K8s has become a de facto distributed operating system. How does K8s-service connect with the registration center? Furthermore, can it solve the problem of K8s-service across multi-cluster?

"Summarize"

In summary, in addition to being down-to-earth and solving current problems, you also need to look up at the stars. Possibility to solve the naming challenge under the cloud native trend is also the main goal of V6 refactoring.

PART. 3 SOFARegistry 6.0: performance-oriented

SOFARegistry 6.0 is not just a registry engine, it needs to cooperate with surrounding facilities to improve the efficiency of development, operation and maintenance, and emergency response, and solve the following problems. (The red module is a more challenging area)

SOFARegistry 6.0 related work includes:

Architecture optimization

Architecture transformation ideas: While retaining the V5 storage sharding architecture, the key goal is to optimize meta-information meta consistency and ensure that the correct data is pushed.

Meta-information meta consistency

V5 introduces the strong consistency of Raft into the meta role to elect a leader and store meta-information, where meta-information includes a list of nodes and configuration information. Data sharding is used for consistent hashing by obtaining the meta node list. There are two problems in this:

Raft/operator complex operation and maintenance
(1) Customized operation and maintenance process: It is necessary to support orchestration such as change peer. In Ant Group, the cost of specialized operation and maintenance process is high, and it is also not conducive to output.
(2) The cost of implementing a robust operator is very high, including access to the change management and control operator's own changes, etc.
(3) Sensitive to network/disk availability. In the output scene, you will face a relatively bad hardware situation, and the troubleshooting cost is higher.
Fragile strong consistency

The use of meta information is based on a strong consistency. If there is a network problem, such as a session network partition that cannot be connected to the meta, the wrong routing table will cause data fragmentation. A mechanism is needed to ensure that even if the meta information is inconsistent, the correctness of the data can be maintained in a short time, leaving an emergency buffer time.

Push the correct data

When data nodes are operated and maintained on a large scale, drastic changes in the node list result in continuous data migration, and the integrity/correctness of the pushed data may be at risk. V5 avoids this situation by introducing 3 copies. As long as one copy is available, the data is correct. However, this restriction puts a heavy burden on the operation and maintenance process. We must ensure that there are less than two copies per operation or select the one that meets the constraints. Operation and maintenance sequence.

For V5 and previous versions, the operation and maintenance operations are relatively rough. One-size-fits-all releases are shut down. PaaS is locked to prohibit business changes. After the data node is stable, the push capability is turned on to ensure that the risk of pushing incorrect data is avoided.

In addition, the expected operation and maintenance work can be done, but for sudden multi-data node downtime, this risk still exists.

We need a mechanism to ensure that when the data node list changes cause data migration, we can tolerate the additional slight push delay to ensure that the data is pushed correctly.

"Results"

Plug-in meta storage/election components, go to Raft on the site, use db for leader election and store configuration information, reducing operation and maintenance costs.
Data uses fixed slot fragmentation, meta provides scheduling capabilities, slot scheduling information is stored in slotTable, session/data nodes can tolerate the weak consistency of this information, and improve robustness.
Multi-copy scheduling reduces the cost of data migration when the data node changes. The current online data volume follower upgrades the leader about 200ms (the follower holds most of the data), and the direct allocation of the leader data synchronization takes 2s-5s.
Optimize data communication/replication links to improve performance and scalability.
Large-scale operation and maintenance does not require PaaS to be locked late at night, which reduces interruption to the business and keeps the hair of operation and maintenance personnel, and improves happiness.

Data link and slot scheduling:

Slot sharding refers to the practice of Redis Cluster, using virtual hash slot partitions, and all dataIds are mapped to 0 ~ N integer slots according to the hash function.
The leader node of meta senses the list of surviving data nodes through heartbeat, and distributes multiple copies of the slot to the data nodes as evenly as possible. The related mapping relationship is stored in the slotTable, and the session/data is actively notified when there is a change.
At the same time, session/data obtains the latest slotTable through heartbeat to avoid the risk of meta notification failure.
Slot has a state machine Migrating -> Accept -> Moved on the data node. When migrating, ensure that the slot data is the latest before entering the Accept state before it can be used for push to ensure the integrity of the pushed data.

Data migration of data node changes:

Perform a pressure test on the push capability of a cluster connected to a 10w+ client. With a push volume of 12M per minute, the push delay p999 can be kept below 8s. Session cpu 20%, data cpu 10%, the physical resource water level is low, and there is a larger push buffer.

At the same time, we are also verifying the horizontal expansion capability online. The cluster tries to expand the maximum capacity to session 370, data 60, meta*3; meta because it has to process all node heartbeats, the CPU reaches 50%, and 8C vertical expansion or further optimization of heartbeat overhead is required. . According to the safe water level of a data node to support 200w pub, a pub has an overhead of about 1.5K. Considering that the data node is down to 1/3 and still has service capacity, it is necessary to retain the buffer rising pub. The cluster can support 120 million pubs. If configured Dual copy can support 6kw pub.

Application-level service discovery

The registry retains strong flexibility in the pub format. When some RPC frameworks implement RPC service discovery, they use an interface-to-pub mapping method. SOFA/HSF/Dubbo2 all use this mode. This model is relatively natural, but Will cause pub/sub and push volume to expand very greatly.

Dubbo3 proposed application-level service discovery and related principles [1]. In terms of implementation, SOFARegistry 6.0 refers to Dubbo3, adopts the solution of integrating the service metadata center module on the session side, and makes some adaptations in compatibility.

"Application-level service pub data split"

"compatibility"

One difficulty of application-level service discovery is how to be compatible with interface-level/application-level at low cost. Although most applications can be upgraded to application-level in the end, the following problems will be faced during the upgrade process:

There are a large number of applications, and the time point for each application to be upgraded to the application level is relatively large
Some apps cannot be upgraded, such as some ancient apps

We adopt solutions that focus on application-level services and are compatible with interface-level solutions:

During the upgrade, there are two SOFARegistry of the new and old version at the same time, and different versions of SOFARegistry correspond to different domain names. The upgraded application terminal (MOSN in the figure) adopts dual-subscription and dual-publishing method to gradually switch in grayscale to ensure that applications that have not been upgraded to access MOSN or have not turned on the switch will not be affected during the switching process.

After completing the application-level migration of most applications, the upgraded applications have been in the registration center of SOFARegistry 6.0, but there are still a small number of applications because they are not connected to MOSN, and these remaining old apps are also switched to SOFARegistry through domain names. 6.0, continue to use interface-level subscription to publish and interact with the registry. In order to ensure that the upgraded and unupgraded applications can subscribe to each other, some support has been made:

Provides the ability to transfer application-level Publisher to mouth-level Publisher: Interface-level subscribers cannot directly subscribe to application-level publication data. For interface-level subscriptions, they are converted from AppPublisher to InterfacePublisher on demand. Applications without access to MOSN can be subscribed smoothly. This part of the data, because only a small number of applications are not connected to MOSN, there are few application-level Publishers that need to be converted.
The application-level subscriber initiates an additional interface-level subscription when subscribing to subscribe to application publication data that has not been upgraded. Because this part of the application is very small, the actual majority of service-level subscriptions will not have a push task, so there will be no pressure on the push.

"Effect"

The figure above is the effect of a cluster switching application-level. The remaining part of the interface-level pub after the switch is to be compatible with the converted data, and the interface-level sub is not reduced for compatibility with interface-level publishing. If compatibility is not considered, pub data is reduced by up to 97%. It greatly reduces the pressure on the cluster due to the data scale.

SOFARegistryChaos: automated testing

The final consistency model of the registry has always been a test problem:

How long is it in the end?
Was there any wrong data pushed before reaching the end?
Did you push less data before reaching the end?
The impact of cluster failure/data migration on data accuracy and latency
The impact of frequent calls to APIs on the client side in various orders
The impact of frequent disconnection on the client side

In response to this series of problems, we developed the SOFARegistryChaos, which provides a complete test capability for final consistency, in addition to function/performance/large-scale pressure test/chaos test capabilities. At the same time, through the plug-in mechanism, it also supports the ability to access and test other services to discover products. Based on the deployment capabilities of K8s, we can quickly deploy test components.

With the above capabilities, you can not only test your own product capabilities, for example, you can quickly test the related performance of zookeeper in service discovery for product comparison.

Test observability

The observation capabilities of the provided key data are revealed through metrics, and the visualization capabilities can be provided by docking with Prometheus:

Push delay
Final consistency check within a set time
The point in time when the fault injection occurs
Integrity of pushed data during final agreement
The test of this ability is an interesting innovation. By solidifying a part of the client and the corresponding pub, verifying the push data caused by various other changes each time, this part of the data must be complete and correct.
Push times
Push data volume

Troubleshooting for failed cases

In the test scenario, the client operation timing and fault injection are randomly arranged. We recorded and collected all the operation command timings on the SOFARegistryChaos master. When the case fails, you can quickly locate the problem through the failed data details and the API call status of each client.

For example, the failure case shown in the figure below shows that the subscriber on a certain Node failed to verify the subscription data of a certain dataId. It is expected that it should be emptied, but a piece of data is pushed down. At the same time, the relevant operation traces of all the publishers related to the dataId during the test are displayed.

Black box detection

Have you experienced a similar case:

Suddenly I was told by the business that there was a problem with the system, and I looked confused: the system is not abnormal.
When the system is found to be faulty, it has actually caused a serious impact on the business

Due to its own characteristics, the registration center often has a lagging impact on the business. For example, only 1K of 2K IPs are pushed. This kind of error will not cause the business to immediately perceive an abnormality. But the problem itself has actually gone wrong. For the registration center, it is more necessary to have the ability to detect and cure the last disease in advance.

Here we introduce the method of black box detection: simulate user behavior in a broad sense, and detect whether the link is normal.

SOFARegistryChaos can actually be used as a registry user, and it is an enhanced version that provides end-to-end alerting capabilities.

We deployed SOFARegistryChaos online and enabled small traffic as a monitoring item. When the registry is abnormal but has not caused a perceptible impact on the business, we have the opportunity to intervene in time to reduce the risk of a risk event escalating into a major failure.

Sharpen the knife and chop the wood by mistake

Through SOFARegistryChaos, the verification efficiency of core competence is greatly improved, while the quality is guaranteed, it is much easier to develop and write code for students. In the three and a half months from July to October, we iterated and released 5 versions, close to 1 version in 3 weeks. This development efficiency was unimaginable in the past, and at the same time it has also obtained a complete end-to-end alerting capability.

Operation and maintenance automation

nightly build

Although we have a very large number of clusters, because we distinguish between multiple environments, some environments have slightly lower requirements for stability than production flow requirements, such as environments below grayscale. Can the clusters in these environments be applied quickly and at low cost while the quality of the new version is guaranteed? Combined with SOFARegistryChaos, we and our quality/SRE students are building nightly build facilities.

SOFARegistryChaos is used to change the access control, and the new version is automatically deployed. After the test of SOFARegistryChaos is passed, it will be automatically deployed to clusters below the gray level, with manual intervention only during production release.

Through nightly build, the release cost of non-production environment is greatly reduced, and the new version can be tested by business traffic as soon as possible.

Fault drill

Although we have done a lot of quality-related work, how did we perform in the face of various failures online? Was it a mule or a horse, or had to pull it out for a stroll.

We and SRE students will regularly conduct fault tolerance exercises online, including but not limited to network failures, large-scale machine downtime, etc. In addition, rehearsals cannot be a one-off deal, and disaster tolerance without preservation is actually zero. In the simulation/gray-scale cluster, normalize disaster tolerance, drill-iterative cycle.

Location diagnosis

After the normalization of fault tolerance drills, how to quickly locate the source of the fault has become a problem on the table. Otherwise, every time the drill is performed, the efficiency is too low.

Each node of SOFARegistry has made a lot of observability improvements and provides rich observability. The SRE diagnosis system uses relevant data to perform real-time diagnosis. For example, in this case, a session node failure caused the SLO to break. With the positioning capability, the self-healing system can also play a role. For example, if a session node is diagnosed with a network failure, the self-healing system can trigger the automatic replacement of the failed node.

At present, most of our disaster recovery drills and emergency cases no longer require human intervention, and only such low-cost drills can be normalized.

"income"

The stability of SOFARegistry has gradually improved through continuous drills to expose problems and rapid iterative repairs.

"Summarize"

In addition to its own optimization, SOFARegistry 6.0 has done a lot of work in testing/operation and maintenance/emergency. The goal is to improve the efficiency of R&D/quality/operation and maintenance personnel, so that relevant students can get rid of inefficient human work and improve happiness.

PART. 4 Open source: one person can go fast, but a group of people can go further

SOFARegistry is an open source project and an important part of the open source community SOFAStack. We hope to use the power of the community to promote the development of SOFARegistry instead of only the Ant Group engineers to develop it.

In the past year, the SOFARegistry has been in a stagnant state due to its focus on the 6.0 refactoring. This is where we have not done well enough.

We have formulated a community plan for the next six months. In December, we will open source 6.0 based on the internal version. The open source code contains all the core capabilities of the internal version. The only difference is that the internal version has more compatibility support for confreg-client.

In addition, after 6.1, we hope that the subsequent program design/discussion will also be carried out based on the community, and the entire R&D process will be more transparent and open.

PART. 5 We are still on the road

2021 is a year for SOFARegistry to examine the past, consolidate the foundation in an all-round way, and improve efficiency.

Of course, we are still at an early stage, and there is still a long way to go. For example, the scale of Double Eleven this year faces a series of very difficult problems:

Too many instances of a single application in a cluster (up to 7K instances in a single cluster of hot applications) cause excessive CPU/memory overhead when the business side receives an address push.
Pushing the full address list causes too many connections, etc.

There are other challenges:

Incremental push, reducing the amount of data pushed and the resource overhead on the client side
Unified service discovery, supporting cross-cluster
Adapt to new trends under cloud native
Operation of the community
Product ease of use

"refer to"

[1] Dubbo3 proposed application-level service discovery and related principles:

https://dubbo.apache.org/zh/blog/2021/06/02/dubbo3-%E5%BA%94%E7%94%A8%E7%BA%A7%E6%9C%8D%E5%8A%A1%E5%8F%91%E7%8E%B0/

about us:

The Ant application service team is the core technical team serving the entire Ant Group. It has built a world-leading financial-level distributed architecture infrastructure platform. It is a leader in cloud-native fields such as Service Mesh. It develops, operates and maintains the world's largest Service Mesh. Cluster, the messaging middleware of Ant Group supports trillions of messages flow every day.

Welcome students who are interested in Service Mesh/microservices/service discovery and other fields to join us.

Contact email: yuzhi.lyz@antgroup.com