头图

Standing on the vent

What is the hottest technical topic in the past two years?

Service Mesh will not be absent.

If you use one sentence to explain what Service Mesh is.

It can be compared to TCP/IP between applications or microservices, responsible for network calls, current limiting, fusing and monitoring between services. *

For writing applications, there is generally no need to care about the TCP/IP layer (such as RESTful applications through the HTTP protocol), and the same use of Service Mesh does not need to care about the things that were originally implemented through the service framework between services, as long as they are handed over to Service Mesh will do.

Service Mesh runs as a sidecar and is transparent to applications. All traffic between applications will pass through it. Therefore, the control of application traffic can be implemented in Serivce Mesh, which is natural for current limiting fuse Traffic hijacking point.

Now that more than 80% of Ant's applications have been meshed, the construction of Mesh unified current limiting fuse is naturally a matter of course.

(Service Mesh) is the infrastructure layer that handles the communication between services. It is responsible for the complex service topology that constitutes modern cloud-native applications to reliably deliver requests.

In practice, the Service Mesh is usually in the form of a lightweight network proxy array , which is deployed together with the application code, and the application does not need to be aware of the existence of the proxy.

Compared with traditional current-limiting components, Mesh current-limiting has many advantages, and has achieved significant benefits in terms of R&D efficiency and R&D costs:

- MOSN natural flow architecture enables applications without having to individually access hijacking SDK

is no need to develop different versions of current limiting components for specific languages

- upgrade and without limiting the ability of service synchronization upgrade

"Background Business"

Before the implementation of Mesh unified current limiting, there are a number of different current limiting products within Ant Group, which provide different flow control strategies:

Different types of traffic (SOFARPC, wireless gateway RPC, HTTP, messaging, etc.) current limiting configurations are scattered on different platforms and maintained by different teams. Product quality and document quality are uneven, high learning costs, and poor user experience.

Different current limiting strategies need to access different SDKs, introduce many indirect dependencies, frequent upgrades caused by security vulnerabilities, and high maintenance costs.

Not only is there unnecessary human investment in development and construction, but it also causes trouble and inconvenience to the business side.

On the other hand, our business scale is getting bigger and bigger, but a large number of services are still using the simplest single-machine current limiting strategy. There is no universal adaptive current limiting, hot spot current limiting, refined current limiting, cluster current limiting and other capabilities.

Faults occur frequently due to problems such as lack of current-limiting capabilities, missing current-limiting configurations, and incorrect current-limiting configurations.

Under the Mesh architecture, sidecar has natural advantages for traffic management. Services do not need to access or upgrade current-limiting components in applications, and middleware does not need to develop or maintain multiple versions of current-limiting components for different technology stacks.

In the context of the current large-scale internal access of Service Mesh Ants, the unified closing of a variety of different current limiting capabilities to MOSN, and the unified closing of all current limiting rules to the "unified current limiting center" can further increase the flow of MOSN Management capabilities, while drastically reducing service current limit access and configuration costs.

Based on this background, we have built a unified current limiting capability in MOSN.

Standing on the shoulders of giants

In the process of building a unified current limiting capability, we have investigated many mature products, including our own Guardian, Shiva, Dujiangyan, etc., as well as open source community concurrency-limits, Hystrix, Sentinel and other products.

We found that Alibaba Group's open source Sentinel is one of the masters.

In the process of building Shiva, we have also exchanged and learned with students of the group Sentinel, and they are also actively building the Golang version of sentinel-golang.

As a Mesh open source framework based on Golang technology developed by Ant, if it is paired with Sentinel's powerful flow control capabilities and relatively excellent community influence, it is simply a combination of strong and powerful, like a tiger with wings, a perfect match, and complement each other...ah .

However, Sentinel is not out-of-the-box for us. We are not a completely new business without historical baggage. We must consider the compatibility of Ant’s infrastructure and historical current limiting products. After our research, we found that there are several Points that need to be invested in construction:

  1. control surface rules are issued and the infrastructure that needs to walk the ants
  2. Sentinel-golang's stand-alone current limiting, fuse and other logic, and our previous products are quite different
  3. cluster current limit also uses ant infrastructure to achieve
  4. Sentinel adaptive current limiting granularity is too coarse, ants have more refined needs
  5. log collection program needs to be adjusted

After comprehensive consideration, we decided to expand based on Sentinel, standing on the shoulders of giants to build Ant's own Mesh current limiting capabilities.

Based on Sentinel's good scalability, we have implemented ant's own implementation of single-machine current limiting, service fuse, cluster current limiting, adaptive current limiting, etc., and also fed back some common changes to the open source community, and at the same time, we built a unified Log monitoring and alarm, unified current limiting center.

Finally, we have completed the construction of various capabilities in MOSN. The following table shows the comparison of the capabilities of MOSN current limiting and other current limiting components:

Occam's Razor

Pluralitas non est ponenda sine necessitate.

If it is not necessary, do not increase the entity

A current limiting strategy is accompanied by an SDK and a management backend. The interactive experience is uneven, and the quality of the documents and operation manuals is also uneven. It is maintained and answered by different teams. If you experience them all, you will definitely hate it.

And one of the core purposes of Mesh unified current limiting is to cut these things, simplify the complexity, reduce the learning cost and use cost of business students, and reduce our own maintenance cost.

- ability to control the flow of all integrated into MOSN in, take the public-long, to its dregs

- to the unified current limiting center 1615bf7f9e3f0f

This should be the last limiting wheel we built, right?

Blue is better than blue

As mentioned above, we are standing on the shoulders of Sentinel to achieve the unified current limit of Mesh, so what have we done that Sentinel does not have?

In fact, we have made a set of our own implementations for almost all the current limiting capabilities provided by Sentinel, and there are also many highlights and enhancements.

Here are some of our technical highlights.

Adaptive current limit

- for business students, the interfaces do individually assess capacity and pressure measured return time-consuming bother, only limited energy into the protection of key interfaces will inevitably leak with some small flow restrictor interface.

who are responsible for quality and stability guarantees often see various faults caused by missing configuration current limiting, mismatching current limiting, pressure measurement failure, thread blocking, etc. during fault replay.

We hope that even in the case of system mismatch and current limit, MOSN can accurately find the culprit causing the system resource shortage when the system resources are severely insufficient, and automatically adjust the abnormal flow according to the system water level in real time.

In the context of this demand, we have implemented a set of self-testing and self-adjusting current limiting strategies that meet the definition of mature cloud natives.

The implementation principle of adaptive current limiting is not complicated. The simple explanation is that after triggers the current limit, the overall water level of the system is detected in real time, and the flow is proportionally adjusted to at the second level.

The core logic is as follows:

- system resource detection : second-level detection of system resource occupancy, if the threshold is continuously exceeded for N seconds (default 5 seconds), baseline calculation is triggered, and the pressure measurement traffic is blocked to free resources for online business use;

- Baseline calculation : Traverse all the current interface statistics, and use a series of algorithms to find out the major resource consuming users, and then find out the abnormal traffic that has increased significantly among these large households, and take a snapshot of their current resource occupation Stored in baseline data;

-Baseline adjuster : Adjust the baseline data stored in the previous step according to the actual situation, and adjust the baseline value in seconds according to the result of system resource detection. If it still exceeds the system threshold, the baseline value will be reduced proportionally, otherwise it will be restored proportionally Baseline value, and so on;

-Current limit decision :

System traffic continues to pass through the adaptive current limiting module, and it will try to obtain the baseline data of the interface. If it does not indicate that the interface has not been limited by the current, it is directly let go;

If there is baseline data, compare whether the current concurrency exceeds the baseline data, and decide whether to allow the request to pass according to the actual situation.

This set of self-implemented adaptive current limiting has the following advantages:

-worry-free configuration : no code intrusion, minimal configuration;

-second-level regulation : stand-alone self-detection and self-regulation, no external dependence, second-level adjustment of the water level;

-Intelligent identification stress testing resources and the identification of abnormal traffic;

-Accurate identification of : Compared with other adaptive current limiting technologies, such as Netflix's concurrency-limits, Sentinel's system dimension adaptive current limiting based on BBR ideas, etc., accurate identification can achieve interface dimensions, even parameters or application sources Dimensional adaptive current limit.

cluster current limit

Before introducing cluster current limiting, let's briefly think about the scenarios in which single-machine current limiting will have shortcomings.

The single-machine current-limiting counter is counted independently in the single-machine memory, the data between independent machines does not care about each other, and each machine usually adopts the same current-limiting configuration.

Consider the following scenario:

- assume that the total current limit threshold that the business wants to configure is less than the total number of machines. For example, the business has 1000 machines, but you want to limit the total QPS to 500, which is evenly distributed to each machine QPS\<1. What should the single-machine current limit value be? What about the configuration?

- assume business QPS want to limit the total amount of 1000, a total of 10 machines, but the distribution to traffic on each machine is not absolutely uniform, stand-alone value of limiting how should the allocation? *

Any problem in the field of computer science can be solved by adding an indirect middle layer. It is easy to think of storing current limiting statistics through a unified external counter. This is the basic idea of cluster current limiting.

However, there are some problems with each request to synchronize the request cache:

- If the request is a large amount of cache will be great pressure, need to apply enough resources;

- synchronization request cache, especially in the case of Cross City access the cache, consuming significantly increased, in the worst case 30ms + of Cross City is not time-consuming calls every business can accept.

- We provide two modes of synchronous current limiting and asynchronous current limiting in the cluster current limiting. For situations where the traffic is large or time-consuming and sensitive, we have designed a second-level cache solution. Instead of requesting cache every time, we do an accumulation locally, and then consult the cache after reaching a certain share or reaching a certain time interval. If the remote share has been deducted, the traffic will be blocked from re-entering until it is restored after the next time window. The asynchronous current limiting mode achieves a balance as much as possible for the performance and accuracy of the cluster current limiting in a high-traffic scenario.

refined current limit

Traditional interface-granularity current limiting may not be able to meet some complex business current limiting requirements. For example, the same interface service wants to be treated differently according to different calling sources, or according to the value of a certain business parameter (such as merchant ID, activity ID, etc.) ) Configure independent current limiting configuration.

refined current limiting is designed to solve such complex current limiting configuration.

Let's first sort out the conditions that business students may wish to support. Basically, there are several categories:

  1. by business source

For example, the external service provided by the A application is called by the three systems of B, C, and D. It is hoped that only the traffic from B will be restricted, and C and D will not be restricted.

  1. According to business parameter value

For example, by UID, activity ID, merchant ID, payment scenario ID, etc.

  1. According to the full link service standard ¹

For example, "Huabei withholding", "Yuebao purchase payment" and so on.

[Note 1] : The full link service logo is an identifier generated according to the rules of the service configuration. The identifier will be transparently transmitted in the RPC protocol to achieve the purpose of identifying the source of the service across services.

In more complex scenarios, the above conditions may have some logical calculation relationships, for example, the traffic source is A and the activity ID is xxx, the service label is A or B, and the parameter value is xxx, etc.

Some of the above conditions can be obtained directly from the header of the request. For example, the business source application, source IP, etc. can be directly obtained, which we call basic information, while business parameters and full link identification are not available for every application. Yes, we call it business information.

The flow condition rule is to allow basic information, business information, etc. to support basic logical operations, and generate independent sub-resource points based on the results of the operations.

According to the conditional rules of the business configuration, the traffic is split into several sub-resource points, and then independent current-limiting rules are configured for the "sub-resource points", thereby realizing the demand for refined current-limiting.

Do More

After realizing the unification of current-limiting fusing capabilities, what else can we do? Let’s talk to you about some of our thoughts.

current limit X self-healing

After implementing the adaptive current limit, we quickly carried out a large-scale promotion and coverage within the group. There are cases of adaptive current limit triggering almost every day, but we found that many times the adaptive current limit triggering is caused by a single machine failure. of. With hundreds of thousands of containers running online, it is inevitable that there will be occasional stand-alone jitter.

Current limiting solves the overall capacity problem. After limiting the flow of heavily dependent services, the business still fails. A better way is to quickly transfer the traffic to other healthy machines.

Traditional self-healing platforms detect machine faults through monitoring, and then perform subsequent self-healing actions. Monitoring usually has a data delay of 2 to 3 minutes. If the self-healing platform immediately reports data to the self-healing platform after the adaptive current limit is triggered, The healing platform then judges whether it is a stand-alone problem, and then performs self-healing processing, which can improve the effectiveness of self-healing and further increase the business availability rate.

In the same way, after the self-healing platform receives the message triggered by the adaptive current limit, if it finds that it is not a single machine problem but an overall capacity problem, it can perform rapid capacity expansion to realize self-healing of the capacity problem.

limit X Downgrade middle station

When the service that the business strongly relies on fails, the current limit guarantees that the service will not cause a service avalanche due to capacity problems, and it cannot improve the availability of the business. A single machine failure can be used for traffic transfer, but what should be done when an overall failure occurs?

A better way is to downgrade service prepared in advance by 1615bf7f9e47b6.

The degraded middle station based on the serverless platform can sink some general logic of the degrading into the base (for example: cache accounting, asynchronous recovery, etc.), and the business can implement its own serverless business degrading module according to actual needs, so that even in When the service is completely unavailable, MOSN can still forward the request to the degraded service, thereby achieving a higher service availability rate.

"Summary"

With the gradual enrichment and improvement of MOSN's current limiting capabilities and the construction of more Mesh high-availability capabilities in the future, MOSN has gradually become an important part of technical risk and high-availability infrastructure.

The above are some experience sharing of our Mesh current limiting practice and landing. I hope you can have a deeper understanding and understanding of Service Mesh through these sharings. I also look forward to your attention to MOSN so that we can get more feedback from the community. To help us do better.

I hope everyone will work hard and make progress together.

"Zhang Xihong", the core member of the open source project MOSN, shared "Current Limitation on Technical Vents" at SOFAMeetup "Chengdu Station" on August 11, leading everyone to understand the future exploration direction of Mesh current limiting and fuse.

Recommended reading this week

For more articles, please scan the QR code to follow the "Financial-level Distributed Architecture" public account


SOFAStack
426 声望1.6k 粉丝

SOFAStack™(Scalable Open Financial Architecture Stack)是一套用于快速构建金融级分布式架构的中间件,也是在金融场景里锤炼出来的最佳实践。