[Internet Business Double Eleven] Service link isolation technology and practice based on ServiceMesh technology

Text｜Zhang Huaren (flower name: Hua Lun)

Architect of Basic Technology Architecture Department of Internet Commercial Bank

Proofreading｜Kan Guangwen (Name: Kongmen)

This article is 4832 words read in 10 minutes

｜Introduction｜

Under the microservice architecture, the calls between services are intricate and complicated, and an application may carry multiple different business flows. Because it runs in the same application process, there is bound to be mutual influence between multiple business flows.

If a certain business traffic increases sharply, leading to a sharp increase in application process load, and then requests are queued, other business traffic is bound to be affected. Most of the time, this kind of mutual influence is within the tolerance range or can be avoided. In certain scenarios, we may need to consider isolating certain business traffic to eliminate the risk of mutual influence between businesses:

For example, when background scheduling traffic affects online user requests;
Another example is when the low-sensitivity or even failable business affects the high-sensitivity business that needs to be protected.

The demand for service link isolation is widespread in the industry. The usual solution is to create a new application, and then migrate the business that needs to be isolated to this new application.

New application methods, R&D, operation and maintenance, etc. all need to pay double the cost, and related applications need to be modified and migrated. It may be reluctant to accept the situation where only a single application needs to be created. Some applications of Internet business banks, such as Gaobao Minimalist Gateway, Gaobao Customer View, etc., are currently adopting this kind of solution. This method is very cumbersome, and when we expect that multiple applications on the entire link associated with a particular service are service-isolated, the cost of this solution will rise non-linearly and become unacceptable.

Under the cloud native architecture, more refined management and control of containers and traffic can be performed. For the above-mentioned business traffic isolation scenario, we have a simpler, more flexible, and more general alternative-we call it "business unit isolation" , You can achieve the above demands without creating new applications. This solution has been applied in a number of business scenarios of network operators, including core links, and has successfully passed the test of this year's Double Eleven promotion.

So what exactly is "business unit isolation"? How do we use "business unit isolation" to achieve business link isolation? This article will elaborate with you.

PART. 1 Concept and basic principles

Concept and operation and maintenance model

"Business Unit Isolation" is a set of traffic coloring and resource isolation solutions that can help businesses achieve business link isolation relatively easily. In the process of research and verification, we also put forward an optimization and improvement plan and promoted the implementation, which ultimately further reduced the cost of business access.

"Business unit isolation" needs to combine two new concepts to elaborate: "AIG" and "business unit".

AIG is a set of resources isolated by an application in order to support certain services. The business link between a service and a certain business or a particular business, which is composed of one or more application AIGs, is called a business unit. To ensure that there is and only the traffic that meets the characteristics is diverted to a certain business unit, we call it "isolated deployment of business units".

AIG 运维模型简单示意

Main tasks and supporting facilities

It is not difficult to see from the concept of "business unit isolation": to achieve traffic isolation of a certain business link, at least the following things need to be done:

1. Business unit construction: Create AIGs for the applications on the link to form a business unit, and it must be ensured that no traffic flows into the new business unit.

2. Business flow identification: It is necessary to identify the flow of a specific business flowing into the upstream application in some way.

3. Specific business diversion: For the identified specific business traffic, there needs to be a mechanism for these traffic to flow to the newly created business unit.

Obviously, the above-mentioned things inevitably require the cooperation of the infrastructure side and the application side to be realized. As shown in the figure below, the related infrastructure and functions are as follows:

1. Business unit construction: AIG needs to provide complete R&D/O&M/monitoring support;

2. Flow identification (RPC): The application (A) upstream of the business unit in the link needs to access the marking and dyeing SDK in order to issue the marking and dyeing rules through the dyeing management and control platform;

3. Flow identification (scheduling): Complex scheduling (message triggering, autonomously distributing batch tasks in a single LDC) is converted into SOFARPC-based streaming tasks to achieve coloring and isolation.

4. Specific business diversion: The refined routing on the MOSN side needs to support AIG, so that traffic can flow into the new specific business unit.

业务单元隔离方案总览及周边配套设施

Business unit construction

The business unit is actually a relatively abstract concept, corresponding to a business link.

In the practice of the Internet business, in order to make the business unit more concrete, we stipulate that for multiple applications in a business unit, the aigcode part of the AIG name (appname-aigcode) must be as consistent as possible.

Therefore, to construct a specific business unit is essentially to create a resource group (AIG) that serves the specific business isolation for related applications on the link.

For a single application, building AIG consists of two parts:

One is to initialize AIG metadata;

The second is the various operation and maintenance operations around this AIG (capacity expansion, online and offline, restart, sidecar injection and upgrade, etc.).

It can be seen that to support AIG, almost all operation and maintenance operations on the PaaS side need to be adapted, and the workload is very large. Therefore, the PaaS side must also weigh the trade-offs in supporting AIG, and decided to only support AIG in the final workload operation and maintenance mode, which also led to the migration of AIG's strong dependence on applications from the existing image mode to the workload mode.

In workload operation and maintenance mode, PaaS arranges the content of publishing and operation and maintenance into CRD resources, and hands them to the underlying sigma (K8s) for operation and maintenance. Switching to the workload operation and maintenance mode is conducive to the unified release of the operation and maintenance system of the group, and it can also better support scenarios such as flexible expansion and contraction and self-healing.

However, compared to the image mode, the workload mode has a great impact on the user's usage habits and experience, and there are many related problems in the initial stage. Therefore, even though the workload of Internet providers has been advancing in an orderly manner, in the subsequent core business access to AIG projects, in order to avoid forced switching to the workload operation and maintenance mode to affect the core business operation and maintenance emergency, we also provided support for PaaS and only enabled AIG machines The requirements of workload have been met, and a complete mixed operation and maintenance verification has been done in response to this situation.

RPC traffic isolation

After the business unit is created, how to ensure that the new business unit does not have RPC traffic inflow by default without draining traffic?

The reason why the application machine has RPC traffic inflow is because the machine IP is mounted in the registration center (SOFARegistry) and cross-machine room load balancing (AntVip): After the application process is started successfully, MOSN will register the service information to the SOFARegistry, The PaaS side will call the interface to mount the IP of the machine on AntVip after the machine health check passes through the release of the operation and maintenance process.

Therefore, to ensure that the new AIG machine has no traffic inflow by default, adjustments are required on the MOSN and PaaS sides.

The overall adjustment plan is shown in the figure below:

默认情况下没有 RPC 流量流入 AIG 原理

How to identify the RPC traffic of a specific business?

After the upstream application is connected to the marking and dyeing SDK, it can be intercepted by the RPC interceptor in the SDK when it is called by other applications as a server and other applications as a client. The interceptor compares the RPC request with the issued marking Mark the coloring rules, once match, the service request identifier will be added to the RPC Header.

基于打标染色 SDK 的流量识别示意

Finally, it is to divert traffic to specific business units.

With the powerful refined routing capabilities of MOSN, we can route traffic to a designated business unit and converge within the business unit. Business unit isolation mainly uses MOSN's client-side routing capabilities. When a client application initiates a call and requests to flow through the MOSN of the current Pod, it can control the flow of traffic according to the routing rules we issued.

引流到特定业务单元 & 业务单元内流量收敛

Dispatching traffic isolation

The essence of scheduling is messages, and simple scheduling scenarios usually do not require isolation. Many scenarios with isolation requirements are currently in the "message task + three-tier distribution" mode, using scheduling to trigger batch processing logic.

The three-layer distribution protocol is based on the tb-remoting protocol to distribute requests, not the standard SOFARPC protocol, and does not go through MOSN, so MOSN cannot control the direction of this request.

In order to solve this problem, AntScheduler introduced a new streaming scheduling mode, by transforming the three-tier distribution mode into multiple standard SOFARPC calls, so as to seamlessly cooperate with MOSN to meet the demand for traffic isolation.

For scenarios where you want to schedule traffic to be routed directly to AIG, you can directly configure it on the AntScheduler interface. After configuration, the platform will issue service-level MOSN client routing rules.

For the scenario where the entire link is isolated, the scheduling platform is connected to the marking and dyeing platform, and the initiated RPC traffic will be automatically marked, and downstream applications can choose to customize further dyeing and drainage based on this calibration.

“消息任务+三层分发” vs“流式任务”

PART. 2 Asynchronous replenishment link isolation

After the "Business Unit Isolation" infrastructure has been implemented, several business scenarios have been gradually connected. Asynchronous replenishment link isolation is the first application of "business unit isolation" to the core link, which realizes the isolation of real-time transaction traffic and asynchronous replenishment traffic and avoids mutual influence. This year, the Double Eleven Promotion Asynchronous Account Replenishment Business Unit carried 10% of the asynchronous replenishment traffic, showing a smooth performance.

Next, I will use this project as a carrier to detail how we use "business unit isolation" to achieve business link isolation.

Project Background

The project-related applications are on the core links of the network operators, which are already the objects of heavy protection, and the subsequent business is expected to develop rapidly, so the high-availability guarantee of the links is facing huge challenges.

The current link mainly has two kinds of traffic, one is real-time transaction traffic, and the other is upstream asynchronously initiated replenishment traffic.

For replenishment traffic, failure is tolerated because it has already been dropped. The flow of real-time transactions is an object that must be protected.

In the follow-up business development, asynchronous replenishment traffic will increase sharply, and real-time transaction traffic is at risk of being affected. Therefore, the business side hopes that there will be a way to isolate asynchronous replenishment traffic and real-time transaction traffic to ensure real-time transactions. High availability.

Overall plan

Since the link involves multiple core applications, if the traditional new application solution is adopted, the initial transformation and subsequent maintenance costs are extremely high, so the business hopes to adopt the "business unit isolation" solution. After in-depth communication with the business side, it is confirmed that a new asynchronous replenishment business unit is to be created and carries the following traffic:

1. Asynchronous replenishment flow (RPC) from the upstream application U;

2. Follow-up traffic from upstream application U's replenishment scheduling (scheduling -> RPC);

Asynchronous replenishment RPC isolation

The upstream application U of the above asynchronous replenishment unit needs to be modified a little to access the traffic marking and dyeing SDK so that we can identify the asynchronous replenishment traffic.

After the application U is connected to the SDK, when it is called by other applications as a server or as a client, it will be intercepted by the RPC interceptor in the SDK and can be marked and dyed. The RPC request or response header of the dyed traffic will carry a traffic identifier. Recognizing this identifier during MOSN routing can lead the traffic to the asynchronous replenishment business unit.

The following figure shows the marking, dyeing and drainage logic of RPC traffic for asynchronous replenishment:

Asynchronous replenishment scheduling isolation

The identification of the dispatched traffic requires the application to switch from the "message task + three-tier distribution" mode to the streaming task mode, transform it into multiple SOFARPC calls, and then can use MOSN to refine the routing to the designated AIG.

In this project, the replenishment scheduling RPC request has been marked, so only the dyeing and MOSN drainage rules are issued on the U side of the upstream application.

The whole logic is shown in the following figure:

Pressure measurement and grayscale mechanism

The marking and dyeing SDK can identify the pressure measurement flow when marking and dyeing the traffic, but we did not use this method in this project, but added restrictions in the MOSN routing rules.

On the one hand, it is because the SDK does not yet support the identification of network operators' pressure measurement traffic;

On the other hand, the MOSN rule issuance process is simpler.

MOSN routing rules support the configuration of multiple rules. Each rule is composed of scope, limiting condition, and routing destination. It supports any proportion of gray scale and also supports limited pressure measurement flow, which can ensure the safety of the entire drainage process. The following figure shows the MOSN routing rule of applying U gray-scale drainage of 1/1000 of the pressure test flow (shadowTest=T) to the asynchronous replenishment AIG (A-vostro) of application A:

Self-convergence of flow in the unit

After the traffic flows into the business unit, it will continue to call other applications in the future, and MOSN routing rules need to be issued to ensure that the traffic converges within the business unit, otherwise it will still flow back to the default business unit by default.

The original plan was to continue routing with the traffic identification written by the marking and dyeing SDK, with rules such as: scope: app=U; condition: sl_biz_unit=xxx; destination: mosn_aig=A-vostro.

However, this kind of rule is strongly bound to client applications and server applications. For complex scenarios such as this project, each call relationship needs to be issued a rule, and the overall sorting and maintenance workload is very large. of.

During the investigation and verification, we identified this problem. After discussing with relevant students, we finally proposed a more concise and feasible solution (AIG self-convergence). On the MOSN side, it supports identifying its own aigcode and issuing it to all applications that call this application. The rules can be simplified to only relate to the current application and aigcode, such as: scope: aigcode=vostro; destination: mosn_aig=A-vostro. After simplification, the number of rules is the same as the number of applications in the unit.

The self-convergence rules of this project are as follows:

｜Summary and Outlook｜

This article mainly introduces a new solution and business practice process for network operators in dealing with business traffic isolation scenarios.

Compared with the traditional cumbersome solution for newly added applications, the "business unit isolation" solution based on cloud native technologies such as containers and ServiceMesh is more lightweight and flexible. At present, we have achieved the isolation of RPC, scheduling, and HTTP traffic, and we will further improve the isolation of supporting messages and other traffic in the future.

Students who have similar complaints or are interested in related technical solutions are welcome to come and discuss at any time.

[Internet Business Double Eleven] Service link isolation technology and practice based on ServiceMesh technology

｜Introduction｜

PART. 1 Concept and basic principles

Concept and operation and maintenance model

Main tasks and supporting facilities

Business unit construction

RPC traffic isolation

Dispatching traffic isolation

PART. 2 Asynchronous replenishment link isolation

Project Background

Overall plan

Asynchronous replenishment RPC isolation

Asynchronous replenishment scheduling isolation

Pressure measurement and grayscale mechanism

Self-convergence of flow in the unit

｜Summary and Outlook｜

Recommended reading this week

SOFAStack

引用和评论

蚂蚁 Flink 实时计算编译任务 Koupleless 架构改造

你可能不知道的图片加载相关知识

前端微服务跨域配置解决办法，devServer为例

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你