Realizing the New Value of Service Mesh: Precise Control of "Explosion Radius"

Author: Zhijian

Software is constantly evolving in an iterative fashion. To a certain extent, we are not worried about the software being imperfect, but we are worried that the iteration speed of the software is too slow and affects the speed of perfection. In the field of distributed software, how to quickly and safely verify new software versions has always been a concern and exploration of everyone. The emergence of Service Mesh has pushed the exploration of this field to a new level. The concept of "swimlane" is not a new word in the field of distributed software, but this time we built it with service mesh as the basic technology, giving full play to the natural advantages of cloud-native technology to manage traffic flexibly.
This article shares the full-link traffic marking and routing capabilities accumulated in Alibaba Cloud, and provides a new experience of service grid technology while realizing the new value of service grid.

Concepts and Scenarios

Figure 1 uses the Bookinfo sample program officially provided by Istio as an example to illustrate the key concepts in the usage scenario. The purple rounded box represents Envoy. The nature of all swimlanes in the figure is the same, and the different names are just to distinguish sub-scenarios or users.

• Baseline: refers to the environment in which all services of the business are deployed. The baseline can come from a real production environment, or it can be a completely separate environment from the production environment built for development work.

• Traffic lane: represents a soft environment that is isolated from the baseline environment. Machines are added to this lane by labeling machines (ie, Pods in Kubernetes). Obviously, the machines that join the swimlane are interoperable with the machines in the baseline at the network level.

• Traffic fallback: The number of services deployed in the swimlane is not required to be exactly the same as the baseline environment. When there are no other services that depend on the call chain in the swimlane, the traffic needs to fall back to the baseline environment. Reflux the lane when necessary. For example, the dev1 swimlane in Figure 1 does not have the reviews service that the productpage service depends on, so the traffic needs to be rolled back to the reviews service in the baseline (shown by the dark blue line in the figure), and then the reviews service in the baseline needs to Traffic goes back to the ratings service in the dev1 swimlane.

• Traffic label passthrough: All sidecars on the service side need to be able to automatically put the traffic label carried in the incoming request into each outgoing request forked by this request , so as to achieve full-link traffic identification transparent transmission and routing by traffic identification, otherwise the traffic between the swimlane and the baseline cannot be shuttled back and forth.
• Entrance service: refers to the first service that traffic reaches when it enters a swimlane. The graph representing the service in Figure 1 is marked with a triangle on the left border to indicate that it is an entry service.

在这里插入图片描述

figure 1

The swimlane technique can be used in the following scenarios:

• Routine development of a single service or routine development of multiple services. The developer builds a swimlane, deploys services with new functions into the swimlane, and introduces test traffic into the swimlane for verification by defining rules based on traffic characteristics. Since the swimlane only needs to deploy the new version of the service under test, it saves the trouble of building a full-link test environment. In this scenario, you need to pay attention to the problem of data placement in the test traffic, and deal with the dirty data left in the development and joint debugging process.

• Full link grayscale. For multiple services involved in the launch of major functions, more comprehensive functional verification can be done in the form of full-link grayscale through swimming lanes. After the full-link function acceptance is passed, the new version of the service is released to the baseline.

• Critical business reassurance. For businesses similar to retail scenarios (such as POS machine cash registers), we do not want to cause huge public opinion due to software failures. At this time, we can isolate business traffic through swimming lanes to achieve reinsurance.

Technical realization

### Traffic marking scheme and implementation

When using the swimlane technique, there are three different schemes depending on where the flow is marked. It is worth pointing out that although the solutions are different, the technical implementation of the service mesh is exactly the same, and the solutions are listed to help readers better understand.

Figure 2 illustrates option one. In this scenario, before the traffic enters the Ingress gateway of the service mesh, there is a first-level gateway, let's call it an API gateway (eg, Nginx). Usually, the API gateway can add extra headers before forwarding the received request according to the characteristics of the traffic, so as to complete the marking of the traffic. In the figure, an HTTP header named x-asm-traffic-lane: dev1 will be added for specific traffic, which means that traffic needs to be sent to the dev1 swimlane. In this scheme, Envoy in the service mesh does not need any traffic marking action.

在这里插入图片描述

figure 2

Figure 3 illustrates the second option. In this solution, the client's traffic goes directly to the Ingress gateway of the service mesh. After being identified by Istio's native VirtualService matching rules according to the traffic characteristics, the Ingress gateway adds an HTTP header named x-asm-traffic-lane before forwarding the request, and then routes the traffic to the corresponding swimlane.

在这里插入图片描述

image 3

Figure 4 illustrates option three. In essence, this solution is exactly the same as the second solution. It also identifies the corresponding traffic through Istio's native VirtualService matching rules and adds an HTTP header named x-asm-traffic-lane. The only difference is that the role of Envoy in solution 2 is Ingress, while the role of Envoy in solution 3 is Sidecar.
在这里插入图片描述

Figure 4

Once the traffic is marked, each Envoy in the service mesh performs the full-link mark transparent transmission and mark-by-mark routing based on the traffic mark and the configuration issued by the control plane.

Transparent transmission of traffic IDs

Figure 5 illustrates the traffic details between a service in the service mesh and the Envoy (Sidecar) on the edge.
在这里插入图片描述
Figure 5

From the perspective of Envoy, it includes forwarding of both incoming and outgoing traffic. I1 is the incoming traffic, which will be forwarded to the local Svc A when received; O1 is the outgoing traffic (caused by calling other services due to the needs of processing l1), and will be forwarded to the external called service after receiving . Inflow and outflow are only related to the request, and have nothing to do with the response corresponding to the request. Obviously, an incoming request may lead to multiple outgoing requests (ie "forks"), depending on the specific business logic of Svc A.

The core point that the swimlane technology needs to solve is how to make every outgoing traffic branched by it carry the same label when the incoming traffic is marked with the corresponding label. The solution we adopt is to combine the link Tracking technology (eg, OpenTelemetry) to solve. The link tracing technology uses traceId to uniquely identify a call chain tree. After the root request is assigned with a unique traceId in the entire network, all new calls branched by it must carry the same HTTP header with the same value. In other words, service developers need to ensure that this header is propagated to subsequent service calls during programming (eg, calling OpenTelemetry's SDK to complete header propagation). In other words, the premise of using swimlane technology is that each service needs to use link tracking technology, which is one of the best practices of microservice architecture. This prerequisite is easy to meet. Returning to Figure 5, Svc A needs to initiate an O1 call when it receives an I2 request and processes it. At this time, it needs to ensure that the traceId header in I2 is propagated to the O1 request. This is a detail that developers of Svc A need to pay special attention to.

Once all service requests in the service mesh carry traceIds, it is very simple to implement full-link traffic mark transparent transmission through Envoy. Broadly divided into these steps:

• Envoy builds a mapping table internally to record the mapping between traceId and traffic target. For example, the traffic token shown in Figure 5 is placed in the HTTP header x-asm-traffic-lane. x-asm-traffic-lane: dev1 represents the traffic flag is dev1, x-asm-traffic-lane: canary represents the traffic flag is canary.
• When the request I1 goes to Envoy, Envoy adds a mapping record in the mapping table based on the traceId and traffic mark carried in the request.
• For each O1 request received, Envoy finds the corresponding traffic mark from the mapping table based on the traceId in the request and adds it to the O2 request before forwarding it.

The advantage of the technical solution based on traceId marking through the service grid is that the action of traffic marking and the transmission of traffic labels are completely decoupled from the service, and this capability is sunk into the service grid that is good at traffic governance. The flexibility of traffic scheduling can be further unlocked.

Definition of traffic identification and traceld

Based on the existing CR in Istio, we have added a new CRD, TrafficLabel. The reason for choosing to add a new addition instead of directly extending VirtualService is that the design of VirtualService is application-oriented at the beginning. When a business is so complex that there are many applications in the whole link that need to be placed in the swimlane, it is necessary to adjust the VirtualService of each application. Making changes, the timeliness and operability caused behind them will be a problem. Another way to extend VirtualService is to give VirtualService the ability to configure global rules, which requires the use of the rule merging mechanism, which is also problematic from a practical level. The Istio community has a lot of discussions on the need for merging multiple VirtualServices. Currently, merging is only supported on the gateway, but not for Sidecar, because it is worried that the order of merging will be different and cause failures.

Figure 6 shows an example of how to use the TrafficLabel CR to define a globally valid traffic labeling method in the istio-system root namespace. It defines a label named x-asm-traffic-lane, which is used as the header of the HTTP request to store the traffic identifier (for example, dev1, dev2, canary, etc.), and the traceId is obtained based on x-request-id. Users can set it according to the specific implementation of the link tracking system of their choice. In the figure, it is set to be obtained from the x-request-id header because Envoy implements the function of unique identification of the entire network link through x-request-id. Using x-request-id as the key of the mapping table means that we can directly use the Bookinfo sample program provided by the Istio open source community to demonstrate the effect of the swimlane, because all services in Bookinfo are called from the x-request-id header. Request to callout request propagation.

 apiVersion: istio.alibabacloud.com/v1beta1
kind: TrafficLabel
metadata:
  name: global-traffic-label
spec:
  rules:
  - labels:
      - name: x-asm-traffic-lane
    protocols: "http"
    traceIdHeader: x-request-id
  hosts:
    - "*"

Image 6

route by traffic

In order to support routing by traffic labels, Istio's VirtualService needs to be extended to allow the destination field to support specifying the destination of traffic with variables such as $x-asm-traffic-lane, as shown in Figure 7 below. In other words, the traffic containing the x-asm-traffic-lane: dev2 header will hit the dev2 swimlane, which is actually a subset named dev2 defined by the DestinationRule, as shown in Figure 8. Note that the name $x-asm-traffic-lane in VirtualService in Figure 7 should be the same as the name defined in TrafficLabel in Figure 6.

 apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: dev2
    route:
    - destination:
        host: reviews
        subset: dev2
      fallback:
        case: noinstances|notavailable
        target:
          host: reviews
          subset: baseline
      headers:
        request:
          set:
            x-asm-traffic-lane: dev2
  - route:
    - destination:
        host: reviews
        subset: $x-asm-traffic-lane
      fallback:
        case: noinstances|notavailable
        target:
          host: reviews
          subset: baseline
  - route:
    - destination:
        host: reviews
        subset: baseline

Figure 7

 apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  subsets:
  - labels:
      version: v2
    name: baseline
  - labels:
      version: v3
    name: dev2

Figure 8

From the definition of DestinationRule in Figure 8, it is not difficult to see that only the swimlane dev2 is defined except baseline, and Figure 7 is the definition of VirtualService in the corresponding situation. The corresponding usage scenarios of the two are exactly the baseline and dev2 swimlane in Figure 1.

Product realization

In the context of cloud native technology, ease of use has been put under the spotlight, and we have a deep understanding of what it means behind. To this end, when designing the interaction of the product, try to clear what you know, stand in the scene faced by the user and think and optimize, and strive to achieve a balance between functionality and ease of use.

Before a user uses a swimlane, we assume that he has built a baseline environment that includes all services. In K8s, the baseline environment is usually deployed in a specific namespace to better operate and manage the services in it. When users create a swimlane, they only need to provide the swimlane name. The rest of this section expands to create a swimlane named dev2.
在这里插入图片描述

After the swimlane is created, the service needs to be published into the swimlane. Since the published service already exists in the baseline environment and K8s Service resources are created, publishing a service in the swimlane is actually creating a Deployment under the corresponding service. The intuitive understanding is to create another software version of the existing service. It is not difficult to imagine that this release action includes confirming the baseline version, the number of instances and the address of the container image.

在这里插入图片描述

After the service is published in the swimlane, it is necessary to ensure that all services are started normally through the service list of the swimlane. At this time, there is no traffic entering the swimlane. You need to configure the traffic diversion rules to introduce the traffic in the baseline into the swimlane.

在这里插入图片描述

Drainage rules can be configured based on the characteristics of HTTP headers, URIs, and cookies, so that we can accurately select the measured traffic to enter the swimlane. The rule in the figure below refers to directing the traffic whose HTTP header end-user value is dev2 into the dev2 swimlane. While configuring the rules, you need to correctly specify the ingress service.

在这里插入图片描述

After the drainage rules are applied, you can log in with the dev2 user name on the webpage to see the effect of the service in the dev2 swimlane. The following two figures illustrate the page effect seen by the full baseline and the dev2 swimlane, respectively. Since the two services productpage and details are not deployed in the dev2 swimlane, these two services will fall back to using the baseline, and the final effect is that the contents of The Comedy of Errors and Book Details in the two figures are completely consistent.
在这里插入图片描述

After the service is published in the swimlane, you can easily view the traffic comparison between each service and the baseline version in the service list of the swimlane. Helps developers better understand the operation of services in swimlanes.

在这里插入图片描述

In addition, through the service topology diagram, you can clearly see the invocation of services in the dev2 swimlane (lane-dev2 in the figure).
在这里插入图片描述

Summary and Outlook

The swimlane technology based on the service mesh we have explored allows developers to create an isolated environment in seconds for development testing or business re-assurance, and control the "explosion radius" to a minimum through precise drainage rules. It has well realized the new experience and new value of cloud-native service mesh technology.

Next, we will further open up the functions of swim lanes and version grayscale in a scene-based way, so that users can use these functions based on intuition. At the functional level, we will further improve the protocols supported in the swimlane, such as RocketMQ, Dubbo 3.0, etc., to maximize its value by enriching the application scenarios of the swimlane technology.

Finally, we will continue to build a modern service governance platform with microservice architecture based on the concept of Service Mesh as Infra, and work with industry partners to accelerate the development and promotion of this new technology.

About the author: Li Yun (flower name: Zhijian), technical leader of Alibaba Cloud Service Grid Hybrid Cloud Products. Since 2018, he has led the team in the development and construction of service grid technology in Alibaba Group, and has done many technical sharings on cloud native and service grid at QCon.

Realizing the New Value of Service Mesh: Precise Control of "Explosion Radius"

Concepts and Scenarios

Technical realization

Transparent transmission of traffic IDs

Definition of traffic identification and traceld

route by traffic

Product realization

Summary and Outlook

阿里云云原生

引用和评论

三句话生成 P5.js 粒子特效代码，人人都可以做交互式数字艺术

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

开放创新，释放云上数字生产力｜2024华为云开源开发者论坛圆满落幕

阿里云可观测 2024 年 11 月产品动态

商业版vs开源版：一图看懂云消息队列 RocketMQ 版核心优势

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？

链路诊断最佳实践：1 分钟定位错慢根因