1
Introduction to This article shares the thinking and practice behind Alibaba's service grid technology trinity strategy. About some product functions of Alibaba Cloud Service Grid ASM, including some recently released functions, please click below to view details~

Authors: Zong Quan, Yu Zeng

Alibaba's Trinity Strategy

Alibaba Cloud has put forward the trinity strategy of open source, self-research, and commercialization early on. Let me talk about my understanding of it first.

1.png

Years of software development experience tells us that there are some key elements in developing a great software:

  • communication
  • Feedback
  • practice

In the software development process, we can't do things behind closed doors, and we can't "create" business scenario requirements at will. Business scenarios and product functions need to be refined. Open source provides us with a platform for joint innovation. Based on this platform, everyone can jointly define some specifications and standards. Different manufacturers follow the corresponding standards, customers will not have the risk of being locked in, and they can always migrate, always find the best manufacturer, put their own business on it, and use the simplest, most convenient and most economical way to do so. Run your own business.

When many customers choose Alibaba Cloud service grid, there is a more important criterion: whether it is compatible with the community Istio. Because customers are worried about being locked in, they rely on Alibaba Cloud;

Then when it comes to self-study, some students may ask whether open source and self-study contradict each other. The answer is no.

Because the self-research we mentioned here is actually self-research based on open source, not abandoning the open source version and recreating a new wheel. Self-research means that we need to have a deep enough understanding of open source products:

  • To master all source code;
  • Have the ability to modify every line of code
  • Of course, self-research also means that there may be specific and unique demand scenarios for its own business, and some scenarios that cannot be standardized.

Based on self-research and in-depth control and understanding of open source products, we will move functions that have common customer scenarios to the cloud and package them into cloud products so that customers on the cloud can use them out of the box. This is commercial Original intention.

Back to the Ali Group, open source, self-research, and business are actually a technology flywheel.

For Ali's technical students, Double 11 is a "feast" every year. In order to provide customers with a smooth shopping experience and provide merchants with more diversified profit-making activities, the requirements of Alibaba e-commerce platform for efficiency, reliability, and scale have been doubled driven by Double 11, which inspires technical people’s interest. potential. As one of the core of basic technology, Alibaba middleware will also usher in a comprehensive technological evolution and upgrade once a year on Double 11.

Alibaba has launched Dubbo, RocketMQ, Nacos, Seata and other well-known open source projects in the open source community, encouraging developers to build a middleware ecosystem, including ServiceMesh related technologies.

Embrace the open source technology of service grid

2.png

image.gif

Alibaba Cloud started investigating and practicing ServiceMesh technology very early. In 2018, Istio officially released version 1.0 and entered the public eye. In this earlier period, Alibaba has already begun to participate in the contribution of related ecological open source products.

In the field of microservice ecology, Alibaba Cloud also has some open source service frameworks, such as Dubbo and Spring Cloud Alibaba. It can be said that in the field of microservices, because of the large experimental platform of e-commerce, Alibaba Cloud is a "technical expert" in this regard. , We will conduct a horizontal function comparison to compare the advantages and disadvantages of the Sidecar model and the original model; in this process, we are also actively participating in the open source contribution of Istio microservice-related ecological projects; such as Envoy, Dubbo Filter, RocketMQ Filter, Nacos Mcp functions , Spring Cloud Alibaba, Sentinel, etc.

Various service frameworks are currently popular. How to develop interoperable services based on different frameworks? The service framework is just like the railroad tracks, it is the foundation of intercommunication. Only by solving the intercommunication of the service framework can it be possible to complete higher-level business intercommunication. Therefore, the same standards are used to unify, and the two are combined to build a new generation of services. Framework is an inevitable trend.

Dubbo and HSF are both microservice RPC frameworks used internally by Alibaba. These frameworks have provided solid support for the underlying micro-service capabilities during the continuous iterative development of Alibaba's business, and ensured one Double 11 promotion one after another.

With the wave of cloud native, as well as the overall resource cost optimization, DevOps, etc., some shortcomings of the original microservice framework Dubbo and HSF are slowly exposed, such as multi-language support, configuration and code logic separation, etc. The SDK version upgrade needs to promote the business side, and the interoperability of different frameworks for the acquired business.

Some businesses within the Alibaba Group began to try to use service mesh technology to transform the underlying microservice framework. In the process of meshing the Dubbo framework, the Alibaba Cloud service mesh team contributed Envoy Dubbo Filter to achieve the meshing of the original Dubbo business. , In order to obtain the new incremental value brought by the service grid.

On the other hand, the Dubbo community itself is also iterating towards the cloud native field. Dubbo began to discuss Dubbo's cloud-native evolution plan from Dubbo 2.7.8. In order to better adapt to cloud-native scenarios (infrastructure changes, Kubernetes has become the de facto standard for resource scheduling orchestration), the Dubbo team is currently shifting Dubbo 2.0 to Dubbo 3.0 Doing technological evolution, and proposed the design of Proxyless Mesh.

With the gradual migration of businesses to the cloud, due to the diverse migration paths and the transition from the existing architecture to the cloud-native architecture, the facilities for deploying applications are flexible and changeable, and the microservices on the cloud are also showing a diversified trend. Cross-language, cross-vendor, and cross-environment calls will inevitably give rise to unified protocols and frameworks based on open standards to meet interoperability requirements. These scenarios, the areas where the formal service grid is good at, give the service grid a good room for development;

Currently, the Dubbo 3.0 community version has been released. The core changes are:

  • Application-level service discovery
  • The Dubbo 2.0 protocol evolved into a triple protocol based on gPRC
  • ProxylessMesh without Sidecar

3.png

Meshing does not happen overnight. For the original stock business and similar business to the cloud, there is an intermediate transition stage. Traditional microservice frameworks, such as Dubbo, Spring Cloud, and other stock business use Nacos, Eureka, Zookeeper service registry, and we need to treat it Compatible adaptation; based on the open Mcp Over XDS protocol of the Istio control plane, the protocol support is first implemented in Nacos, so that Istiod can directly connect to the Nacos registration center.

image.gif4.png

Open source products often cannot be used directly in large-scale production environments, and some adaptation and tuning, as well as the packaging of some productization capabilities; for example: Intel's mTLS acceleration solution.

5.png

image.gif

Intel submitted an implementation of Upstream on the Envoy side, but Istiod has not yet supported it. As a cloud product, we hope to provide customers with out-of-the-box capabilities. The service mesh ASM is based on Intel’s open-source mTLS acceleration solution, which implements the extended support of Istiod on the control plane, and because the mTLS acceleration solution relies on the actual CPU of the underlying resources Type (icelake), ASM has made adaptive acceleration function on and off for the actual deployment of user business. When the multiBuffer acceleration function is turned on, using Alibaba Cloud g7 generation ecs as the node node, QPS has been improved by nearly 80%.

When it comes to service grids, a topic is often mentioned: "What is the difference between it and Dapr?"

image.gif

6.png

Dapr uses the Sidecar architecture to run as a separate process together with the application, including functions such as service invocation, network security, and distributed tracing. This often raises a question: How does Dapr compare to service mesh solutions such as Istio?

Although Dapr and service mesh do have some overlapping functions, unlike service mesh that focuses on network issues, Dapr focuses on providing building blocks that make it easier for developers to build applications as microservices. Dapr is developer-centric, while the service grid is infrastructure-centric. In addition, Dapr does not provide flow control functions such as routing or flow distribution.

Of course, the two can be deployed together. At this time, both Dapr and Sidecar of the service grid are running in the application environment.

The landing and practice of service grid in Alibaba

As you can see earlier, Alibaba has open sourced some products for the microservice ecosystem. These products were actually due to internal business scenarios at the beginning. Based on the incubation of these internal business scenarios and large-scale business inspections, internal customers feel that external customers also have similar needs. That's why all these internal implementations are open sourced.

Corresponding to Istio Mesh is the same. The internal business of the group started the business exploration of Mesh very early. Let's look at it specifically:

7.png

As can be seen from the overall architecture diagram, Alibaba Group provides a set of consoles for Mesh users to operate. The console is based on the application perspective and integrates CICD, authorization management, safety production, SRE operation and maintenance systems and other platforms to provide application access The unified Portal after Meshing allows users to realize the full lifecycle management of applications based on the concept of DevOps, and provides application service governance, full link grayscale, and safe production capabilities through the Mesh method, so that the application owner can self-help and The effect of self-healing operation and maintenance.

Among them, the core capabilities of Mesh support RPC protocols such as Dubbo, MetaQ (RocketMQ), and LWP, and expand the realization of Mesh capabilities such as full link coloring, routing strategies, and plug-in markets.

At the same time, Alibaba Group also supports the ability to provide third-party system integration through OpenAPI and Kubernetes API.

Based on the community’s Istio architecture, Alibaba Group’s internal and internal middleware (Diamond, ConfigServer) have been deeply integrated to be compatible with the original use of the reserved business, allowing the business to seamlessly connect to the Mesh. This is also part of the Mesh business that we consider There is a need to use Nacos to support scenarios with multiple registration centers such as Nacos at the ASM product level;

At the same time, the operation and maintenance plane is abstracted, and the configuration of service traffic management rules (virtualservice, destinationrule, etc.) can be realized through the UI console. At the same time, through the integration with OpenKrusise, the functions of opening, closing, and hot upgrade of the sidecar of pod granularity can be realized. Through the integration of Prometheus and Grafana and alarm ARMS within the group, the observability and monitoring of microservices can be realized.

8.png

image.gif

The evolution path of Alibaba Group's service grid

The service grid evolution of Alibaba Group is divided into three stages: non-intrusive partial scale, non-intrusive full scale, and cloud-native end state. Currently, the cluster business meshing is in the second stage.

The first stage: there is a transitional stage in the meshing of stock services, and it is necessary to ensure that this transitional stage is relatively intrusive, so that business developers have no perception; this is the background and premise of why we need to adopt a non-intrusive solution; and we need to use Mesh to cover the existing Some microservice governance capabilities, while providing the incremental value of Mesh;

The second stage: full scale, while solving the resource overhead and performance problems caused by scale, realize the lazy loading of service configuration through Sidecarcrd, achieve the problem of configuration isolation, reduce the memory overhead of Sidecar through optimization and tailoring of Metrics, and at the same time through optimization Dubbo/HSFFilter implements lazy encoding and decoding, improves data surface processing performance and reduces latency.

As the internal business Dubbo 2.0/HSF evolves to Dubbo 3.0, it eventually evolves to the cloud-native final state solution.

The third stage: Cloud-native final state. As the infrastructure evolves to Kubernetes, in the cloud-native scenario, service discovery and service governance capabilities sink. Through Mesh, business logic and service governance can be decoupled, and configuration and code logic can be separated. So as to better DevOps, and enjoy the rich and scalable traffic scheduling capabilities and observability brought by Mesh.

Dubbo/HSF RPC supports multiple serialization methods, and Mesh does not provide friendly support for some serialization, such as Java serialization.

Therefore, in the first stage of the Meshization of the business, for Java serialization, Sidecar does not perform encoding and decoding, and uses Passthrough traffic transparent transmission; for Hessian2 serialization, Mesh implements complete encoding and decoding support, and achieves laziness based on performance considerations Codec. Based on this, we can achieve traffic marking (coloring) for this type of traffic and implement label routing and Fallback capabilities by extending VirtualService. It can also implement some specific business scenarios, such as canary release, full link grayscale and other scenarios;

The internal business MeshSDK layer will be gradually upgraded to Dubbo3.0 SDK. When Mesh is turned on, Dubbo3.0 SDK only provides RPC and other capabilities, corresponding to ThinSDK mode. After Meshization, Sidecar's protocol support is more friendly and resource overhead The cost is reduced to a certain extent; when the Sidecar fails, you can quickly read and switch back to the FatSDK mode, without business perception;

For the services within the cluster, there are some more complex scenarios for traffic scheduling, especially for larger-scale services. For example, multiple computer rooms and multiple regions are deployed, and there are multiple versions of services and routing in multiple environments in a single region.

This involves routing and back-end cluster selection in different dimensions. These dimensions may include:

  • Regionalized routing
  • Computer room routing
  • Unitized routing
  • Environmental routing
  • Multi-version routing

Group e-commerce scenarios are particularly typical. Based on this, the internal expansion of Istio has achieved the ability to mark traffic and route according to standards by introducing new CRDs: RouteChain, TrafficLable and the expansion of VirtualService. image.gif

9.png

The commercial product Alibaba Cloud Service Grid ASM also reveals these capabilities to varying degrees. Based on this, scenarios such as canary release, A/B testing, and full-link grayscale can be implemented.

Cloud products: Alibaba Cloud Service Grid ASM

Earlier we introduced the practice of Alibaba service grid in open source and large-scale implementation, and then we will share the design of cloud products in the cloud-native trinity. Alibaba Cloud continues to drive technological development by summing up experience in the implementation of business scenarios, accumulating a series of core technologies for service grids.

in terms of large-scale landing: such as push-demand dynamic rule configuration, non-destructive large-scale business under Sidecar hot upgrade, the most comprehensive support for heterogeneous computing infrastructure to support multi-registry platform.

in terms of traffic management: provides fine-grained flow control, dynamically intercepts traffic protocols and ports on demand, implements request label routing and traffic coloring with zero configuration, and supports refined management of multiple protocols.

in terms of observability: provides integrated intelligent operation and maintenance that integrates logging, monitoring, and tracking. At the same time, it enhances observability based on eBPF, realizes non-intrusive observability across the entire link, and assists in rapid troubleshooting of services.

in terms of security capabilities: supports Spiffe/Spire, realizes a zero-trust network, enhances the authentication mechanism, and supports the progressive realization of mTLS.

in terms of performance optimization: uses eBPF technology to accelerate the network and realize the performance optimization of software and hardware integration.

10.png

Alibaba Cloud Service Grid ASM is the industry’s first Istio-compatible managed service grid platform that supports complete service grid product capabilities: refined application traffic management, end-to-end observability, security and high availability; multiple support Complex scenarios such as multi-language environments, multiple micro-service frameworks, and multi-protocol interconnection. The technical architecture of the service grid ASM has been upgraded to V2.0, hosting the core components of the control plane, ensuring the unification of the architecture of the standard version and the professional version, and smoothly supporting the upgrade of all versions of the community. At the same time, ASM performs various capability enhancements on the basis of unity with community standards. It mainly includes traffic management and protocol enhancement, support for multiple zero-trust security capabilities, and support for docking with multiple registry centers such as Nacos and Consul. In addition, the grid diagnosis capability can be used to quickly analyze the health of the grid and respond quickly to the control plane alarms.

The service grid ASM is fully integrated with various cloud service capabilities, including observable capabilities such as link tracking, Prometheus monitoring, and log services. Integrated AHAS supports service current limiting, cluster current limiting, and adaptive current limiting, combined with the microservice engine MSE to support service governance, and can provide a consistent governance experience across multiple VPC clusters. In terms of custom extensions, it supports OPA security engine, webAssembly and other custom extension capabilities.

Users can use the service grid technology through the ASM console, OpenAPI, declarative cloud native API, data plane and control plane Kubeconfig. Through the polishing of the control plane and the management plane of the service mesh ASM, a unified grid governance capability (Anywhere Service Mesh) can be provided for services running on heterogeneous computing infrastructure, from the entry gateway to the data plane Sidecar injection, support Container service ACK, Serverless kubernetes, edge clusters and externally registered Kubernetes clusters, as well as various infrastructures such as ECS virtual machines.

Functional Design of Service Mesh ASM

ASM-based traffic marking and label routing realize full-link grayscale. Under the microservice software architecture, building a complete set of test systems for verification before new business functions goes online is a time-consuming and time-consuming task. As the number of microservices to be split continues to increase, it becomes more difficult. Based on the capabilities of "traffic marking" and "routing by standard", it is a general solution that can better solve related problems such as test environment management, online full-link grayscale release. And based on the service grid technology, it can be independent of the development language. The solution is adapted to different 7-layer protocols. The current service grid ASM already supports HTTP/gRpc and Dubbo protocols. A brand new TrafficLabel CRD is introduced in ASM to define where the traffic labels needed to be transparently transmitted by the Sidecar are obtained. The flow control of the entire link is logically isolated, and the traffic is marked (dyed) and routed according to the standard. Through the use of the service network With ASM, there is no need for each technical R&D personnel to deploy a complete set of environments, realize multi-environment governance, and greatly reduce R&D costs.

11.png

image.gifimage.gif

12.png

The service mesh ASM supports canary publishing. Release is the last link of the entire function update to the online. Some problems accumulated during the research and development process will only be triggered at the final release link. At the same time, publishing itself is also a complicated process. During the publishing process, it is easy to make mistakes or omit key operations. The canary release configuration is flexible, and the strategy is customized. It can be grayed out according to the traffic or specific content (such as different accounts and different parameters), and problems will not affect the entire network users. In the figure, the environment label is applied to the application, and the user traffic is marked with gray label for http-header: user-id% 100 == 20 through TrafficLable. At the same time, the label traffic routing rule is issued through VirtualService, so the user traffic with userId of 120 will be Routed to the gray environment, user traffic with userId 121 is routed to the normal environment. The canary release implemented by the service grid ASM supports routing by traffic percentage, routing by request characteristics (such as http header, method parameters, etc.), and is perfectly integrated with the service grid entry gateway, and supports HTTP/gRPC/Dubbo protocols.

In addition to using traffic marking and label routing to achieve full-link grayscale and canary release, the service mesh ASM also supports the combination with KubeVela to achieve progressive release. KubeVela is an out-of-the-box, modern application delivery and management platform that simplifies the application delivery process for hybrid environments; at the same time, it is flexible enough to meet the iterative pressure brought about by constant rapid changes in business at any time. The Open Application Model (OAM), an application delivery model after KubeVela, is a highly extensible model in terms of design and implementation. It has the characteristics of completely application-centric, programmable delivery workflow, and independent of infrastructure. Alibaba Cloud Service Grid ASM supports the complex canary publishing process combined with KubeVela, which can transform the relevant configuration defined by KubeVela into traffic governance rules and send it to the data plane.

13.png

14.png

The ASM service grid of Alibaba Cloud achieves zero-trust security capabilities. The interaction using HTTP communication in a microservice network is not secure. Once an internal service is compromised, an attacker can use the machine as a springboard to attack the internal network. The service mesh ASM can reduce the attacked area in the cloud native environment and provide the basic framework required by the zero-trust application network. Through the ASM management service-to-service security, you can ensure the end-to-end encryption, service-level identity authentication, and fine-grained authorization policies of the service grid.

Compared with the traditional construction of a security mechanism in the application code, the ASM zero-trust security system has the following advantages:

  • The policy life cycle of ASM Sidecar agents remains independent of the application, so these Sidecar agents can be managed more easily.
  • ASM supports dynamic configuration strategy, update strategy becomes easier, the update takes effect immediately without the need to redeploy the application.
  • ASM provides the ability to authenticate the end user credentials attached to the request, such as JWT.
  • ASM's centralized control architecture enables enterprise security teams to build, manage, and deploy security policies that are applicable to the entire enterprise.

Deploy the authentication and authorization system as a service in the grid. Like other services in the grid, these security systems can also obtain security guarantees from the grid itself, including encryption in transmission, identification, policy enforcement points, Authentication and authorization of end user credentials, etc. The policy control plane defines and manages multiple types of authentication policies; the grid control plane assigns the identity of the workload in the grid and automatically rotates the certificate; the sidecar code execution strategy of the data plane. The user configuration rules in the figure only allow the transaction service to initiate invocation of the order service, and refuse the shopping cart service to invoke the order service.
image.gif

15.png

Since the service grid ASM is a control plane hosting, it supports the management and control of multiple data plane clusters, and the traffic management CR has a control plane, which supports users to operate governance rules through the KubeAPI of the control plane. In the new version of the service grid, in order to:

1. Support the user's operating habits in unmanaged mode, and be able to read and write Istio resources in the data plane Kubernetes cluster;

2. Support Helm common command tools;

3. Compatible with API operations of other open source software in single-cluster addon mode. Alibaba Cloud Service Grid ASM implements the Kube API to support data plane clusters to access Istio resources. Both are provided to the outside world at the same time, and users can use them on demand according to actual scenarios. image.gif

16.png

ASM is compatible with community standards and provides a smooth upgrade of the control plane. Then the data plane can be upgraded in two ways: rolling upgrade and hot upgrade capability. For the rolling upgrade capability, you need to set the upgrade strategy to RollingUpdate, and when the Pod injected into Sidecar is released, The Envoy image will be automatically upgraded to the new version. The figure mainly introduces the second method, ASM service grid ASM combined with the hot upgrade function implemented by the OpenKruise project, which will not interrupt the service when upgrading the data plane, so that the data plane can be upgraded without application perception. Application release and update automatically generate SidecarSet configuration, update SidecarSet configuration to complete the data plane upgrade, currently this capability is in the new version of the grayscale.

17.png

image.gif

The service grid ASM and the Alibaba Cloud application high-availability service AHAS can control the flow of applications deployed in the service grid. Currently, it supports single-machine current limiting, cluster current limiting, and adaptive current limiting. At the same time, the service grid ASM also natively supports Istio's global current limiting and local current limiting. The global current limiting uses the global gRPC service to provide rate limiting for the entire grid. The local current limiting is used to limit the request rate of each service instance. The local current limiting can Used in conjunction with global current limit.

The service grid ASM also supports the MCP over XDS protocol to connect to the registry of the microservice engine MSE, and synchronize service information to the grid. MSE's Nacos natively supports the MCP protocol. Users only need to enable the Nacos registry docking function when creating or updating an ASM instance, so that the registry services can be synchronized to the service grid, and it can easily support the gridization of Dubbo and Spring Cloud services. , There is no need to modify any business code on the user side.

18.png

19.png

Finally, I will share a few customer cases, how customers use the service grid ASM to shorten the service grid technology landing cycle, reduce the cost of abnormal troubleshooting, and save the cost of control surface resources.

1. With the development of its business, Dongfeng Nissan’s "12 Chinese Zodiac" (twelve complete test environments) created earlier can no longer meet many concurrent needs, and even requires a lottery distribution environment. Through the introduction of Alibaba Cloud Service Grid ASM, an "infinite zodiac" system based on traffic management has been built to meet the demands of providing an environment automatically on demand. Based on the free operation and maintenance, easy upgrade, and rich product support capabilities provided by ASM, the production and research team can concentrate on enjoying the value brought by ServiceMesh.

2. In order to cope with the global expansion and integrated operation of your business, you have deployed business applications across regions based on Alibaba Cloud Service Grid ASM and Container Service ACK, and optimized the customer access experience through the strategy of accessing services by region. Effectively reduce service access delay and improve service response speed.

3. Sunmi Technology introduced ASM service grid ASM to build an intelligent digital business intelligence POS software and hardware integrated system solution, and use ASM service grid to solve core problems such as gRPC service load balancing, link tracking, and unified traffic management.

This article shares the thinking and practice behind Alibaba's service grid technology trinity strategy, about some of the product functions of Alibaba Cloud service grid ASM, including some recently released functions, such as Istio resource history version management function, support data plane cluster Kubernetes API Access to Istio resources, support for cross-regional failover and cross-regional traffic distribution, support for control plane log collection and log alarms, support for KubeVela to achieve progressive release and other detailed information, and more about traffic management, observable, zero-trust security, and solutions For product functions such as solutions, please click to read the original text to access the ASM product documentation of Alibaba Cloud Service Grid. If you are interested in the service grid ASM, welcome to scan the QR code below or search the group number (30421250) to join the service grid ASM user exchange group and explore the service grid technology together.

Click here to view more service mesh ASM related information~

Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。