The cloud-native evolution path of Alibaba&#39;s billion-level long-connected gateway

Author: Light Cone

The AServer access gateway carries the ingress traffic of the entire Ali Group, is responsible for the long-chain keep-alive of hundreds of millions of users, supports tens of thousands of routing strategy forwarding, and is a bridge connecting hundreds of millions of users with hundreds of thousands of back-end service nodes. It needs to support hundreds of millions of online users, tens of millions of QPS, and effective tens of thousands of API management and control strategies, to achieve safe and reliable forwarding routing, and to ensure that the user experience is as smooth as silk.

Behind the support of large-scale business traffic and management and control, it is necessary to accurately control every detail of the system to eliminate every potential risk point.

With the help of cloud native architecture, O&M operations can be greatly simplified, and potential risks are reduced. Last year, the thousands of Pods of Ali AServer access gateways on Double Eleven surpassed the peak steadily. This article mainly introduces how Ali AServer access gateway embraces changes from the previous generation architecture and the evolution of a comprehensive cloud native.

Architecture evolution background

The annual Double Eleven promotion is the most severe test for all services of Ali, especially for the AServer access gateway. As the first portal of Ali Group, it needs to resist the traffic peak brought by the peak of the promotion and clean the attack traffic. The scale of the cluster is huge.

The huge cluster size and the extreme requirements for machine performance have led to the complexity of operation and maintenance; with the increase of access services, the supported business scenarios have expanded, and the flexibility of the routing strategy and the effective real-time requirements of the business have changed. High, it has a strong demand for the dynamic orchestration ability of routing strategies; due to the diversity of services, different network blocking rhythms of business lines, and fault isolation, there is a demand for stability in traffic isolation.

The complexity of operation and maintenance, demand for dynamic orchestration, traffic isolation, and extreme performance requirements drive the continuous evolution and growth of AServer access gateways. While keeping up with the pace of business development, it gradually reduces operation and maintenance costs and enhances system stability. Sex, can withstand the test of Double Eleven time and time again.

Business background

As the AServer access gateway of the Ali Group, it carries the ingress traffic of the entire Ali Group. The tengine gateway, which initially supports the domain name forwarding strategy, is forwarded to different back-end services according to the domain name, and the business form is relatively simple.

Coming to the All in wireless era, in order to optimize the user experience on the mobile side and reduce the development cost of the server personnel, the group self-developed MTOP (Mobile Taobao Open Platform) API gateway, which provides a consistent API for the client and server The platform, the same domain name, is only forwarded to the corresponding service through the API information carried by the URI. The access gateway needs to support the routing and forwarding ability according to the API (differentiated by the URI), which has rapidly increased to tens of thousands of rules in a few years.

As business development becomes more and more refined, it is expected that different business scenarios under the same API will be subdivided, such as the source of the Double Eleven promotion venue, such as mobile shopping, Alipay, and other external investment pages, for more refined control In order to adapt to business development, the gateway needs to support refined management and control capabilities, and perform management, control and distribution based on service request parameters and request headers. Each request must match a unique path from tens of thousands of flexible configuration rules, while maintaining extremely high performance, which is extremely challenging.

business model diagram

Operation and maintenance system background

At the beginning, the basic supporting infrastructure was not perfect. The gateway layer was built based on tengine. The simplest and quickest solution was to use a physical machine, and the deployment process and configuration could complete the service setup. As the business grows, configuration management becomes a bottleneck. The gateway layer needs a powerful configuration management platform to generate business configuration in a standardized way. The self-developed configuration management platform divides the configuration into application configuration, public configuration management, and certificate configuration. three parts.

Public configuration: Generate the basic configuration of tengine running through Git version management, such as enabling module configuration, tengine running logic configuration
Application configuration: Generate the tengine configuration required by the business through standard templates
Certificate configuration: Since the certificate has a validity period, in order to prevent forgetting to renew when it expires, it also undertakes the task of automatically renewing the certificate

The initial system deployment architecture:

This solution can realize business self-service access. The tengine configuration is generated through the template of the configuration management platform, and then pushed to the gateway machine regularly and reloaded to make the configuration effective.

Through this operation and maintenance method, it does not rely on infrastructure and can evolve quickly. However, with the growth of business and the increase in the scale of clusters, the disadvantages of the operation and maintenance methods of physical machines gradually appear. It has become the top priority. The binary release of the physical machine relies on manual deployment. It is necessary to execute commands in batches to install the rpm package, and restart the process in batches, all of which are completed with a black screen.

Obviously, this kind of operation and maintenance method cannot meet the current stability requirements. Through manual release, it is very easy to cause systemic failures due to misoperation. In addition, it is difficult to ensure consistency in physical machine operation and maintenance, including binary consistency and the consistency check of the machine's own environment (such as kernel parameters, etc.). The manual operation and maintenance methods in the past have obviously not kept up with the pace of the times.

The best solution to solve the issue of release and environmental consistency is containerization technology. With the improvement of the group’s infrastructure, the access gateway containerization transformation removes obstacles and packs invariants (system configuration, binary) into one for release. Variables (application configuration, public configuration, certificates) continue to be managed by the configuration management platform, and adjusted with containerization technology.

The release and configuration change process after the containerization transformation:

The containerized architecture simplifies site construction, expansion and contraction operations, greatly improves publishing efficiency, increases the approval process, and systematically prevents failures caused by human operations. The publishing process can also be connected to the monitoring system to automatically alert and Suspend publication.

the core issue

As the e-commerce business develops faster and faster, after the scale reaches the bottleneck, the business will have more horizontal expansion, the degree of refinement will become higher and higher, and the iteration speed will also increase. The gateway layer adapts to changes in the business. The higher the cost, the core problems that this brings:

Operation and maintenance operation complexity: due to the extreme requirements for performance, gateway clusters have special requirements for machines; due to the particularity of gateway configuration management, the operation and maintenance operation complexity is caused; the existence of particularity cannot be well connected to the existing group The operation and maintenance system needs to be upgraded;
Insufficient dynamic orchestration capabilities: With the increase of access services, the supported business scenarios are expanded, and services have higher and higher requirements for the flexibility and real-time performance of routing strategies. It is difficult for static configuration to take effect in real-time or policy flexibility. To meet the needs of business development, dynamic orchestration capabilities that support routing strategies are required;
Traffic isolation cost is high: Lack of lightweight business scope isolation capabilities, and the cost of creating new clusters is too high. To support different network closure rhythms of business lines and support fault isolation, a lightweight multi-cluster traffic isolation solution is required.

The rapid development of cloud native in recent years has also provided a better architecture choice for the gateway layer.

Cloud native architecture

In order to solve the existing problems of the access gateway, combined with the group's business scenarios and the cloud-native open source system, the AServer access gateway's cloud-native evolution path has been opened. For step-by-step verification, the decomposition of three phases is gradually realized: operation and maintenance system upgrade, service Governance & gateway mesh, split north-south architecture. Next, a detailed description of the evolution of each step is given.

Operation and maintenance system upgrade

Problems to be solved

Through containerized upgrade and deployment, the deployment and operation and maintenance methods are greatly simplified, and the most prominent problems at the time can be solved, but it is not enough to simply renovate the deployment method:

Due to the particularity of the access gateway (such as the need to connect to the configuration management platform, there are a large number of VIP requirements), it cannot directly connect to the group’s infrastructure, and independent customized operation and maintenance tools have been developed. The expansion and contraction process requires multiple foundations. The components are coordinated through non-standard interfaces, which greatly affects the iterative efficiency of operation and maintenance products
Operations such as replacing the machine with a faulty machine rely on external system polling and detection, and the group's basic setting system can only be processed by docking with the customized operation and maintenance platform, which has a large delay
Operation and maintenance operations are separated from the group operation and maintenance system

Evolution thinking

With the gradual improvement of the unified infrastructure ASI (Alibaba Serverless infrastructure) designed for cloud-native applications within the group, a complete cloud-native technology stack support based on the native K8S API is provided.

The cloud-native solution has strong orchestration capabilities. It is easy to smooth out the peculiarities of the gateway layer by implementing k8s expansion through customization. ASI's original automated operation and maintenance methods can be directly applied to the gateway layer.

The particularity of the model of the gateway layer can be realized by dividing the node pool. The model and kernel parameters of the gateway machine node pool can be customized, eliminating the particularity of the gateway operation and maintenance, and managing the operation and maintenance in a unified manner.

Evolution plan

Through k8s's own Controller expansion capabilities, custom container layout, you can monitor Pod change events during expansion and contraction to perform machine additions and deletions to the configuration management platform, and you can also mount/uninstall VIPs, smoothing out the peculiarities of operation and maintenance. And all resources are defined through declarative API, which is convenient for operation and maintenance.

For gateway operation and maintenance, it is also necessary to retain a very simple operation and maintenance platform, which is only used for website construction. Compared with ordinary applications, gateway construction needs to create VIPs in the corresponding area and perform operations such as domain name binding, which is lightweight and easy to maintain:

Through the ASI transformation, the operation and maintenance of the access gateway is integrated into the group’s ASI cloud native system (improving delivery efficiency and removing special operation and maintenance), and the general capabilities are lowered to ASI and basic systems. At the same time, it has risk isolation, self-recovery, and Flexibility

Risk isolation: Use Sidecar capabilities to isolate security and engineering capabilities to avoid mutual interference between the two. Abnormal security capabilities will only affect traffic cleaning. After the security capabilities are degraded, the overall service will not be unavailable;
Self-healing: For the self-healing ability of containers, the original containerization method relies on the polling detection of external applications. Both accuracy and real-time performance are lacking. After upgrading ASI, through the container's own detection, it can be 3-5 Identify and replace the faulty container within minutes;
Resilience: Through the transformation of ASI, the docking method of each system can use standard declarative APIs to integrate various components in the group, which greatly simplifies the expansion and contraction operations and provides support for automatic flexibility;
Differences in shielding models: Through the division of node pools, special models can be used for gateway applications, and the underlying configuration shields differences without special operations.

Service governance & gateway meshing

Problems to be solved

As the types of services accessed at the gateway layer increase, tens of thousands of API routing rules need to be supported, and routing strategies are becoming more and more refined, and the use of tengine's native capabilities cannot meet business needs. Through custom development of tengine module, non-standard definition method, it can be well adapted to the development of business in the past few years, but as business demands become more refined, the cost of custom development of tengine module has gradually increased.

Original structure

The routing configuration is a combination of module configuration + native configuration. Multiple module configurations jointly determine the routing strategy. Distributed configuration cannot identify a complete routing path for a request;
Through the division of functional modules, it is difficult to implement incremental updates according to the business granularity;
Based on the tengine architecture, the ability to dynamically change is insufficient, and domain name changes are regularly pushed and configured to take effect every day, which cannot meet the needs of rapid business iteration;
Non-standard protocols are directly connected to different management and control platforms, and the cost of connection is high, and it is not easy to close the mouth and control;
For different business lines (such as Taoxi, Youku), resource isolation must be achieved. Since most module configurations use static public configurations, the cost of building a website is relatively high.

Evolution thinking

How to dynamically orchestrate and finely control routing strategies is the primary consideration in the cloud-native system. Refer to the industry gateway layer practices, such as Kong, Ambassador, etc. The mainstream gateway data plane implementations are all based on nginx or envoy. The comparison of the scalability, dynamic orchestration capabilities, and maturity of different products:

From the perspective of dynamics, standardization, and performance, using envoy as the data plane is more suitable for the cloud-native evolution direction:

Dynamic and flexible
- The standard xDS protocol implemented by envoy is flexible enough and can be fully configured and changed dynamically
- Envoy is extensible enough, and the unique routing logic within the group can be realized by implementing filter extension
Standard
- istio standard components, strong community support and rapid development
- Ali Group’s meshing uses istio technical solutions and envoy as a data plane option can be unified with the group’s business management and control
performance
- C++ implementation, performance is good enough, and development efficiency is higher than tengine

The disadvantage of envoy is that as a standard component of istio, it has strong east-west routing capabilities. As a north-south direction, certain performance and stability optimizations are required, but in the long run, dynamics and standardization are more important.

Evolution plan

As a unified control plane component, Reuse Group Pilot realizes the Meshization of the gateway itself:

The control plane needs to provide a layer of management and control logic to close the permissions in order to provide the writing of each exposed business product. Each product writes the routing strategy through the k8s declarative api, and then is converted from the Pilot control plane to the xDS data plane protocol for real-time synchronization For the data plane Envoy, the implementation architecture of the southbound routing gateway:

Due to the large-scale configuration of the group, hundreds of thousands of routing rules, thousands of applications, and hundreds of thousands of business nodes, the open source system rarely has such a scale. After the Pilot + Envoy solution is applied to the north-south gateway, it is necessary to optimize and customize the native components to solve the performance and stability problems caused by scale:

Pilot supports the SRDS protocol: solve the linear matching performance problem caused by large-scale API configuration
Incremental configuration update: realize and improve the incremental update capability of the control plane to avoid the risk of expanding the radius of change caused by full update
Node change optimization: Solve the impact of state changes of hundreds of thousands of business nodes on the performance of the control plane and data plane
Extended customization: customized filter implementation for group-specific routing rules

By customizing and optimizing the open source system, the needs of the group can be well matched, and the special needs of different businesses within the group can be realized through flexible configuration combinations and the ability of fast iterative control plane transparent transmission.

North-south split

Problems to be solved

As a bridge between users and services, the gateway keeps the long chain alive on the user side and optimizes the protocol to allow users to connect to the group as quickly and stably as possible; supports flexible routing and fuse current limiting strategies for services, and load balancing. Although the overall capabilities of connection keep-alive and routing and forwarding as gateways are revealed, the requirements for iterative efficiency and service characteristics of the two are quite different.

In some big promotion scenarios, even if there are unexpected traffic peaks, the gateway layer, as a barrier to protect business services, can still be as stable as a rock, relying on high performance and water level reservations. Considering the keep-alive long chain, protocol optimization has this long iterative cycle, and the performance is extremely high; routing forwarding and traffic cleaning are naturally relatively high due to flexible and complex strategies. If the two are divided into architectures, it can be greatly To improve the overall resource utilization rate.

Evolution ideas and plans

The protocol uninstallation, long-chain keep-alive, etc. interact with the client, and can maintain extremely high-performance modules, which can be separately split into northbound clusters. Due to the good performance, only a small number of machines can be used to build a high dam to block the flood; The business routing strategy is related to the security cleaning capability, which consumes more performance. It is split into the southbound cluster, and the southbound cluster is protected from overload by the northbound high dam. The southbound cluster can reduce the reserved water level, thereby improving the overall resource utilization In this way, it can not only improve resource utilization, but also can be flexibly configured to meet the needs of rapid business development.

Overall structure

Through three stages of evolution, the final architecture diagram is as follows:

AServer access gateway cloud native architecture

Unified control plane: service access, service discovery, and current limiting control through the group's unified control plane, which play a role in unified handling of changes and closures;
Northbound connection layer: Based on tengine carrying hundreds of millions of online users and traffic peaks, it acts as a high dam and improves the resource utilization rate of the southbound routing layer;
Southbound routing layer: based on Envoy through Pilot conversion xDS protocol to dynamically issue routing strategies to achieve dynamic routing and lightweight traffic isolation solutions;
Cloud-native base: Operation and maintenance operations are established on the group's unified infrastructure ASI, which shields gateway differences and reduces the complexity of operation and maintenance.

future

Ali AServer access gateway is evolving to cloud native step by step. Each evolution is based on a problem that has plagued us for a long time, but it is not just about solving problems. At the same time, based on the solutions of the current era, the transformation of cloud native architecture is far from the end. , The advantages of cloud native have not yet been fully utilized. The technology upgrade is ultimately to serve the product. After the cloud native upgrade, we have a powerful engine. The next thing we need to do is to use this engine to transform the product form, so that developers based on the gateway will ultimately benefit.

product integration

What kind of state is the best state of a gateway product? Developers use it every day, but they don't need to care about the existence of the gateway. In this way, the state with the lowest sense of existence may be the optimal state. The current access gateway exposes some implementation details from the product form. An entry application needs to interact through several different systems to complete the access. After the cloud native transformation is completed, it can better realize All in One and integrate the product. And closed loop.

Fast and flexible

Although the ASI Pod upgrade has been completed, it can automatically perform operations such as replacement of faulty machines and machine migration, which reduces operation and maintenance costs, but the most important ability of going to the cloud is rapid flexibility, such as rapid expansion before the double eleven peak promotion. , The rapid shrinkage after the big promotion can greatly reduce the machine resources reserved for preparing for the big promotion, thereby saving a huge amount of cost. Of course, there are also many problems to be solved, such as security, reliability, and flexible real-time, which all need to be built together with cloud infrastructure to truly take advantage of the cloud.

, 3 mobile technology practices & dry goods for you to think about every week!

The cloud-native evolution path of Alibaba's billion-level long-connected gateway