头图

With the continuous business iteration of the communication industry, the new track has brought new changes to the business. The scale of ecological cooperation and channels have brought model innovation to the system, but also brought greater pressure.

At the same time, the regional environment of the international station and the factors of local policies and regulations have also brought new opportunities and challenges to the construction of globalization.

This article will discuss the gateway technology in the cloud-native era, facing the background of globalization, platformization, and refinement, how to tap the opportunity for self-transformation in the cloud-native era, and how to drag the heavy technical debt to achieve Nirvana rebirth and achieve high performance , high-availability, low-cost architecture evolution and technological breakthroughs, this article combines the practical experience of gateway technology under the double 11 top stream over the years, and hopes to help readers.

New trends and challenges in the development of cloud communication gateways

Alibaba Cloud Communication SMS Gateway is a cloud-native gateway built on the leading communication architecture and large-scale distributed gateway processing technology. It provides stable communication service capabilities, and is capable of disaster redundancy, recovery, and switchable high-availability service guarantees. The customer's SLA guarantee requirements, and ultimately achieve the maximum utilization of resources and maximize profits.

High performance, high availability, low cost - both trends and challenges.

High performance, 100,000-level concurrency, second-level reach

Alibaba Cloud Communications started in 2017. It was incubated in Alibaba Communications in the early days, and then integrated with Alibaba Cloud. After just a few years of development, it is now one of the most popular cloud service products on Alibaba Cloud. In 2019, it ushered in large-scale development. This year, it achieved a historical peak on the day of the Tmall Double 11 event, covering more than 200 countries around the world.

From a technical point of view, the cloud communication SMS gateway supports the traffic distribution of 100,000 QPS on Double 11, and this type of concurrency is not a simple query, but requires interaction with operators or other third-party systems. In order to schedule such a large amount of business traffic and resources, in addition to ensuring system protection, it is also necessary to ensure low latency of transmission and response, and achieve global coverage and second-level reach, which is a big challenge.

The appeal is: to meet high concurrency and high performance at the same time.

So where is the main bottleneck of the current problem?

1. The current gateway architecture mainly trades scale for performance, requiring large-scale cluster distributed deployment to provide high concurrency capabilities.

2. In terms of communication network transmission, it is necessary to rely on long-term connection methods such as communication standard protocols to transmit through the Internet.

High availability, minute-level fault isolation and recovery <br>With the development of business, cloud communication resource nodes will reach 10,000-level, and how to achieve the stability of 10,000-level nodes under 100,000-level concurrency is a very big problem. In addition, cloud communication has a business scenario of sudden traffic like spikes, such as marketing text messages, which will send massive text message requests within a few minutes. This instantaneous traffic often forms a flood peak and impacts the system.

From a technical point of view, the cloud communication SMS gateway adopts a micro-service distributed architecture for domain splitting and deployment, and uses a lot of asynchronous programming and multi-threaded concurrent scheduling models. The system complexity is evident. Such a large cluster scale and intensive communication network , in addition to the business fault monitoring coverage and alarm accuracy of 100%, but also to ensure fault isolation and rapid recovery, to achieve high availability of the overall system, which is another big challenge.

So what hidden dangers still exist in the current system risk?

1. The current gateway architecture is mainly a multi-center and multi-group deployment architecture, which requires the isolation and deployment of services, scenarios, and customers of different dimensions.

2. Secondly, in terms of data storage resources, it is necessary to focus on the stability of the database.

Low cost, elastic and scalable container resources

With the exponential growth of computing scale, especially the hundreds of servers deployed during Double 11, when the traffic and resources are further overturned, the cost consumption will also increase. However, after the big promotion and the ebb and flow, it is the expansion and contraction of container resources. However, for stateful services, the cost and difficulty of resource migration corresponding to expansion and contraction is not an easy task.

From a technical point of view, stateful services are tied to resources. The reason is that SMS is a long-connect asynchronous full-duplex communication mode. The essential conflict is the utilization of resources under tidal traffic. Faced with this stateful In addition to the optimal matching of traffic and resources, reducing the cost and waste of idle resources, and improving CPU utilization, it is also necessary to achieve stateless container resource elasticity and scalability, and further reduce operation and maintenance costs. This is another big challenge.

So what are the current technical difficulties?

1. The current gateway deployment is mainly in the DevOps mode, and it is necessary to apply for resources in advance before deploying the image container.

2. In the management of resource connections, it is necessary to pre-allocate resource connections to realize the binding of resource connections and container IPs.

Breaking the game with leverage: cloud-native edge gateway architecture

Cloud native is a set of technical method systems that are born from the cloud and act in response to the cloud. Cost reduction and efficiency enhancement are the greatest value of cloud native applications.

Next, let’s talk about the technical advantages established by the SMS gateway in combination with the technical characteristics of cloud native.

Easy deployment, wide coverage, minute-level service deployment

Cloud native is a set of technical method systems born of cloud, and Alibaba Cloud has center and edge nodes all over the world, so how does SMS gateway based on cloud native technologies such as containerization, service mesh, and microservices, combined with edge cloud, Create a lightweight edge gateway and cloud network deployment architecture to achieve easy deployment and wide coverage of global access and distribution capabilities, so as to achieve the development goals of improving gateway performance and reducing operation and maintenance costs.

In order to achieve the goal of heterogeneous deployment, there are two main points here: one is that the system architecture supports easy deployment of cloud natives, and the other is that the DevOps platform supports easy deployment of application environments.

First of all, at the system architecture level, the SMS gateway implements a two-layer architecture system for this splitting and decoupling to support the business. The lightweight gateway architecture created is easier to deploy in various regions, enabling customers to achieve nearby access and low security. Delayed SMS sending experience. As shown below:

The two-tier architecture of SMS gateway provides a variety of solutions for business support. The lightweight gateway architecture is very easy to deploy in various regions, enabling customers to access nearby and ensuring a low-latency SMS sending experience. Lightweight and easy to deploy, whether it is a public cloud, a hybrid cloud or a proprietary cloud, it can be quickly deployed and built based on containerization. The two-tier architecture of SMS gateway supports independent deployment, and can also be integrated and integrated to help create a diverse deployment architecture.

Secondly, at the DevOps platform level, in order to adapt to the deployment of multi-cloud environments, the middleware and resources required by edge gateways should be as lightweight and open source as possible, including deployment to public clouds, hybrid clouds, and proprietary clouds. Based on this, we designed the edge gateway to be built entirely based on the cloud-native base to achieve stronger adaptability in deployment.

In terms of DevOps platform, we have chosen two ways to support: fully managed cluster and edge fully managed cluster. Both platforms can encapsulate the underlying resource pool into containers through virtualization technology, which can be combined with image services. To achieve rapid deployment of services, in particular, the edge fully managed platform can also manage external resource pools, so that when we deploy for hybrid cloud, we can deploy to customers' containerized services only through images.

To sum up, the edge gateway is based on Alibaba Cloud edge nodes, which helps services sink to 10 kilometers away from users, reduces latency and bandwidth costs, and achieves technology cost reduction and global multi-node fast deployment while ensuring stability. .

Easy scheduling, low latency, millisecond response

Cloud native is also a set of technical method systems that respond to the cloud. As mentioned above, the SMS gateway is a multi-group deployment solution. The gateway is independently deployed in the area close to the user to conduct low-latency and high-quality communication with suppliers. docking. So here's the question; how are edge nodes at such a large scale scheduled? How complex is the scheduling?

For complex traffic scheduling scenarios, reduce the complexity of the business architecture, realize the decoupling of business logic and traffic control logic through architecture upgrade, and turn complex scheduling into a unified traffic scheduling model that is observable and controllable, so as to achieve easy scheduling and low latency discovery target.

In order to achieve the goal of easy scheduling, it is also necessary to solve two key points: one is that the system architecture supports cloud-native easy scheduling, and the other is that the communication network architecture supports easy scheduling of the application environment.

First of all, at the system architecture level, a routing addressing scheduling algorithm based on three-level strategy is implemented to realize data link communication between nodes, between nodes and resources, and between resources and connections; and a dynamic perception algorithm based on multi-factor and multi-weight routing cooperative control Realize stable and reliable routing addressing in abnormal situations.

In addition, SMS-oriented scenarios: verification codes, notifications, marketing, etc., have very high requirements for timeliness. Technically, we have implemented an adaptive elastic flow control algorithm based on scenario priority . In isolation, the flow rate control of each queue will be affected by the operation of other queues. The queue with higher priority has greater flow rate control, and the queue with lower priority has smaller flow rate control, and can dynamically change with the system operation. Adjustment, with high timeliness self-adaptive adjustment ability. In fact, no matter which algorithm, the main goal is to make the traffic smoother and more immediate.

Secondly, at the level of communication network architecture, we mainly use open source middleware products on the cloud, such as Nacos, Redis, MNS, etc. In addition, in the process of VPC networking, we also use a lot of EIP, NAT, SLB, VPN, IPSec, etc. Network acceleration technology to ensure low latency of communication.

We know that cloud services are usually deployed in independent VPCs. VPC access needs to pass through SLB/NAT. The traffic of public network users actively accessing resources on the cloud is forwarded through SLB, and the traffic of resources on the cloud actively accessing the public network is forwarded through NAT. For the cross-region cloud network mutual access, the method we use is to call the cross-region gateway first to go inside the region gateway, and then reach the outside of the region gateway, so that the performance of network transmission will be improved. Assure.

Easy operation and maintenance, cost saving, elastic scaling in seconds

As mentioned above, the SMS gateway has a huge cluster scale and global nodes. In addition to scheduling considerations, there is another question: Is such a large-scale edge node cost control? How to operate and maintain elastic scaling under tidal traffic?

In essence, the core difficulty of SMS gateway operation and maintenance is that the connection is stateful, which will cause various complex problems. The biggest difficulty is that stateful containers cannot be elastically expanded or contracted. Therefore, one of the goals of achieving cost savings is also here. In order to achieve the goal of easy operation and maintenance, two key points need to be solved: one is that the system architecture supports cloud-native easy operation and maintenance, and the other is that observable technology supports digital and intelligent easy operation and maintenance.

First of all, at the system architecture level, we realize the cloud-based reconstruction of traditional communication gateways through the distributed loosely coupled gateway architecture , decoupling the business processing module and the communication protocol session module. The business processing layer does not need to care about the communication connection status, and can dynamically expand according to traffic. Reduced capacity, self-developed data connectors provide route discovery and scheduling capabilities.

For more lightweight deployment and design, we split the cloud network architecture into independent domain modules as a whole, and each module independently solves problems in its own domain. For some collaboratively related business service areas, we use the service integration and extension method to communicate between services, rather than developing on the local gateway, so as to ensure the lightweight and exclusiveness of the local gateway, which makes it easier to operate. dimension.

Secondly, at the level of digital and intelligent operation and maintenance, the first thing to think about is why dig deep into observable technology? What is the scope of observable data coverage? Is the data isolated? Or aggregated? What does the network structure look like?

"Observable" is a relatively large and comprehensive concept, including application performance indicators, link tracking, container monitoring, system monitoring, log monitoring, etc., each of which is a separate point, but for business application systems, we What should be done should be an all-round observable system.

Specifically, from the perspective of layers, the top layer is "seeing", which can see indicators and alarms; the next layer is "analysis", which can track the call chain, analyze RT, and where the exception is; the bottom layer is reaction, For some relatively clear scenarios, root cause analysis based on orchestration and automatic fault location based on orchestration are implemented through system automation.

To sum up, observability should be multi-faceted. What we actually solve is how to aggregate and analyze these observable data and react to the business gateway, so as to automate the operation and maintenance of AIOps.

After evolution and development, the gateway has always been committed to the development of scale, marginalization, and digital intelligence:

The global multi-site and multi-node network topology deployment is realized through the cloud network architecture of cloud gateway and edge gateway;

Focus on the evolution of the edge architecture to help the rapid and convenient deployment of large-scale gateways, while rebuilding the cloud-network communication mode to achieve the elastic horizontal scalability of cloud gateways;

Finally, through the observable technology, the global gateway nodes are monitored, embedded in metrics and traces, and the root cause analysis capability based on orchestration is built. I hope that the above content can help you to have a new understanding of cloud communication. If you are interested, you are welcome to comment and exchange.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝