1
头图

image.png

1. Background

With the development of Ctrip's overseas hotel business, the amount of data transmission between overseas suppliers all over the world and Ctrip's headquarters IDC has grown rapidly. Technically, this growing amount of data has put forward higher requirements on the bandwidth and delay of cross-border network private lines; in business, due to the current limited cross-border network private line resources, business processing efficiency and user experience have also been affected. In terms of cost, cross-border network leased line is an expensive resource, and simply expanding the leased line will put enormous pressure on IT costs. So we started to think about whether we can solve the problem of increasing bandwidth pressure and supplier interface delay through public cloud combined with the business characteristics of hotel direct connection.

The hotel direct connection system mainly uses automated interfaces to realize the system connection between suppliers or groups and Ctrip, and realizes the flow and interaction of static information, dynamic information, order functions, etc. through the system. At present, a large number of overseas hotel businesses of Ctrip are connected through the hotel direct connection system.

image.png

📢 To learn more about the latest technology releases and practical innovations of Amazon Cloud Technology, please pay attention to the 2021 Amazon Cloud Technology China Summit! Click on the image to sign up

This article mainly focuses on the application architecture adjustment and cloud-native transformation in the process of migrating and deploying the direct connection service of Ctrip Elastic Kubernetes Service), DNS query latency, and cost optimization for cross-AZ traffic reduction are described in detail.

2. Pain point

The overseas direct connection of Ctrip hotels has connected with thousands of overseas suppliers, and all interface accesses go out through agents (see Figure 1). When it splits into multiple requests, at most one request may split into dozens of requests, and the request message is very huge (usually ranging from tens of Kb to hundreds of Kb), although we may only need to return the report A small part of the information in the article, but due to the limitation of the current architecture, all the packets can only be requested back for processing, which undoubtedly wastes a lot of bandwidth.

image.png

figure 1

At the same time, because suppliers are located all over the world, all requests/responses need to be exported through the group's proxy, which leads to a higher latency of some suppliers' interface responses affected by physical distance, which will reduce the user experience.

3. Cloud service selection and preliminary plan

One of the core goals of this time is to improve the network transmission capability and latency of connecting with global suppliers, and improve user experience. It is necessary to choose a cloud supplier with a wide distribution of resources around the world to help Ctrip access data as close as possible to suppliers. After several rounds of exchanges with multiple public cloud vendors, considering the level of technology manufacturers, service capabilities, cost, price and other factors, we believe Amazon cloud technology both in coverage and global network capabilities (see Figure 2) ( Amazon Cloud Technology provides a wide range of service capabilities in 25 regions and 80 availability zones distributed around the world. At the same time, data centers are interconnected through its backbone network, which improves the data exchange ability of different data centers in the future), the advanced nature of maturity, on-site team of service capabilities, response time, professional level has obvious advantages , ultimately we choose cloud vendor partner Amazon cloud technology as a resource deployment.

image.png

figure 2

In order to better integrate with the use of resources on the cloud, we adopted the containerized deployment solution of IDC, and finally considered the high availability design and SLA guarantee of the managed container platform, as well as the compatibility with the community, using the Amazon EKS managed container platform as a platform for deployment.

In terms of resources, after we renovated the service, we used a large number of spot instances as Amazon EKS worker nodes, which greatly reduced costs and improved efficiency.

At the same time, using the network and platform advantages of the public cloud, the corresponding business services originally deployed in the IDC of Ctrip's headquarters are deployed to overseas public cloud sites that are closer to the supplier, so as to achieve a highly reliable and low-latency network between Ctrip and overseas suppliers. Direct connection, and strip out some data preprocessing logic and deploy it on overseas public clouds, so that only the processed valuable data (rather than the original, full amount of raw data) is compressed and then transmitted to the Ctrip headquarters data center , thus achieving reduce the pressure on the cross-border leased line network to enhance the business data processing efficiency, reduce costs, optimize the user experience and other targets .

Fourth, the hotel directly connected to the cloud experience

4.1 Cloud native transformation of cloud business applications

In order to fully use the convenience and cost optimization brought by cloud services, after research and analysis, if we directly migrate the application to the public cloud, although the business will generate corresponding value, the cost will be relatively high. The cloud-native architecture has been optimized for Connected Services. The main adjustments are as follows:

1) Access the supplier module on the cloud

Saving bandwidth requires reducing the number of requests going through the proxy and reducing the packet size per request. Our approach is to move the logic of request splitting to Amazon cloud technology, so that each time a user requests come in and go out through the proxy, there is only one request/response. At the same time, we remove the useless attributes in the packets returned by the supplier on Amazon Cloud Technology, and then combine the relevant nodes according to the business attributes and finally compress and return them, thus achieving the purpose of reducing the size of the packets (see Figure 3). Judging from the current running data, the bandwidth traffic of the entire proxy is only used by 30% to 40% of the previous one.

image.png

image 3

Public cloud vendors generally adopt a pricing strategy of charging by traffic. In the process of designing technical solutions for network inbound and outbound network access, Amazon Cloud Technology NAT gateway is used by default, so network traffic costs are relatively high. Considering that the hotel direct connection request has a feature, usually the request message is less than 1K, while the average response message is 10K to 100K. Taking advantage of this feature, we have adopted the Amazon EKS-based self-built Squid proxy solution on Amazon Cloud Technology. (See Figure 4), in this way, only outbound request packets will incur traffic charges, while a large number of inbound response packets will not be charged, thus greatly reducing the network traffic charges generated by Amazon Cloud Technology.

image.png

Figure 4

2) Reduce network latency and use Amazon cloud technology global data center to access suppliers nearby

Many overseas provider services are deployed all over the world, and all our overseas visits are unified through the proxy, so some providers whose servers are deployed far away cause high network latency due to physical distance. Through Amazon's data centers around the world, we can deploy services near the supplier's computer room, and at the same time use the backbone network of Amazon to reduce the delay from each data center to the Amazon data center near the agent's location, and finally Connecting the Amazon cloud technology data center and Ctrip IDC through a dedicated line (see Figure 5), the whole process improves the performance of suppliers who have a greater impact on network latency due to physical distance, and can reduce the response time by up to 50%.

image.png

Figure 5

4.2 Ongoing architectural overhauls and performance and cost optimizations

In the current solution, we have developed a new set of applications for cloud migration. The problem is that when there is a business change, we need to adjust the two applications deployed on Ctrip IDC and Amazon Cloud Technology at the same time. System maintenance costs. The main reason is that the original application relies heavily on Ctrip's basic components. This time, the cloud migration attempt uses a completely independent account and VPC network. If it is unrealistic to deploy the same set on the cloud, one is that the cost is too high, and the other is that some Sensitive data cannot be stored in the cloud, so we will optimize the adapter architecture in the future, and reuse a set of applications to adapt to different cloud environments without relying on Ctrip's basic components.

After the business is launched, in order to verify the possibility of going to the cloud for larger-scale workloads in the future, we are also making continuous optimizations in terms of performance, cost, and high availability.

4.2.1 Using cloud elastic scaling capabilities

Take computing resource cost as an example: computing instance cost = instance running time * instance price. If the operating mode of the local computer room is simply and rudely applied to cloud computing, the cost of cloud service computing resources is higher than that of the local computer room. Therefore, we need to make full use of the on-demand charging feature on the cloud to reduce the cost of idle resources. The running time of the instance is proportional to the number of services in the Kubernetes cluster and the computing resources allocated to these services, and the number of services is proportional to the traffic.

There are unpredictable business traffic in hotel direct connection business scenarios, such as travel policies promulgated near holidays, or live marketing activities. The elastic characteristics of cloud native make good use of reasonable resources to deal with burst traffic.

The HPA elastic architecture of Kubernetes collects the overall load indicators of the cluster in real time, determines whether the elastic scaling conditions are met, and executes the scaling of pods. The scaling of pods alone is not enough. We also need to use the Cluster Autoscaler component in the cluster to monitor the pods in the cluster that cannot be scheduled due to insufficient resource allocation, and automatically apply for additional nodes from the instance pool of the cloud platform. At times, the Cluster Autoscaler component will also detect nodes with low resource utilization in the cluster, schedule the pods to other available nodes, and recycle these idle nodes.

image.png

Elastic scaling case

The elastic nature of cloud native not only helps to reduce the cost of resource usage, but also improves the service's fault tolerance rate for infrastructure failures. During the interruption of some availability zones of the infrastructure, other availability zones will increase the corresponding number of nodes to continue to maintain the availability of the entire cluster.

Kubernetes supports CPU and memory tuning for pod containers, finding a reasonable quota for optimal performance at a reasonable cost. Therefore, we will do some load tests that are close to the real environment before the service is launched to the cloud, and observe the impact of changes in business traffic on cluster performance (resource utilization rates of business periodic peaks and low peaks, resource bottlenecks of services, appropriate surplus resources) buffer for spiking traffic, etc.). It will neither cause stability problems due to high actual utilization, such as OOM or frequent CPU throttling, nor waste resources due to low utilization (after all, even if your application only uses 1% of the instance, you have to pay for the instance 100% of the fee).

4.2.2 Using the public cloud auction instance

Some cloud platforms will use some idle computing resources as bidding instances and rent them out at a lower price than on-demand instances. As the name implies, the final cost of a spot instance is determined by bidding based on market supply and demand. According to our actual experience, if it is not a particularly popular model, the price is basically about 10-30% of the on-demand instance cost. Low-priced Spot Instances naturally have their limitations. The cloud platform may adjust the resource ratio of the Spot Instance pool to recycle some instances. Generally, the probability of recycling is usually less than 3% according to statistics. At the same time, these instances will be notified 2 minutes in advance before recycling. . We use the Terminal handler component provided by Amazon Cloud Technology to schedule the container to other available instances in advance after receiving the recycling notification, which reduces the impact of resource recycling on the service. The figure below shows the resource pool division of a cloud for Spot Instances. We can see that even the same instance resources are independent resource pools in different Availability Zones.

image.png

Image 6

In order to minimize the impact of spot instance interruptions, including the rebalancing impact of instances in multiple availability zones, we use Amazon ASG (Amazon Auto Scaling Group) to select different instance types and use different instance resource pools. Independent use of Amazon ASG for management, which ensures maximum utilization of resources.

image.png

Figure 7

Ctrip Hotel Direct uses a hybrid deployment of on-demand instances and spot instances to ensure low cost and high availability. Some system-critical components (such as the Cluster Autoscaler), stateful services (such as Prometheus) that lose data if interrupted, run on On-Demand Instances. It has a high tolerance for errors and uses flexible and stateless business applications to run on Spot Instances. Through the node affinity of kubernetes, control the scheduling of different types of services to the instances of the corresponding type labels. (See Figure 8)

image.png

Figure 8

Through the Kubernetes-native HPA and Cluster Autoscaler components combined with the full utilization of Amazon ASG and bidding resources, the cost can be reduced by 50%-80%.

4.2.3 DNS resolution performance optimization

When the service scale gradually increased, we found that the call delay between services increased significantly, with an average of 1.5S and a peak of 2.5 seconds. After analysis, it was found that the performance resolution bottleneck was mainly caused by the high DNS resolution load. In the end, we Using the more mainstream localdns method in the community, do local caching for hot spot resolution domain names to reduce frequent resolution requests to core DNS and improve performance:

image.png

Figure 9

As shown in Figure 9, the NodeLocal DNSCache based on DaemonSet is deployed on each Node, and the DNS query pressure of the CoreDNS service is relieved through Node LocalDNS. The LocalDNS Cache will monitor the DNS resolution request of each Client Pod on the node where it is located. Configuration, Local DNS Cache will try to resolve the request through the cache first, if it does not hit, it will go to CoreDNS to query the resolution result and cache it for the next local resolution request.

As shown in the figure below, by using the LocalDNS solution, we reduced the peak latency from 2.5S to 300ms, shortening the response time by 80%:

Before using LocalDNS, the average response is 1.5-2.5S.

image.png

Before optimization

After using the LocalDNS solution, the response request is reduced to 300-400ms, and the delay is optimized by 80%.

image.png

Optimized

4.2.4 Public cloud cross-availability zone traffic optimization

After heavily optimizing resources with Spot Instances, we noticed that the traffic across Availability Zones is very high (60%) after the services are greatly scaled, this is because when calling between services, we deploy service units to different Availability zone, maximize the availability of services, and bring the problem that a large number of traffic interactions between services bring traffic costs across Availability Zones (see Figure 10).

image.png

Figure 10

However, for the high availability of the entire system, we do not want to deploy the service in a single availability zone and reduce the service SLA. We need to reduce traffic across Availability Zones while maintaining high availability of the service.

After researching different solutions, we finally use Amazon NLB to expose services, and use the disable cross-az function of Amazon NLB to control the traffic availability zone of upstream and downstream services in the same availability zone. At the same time, the local dns component mentioned above is used to solidify the domain name resolution of upstream services accessing different availability zones of Amazon NLB, ensuring that the upstream and downstream service traffic can only be communicated within the availability zone. After the transformation, the picture is as follows:

image.png

Figure 11

Because the back-end service will be forwarded through the Kube-proxy of K8s, resulting in cross-availability zone and cross-node, we choose to use the externalTrafficPolicy local policy to solidify the forwarding traffic on the service of the local node, but at the same time, the local forwarding policy also brings some problems ( See Figure 12):

image.png

Figure 12

As shown in the figure above, the local forwarding strategy may cause traffic black holes and unbalanced service loads due to the unbalanced distribution of backend services. Therefore, on this basis, we use Amazon EKS elastic expansion group strategy to evenly distribute the underlying node resources to different At the same time, using the K8s anti-affinity strategy, the service is distributed to nodes in different availability zones as much as possible, which ensures the balance of traffic to the greatest extent, and at the same time ensures the high availability of cross-availability zone deployment of services.

After optimization, cross-availability zone traffic is reduced by 95.4%. (See Figure 13)

image.png

Figure 13

5. Subsequent optimization and improvement direction

Although the current architecture solves some problems in our business, there are still some shortcomings that can be improved. In order to access the supplier nearby, we use an independent VPC network to deploy and test our cluster, so we need to deploy the relevant storage dependencies and log monitoring components in the cloud separately, which undoubtedly increases the difficulty of operation and maintenance and the service in different Migration difficulty to the cloud.

In the latest architecture design, we plan to make the following transformations to address this problem. First, for functions that require cloud computing and rely on persistent data storage, so that this part of data does not need to be transmitted to the cloud. Secondly, because the company already has a mature environment in other data centers of Amazon Cloud Technology, we only need cooperate with OPS to open up the VPC network between the two Amazon Cloud Technology data centers, and then we can use the company's log and monitoring framework. Reduce operation and maintenance costs.

Six, summary

This article shares how to quickly build a stable and efficient production environment on the cloud to achieve rapid delivery, intelligent elasticity, and some cost optimization experience on the cloud through the practice of Ctrip hotel direct connection in cloud native. With the help of the cloud native system, the infrastructure is automated, and part of the operation and maintenance work is freed, so that more investment in business iterations, more agile response to business demand iterations, and rapid trial and error and feedback through monitoring and logs. I hope this will help more teams who want to go to the cloud, avoid detours, and embrace the benefits of cloud native.

Author of this article

Ctrip software technical expert

Pay attention to the system architecture, and devote to the development of high-availability and high-performance supporting business systems.


亚马逊云开发者
2.9k 声望9.6k 粉丝

亚马逊云开发者社区是面向开发者交流与互动的平台。在这里,你可以分享和获取有关云计算、人工智能、IoT、区块链等相关技术和前沿知识,也可以与同行或爱好者们交流探讨,共同成长。