Virtual nodes can easily cope with LOL S11&#39;s millions of concurrent traffic-Tenking Sports&#39; flexible container practice

author

Liu Rumeng, R&D engineer of Tengjing Sports, is good at high concurrency, microservice governance, DevOps, and is mainly responsible for the architecture design and infrastructure construction of the e-sports service platform.

Zhan Xuejiao, Tencent Cloud Elastic Container Service EKS Product Manager, is mainly responsible for product planning related to EKS virtual nodes and container instances.

Business introduction

Since 2019, Tengjing's entire e-sports event data service is completely hosted by Tencent Cloud TKE container service. The open platform of Teng Competition Data currently mainly provides authorization and query of professional event data. With the successive access of Douyu, Huya, Penguin, Palm League, WeChat Live, Weibo and other platforms, the overall traffic of the platform has explosive growth. .

Prior to this, the 2021 League of Legends Global Finals (hereinafter referred to as S11) set a new high for platform traffic, reaching a million-level QPS and tens of billions of calls. In the face of business scenarios with strong periodicity and high concurrency such as e-sports, effective and rapid automatic expansion and contraction and improvement of resource utilization are the keys to meeting the rapid development of business and reasonably controlling costs.

Here we will introduce how the Data Open Platform uses the 161b712a166790 virtual node flexible scheduling + VPC-CNI architecture during the LOL S11 event to easily cope with the explosion of millions of traffic.

Business characteristics

E-sports events have obvious business characteristics and have very high requirements for the automatic scalability of services.

Periodic

E-sports events have obvious periodicity. During the competition period is the peak period of traffic, the traffic in the rest of the time drops sharply, and the traffic is hundreds of times different. It is necessary to use flexible expansion and contraction capabilities to reduce redundant resources and reduce costs during the trough.

High concurrency

During the game, the service needs to carry millions of QPS, fast expansion time, and a resource pool with sufficient inventory.

Sudden increase
At the beginning of the game, players began to flood into the live broadcast room. It was necessary to ensure the stability of the service to avoid a sudden increase in traffic that would cause a cluster avalanche.

Architecture introduction

Overall structure

The cluster uses Istio as the service grid framework for microservice management. Traffic enters the Istio Ingress (direct connection Pod) through multiple CLBs (solving the upper limit of the bandwidth of a single CLB) and then distributes the traffic. Carry out very refined flow management, such as: grayscale, current limit, fusing, etc.

Normal node + virtual node

After enabling the VPC-CNI and adopting the direct connection Pod mode, the cluster is no longer restricted by the NodePort network forwarding capability. A small number of conventional nodes can cope with daily low-load scenarios of the business, and the flexible expansion and shrinkage capabilities of virtual nodes can be used to cope with the extremely high-load scenarios of the business during the competition.

DevOps

Docker-based CI/CD service supports multi-environment (cloud, local) and multi-cluster orchestration services to meet different deployment requirements of the business.

The evolution of flexible expansion plans

Based on the above-mentioned business characteristics, the flexible expansion solution has gone through a series of evolutions of [manual expansion=>node pool=>virtual node]. The current elastic expansion solution can perfectly meet business needs.

Early stage of business: manual expansion

At the beginning of the business, the load is low. According to the business characteristics, manual expansion and contraction can basically meet the demand.

Since manual expansion and contraction requires a certain time window, a certain amount of redundant resources need to be placed to cope with the sudden increase in traffic, and the resource utilization rate is low, only about 6%.

Business development: node pool

With the development of business, the characteristics of periodic high and low peak traffic have become more and more obvious. In the face of high-frequency expansion and contraction requirements, manual expansion and contraction not only have higher labor costs, but also human error cannot be avoided.

In the scenario where the sudden increase in traffic speed is slow, the node pool can better meet the business needs, but the server needs to be configured, the expansion speed is slow, the redundant resources still exist, and the resource utilization rate is low. In addition, performing operations such as blockade and expulsion of nodes during capacity reduction is not conducive to the stability of the service.

Rapid business development: virtual nodes, second-level expansion, saving 30% of costs

During the rapid business development stage, there is a huge gap between high and low peak traffic, concurrency is gradually increasing, and the sudden increase in traffic time reaches the second level. The expansion speed of the node pool is not enough to meet business needs, and there is a risk of insufficient inventory when purchasing servers.

virtual node is a flexible scheduling capability provided by TKE, , which provides a nearly unlimited resource expansion capability. Pod can be directly scheduled to the cloud resources maintained by the elastic container service EKS without the need to expand the node. Compared with node pools, the expansion and shrinking process of virtual nodes simplifies the process of purchasing, initializing, and returning servers, greatly improving the speed of elasticity, reducing possible failures in the expansion process as much as possible, making elasticity faster and more efficient , More cost-saving.

In terms of flexibility and efficiency, virtual nodes can start hundreds of Pods within tens of seconds, which can well cope with high-burst business scenarios such as S11. At the cost level, it avoids the buffer resources generated by ordinary nodes due to the inability to perfectly allocate the resources requested by the Pod, and saves resource costs.

On this basis, we combine business-side data and adopt automated resource preheating to deal with high-frequency sudden increase in traffic scenarios; operational business scenarios need to be closely integrated with the operation department to prepare for manual expansion.

Network forwarding scheme optimization

The problem

When the cluster provides access to the public network, by default external traffic is forwarded to the inside of the cluster via the cluster node NodePort. When the number of Pods deployed in the virtual node is small and the overall load of the cluster is low, this mode will not have a network forwarding performance bottleneck. However, as the number of Pods deployed in virtual nodes increases, the overall load of the cluster increases, and more nodes need to be added for network forwarding, which runs counter to the goals of automatic scaling, rapid expansion, and cost reduction.

Optimization

After enabling VPC-CNI, the direct connection Pod mode is adopted. Containers and nodes are distributed on the same network plane. Each Pod is assigned a fixed IP. The network is directly transferred from CLB to Istio Ingress without forwarding via NodePort, which improves network forwarding efficiency. There is no need for network forwarding nodes, which greatly improves the scalability of the cluster. In this mode, the upper limit of cluster expansion is limited by the number of available IPs on the network segment allocated by the cluster, so it is necessary to plan in advance to avoid cluster expansion limitations.

final effect

Through the combination of virtual nodes and directly connected Pods in the VPC-CNI mode, the overall carrying capacity of the cluster has been greatly improved, and there has been considerable progress in cost control.

Scaling in seconds

With virtual node + K8s HPA capability, the carrying millions of traffic within tens of seconds, which can easily respond to rapid expansion and contraction requirements. Combined with the data on the business side, automatic resource preheating is performed to improve the cluster's ability to resist sudden increase in traffic. It is no longer necessary to block or evict nodes when shrinking, which improves the stability of the service.

Million bearers

The VPC-CNI direct connection Pod solves the bottleneck problem of NodePort traffic forwarding. In addition, the virtual node's near-infinite resource expansion capability greatly increases the upper limit of the cluster level expansion. Scenarios with a large number of readings such as the Data Open Platform can easily expand 161b712a166b39 To one million or even tens of millions of QPS .

lower the cost

The efficient expansion and contraction of virtual nodes, in conjunction with the HPA automatic scaling mechanism of K8s, reduces resource preparation and idle time, avoids the problem of fragmented resources in ordinary nodes, and effectively improves resource utilization. In the end, saves 30% for business. % Cost .

Reference documents

Container Service TKE:
https://cloud.tencent.com/document/product/457/6759

Overview of virtual nodes:
https://cloud.tencent.com/document/product/457/53027

Elastic cluster:
https://cloud.tencent.com/document/product/457/39804

VPC-CNI mode introduction:
https://cloud.tencent.com/document/product/457/50355

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.

③The official account backstage reply [white paper], you can get "Tencent Cloud Container Security White Paper" & "Source of Cost Reduction-Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Virtual nodes can easily cope with LOL S11's millions of concurrent traffic-Tenking Sports' flexible container practice

author

Business introduction

Business characteristics