Tencent Cloud Native Hybrid Cloud-a third-party cluster bomb EKS to deal with sudden traffic

Author

He Pengfei, an expert product manager of Tencent Cloud, used to be the product manager and architect of container private cloud and TKEStack. He participated in the design of Tencent Cloud's internal business and containerization transformation of external customers. He is currently responsible for the design of cloud-native hybrid cloud products.

Hu Xiaoliang, an expert engineer at Tencent Cloud, focuses on the cloud native field. Currently responsible for the design and development of the open source community TKEStack and hybrid cloud projects.

Preface

Hybrid cloud is a deployment form. On the one hand, enterprises can choose hybrid cloud from the perspectives of asset utilization, cost control, and risk reduction. On the other hand, enterprises can also obtain the comparative advantages of different cloud service providers through the deployment of hybrid services, and allow the differences in the capabilities of different cloud service providers to complement each other. Containers and hybrid clouds are a match made in heaven. Based on standardized packaging of containers, the coupling between the application operating environment and hybrid cloud heterogeneous infrastructure is greatly reduced. It is easier for enterprises to realize multi-cloud/hybrid cloud agile development and continuous delivery, enabling application of multi-regional standard management Become possible.
The TKE container team provides a series of product capabilities to meet hybrid cloud scenarios. This article introduces the product features for burst traffic scenarios-third-party cluster bomb EKS.

Low cost expansion

IDC's resources are limited. When there is a burst of business traffic that needs to be dealt with, the computing power resources in IDC may not be enough to deal with it. Choosing to use public cloud resources to deal with temporary traffic is a good choice. The common deployment architecture is: create a new cluster in the public cloud, deploy part of the workload to the cloud, and route traffic to different clusters through DNS rules or load balancing policies:

In this mode, the deployment structure of the business has changed, so it needs to be fully evaluated before use:

Which business workloads need to be deployed on the cloud, whether they are all or part of them;
Whether the business deployed on the cloud is dependent on the environment, such as IDC intranet DNS, DB, public services, etc.;
How to display unified business logs and monitoring data on and off the cloud;
Business traffic scheduling rules on and off the cloud;
How the CD tool adapts to multi-cluster service deployment;

Such a transformation investment is worthwhile for business scenarios that require long-term maintenance of multi-regional access, but the cost is higher for burst traffic business scenarios. Therefore, for this scenario, we have introduced the ability to conveniently use public cloud resources in a single cluster to respond to sudden business traffic: third-party clusters bomb EKS. EKS is Tencent Cloud's elastic container service, which can create and destroy a large number of POD resources in seconds. It is only necessary to propose POD resource requirements, and there is no need to maintain the availability of cluster nodes, which is very suitable for flexible scenarios. Only need to install the relevant plug-in package in the cluster to quickly obtain the ability to expand to EKS.

Compared with the direct use of virtual machine nodes on the cloud, this method expands and shrinks faster, and we also provide two scheduling mechanisms to meet customers' scheduling priority requirements:

global switch: at the cluster level, when the cluster resources are insufficient, any workload that needs to create a new Pod can create a copy on Tencent Cloud EKS;

partial switch: At the workload level, the user can specify a single workload to retain N copies of the cluster, and other copies will be created in Tencent Cloud EKS;

In order to ensure that all workloads have enough copies in the local IDC, when the burst of traffic passes and shrinking is triggered, priority is given to shrinking EKS copies on Tencent Cloud (TKE release cluster is required, detailed introduction about TKE release , Please look forward to the subsequent release of this series of articles).

In this mode, the business deployment architecture has not changed, and cloud resources can be used flexibly in a single cluster, avoiding the introduction of a series of derivative problems such as business architecture transformation, CD pipeline transformation, multi-cluster management, and monitoring log system. The use of upper resources is on-demand and billed on-demand, which greatly reduces user usage costs. However, in order to ensure the security and stability of the workload, we require the user's IDC to communicate with Tencent Cloud's public cloud VPC private line, and users also need to evaluate the applicability in terms of storage dependency and latency tolerance.

EKS pod can interoperate with local cluster pod and node in underlay network mode (you need to add local pod cidr route in Tencent Cloud VPC, refer to routing configuration ), third-party cluster bomb EKS has been TKEStack , detailed use For methods and examples, see usage document

Demonstration

step

Get the tke-resilience helm chart

 git clone https://github.com/tkestack/charts.git

Configure VPC information:

Edit charts/incubator/tke-resilience/values.yaml and fill in the following information:

cloud:
appID: "{腾讯云账号APPID}" 
ownerUIN: "{腾讯云用户账号ID}"
secretID: "{腾讯云账号secretID}"
secretKey: "{腾讯云账号secretKey}"
vpcID: "{EKS POD放置的VPC ID}"
regionShort: {EKS POD 放置的region简称}
regionLong: {EKS POD 放置的region全称}
subnets:
 - id: "{EKS POD 放置的子网ID}"
zone: "{EKS POD 放置的可用区}"
eklet:
podUsedApiserver: {当前集群的API Server地址}

Install tke-resilience helm chart

 helm install tke-resilience --namespace kube-system ./charts/incubator/tke-resilience/

Confirm that the chart pod is working properly

Create demo application nginx: ngx1

Effect demonstration:

Global scheduling

Since this feature is enabled by default, we first set AUTO_SCALE_EKS in kube-system to false
By default, the number of ngx1 copies is 1

Adjust the number of copies of ngx1 to 50

You can see that there are a lot of PODs in pending state due to insufficient resources
After setting AUTO_SCALE_EKS in kube-system to true, after a short wait, observe the pod status. The pod that was originally pend was scheduled to the EKS virtual node: eklet-subnet-167kzflm.

Specify schedule

We adjust the number of copies of ngx1 to 1 again

Edit ngx1 yaml, set to enable local switch

spec:
    template:
        metadata:
            annotations:
            # 打开局部开关
                AUTO_SCALE_EKS: "true"
            # 设置需要在本地集群创建的副本个数
                LOCAL_REPLICAS: "2""
        spec:
            # 使用tke调度器
               schedulerName: tke-scheduler

Change the number of copies of ngx1 to 3. Although the local cluster does not have insufficient resources, it can be seen that after more than 2 local copies, the third copy is scheduled to the EKS

Uninstall the tke-resilience plugin

helm uninstall tke-resilience -n=kube-system

In addition, TKEStack has integrated tke-resilience, and users can install tke-resilience in the TKEStack application market.

Application scenarios

Cloud burst

Scenarios such as e-commerce promotions and live broadcasts that need to expand a large number of temporary workloads in a short time. In this scenario, the resource demand time is very short. In order to cope with such short-period needs, a large amount of resources are reserved daily, which is bound to have relatively large resources Waste, and resource demand changes with each activity difficult to accurately assess. With this function, you don’t need to focus on resource preparation. You only need to rely on K8S's automatic scaling function to quickly create a large number of workloads for the business to escort the business. After the traffic peak has passed, the POD on the cloud will be destroyed first to ensure that there is no The situation of waste of resources.

Offline calculation

In big data and AI business scenarios, computing tasks also have high flexibility requirements for computing power. In order to ensure the rapid completion of the task, a large amount of computing power needs to be supported in a short time. After the calculation is completed, the computing power is also in a low load state, and the utilization of computing resources is highly fluctuating, resulting in a waste of resources. And due to the scarcity of GPU resources, users who hoard a large number of GPU devices by themselves are not only very costly, but also face various resource management problems such as increased resource utilization, new card adaptation, old cards, heterogeneous computing, and so on. Abundant GPU card types can provide users with more choices. The use-and-return feature also ensures zero waste of resources, and every penny is truly applied to real business needs.

Future evolution

Multi-region support, support application deployment to multiple regions on the cloud, application and region-related deployment and other features
Cloud-edge combination, combined with TKE-Edge, provides application deployment and scheduling strategies for weak network scenarios, and gets rid of dedicated line dependence

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Tencent Cloud Native Hybrid Cloud-a third-party cluster bomb EKS to deal with sudden traffic

Author

Preface

Low cost expansion

Demonstration

step

Get the tke-resilience helm chart

Configure VPC information:

Install tke-resilience helm chart

Create demo application nginx: ngx1

Effect demonstration:

Global scheduling

Specify schedule

Application scenarios

Cloud burst

Offline calculation

Future evolution

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

火热报名中| 第五届Light创造营邀你一起破茧成光！

2025吐槽季第一弹---腾讯云EO边缘安全加速平台服务

可能是腾讯游戏首次接入满血版 DeepSeek——知几公众号对接腾讯云 DeepSeek 方案

2025版 RTC、直播、点播技术对比｜腾讯云/即构/声网如何选型

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

推理模型升级浪潮下，Agentic RAG 如何借力 DeepSeek 实现知识革命？