The road to cloud-native containerization of AI application products that reduce costs by 40% and increase resource utilization by 20%

author

Guo Yunlong, senior engineer of Tencent Cloud, currently works in the AI Application Product Center of CSIG Cloud Product Division III, and is now responsible for the development of the center's back-end business framework.

Lead

In order to meet the needs of AI capabilities in the public cloud SaaS scenario, services and models need to be delivered quickly and iteratively, to ensure the high success rate of services under unstable and high concurrency, and to further improve resource utilization, the AI Application Product Center has carried out a series of Research and practice, this article will focus on the team’s practical experience in containerization.

Background and problems

The general service process of public cloud AI SaaS products (such as face fusion ) is: C-side or B-side customers collect images, audio and video, etc. through collection equipment, and pass in through cloud API and other access methods. The server uses powerful Computing power, sufficient resources and relatively mature algorithms process multimedia content input by customers.

As shown in the figure above, for the general process, we face three challenges.

acquisition quality is unstable. : Due to the differences between the acquisition equipment, the quality of the acquisition will also be different. Take image processing as an example. Large images and small images will bring different pressures to our services. Concurrent failure of the centralized large picture.
Short-term and high concurrency needs are more : Our customers will use our capabilities to achieve different gameplay. Using face fusion to promote game activities is a very common operation method, but this type of activity will bring our services High concurrency pressure in the short term.
model, fast service iteration : AI SaaS services are very competitive, customers often put forward new requirements, and the algorithm will inevitably have bad cases, so our services also have to undergo frequent upgrades and iterations.

Let's take a look at our streamlined architecture before containerization (as shown in the figure above). Under the background of physical machine development and deployment, our logical service belongs to the big mud ball model both in terms of structure and foundation. In addition, algorithm services are also The phenomenon of mixed cloth often exists .

This architecture also leads to frequent occurrences of resources grabbing between services during busy hours, which affects the success rate and time consumption of services, resulting in our inability to meet customer needs well; while the utilization of resources in idle hours is very low, which is prone to waste of resources. .

Take two actual examples to illustrate:

When the upgrade is released, we need to remove a node from the LB first, and perform the service upgrade after observing that no traffic enters on the node. After the upgrade is completed, the service is successfully tested manually, and the test result is ok and then added back to the LB.
When the customer engages in activities, high concurrency requirements are raised. If the current physical machine/vm resource pool is not satisfied, you need to urgently mention the physical machine requirements to the resource classmates. After the resource classmates coordinate to the machine, we need to manually reinitialize the machine environment/network, and then execute The above 1 operation. After the event is over, the machine is idle, which is easy to cause cost waste.

In order to better meet the continuous iteration needs of customers, reduce the burden of R&D operation and maintenance, supplement flexibility and access to an efficient service management and control platform is an urgent need for us. Taking advantage of the company's opportunity to promote the cloud, we conducted several rounds of research and optimization on the architecture components. This article mainly describes the containerization process .

Containerization process record

can be divided into three steps so far: 1614b05b9df179 containerization, stability improvement and utilization increase .

Containerization

The containerization here is mapped to the business. In addition to migrating the service carrier from the physical machine to the container, the original complex logic and .

As shown in the figure below, we first made the service itself slim and microservices, and with the help of the container's capabilities, we completely separated the original mixed services. How to make microservices will vary from business to business, so I won’t go into details in this article.

Stability improvement

After the first step of containerization, we quickly enjoyed general service upgrade and expansion speed . At the same time, the relatively simple understanding of containerization has also brought us some new problems.

Services with large fluctuations in call volume cause business failure due to frequent expansion and contraction
Some large images sent by customers are not efficiently processed on low-core containers
Containers cannot be expanded on demand due to the shortage of cluster resources.

For the above three problems, we have also found out solutions respectively.

Flexible use of probes

At first, our services were not set up with survival and readiness detection ( probe ). Prestop added a layer of protection when scaling down, but it was not complete, and service failures would inevitably occur during expansion.

probe provides us with another powerful solution . At the beginning, we refer to the example in the link and perform a simple port check to determine whether the service is operating normally. Later we discovered more flexible application techniques and usage scenarios. Here are a few examples for your reference and diversify more interesting practices.

Example 1 : At the beginning, you often encounter the situation that LB Agent will inevitably fail to obtain the route when it starts. We can use the ready probe to preload the LB (as shown in the figure below) to achieve the marking service after the successful LB acquisition The effect of successful startup.

example 2 : Due to the weak password problem of some low-version OS instances, you need to upgrade all the mirrors that depend on the old version of the OS. This work is extremely heavy for us, so we also used probes, Remove all weak passwords before starting the container marking service.

Example 3 : A certain service is special, and the memory usage often fluctuates. When the memory is less than a certain value, the service will occasionally fail, but the port survives normally. At this time, we can use the ConfigMap+python script to perform some complex inspections:

Screening and adapting to large images

After containerization, we found that when a certain algorithm receives high-resolution images, the service success rate will fluctuate. The reason is that the algorithm will consume more when extracting features. This phenomenon occurs when deployed on a physical machine. It is concealed by the advantage of a large number of cores of the physical machine, and once it reaches the container with a lower number of cores, it is revealed. In order to solve this problem, we have added a large image filtering function in the upper logic (as shown in the figure below). If a large image is detected, it will go back to the physical machine cluster (because TKEx initially provides the highest specification container core number is 8 Core, later expanded to support 24 cores and above), if it is a general picture, then go to the container cluster.

Multi-cluster deployment

When using TKEx, we often encountered that the deployed workload would not be able to expand to the specified max value due to insufficient overall cluster resources, which was very distressing for a while.

The students of TKEx also recommended that we replicate a resource in other clusters. When one cluster cannot be expanded, the other cluster acts as a backup. After such adjustments, our expansion success rate has gradually increased.

Later, there was a shortage of resources in the entire region, so we deployed some services that are not so delay sensitive (as shown in the figure below) to further reduce the risk of insufficient cluster resources.

When multi-region deployment and LB are used when resources in one place are insufficient, generally LB will dynamically adjust the weight of each node according to the back-end response time, so we should pay attention to the following two points:

Close the nearest visit
Adjust the LB weight according to the upstream and downstream (for example, the upstream service is deployed in Guangzhou, and the downstream is deployed in Nanjing and Guangzhou at the same time. This is the LB weight of Nanjing and Guangzhou is 130 and 100 respectively)

Utilization increase

After a round of stability improvements, we can use more confidently with elasticity, and the utilization rate has also been significantly improved . However, there are still two problems that hinder our utilization rate. One is that some service models are large and slow to start. When the traffic suddenly increases, the service cannot be expanded in a timely manner. At this time, we must occupy some resources in advance to cause the utilization rate to fail.

In response to the first question, we selected some services with regular traffic. utilizes the timing HPA capability provided by TKE to perform a regular round of expansion before the known traffic peak.

Results

	Before optimization	Optimized
Resource occupation	1500+CPU physical machine (8w+ core) 800+GPU physical machine (P4 1600 card)	CPU 6w core T4 1000 card
Resource utilization	10%	30%
cost	-	-40%
Service success rate	99.9%	99.95%
Service expansion efficiency	Small scale (<2000 cores): 3 hours Large scale: 2 days	Small scale (<2000 cores): 10 minutes Large scale: 6 hours
Service upgrade efficiency	Small scale (<50 instances): 6 hours Large scale: 2 days	Small scale (<50 instances): 30 minutes Large scale: 6 hours

At present, our AI service has basically completed the upgrade of containerization. The success rate is high, and the expansion is fast. Welcome everyone to scan the code to experience.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare: The official account backstage reply [manual], you can get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

The road to cloud-native containerization of AI application products that reduce costs by 40% and increase resource utilization by 20%

author

Lead

Background and problems

Containerization process record

Containerization

Stability improvement

Flexible use of probes

Screening and adapting to large images

Multi-cluster deployment

Utilization increase

Results

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

2025免费云服务器盘点

信息安全风云录，AI 时代安全江湖如何见招拆招？

腾讯云TVP AI与安全高峰论坛圆满落幕，共探大模型时代的安全破局之道

腾讯云cos大文件上传服务端实现一篇搞定

具身智能全解读，从实验室到产业化 | TVP技术夜未眠