Peak utilization rate of 80%+, video cloud offline transcoding, self-developed cloud TKE practice

author

Liu Zhaorui, senior R&D engineer of Tencent Cloud, is responsible for Tencent's Bright Eyes ultra-fast high-definition, image quality rebirth and other products. Focus on codec optimization, image quality enhancement and other technologies.

Background and problems

With the reduction of traffic tariffs and the increase of bandwidth, video has become an increasingly important way for people to obtain information. This is followed by the rapid development of cloud-on-demand, video processing and other video-related services, and the video transcoding platform serves as a cloud-on-demand , The basic products of video processing are facing various requirements such as high concurrency, high SLA, high compression rate, etc., and are facing great challenges.

For the general process, we are faced with the following challenges and demands:

Different transcoding products have different requirements for the number of cores. For example, high-speed high-definition, delay-sensitive services require a large core to ensure the stability of complex operations, while ordinary transcoding can be replaced with a small core. Consolidation and slicing services in distributed transcoding are more concerned about IO performance and hard disk size.
The transcoding service has a high utilization rate of the avx instruction set, so the general CPU computing power often does not become a bottleneck, and the calculation frequency of the avx instruction set has become the focus of the transcoding business. The CPU models in the cluster are often diverse. Therefore, a reasonable choice of the CPU model is very important for the transcoding business. When TKE expands the pod, it is necessary to be able to select the CPU model.
There are many short-term and high-concurrency requirements: customers will use our capabilities to achieve different gameplay. For example, customers need to perform extremely fast high-definition compression or image quality enhancement on their entire site’s videos. Here, they need to be able to obtain huge resources in the short term. After use, it can be quickly returned to save costs.
Fast model and service iteration: The competition among cloud service vendors is very fierce, and customers often put forward new requirements. Pod can support fast and non-destructive update and iterative versions.

Containerization & Full Cloud Record

Containerization

The containerization process here mainly includes combing the service process of the business and standardizing the overall release process:

Application for different performance models of business

Before migrating TKE, the model of the physical machine was often fixed, and the combination of CPU cores, memory, and hard disk capacity was fixed, and these often caused a waste of resources for the designated business, and it was impossible to make full use of all resources. For example, the transcoding business cares about CPU performance, and the utilization of memory is very low, and the 48C model of the physical machine is often matched with 64G memory, which causes a certain degree of memory waste.

After migrating TKE, according to different business model scenarios, the CPU, memory, and hard disk resources required by the business can be allocated accurately, and each resource can be fully utilized.

CPU model restrictions

The transcoding service has a high utilization rate of the avx instruction set. Although many types of CPUs have a high general computing frequency, the instruction set is limited. Although this type of CPU has a large number of cores, the coding efficiency is very low. Therefore, when the business is expanding pod, it is hoped that certain models of CPU can be avoided.

To solve this problem, TKE supports CPU affinity configuration, the configuration is as follows:

Rapid expansion and contraction

Although the transcoding service is an offline service, key customers still have high requirements for SLA. Need to be able to expand and shrink more quickly to meet the dynamic needs of customers.

In the face of such sudden requests, TKE can meet the demand through dynamic expansion and contraction, and at the same time, after the business traffic burst ends, it can also quickly scale down to reduce the cost of use.

Of course, dynamic scaling will also bring additional challenges. For transcoding services, many tasks are long-term tasks and cannot be interrupted. For example: a 100+ hours video transcoding has been transferred for 50 hours+, the task cannot be interrupted due to the expansion and contraction, and the transcoding should be restarted. For this scenario, TKE also provides a good solution, which can perfectly support this demand by deleting protection.

Fast business update and online

Cloud transcoding services multiple cloud-based basic products, a large number of internal and external customers, demand and release pace are very fast, there will be new version upgrades every week. Therefore, the ability to support rapid release is a strong demand of the business. At the same time, it releases tasks that cannot interrupt the business being processed. In response to this situation, TKE supports the in-place upgrade option, upgrades the POD business code, does not need to destroy and rebuild the runtime container, and supports hot updates during service operation.

lxcfs & fixed IP help precise task scheduling

The transcoding service is different from the general service request, and the resource consumption of the current transcoding request cannot be predicted before the transcoding starts. For example: game live video and classroom education video, the consumption of resources will differ by an order of magnitude. Therefore, the scheduling of transcoding tasks relies on the transcoding machine to actively report the current number of tasks and the load status of each task, and the scheduling distributes new task requests according to the current actual load status.

However, operations such as ps in the general pod obtain the load information of the parent machine, not the actual load information of the current pod, which will cause scheduling imbalance. In order to solve this problem, TKE supports lxcfs configuration. Through lxcfs, the actual load information of the current pod can be accurately obtained.

In the face of the above scenario, another problem is that if each POD reconstruction process will reapply for an IP, it will undoubtedly cause an additional burden on the scheduled IP management. In response to this situation, TKE also supports fixed IP, IP reservation and other capabilities.

Online results

Video cloud offline transcoding service, with an average CPU utilization rate of 50%+. The peak utilization rate is 80%+. At the same time, the dynamic expansion and contraction and the rapid on-line support can effectively escort business needs and traffic burst guarantees.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

   ①公众号后台回复【手册】，可获得《腾讯云原生路线图手册》&《腾讯云原生最佳实践》~
   
   ②公众号后台回复【系列】，可获得《15个系列100+篇超实用云原生原创干货合集》，包含Kubernetes 降本增效、K8s 性能优化实践、最佳实践等系列。

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Peak utilization rate of 80%+, video cloud offline transcoding, self-developed cloud TKE practice

author

Background and problems

Containerization & Full Cloud Record

Containerization

Application for different performance models of business

CPU model restrictions

Rapid expansion and contraction

Fast business update and online

lxcfs & fixed IP help precise task scheduling

Online results

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

2025免费云服务器盘点

信息安全风云录，AI 时代安全江湖如何见招拆招？

腾讯云TVP AI与安全高峰论坛圆满落幕，共探大模型时代的安全破局之道

腾讯云cos大文件上传服务端实现一篇搞定

具身智能全解读，从实验室到产业化 | TVP技术夜未眠