author
Li Huibo, senior engineer of Tencent business operation and maintenance, currently works in the technical operation and quality center of the TEG cloud architecture platform department, and is now responsible for the video transcoding operation and maintenance of WeChat and QQ social services.
Summary
With the rise and rapid development of short videos, there is an increasing demand for video transcoding. Low bit rate and high definition, 4K, ultra-definition, high-definition, and standard-definition adapt to different terminals and different network environments to improve user experience, as well as diverse user needs such as watermarks, logos, cropping, and screenshots. There is also a need to respond quickly to the diverse needs of resources and flexible expansion and contraction. With the advancement of the company’s self-developed cloud project, the stable line and diversity of equipment can provide more choices to meet the needs of friends, video accounts, Advertising, official account and other transcoding services are fast, stable, and resistant to sudden resource demands.
Service scenario
MTS (Media transcoding service) is positioned as a quasi-real-time (and offline) video processing service for on-demand scenes. It provides basic video processing functions such as high-definition compression, screenshot watermarking, and simple editing that can be completed in minutes for the business. At the same time, it has the ability to integrate deep functions such as customized image quality increase and quality evaluation.
The business developer defines a batch processing template. When the content producer uploads data, it triggers the transcoding job to output multi-specification compressed video and video cover, and then it can be published and pushed.
background
The WeChat side and the small video platform carry a lot of video files, and these videos are basically processed on the transcoding platform according to business needs, in order to reduce the bit rate, reduce costs, and reduce users' stalling due to the network. The earliest transcoding platform basically maintained an independent cluster for each business. There were many clusters, and the resources between the clusters could not be scheduled for use with each other. Moreover, the capacity of a single cluster was small, and services with large requests had to deploy multiple sets of cluster support.
This brings great challenges to operations, requiring a platform with a larger capacity limit, more flexible resource scheduling, and more convenient operation. With the advancement of the company's self-developed cloud project and the containerization of TKE, the transcoding platform needs to be able to quickly connect to TKE resources and use the company's massive computing resources to meet the business's demands for video transcoding.
Construction ideas and planning
The video access and transcoding process often faces multi-service bursts. Under the premise of ensuring service quality, it is necessary to increase the utilization rate and improve the efficiency of operation.
Platform capacity building : The upper limit of the single cluster capacity is improved, the frequency control isolation of the business does not affect each other, and the resource scheduling is flexibly adjusted
Resource Management Construction : Fully excavate idle fragment resources around the TKE container platform, and use HPA to stagger the high and low peaks to elastically expand and shrink the capacity to improve CPU utilization. At the same time, using the characteristics of high traffic of video access services, low CPU usage, low traffic of transcoding services, and high CPU usage, the two scenarios are mixed to make full use of physical machine resources to prevent low load of pure traffic clusters.
Operation system construction : Adapt to business scenarios, improve the process of changing the shelves, remove process monitoring alarms, and establish a stable guarantee platform
Platform capacity building
Architecture upgrade
Old transcoding platform architecture:
- It is a master/slave master-slave structure, with relatively weak disaster tolerance and limited Master performance. A single cluster can manage about 8000 workers
- At the resource scheduling level, workers with different core counts cannot be distinguished amicably, resulting in some high load and some low load
Cannot perform frequency control based on the business dimension, a single business burst affects the entire cluster
New transcoding platform architecture:
- Combine the Master/Access module into sched, and the sched scheduling module can be deployed in a distributed manner, and any node can be automatically removed when it hangs
- Workers and sched establish heartbeats and report their own status and CPU core information, which is convenient for scheduling to allocate tasks according to worker load, ensuring that the same cluster is deployed with different specifications of cpu worker load balancing
- Single cluster management capability 3w+ worker, meeting the sudden increase in business during holidays
- Business is merged into a large public cluster, which can control the frequency of a single business and reduce direct business interference
With the upgrade of the architecture, the platform is no longer limited by the single-cluster capacity, and the daily and holiday peaks can quickly meet the needs, and the large clusters of business mergers stagger the peaks and the resource utilization
Access service svpapi upgrade DevOps 2.0
With the help of tke Dongfeng on the business, the small video platform access service svpapi has been standardized and upgraded. Important improvements include:
- Integrate the original multi-change system, multi-monitoring system, and multi-basic resource management system into the unified portal of Zhiyan, including R&D testing, daily version release, resource elastic expansion and contraction, business monitoring alarms, business log retrieval and analysis. Directly operate TKEStack through the CICD process shielding, which has better security
- The module configuration distinguishes the usage scenarios and is hosted in Qicaishi, and supports 1-minute-level business switch switching, and supports flexible traffic scheduling and business flow control frequency control during holidays
- In the second half of the year, the access service plan uses Zhiyan to monitor the cluster traffic level, combined with TKEStack's HPA capability according to the traffic, and implements the unattended capability of resource expansion
Resource management construction
After having platform capabilities, the next step is to classify and balance the resources of different container specifications. The main difficulties here are:
1. Diversity of business scenarios: There are many TKE clusters, and the performance specifications are also different. From 6 cores to 40 cores, they need to be able to use
2. Resource management and operation need to be considered: Dockerfile image production, adaptation to different TApp cluster configurations, container removal and storage, operation and maintenance change specifications, etc.
Sort out the container configuration under different clusters of TKE
# 不同cpu规格适配不同环境变量
- name: MTS_DOCKER_CPU_CORE_NUM
value: "16"
- name: MTS_DOCKER_MEM_SIZE
value: "16"
# 算力平台亲和性设置,当负载超过70%,则驱逐该pod
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: max_vm_steal
operator: Lt
values:
- "70"
Resource scheduling balance
Transcoding is an asynchronous task. The request time of each task processed is different and has state. Therefore, it is impossible to balance the scheduling tasks based on Polaris, and the scheduling strategy needs to be designed on the platform side.
- Based on different specifications of CPU machine worker performance, balance tasks
- Scheduling according to different worker versions, supporting rapid version iteration for small businesses
For containers of different specifications, use Score and version to balance scheduling
Based on the scheduling ability of task balance on different CPU specifications, the utilization rates of C6 and C12 are relatively similar, which will not lead to waste of large-scale container resources
Operation system construction
How to expand the worker resources of the transcoding cluster to the corresponding cluster? Here, a layer of resource management layer is added, and the designated workers need to be manually called to remove and remove the specified workers from the cluster. Corresponding to the development of a professional OSS system on the platform side, the cluster's sched/worker/tasks are made into pages for easy operation, and APIs for loading and unloading are encapsulated. In fact, TKE has nothing to do with the transcoding platform. In order to achieve decoupling, the operation and maintenance side develops the function of docking TKE to and from the shelves, formulates the process, and calls the OSS API to synchronize the resources of the TKE expansion and contraction. The specific logic is shown in the figure:
TKE supports the Polaris service, associates the corresponding TApp with the Polaris service name, and uses the Polaris service as metadata management for different transcoding clusters to expand and shrink the IP to ensure the consistency of resources on the TKE and transcoding side
Process monitoring
There are tens of thousands of workers managed by the transcoding platform, and the status of the container process cannot be monitored in time during the running process or new version release. The batch scanning time is too long to quickly know the abnormal status of the process, so it is combined with the process monitoring platform in the group , The process of building transcoding containers monitors and alerts, and abnormal workers are promptly notified through the robot enterprise WeChat and phone alerts to eliminate them, improving the quality of service
Resource utilization optimization
The transcoding business is currently basically a self-developed social networking business. The holiday effect is more obvious, and the resource demand is large, most of which are quasi-real-time, and they are also sensitive to the time-consuming transcoding. Therefore, in addition to the guaranteed speed, a buff of 30% to 50 is reserved. However, the business is basically at a low peak in the early morning, so some resources are wasted in the early morning. TKE supports automatic scaling based on system indicators, and its billing model is also based on actual usage within a day. Here we configure elastic scaling based on CPU utilization indicators, shrink capacity at low peaks, and automatically expand at peaks to reduce resource occupation To reduce costs
Flexible expansion and contraction
Reduce capacity at low peak in the early morning according to the number of copies of the actual load node
Workloads CPU actual usage accounted for the peak percentage of requests to reach more than 75%. While ensuring business stability, the CPU utilization rate can be improved
Summary of Achievements
At present, the transcoding platform has merged from the three large clusters of scattered small clusters, with the improvement of operating capacity + the improvement of resource utilization, and efforts are being made to improve the cloud native maturity, as of May 2021.
- The business has accumulated access to internal video services such as WeChat Moments, video accounts, c2c, official accounts, take a look, advertising, Qzone, etc., and 100 million+ video transcoding processes are processed every day
- Daily maintenance of CPU utilization at about 70%, automatic and elastic expansion and contraction according to load, significantly improving business maturity
about us
For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~
Welfare:
①Respond to the backstage of the official account [Manual] to get "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~
②The public account backstage reply [series], you can get the "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。