Tencent TencentOS: Ten Years of Cloud Native Iterative Evolution

Jiang Biao, senior engineer of Tencent Cloud, has been focusing on operating system related technologies for more than 10 years, and is a senior Linux kernel enthusiast. He is currently responsible for the research and development of Tencent Cloud's native OS and the performance optimization of OS/virtualization.

Lead

TencentOS Server (also known as Tencent Linux abbreviated as Tlinux) is a Linux operating system developed by Tencent for cloud scenarios. It provides special features and performance optimizations, providing high performance for applications in cloud server instances, and safer and more reliable operation. surroundings. Tencent Linux is free to use. Applications developed on CentOS (and compatible distributions) can be run directly on Tencent Linux. Users can also continue to receive Tencent Cloud's update maintenance and technical support.

TencentOS has undergone more than 10 years of iteration and evolution within Tencent. It bears support for all Tencent's businesses. The commercial deployment node exceeds 300w, and it has withstood the extreme test of massive complex business models in extreme scenarios.

General OS architecture

The definition of traditional OS (pirate the content of classic textbooks):

operating system is a program that controls the execution of application programs and is the interface between application programs and the computer (hardware).

The operating system has 3 goals:

Convenience: make the computer easier to use
Effective: Allows the use of computer resources in a more efficient manner
Expansion: Allows effective development, testing and introduction of new system functions without affecting the service

The typical architecture design of the traditional general-purpose OS (Linux) is as above. The operating system contains various functional modules and interfaces provided to achieve the above three goals. In general, it is divided into two parts:

Kernel: Provides the basic abstraction of the underlying hardware (computer). Different kernel modules provide different hardware management or related auxiliary functions, and provide services to upper-level applications through system calls.
Basic library and related service components (user mode): Provide a basic operating environment for real business operations

OS in the IaaS scenario

In the IaaS scenario, the OS is mainly used to provide a running environment for the cloud host (virtual machine). In the IaaS scenario, the type and number of tasks running in the OS are controllable, and the scenario is much simpler than the general scenario. There are basically only the following types of tasks:

VM related threads (usually Qemu + Vcpu threads)
Management Agents for various control planes
Some control threads necessary for the OS itself (such as Per-cpu Workers)

In the IaaS scenario, in order to make the performance of the virtual machine infinitely close (or even surpass) that of the physical machine, subtraction is usually considered, and the performance is improved by infinitely reducing the overhead of virtualization and OS. Some commonly used typical methods are as follows:

Tied cores at the CPU level
Pre-allocation at the memory level
Various Bypass Kernel technologies at the IO level

For OS, the final result is:

OS is getting thinner and may eventually disappear

Look at OS from another angle (cloud native perspective)

When the cloud-native wave hits, when you look at the OS from another angle (cloud-native angle), you will see a different view. The cloud-native scenario poses new challenges to the OS, and also injects new impetus into the further development and evolution of OS-related technologies.

In the cloud native scenario, the OS provides underlying support and services for different types of applications (Apps, Containers, Functions, Sandboxes). At this time, compared to the IaaS scenario, the most obvious difference is:

The boundary between applications and systems has moved up significantly, and all applications are OS

For OS, the final result is:

OS is getting thicker and thicker ( unlimited possibilities) 160d538aa700b0, in sharp contrast with the IaaS scenario

TencentOS For Cloud Native

In the context of the cloud-native wave sweeping the industry, with the rapid transformation of Tencent’s own business architecture, the containerization, micro-service, and serverlessization of the business have brought new ideas to the underlying infrastructure (including the core OS). TencentOS has also undergone a rapid transformation. In response to Tencent’s own cloud-native scenarios and requirements, TencentOS has undergone in-depth reconstruction and redesign, fully embracing cloud-nativeness, and stepping forward to the goal of cloud-native OS.

Overall architecture

TencentOS (currently) mainly implements (Kernel layer) the following cloud native features (expanded later)

Cloud Native Scheduler-Tencent Could Native Scheduler (TCNS)
Cloud native resource QoS-RUE
Quality Monitor
Cloud Native SLI
Cgroupfs

Cloud Native Scheduler-Tencent Could Native Scheduler (TCNS)

TCNS is the overall solution of the kernel scheduler provided by TencentOS for cloud-native scenarios. It can cover containers, secure containers and general scenarios. It is required for CPU isolation in the multi-priority business mixed department, and has the ultimate in real-time performance/stability. The required business scenarios are surprisingly effective. For the requirements and possible solutions for CPU isolation in the On the CPU Isolation of Cloud Native Resource Isolation Technology (1)" has a detailed explanation about the kernel scheduler Technical discussions related to real-time guarantees will be discussed in the follow-up os series of articles, please pay attention to it one after another.

TCNS mainly includes 3 modules:

BT Scheduler
VMF Scheduler
ECFS

BT Scheduler

BT Scheduler is a new scheduling class designed by TencentOS for CPU isolation in (container) mixed scenarios. The location is shown in the figure below:

Its core design is: a brand-new design of a new scheduling class with a lower priority than the default CFS. It can only run when there are no other higher priority tasks running, and it is dedicated to running offline tasks (online tasks use the CFS class).

The core benefit of such a design is: it can realize the absolute preemption of offline business by online business, and can obtain nearly perfect CPU isolation effect in the mixed department scene.

The BT scheduling class itself is a new scheduling class with complete functions, and a full set of functions similar to CFS can be provided in the BT scheduling class.

In addition, the BT scheduling class also implements the following features:

For other information about BT, you can click the following link to learn:

【https://cloud.tencent.com/developer/article/1519558】

Note: Although the content is a bit old, the new version has been iterated for several rounds for reference. For the latest introduction of BT, corresponding articles will be released in the follow-up, so stay tuned.

VMF Scheduler

VMF (VM First) scheduler is a kernel scheduler solution specially designed by TencentOS for secure container scenarios (and virtual machine scenarios) (reimplemented a brand new kernel scheduler).

The main background for rewriting the scheduler is that the existing CFS scheduler is based on the principle of "complete fairness" and cannot guarantee the real-time performance of virtual machine (secure container) thread scheduling.

The core design of VMF includes:

Unfair scheduler, by tilting the CPU resources to the virtual machine process, so as to ensure that the virtual machine (safe container) thread can be scheduled first
Scheduling is based on task type without fine-grained priority. In contrast, we believe that the priority of CFS does not accurately describe the operating characteristics of different processes. The typical example is kernel threads. The characteristics of this type of process are obvious. First, it is very important, and secondly, its single execution time is very high. Short, but it is difficult to define their priority, high or low is not appropriate, just through the priority can not accurately describe their operating behavior. In VMF, we classify all processes by profiling and modeling the characteristics of different processes, and design fine scheduling strategies for different types of processes, which can meet the extreme demand for real-time performance in cloud-native scenarios.
Unidirectional aggressive preemption means that the VM process can preempt other tasks as soon as possible under any circumstances, but not vice versa. This can guarantee the real-time performance of the VM process to the greatest extent without compromising the throughput performance of the scheduler.

In addition, we have designed many other features for other scenarios and needs. The space is limited and cannot be described in detail. We plan to introduce a separate topic in the future.

Overall, through the self-developed VMF scheduler, we can obtain several key benefits:

Extreme scheduling delay index (good real-time performance), the maximum delay in extreme cases is in the subtle level
The new scheduler is much lighter than CFS, and the overall code size is less than 1/3 of CFS
In the presence of partial interference, ensure the real-time performance of the virtual machine (safe container) thread
VMF-based classification design can provide different levels of CPU QoS guarantee for different types of processes
Through the completely self-developed scheduler, you can do many dazzling customizations that you can't even imagine. If you have experience in optimizing CFS, you should all know how uncomfortable it is to customize on CFS.

The description of VMF is also planned to be discussed in another article, please look forward to it. There is also a part of the virtualization session at the OS2ATC 2020 conference, the theme: "Tianus Hypervisor-"Zero Loss" Tencent Cloud Lightweight Virtualization Technology"

https://www.bilibili.com/video/BV1Ky4y1m7yr/?aid=798624805&cid=277695096&page=1

<Note: Start at 1:24:00>

ECFS Scheduler

ECFS is optimized for common scenarios (Upstream route) based on the mainstream CFS scheduler in the community. The core optimization (design) points are:

Introduce a new task scheduling type to distinguish online and offline tasks.
Optimize the preemption logic to ensure online preemption of offline. Avoid unnecessary preemption from offline to online
Absolutely preemptive design
Hyperthreading interference isolation

The specific principles will not be expanded yet, please look forward to the subsequent os series of articles.

Cloud native resource QoS-RUE

RUE (Resource Utilization Enhancement), the Chinese brand "Ruyi", is a product in the TencentOS product matrix designed for server resource QoS in cloud native scenarios, improving resource utilization and reducing operating costs. Ruyi is the unified scheduling and allocation of the CPU, IO, network, memory and other resources of the machines on the cloud. Compared with traditional server resource management solutions, Ruyi is more suitable for cloud scenarios, which can significantly improve the resource usage efficiency of the machines on the cloud and reduce the number of customers on the cloud. To provide resource value-added services for customers such as public cloud, hybrid cloud, and private cloud. Ruyi's core technology can ensure that services of different priorities do not interfere with each other, and realize the efficient unity of resource utilization, resource isolation performance, and resource service quality.

Architecture

RUE includes the following main modules:

Cgroup Priority

The concept of global unified Pod priority is introduced into the kernel, and it runs through the processing stack of all resources of CPU, Memory, IO, and Net to provide unified priority control.

CPU QoS

Based on the TCNS implementation mentioned in the previous section, absolute preemption and perfect isolation at the CPU level can be achieved.

Memory QoS

Through the priority awareness on the allocation and recycling path, different levels of memory allocation QoS guarantees are provided for containers of different priorities (the memory availability of low-quality containers is sacrificed to guarantee the memory QoS of high-quality containers). Among them, several original features have been implemented, which can guarantee the memory allocation delay of high-quality containers to the greatest extent, and this is also one of the key capabilities that Upstream Kernel lacks.

IO QoS

Allow users to divide containers into different priorities from the IO perspective, allocate IO resources according to priority, ensure that low-priority containers will not interfere with high-priority containers, and allow low-priority containers to use idle IO resources, thereby improving resource utilization rate. IO resource QoS includes three aspects: bandwidth QoS, delay QoS, and write-back QoS. In addition, it also provides a minimum bandwidth guarantee function to prevent possible priority reversal caused by low-quality starvation.

Net QoS

Allows users to allocate the bandwidth of the server's network card to different containers according to their priorities, allowing low-priority containers to use idle bandwidth resources without causing interference to the network bandwidth of high-priority containers. In addition, it also provides a minimum bandwidth guarantee function to prevent possible priority reversal caused by low-quality starvation.

The overall structure of RUE is relatively complicated, and a lot of changes and optimizations have been made to the Upstream Kernel. The relevant features involve a lot of content and extensive content. This article cannot be expanded one by one. The follow-up will discuss one by one, so stay tuned.

Overall effect

Introduce the concept of global unified Pod priority to provide unified priority control
Suitable for mixed deployment of multi-priority containers (Pod/task), which can greatly improve resource utilization

Quality Monitor

In the mixed scene, in order to improve the resource utilization of the whole machine, it is inclined to maximize Overcommit. Under the premise of the guarantee of the underlying isolation technology (resource QoS), interference isolation can be guaranteed to a certain extent. But there are still two main challenges:

How to evaluate the QoS effect and perceive "interference"?
How to effectively troubleshoot the occurrence of "interference"?

On the other hand, the upper-level scheduling (K8s) also needs to provide more meaningful indicators (service quality assessment and more detailed indicators) based on the underlying (kernel), perform refined operations, improve the overall performance of the mixed cluster, and improve the mixed technology The overall competitiveness of the program.

There are some scattered statistical data of different dimensions in the existing system, but:

It is not "friendly" enough, for the upper-level scheduling (K8s), it cannot be understood, and some more meaningful abstract data is needed as a basis for "basic" scheduling.
It is not "professional" enough. For mixed-department scenarios, some targeted monitoring data is needed. K8s can do more "fine" operations based on these data.

On the other hand, the existing system lacks normal operation debugging methods, and can catch the scene as soon as possible when "interference" (or high-quality container jitter) occurs, and effectively analyze and locate the method. Insufficiency of existing means:

Post-deployment is often required (Ftrace/Kprobe, etc.), but business jitter may be difficult to reproduce, or it may happen occasionally, which is difficult to capture.
It is expensive and difficult to deploy in a normal manner.

The PSI that appeared with Cgroup V2 is a very good attempt. It reflects the health status of the system to a certain extent, but it is still slightly thin when used in the evaluation of the QoS effect of the mixed scene.

TencentOS has designed Quality Monitor, which is dedicated to assessing the quality of service (QoS) in all aspects of the container, and provides a normalized, low-overhead, and event-triggered monitoring mechanism. When the quality of service is degraded (not up to standard), it can be captured in a timely and effective manner Exception context.

Quality Monitor mainly consists of two modules:

Score

Service quality score, specifically defined as:

Per-Prio score = 100-the proportion of Stall's time due to other priority (Cgroup) process interference (resource preemption)

Per-Cg score = 100-the proportion of Stall's time due to interference from other Cgroup processes (resource preemption)

Note: The interference here includes software and hardware interference

Monitor Buffer

Normally monitor the memory area of interference and jitter, and automatically record relevant context information when key indicators (depending on cloud native SLI) do not meet expectations (exceeding limits).

Overall effect:

Provides service quality scores in two dimensions, priority and Pod, to evaluate the quality of service (QoS) of the container
When the service quality is degraded (interference), the abnormal context can be captured through the Monitor Buffer

Cloud Native SLI

definition

SLI (Service Level Indicator) is an indicator used to observe Service level, such as Latency, throughput, error rate, etc.;

SLO is based on the target specified by SLI;

From a cloud-native perspective, cloud-native SLI can be (in a narrow sense) understood as indicators that can be used to observe Service level for containers, that is, some key indicators from the container perspective, which are also the basis for defining container SLO.

On the other hand, the basic statistics and monitoring of the existing Upstream Kernel in Cgroup are still relatively primitive and rough, with only some basic statistical information, such as Memory/Cpu Usage information, and lack of usable SLI data collection and abstraction from a container perspective.

TencentOS has designed the cloud native SLI. Through real-time collection and calculation in the kernel (low-overhead method), it provides sufficient, professional, and different-dimensional SLI indicators for the upper layer (K8s) to use. Users can set a corresponding one based on this SLO.

Cloud native SLI includes the following modules:

CPU SLI

Collect and calculate the SLI of the CPU dimension, including scheduling delay, kernel state blocking delay, Load, Context-switch frequency, etc.

Memory SLI

Collect and calculate the SLI of the Memory dimension, including memory allocation delay, memory allocation speed, direct recovery delay, memory compaction delay, memory recovery delay, memory allocation failure rate, etc.

IO SLI

Collect and calculate the SLI of the IO dimension, including IO latency, IO throughput, IO error rate, etc.

NET SLI

Collect and calculate the SLI of the network dimension, including network delay, network throughput, IO error rate, etc.

Overall effect

Provide fine-grained SLI metrics at the container level
K8s or other modules (such as Quality Monitor) can perform refined operations based on relevant indicators

Cgroupfs

In the cloud native scenario, based on the underlying resource isolation technologies such as Namespace and Cgroup, basic resource isolation (container perspective) is done, but the overall isolation of the container is still very incomplete. Among them, some resources in the /proc and /sys file systems Statistical information, there is no complete containerization (or Namespaceization), resulting in some common commands in the physical machine/virtual machine (such as free / top) when running in the container, the information from the container perspective cannot be displayed accurately (default display System-level global information, such as total system memory and free memory, is also a type of stubborn disease that has always existed in cloud native (container) scenarios.

The direct reason is that the relevant information has not been containerized, and the essential reason is that the isolation of the container is still insufficient.

In view of the problem that the key information in the /proc file system is not containerized, the solution recommended by the community is:

lxcfs

lxcfs is a virtual file system tailored to the above scenarios. Its bottom layer implements a user-mode file system based on FUSE, provides statistical information in the /proc file system from a containerized perspective, and also includes a little bit of /sys file system The realization of individual information is relatively simple and straightforward.

lxcfs basically solves the problem of using common basic commands (free / top / vmstat, etc.) in the container, but still has the following shortcomings:

It needs to rely on the additional component lxcfs, which is difficult to integrate deeply with the container and cannot be controlled.
User mode is implemented based on FUSE. The cost is greater than the kernel, and the information accuracy is not as good as the kernel.
The stability of lxcfs is relatively poor (according to user feedback), and there are often problems: stuck (large overhead), information cannot be obtained, etc.
Poor customization capabilities. The current implementation is completely based on some basic information visible in the user mode (current information is still relatively limited). If a deeper level of customization (based on user needs) is required, capacity bottlenecks will be encountered (limited by user mode implementation) .

TencentOS provides a corresponding solution in the kernel mode, named: _Cgroupfs_

The core design is: design a new virtual file system (placed in the root file system), which contains fs such as /proc and /sys from the container perspective that we need to implement, and its directory structure remains consistent with the global procfs and sysfs. To ensure compatibility with user tools. When actually reading related files, the corresponding container information view is dynamically generated through the context of the reader process of cgroupfs.

The directory structure is as follows:

Overall effect

The virtual machine file system (/proc, /sys) from the perspective of the kernel mode container, isolates global information, supports conventional commands (top, free, iotop, vmstat, etc.)
Designed for Cgroup V2, unified hierarchical structure
In-depth customization and expansion can be done according to needs

TencentOS For Kubernetes

Under the cloud-native wave, Kubernetes bears the brunt as the de facto industry standard. As cloud native enters the deep water zone, businesses are also paying more attention to the actual gains after going to the cloud, and resource utilization and cost are also increasingly valued. In the original Kubernetes system, different priority workloads can be mixed in the same cluster through Service QoS Class and Priority to increase resource utilization and reduce resource operating costs. However, this "user mode" behavior is limited by the design of Linux kernel cgroups, and the inherent isolation granularity is insufficient. Business will be damaged due to competition for resources after mixing, and sometimes it is often not worth the loss.

In this context, TencentOS's cloud-native and Prioirty design can perfectly solve this problem. By matching the Kubernetes Service QoS Class and TencentOS Priority one-to-one, we initially perceive priority at the kernel layer and provide a strong isolation mechanism at the bottom layer to ensure the service quality of the mixed service to the greatest extent. And this priority mechanism runs through the entire cgroups subsystem.

The Tencent Cloud Container team has open sourced the TKE release . This feature will be supported in the next version, and users can continue to follow the community dynamics.

In addition to focusing on cloud native, TencentOS Server itself is a general-purpose server OS. In the process of focusing on the kernel for more than 10 years, many large or small features have also been developed/customized. Other instructions and codes related to TencentOS, if any Interested, please continue to click the following link to understand:

【https://github.com/Tencent/TencentOS-kernel】

Concluding remarks

TencentOS has been thinking about and exploring its own cloud-native road. The journey has begun, but it is far from over!

Tencent TencentOS: Ten Years of Cloud Native Iterative Evolution

Lead

General OS architecture

OS in the IaaS scenario

Look at OS from another angle (cloud native perspective)

TencentOS For Cloud Native

Overall architecture

Cloud Native Scheduler-Tencent Could Native Scheduler (TCNS)

BT Scheduler

VMF Scheduler

ECFS Scheduler

Cloud native resource QoS-RUE

Cgroup Priority

CPU QoS

Memory QoS

IO QoS

Net QoS

Overall effect

Quality Monitor

Score

Monitor Buffer

Cloud Native SLI

definition

CPU SLI

Memory SLI

IO SLI

NET SLI

Overall effect

Cgroupfs

lxcfs

Overall effect

TencentOS For Kubernetes

More

Concluding remarks

鸣飞

引用和评论

涛思数据与浪潮KaiwuDB商标被侵权引发开源商业化合规思考

火热报名中| 第五届Light创造营邀你一起破茧成光！

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

JManus - 面向 Java 开发者的开源通用智能体

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？

分析型数据库入门指南：如何选择适合你的实时分析工具？

Java 开发玩转 MCP：从 Claude 自动化到 Spring AI Alibaba 生态整合