Introduction to This article is the beginning of the mixed practice series. This article will introduce the importance of resource isolation technology in the mixed department, its landing challenges and our response ideas.
Author: Qian Jun, Nan Yi
audit & proofreading __ pair: Xiyang, Haizhu
Editing & Typesetting:
As the name implies, the mixed department is to mix and deploy different types of services on the same machine, allowing them to share the CPU, memory, IO and other resources on the machine. The purpose is to maximize resource utilization, thereby reducing procurement and operation costs. .
In 2014, Ali began its first exploration of mixed departments. After seven years of training, this sharp sword that greatly increased resource utilization has officially begun commercial use.
Through the isolation of computing resources, memory resources, storage resources, network resources and other full-links and millisecond-level adaptive scheduling capabilities, Ali can perform full-time mixing under the traffic of Double Eleven through intelligent decision-making and operation and maintenance. Ability to support the internal million-level Pod mixing department. Whether it is CPU and GPU resources, ordinary containers and secure containers, including various heterogeneous infrastructures in the localized environment, they can achieve efficient mixing. This allows Ali's core e-commerce The cost of business production clusters has been reduced by more than 50%, while the core business has been disrupted by less than 5%.
In response to the problem of resource efficiency improvement in the cloud-native era, we will launch a series of articles based on the practice of mixing in large-scale scenarios, introducing and sharing the details of mixing technology and the practical problems encountered in large-scale production. As the beginning of the series, this article will introduce the importance of resource isolation technology in the mixed department, its landing challenges and our response ideas.
The relationship between mixed department and resource isolation: resource isolation is the cornerstone of mixed department
The mixing department usually mixes tasks of different priorities, such as high-priority real-time tasks (sensitive to delay, low resource consumption; called online) and low-priority batch processing tasks (insensitive to delay, High resource consumption; called offline). When high-priority services need resources, low-priority tasks need to be returned immediately, and the operation of low-priority tasks cannot cause significant interference to high-priority tasks.
In order to meet the needs of the mixed department, the kernel resource isolation technology in the stand-alone dimension is the most critical technology. Alibaba Cloud has cultivated in the kernel resource isolation technology for many years and has accumulated many industry-leading experience. We mainly focus on these kernel resource isolation technologies. The scope involved is summarized as the three major subsystems of scheduling, memory and IO in the kernel, and in-depth transformation and optimization in each subsystem field according to the cloud-native hybrid scenario, including CPU Group Identity, SMT expeller, and Cgroup-based The memory is recycled asynchronously and so on. These key technologies enable customers to provide optimal solutions based on business characteristics in cloud native hybrid scenarios, effectively increasing user resource utilization and reducing user resource usage costs. They are very suitable for container cloud hybrid scenarios, and at the same time It is also a key technology that a large-scale hybrid deployment solution strongly relies on.
The following figure shows the position of resource isolation capabilities in the entire mixed-department solution:
Why do we need resource isolation, and what obstacles will be encountered in resource isolation
Suppose we now have a server on which we run high-quality online services and offline tasks. The response time (RT) requirement of online tasks is very clear, requiring the lowest possible RT, so it is called latency-sensitive (LS) load; offline tasks always have to consume many resources How many resources, so this type of load is called Best Effort (BE). If we do not interfere with online and offline tasks, offline tasks are likely to occupy various resources frequently and for a long period of time, so that online tasks have no chance to schedule, or schedules are not timely, or bandwidth is not available, etc., thus appearing online The situation where the business RT rises sharply. Therefore, in this scenario, we need necessary means to isolate online and offline containers in resource usage to ensure that online high-quality containers can be obtained in time when resources are used, and ultimately can improve the overall resource utilization rate. Guarantee the QoS of high-quality containers.
Let's take a look at the possible situations when running online and offline together:
- First of all, the CPU is most likely to face offline competition, because CPU scheduling is the core, and online and offline tasks may be scheduled to a core respectively to compete for execution time;
- Of course, tasks may also run to a pair of HTs that correspond to each other, competing with each other for instruction emission bandwidth and other pipeline resources;
- Next, the various levels of the CPU cache will inevitably be consumed, and the cache resources are limited, so the problem of the division of cache resources is involved here;
- Even if we have perfectly solved the resource division of all levels of cache, the problem still exists. First of all, memory is the next level of CPU cache. Similar to the memory itself, contention will occur. Regardless of online or offline tasks, memory resources need to be divided like CPU cache;
- In addition, when the last level cache (Last Level Cache, LLC) of the CPU does not hit, the bandwidth of the memory (we call it runtime capacity, which is different from the static capacity of the memory size) will become higher, so the memory and The resource consumption between CPU caches affects each other;
- Assuming that both CPU and memory resources are okay, isolation has been done very well for this machine, but online high-quality services and offline tasks are closely related to the network during the operation, so it is easy to understand. The network may also need to be isolated;
- Finally, some online models may preempt the use of IO, and we need an effective IO isolation strategy.
The above is a very simple resource isolation process. It can be seen that interference or competition may occur in each ring.
Isolation technology program introduction: unique isolation technology program, each shows its magical power
Kernel resource isolation technology mainly involves the three subsystems of scheduling, memory and IO in the kernel. These technologies are based on Linux Cgroup V1 to provide basic isolation of resources and QoS guarantee. They are suitable for container cloud scenarios and are also a large-scale hybrid deployment solution. The key technology that is strongly relied on.
In addition to basic CPU, memory, and IO resource isolation technologies, we have also developed supporting tools such as resource isolation views, resource monitoring indicators SLI (Service Level Indicator), and resource competition analysis, providing online monitoring, alarming, operation and maintenance, diagnosis, etc. The complete set of resource isolation and co-location solutions within, as shown in the following figure:
Scheduler optimization for flexible container scenarios
How to ensure the quality of computing service while increasing the utilization of computing resources as much as possible is a classic problem in container scheduling. With the continuous improvement of CPU utilization, the problem of insufficient elasticity of the CPU bandwidth controller is becoming more and more serious. Faced with the short-term CPU demand of the container, the bandwidth controller needs to limit the CPU usage of the container to avoid affecting the load delay and Huff.
The CPU Burst technology was originally proposed by the Alibaba Cloud operating system team and contributed to the Linux community and the dragon lizard community. It was included in Linux 5.14 and dragon lizard ANCK 4.19 respectively. It is a flexible container bandwidth control technology. Under the condition that the average CPU usage rate is lower than a certain limit, CPU Burst allows short-term CPU burst usage to achieve service quality improvement and container load acceleration.
After using CPU Burst in the container scenario, the service quality of the test container is significantly improved. As shown in the figure below, it can be found in the actual test that the RT long tail problem has almost disappeared after using this feature technology.
Group Identity technology
In order to meet the business side's demand for CPU resource isolation, it is necessary to ensure that the service quality of high-quality services is not affected, or to control the impact range within a certain range while meeting the maximum utilization of CPU resources. At this time, the kernel scheduler needs to give high-priority tasks more scheduling opportunities to minimize its scheduling delay and minimize the impact of low-priority tasks on it. This is a common demand in the industry.
In this context, we introduced the concept of Group Identity, that is, each CPU Cgroup has an identity, and the CPU Cgroup group is used as a unit to achieve special scheduling priority, and the timely preemption ability of high-priority groups is improved to ensure high priority. The performance of high-level tasks is suitable for online and offline mixed running business scenarios. When in the offline mixing department, it can minimize the problem of untimely scheduling of online services introduced by offline services, and increase the CPU preemption timing of high-priority services and other underlying core technologies to ensure that online services are not affected by the CPU scheduling delay of offline services. Influence.
SMT expeller technology
In some online business scenarios, the QPS when using hyper-threading is significantly lower than when not using hyper-threading, and the corresponding RT has also increased a lot. The root cause is related to the physical nature of hyper-threading. Hyper-threading technology simulates two logical cores on one physical core. The two logical cores have their own independent registers (eax, ebx, ecx, msr, etc.) and APIC, but they will share Use the execution resources of the physical core, including the execution engine, L1/L2 cache, TLB, system bus, and so on. This means that if one of a pair of HT cores runs an online task, and at the same time it runs an offline task on the corresponding HT core, then there will be competition between them. This is the problem we need to solve. .
In order to reduce the impact of this competition as much as possible, we want to make when an online task on a core is executed, its corresponding HT no longer runs offline tasks; or when there are offline tasks running on a core, online task scheduling When it reaches its corresponding HT, the offline task will be driven away. Sounds miserable offline, right? But this is our mechanism to ensure that HT resources are not competed.
The SMT expeller feature is based on the Group Identity framework to further implement Hyper-Threading (HT) isolation scheduling to ensure that high-priority services will not be interfered by low-priority tasks from HT. The hyper-threaded scheduling isolation further realized by the Group Identity framework can ensure that high-priority services will not be interfered by low-priority tasks on the corresponding HT.
Processor hardware resource management technology
Our core architecture supports Intel® Resource Director Technology (Intel® RDT), which is a hardware resource management technology supported by the processor. Including Cache Monitoring Technology (CMT) for monitoring Cache resources, Memory Bandwidth Monitoring (MBM) for monitoring memory bandwidth, Cache Allocation Technology (CAT) for Cache resource allocation and Memory Bandwidth Allocation (MBA) for monitoring memory bandwidth.
Among them, CAT makes LLC (Last Level Cache) a resource that supports Quality of Service (QoS). In a mixed environment, if there is no LLC isolation, the continuous read and write data of offline applications will lead to a large amount of LLC occupation, which will cause the online LLC to be continuously polluted, affecting data access and even hardware interruption delays and performance degradation.
MBA is used for memory bandwidth allocation. For services that are sensitive to memory bandwidth, memory bandwidth can affect performance and latency more than LLC control. In a mixed environment, offline is usually resource-consuming, especially some AI-type jobs consume a lot of memory bandwidth resources. Once the bandwidth occupied by the memory reaches the bottleneck, the performance and latency of online services may be doubled. , And show that the CPU water level rises.
Memcg background recycling
In the native kernel system, when the memory usage of the container reaches the upper limit, if you apply for memory again, the current process context will perform direct memory recycling, which will undoubtedly affect the execution efficiency of the current process and cause performance problem. Then do we have a way to let the container perform asynchronous memory recovery in advance when the memory of the container reaches a certain waterline? In this way, there is a relatively high probability to prevent the process in the container from entering the direct memory reclamation when the memory usage reaches the upper limit when applying for the use of memory.
We know that there is a kswapd background kernel thread in the kernel, which is used to perform asynchronous memory recovery when the system memory usage reaches a certain level. But there is a situation here. For example, the memory usage of the current high-quality business container has reached a relatively tight state, but there is still a lot of free memory in the overall host machine, so that the kernel's kswapd thread will not be awakened for memory recovery, resulting in The memory of these high-quality containers with high memory usage pressure has no chance to be reclaimed. This is a big contradiction. Since there is currently no memory Cgroup-level asynchronous memory recovery mechanism in the native kernel, which means that the memory recovery of the container depends heavily on the recovery of kswapd at the host level or can only rely on its own synchronous recovery, which will seriously affect the business of some high-quality containers performance.
Based on the above background, the Alibaba Cloud operating system team provides a Memcg-based asynchronous recycling strategy similar to kswapd at the host level. The container-level memory recycling mechanism can be implemented in advance according to user needs to achieve early memory decompression.
The specific asynchronous recycling process can be described by the following picture:
Memcg global water mark classification
Generally, resource-consuming offline tasks often apply for a large amount of memory instantly, causing the free memory of the system to touch the global min waterline, causing all tasks in the system to enter the slow process of direct memory recovery. At this time, delay-sensitive online services are prone to occur. Performance jitter. In this scenario, neither the global kswapd background recovery nor the Memcg level background recovery mechanism can do anything.
Based on the fact that "memory-consuming offline tasks are usually insensitive to delay", we designed "Memcg's global min waterline classification function" to solve the above jitter problem. On the basis of the standard upstream global shared min waterline, the global min waterline of offline tasks is moved into direct memory recovery in advance, and the global min waterline of delay-sensitive online tasks is moved down to a certain extent. Realize the min waterline isolation of offline tasks and online tasks. In this way, when a large amount of memory is requested for an offline task, the offline task will be suppressed to the min waterline moved up to avoid direct memory reclamation of the online task. Then when the global kswapd reclaims a certain amount of memory, the offline task will take a short time The suppression was lifted.
The core idea is to separately control the actions of applying for memory by setting different standard global water levels for offline containers. This allows offline container tasks to enter direct memory recovery with online businesses when applying for memory. This solves the problem of instantaneous application of a large number of offline containers. Problems caused by memory.
Students who have a certain foundation in Linux memory management can also refer to the following picture, which records in detail the trend under the action of various water levels in the process of offline container mixing:
Memcg OOM priority
In real business scenarios, especially memory oversold environments, when Global OOM occurs, there are reasons to choose to kill offline businesses with lower priority and protect high-priority online businesses; when offline Memcg OOM occurs At that time, there are reasons to choose to kill those jobs with lower priority, and keep the high priority offline jobs. This is actually a relatively common requirement in cloud native scenarios, but the current standard Linux kernel does not have this capability. When choosing to kill a process, the kernel will have an algorithm to select the victim, but usually it is to find a process with the largest OOM score to kill. This killed process may be an online high-quality business process, which is not what we want to see .
Based on the above reasons, the Alibaba Cloud operating system team provides a Memcg OOM priority feature. With this feature, we can ensure that when the system OOM occurs due to memory shortage, we can kill by selecting low-quality business processes to avoid high-quality business processes. The possibility of being killed can greatly reduce the impact on the customer's business due to the withdrawal of the online business process.
CgroupV1 Writeback current limit
Since Block IO Cgroup has been incorporated into the kernel, there has always been a problem, that is, it can only limit Direct IO (fsync can also limit the current after buffer IO), because when these IOs reach the Block Throttle layer, the current process is the real In the process of initiating IO, the corresponding Cgroup can be obtained according to the process to account correctly. If the bandwidth/IOPS limit set by the user is exceeded, it will be restricted. For those IOs written by the buffer and finally issued by the kworker thread, the Block Throttle layer cannot obtain the Cgroup to which the IO belongs through the current process, and therefore cannot limit the current of these IOs.
Based on the above background, asynchronous IO current limiting is currently supported in Cgroup V2, but it is not supported in Cgroup V1. Because Cgroup V1 is mainly used in the cloud native environment, the Alibaba Cloud operating system team established Page <- > Memcg <-> blkcg The relationship between these three realizes the asynchronous current limiting function of IO in Cgroup V1. The main current limiting algorithm is basically the same as Cgroup V2.
blk-iocost weight control
Under normal circumstances, in order to prevent an IO starvation job from easily exhausting the entire system IO resources, we will set the upper limit of the IO bandwidth of each Cgroup. The biggest disadvantage is that even if the device is idle, the Cgroup with the upper limit configured when its sent IO exceeds the upper limit Cannot continue to send IO, causing a waste of storage resources.
Based on the above requirements, a kind of IO controller-IOCOST appeared. The controller allocates disk resources based on the weight of blkcg. It can make full use of disk IO resources on the premise of satisfying business IO QOS. Once When the disk IO capacity reaches the upper limit and the goal set by QOS is reached, the iocost controller will use weights to control the IO usage of each group. On this basis, blk-iocost has a certain degree of self-adapting ability, and avoid it as much as possible. Disk capacity is wasted.
Prospects and expectations
All of the above resource isolation capabilities have been fully contributed to the dragon lizard community, the relevant source code can refer to ANCK (Anolis Cloud Kernel), interested students can follow the dragon lizard community: https://openanolis.cn/
At the same time, the Alibaba Cloud Container Service team is also working with the operating system team to export externally through Alibaba Cloud Container Service ACK Agile Edition and CNStack (CloudNative Stack) product family, and continue to implement ACK Anywhere to empower more enterprises. In the commercial version, we will be completely based on the cloud-native community standard, and seamlessly install it in the K8s cluster in a plug-in way to deliver the offline output form to customers. The core OS layer isolation capability has been released to Anolis OS, an open source, neutral, and open Linux operating system distribution that supports multiple architectures.
In the near future, Alibaba Cloud Hybrid Department’s core technology series will continue to share content about CPU Brust technology practice, kernel resource isolation, CPU-related isolation/memory-related isolation/IO-related isolation/network-related isolation, etc., so stay tuned!
👇👇 Click here to view the container service ACK product details!
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。