1

Author

Jiang Biao, senior engineer of Tencent Cloud, has been focusing on operating system related technologies for more than 10 years, and is a senior Linux kernel enthusiast. He is currently responsible for the research and development of Tencent Cloud's native OS and the performance optimization of OS/virtualization.

Lead

The mixing department, usually refers to the offline mixing department (also referred to as the off-line mixing department), which means that online business (usually delay-sensitive high-priority tasks) and offline tasks (usually CPU-consuming low-priority tasks) At the same time, mixed deployment on the same node in order to improve the resource utilization of the node. The key difficulty lies in the underlying resource isolation technology, which relies heavily on the OS kernel. The resource isolation capabilities provided by the existing native Linux kernel are once again somewhat stretched (or at least not perfect) in the face of mixed demand. Hack in depth can meet the needs of production level.

(Cloud native) resource isolation technology mainly includes 4 aspects: CPU, memory, IO and network. This article focuses on CPU isolation technology and related background, and then follow-up (series) step by step, and gradually expand to other aspects.

background

Whether in IDC or cloud scenarios, the resource utilization of is definitely a common problem faced by most users/vendors. On the one hand, the cost of hardware is very high (everyone is buying, and most of the hardware (core technology) is in the hands of others, without pricing power, and bargaining power is usually weak), and the life cycle is still very short (after a few years) Need to be replaced); On the other hand, it is extremely embarrassing that such expensive things cannot be fully utilized. Take the CPU occupancy rate, the average occupancy rate of most scenes is very low (if I shoot no more than 20% (here refers to Daily average, or weekly average), I believe that most students will have no opinion, which means that the expensive things are actually used less than one-fifth. If you still want to live at home in a decent way, you will definitely think Distressed.

Therefore, improving the resource utilization of the host (node) is a task worth exploring, and at the same time, the benefits are very obvious. The solution is also very straightforward,

Conventional thinking mode: do more business. It's easy to say, who hasn't tried it. The core difficulty lies in: the usual business has obvious peak and valley characteristics

What you want might look like this:

But the reality is mostly like this:

When doing capacity planning for the business, it needs to be done according to Worst Case (assuming that all businesses have the same priority). Specifically, at the CPU level, it needs to be based on the CPU peak value (may be a weekly peak or even a monthly/annual peak). ) To plan the capacity (usually a certain margin must be left to deal with emergencies),

In reality, most situations are: the peak is very high, but the actual average is very low. As a result, the average CPU in most scenarios is very low, and the actual CPU utilization is very low.

I made an assumption before: "All businesses have the same priority." The Worst Case of the business determines the final performance of the whole machine (low resource utilization). If you change your thinking, but when the business is prioritized, there is more room to play. You can sacrifice the quality of service of low-priority businesses (usually tolerable) to ensure the quality of service of high-priority businesses. This can be deployed in While appropriate high-priority services are deployed, more services (low-priority) are deployed to improve resource utilization as a whole.

Hybrid (hybrid deployment) therefore came into being. The "mixing" here is essentially "prioritizing." In a narrow sense, it can be simply understood as "online + offline" (offline) hybrid. In a broad sense, it can be extended to a wider range of applications: mixed deployment of multi-priority services.

The core technology involved includes two levels:

  1. The underlying resource isolation technology. It is (usually) provided by the operating system (kernel), which is the core focus of this article (series).
  2. Upper-level resource scheduling technology. (Usually) Provided by the upper-level resource orchestration/scheduling framework (typically such as K8s). I plan to do another series of articles. Please look forward to it.

The mixed department is also a very hot topic and technical direction in the industry. The current mainstream leading manufacturers are continuing to invest, the value is obvious, and there are also high technical barriers (barriers). Related technologies originated very early and have a lot of origins. The famous K8s (predecessor Borg) actually originated from Google's hybrid scene, and from the history and effect of the hybrid, Google is regarded as the benchmark in the industry, known as the CPU occupancy rate (average value) Able to achieve 60%, please refer to his classic papers for details:

https://dl.acm.org/doi/pdf/10.1145/3342195.3387517

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43438.pdf

Of course, Tencent (Cloud) has also been exploring the direction of the mixed department very early, and has experienced several major technology/solution iterations. So far, there have been good landing scales and results. The details need another topic, which is not discussed in this article.

Technical challenge

As mentioned earlier, in the mixed-department scenario, the underlying resource isolation technology is very important. The "resources" are divided into 4 categories as a whole:

  • CPU
  • Memory
  • IO
  • The internet

This article focuses on CPU isolation technology, mainly analyzing the technical difficulties, status quo and solutions at the CPU isolation level.

CPU isolation

Among the four types of resources mentioned above, CPU resource isolation can be said to be the most basic isolation technology. On the one hand, the CPU is a compressible (reusable) resource, and the difficulty of reuse is relatively low, and the availability of Upstream's solution is relatively good; on the other hand, the CPU resource is strongly related to other resources, and the use of other resources (application /Release) often depends on the process context and indirectly depends on CPU resources. For example, when the CPU is isolated (suppressed), other requests such as IO and network may (in most cases) be suppressed because the CPU is suppressed (not scheduled), and thus will be suppressed accordingly.

Therefore, the effect of CPU isolation will also indirectly affect the isolation effect of other resources. CPU isolation is the core isolation technology.

Kernel scheduler

Specifically, in the OS, CPU isolation is essentially completely dependent on the kernel scheduler . The kernel scheduler is the basic functional unit of the kernel for load distribution of CPU resources (very official term), specifically (in a narrow sense) , Which can correspond to the default scheduler of the Linux kernel that we have most contact with: CFS scheduler (essentially a scheduling class, a set of scheduling strategies).

The kernel scheduler determines when and which tasks (processes) are selected for execution on the CPU, and therefore determines the CPU running time of online and offline tasks in the hybrid scenario, thereby determining the effect of CPU isolation.

Upstream kernel isolation effect

The Linux kernel scheduler provides 5 scheduling classes by default, and there are basically only two types that can be used in actual business:

  • CFS
  • Real-time scheduler (rt/deadline)

In the mixed scene, the essence of CPU isolation lies in the need:

  • When online tasks need to be run, try suppress offline tasks
  • When online tasks are not running, offline tasks use idle CPU to run

For "suppression", based on the Upstream kernel (based on CFS), there are several ideas (plans) as follows:

priority

The priority of offline tasks can be reduced, or the priority of online tasks can be increased. Without modifying the scheduling class (based on the default CFS), the priority range that can be dynamically adjusted is: [-20, 20)

The specific performance of the time slice is the time slice that can be divided in a single scheduling period, specifically:

  • The weight ratio of time slice allocation between normal priority 0 and lowest priority 19 is: 1024/15, approximately: 68:1
  • The weight ratio of the time slice allocation between the highest priority -20 and the normal priority 0 is: 88761/1024, which is approximately 87:1
  • The time slice allocation weight ratio between the highest priority -20 and the lowest priority 19 is: 88761/15, which is approximately 5917:1

It seems that the suppression ratio is still relatively high. It is added that the priority of offline tasks is set to 20, and the online default is 0 (the usual practice). At this time, the online and offline time slices are assigned a weight of 68:1.

Assuming that the length of a single scheduling cycle is 24ms (the default configuration of most systems), it seems (rough estimate) that the time slice that can be allocated offline in a single scheduling cycle is about 24ms/69=348us, which can occupy about 1/69=1.4% CPU.

The actual running logic is still a bit different: CFS considers throughput and sets the minimum time granularity protection for a single run (the minimum time for a single run of a process): sched_min_granularity_ns, in most cases it is set to 10ms, which means that once offline preemption occurs, it can continue The running time of 10ms means that the scheduling delay (RR switching delay) of online tasks may reach 10ms.

Wakeup also has minimum time granularity protection (when Wakeup, the minimum running time of preempted tasks is guaranteed): sched_wakeup_granularity_ns, which is set to 4ms in most cases. This means that once offline, the wakeup latency (another typical scheduling delay) of online tasks may also reach 4ms.

In addition, adjusting the priority cannot optimize the preemption logic. Specifically, when implementing preemption (wakeup and periodicity), the priority is not referred to, and there will be no real-time different preemption strategies due to different priorities (not Because the priority of the offline task is low, and its preemption is suppressed and the preemption time is reduced), it may cause unnecessary preemption offline and cause interference.

Cgroup(CPU share)

The Linux kernel provides a CPU Cgroup (corresponding to the container pod), and the priority of the container can be controlled by setting the share value of the Cgroup, that is to say, the purpose of "suppression" can be achieved by lowering the share value of the offline Cgroup. For Cgroup v1, the default share value of Cgroup is 1024, and the default share(weight) value of Cgruop v2 is 100 (of course, you can also adjust). If you set the share/weight value of offline Cgroup to 1 (the lowest value), then, In CFS, the corresponding time slice allocation weight ratios are 1024:1 and 100:1, and the corresponding CPU usage is about 0.1% and 1%, respectively.

The actual operation logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.

Similar to the priority scheme, the preemption logic is not optimized based on the share value, and there may be additional interference.

Special policy

CFS also provides a special scheduling policy: SCHED_IDLE, dedicated to running tasks with extremely low priority, and it seems to be designed for "offline tasks". The SCHED_IDLE task essentially has a CFS task with a weight of 3, and its time slice allocation weight ratio to ordinary tasks is 1024:3, which is about 334:1. At this time, the CPU occupancy rate of offline tasks is about 0.3%. Time slice allocation such as:

The actual operation logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.

CFS has made a special preemption logic optimization for the SCHED_IDLE task (suppressing the preemption of other tasks by the SCHED_IDLE task, reducing the preemption time), therefore, from this perspective, SCHED_IDLE is "adapted" (although Upstream is not intended to be) mixed scene Take a small step.

In addition, because SCHED_IDLE is a per-task mark, there is no Cgroup-level SCHED_IDLE mark ability, and when CFS scheduling, you need to select the (task) group first, and then select the task from the group, so for the cloud native scene (container) mixed department In other words, using SCHED_IDLE alone does not have a practical effect.

On the whole, although CFS provides priority (share/SCHED_IDLE is similar in principle, and the essence is priority), and can suppress low-priority tasks to a certain extent according to priority, the core design of CFS lies in "fairness" In essence, it is impossible to "absolutely suppress" offline tasks. Even if the "priority" (weight) is set to the lowest, offline tasks can still obtain a fixed time slice, and the time slice obtained is not an idle CPU time slice, but from Grabbed in the time slice of the online task. In other words, the "fair design" of CFS determines that the interference of offline tasks to online cannot be completely avoided, and the perfect isolation effect cannot be achieved.

In addition, by (limit) reducing the priority of offline tasks (essentially the same is true for the above schemes), in essence, it also compresses the priority space of offline tasks. In other words, if you want to further Distinguish the priority between offline tasks (there may also be a QoS distinction between offline tasks, and there may be such a demand in practice), then there is nothing you can do.

In addition, from the perspective of low-level implementation, since both online and offline use CFS scheduling class, in actual operation, online and offline share the run queue (rq), overlay calculation load, share the load balance mechanism, on the one hand, offline is doing shared resources (For example, run queue) operations need to be synchronized (locking), and the lock primitive itself is not prioritized, and offline interference cannot be ruled out; on the other hand, offline tasks cannot be distinguished during load balance and special Processing (such as aggressive balance to prevent starvation, increase CPU utilization, etc.), the balance effect of offline tasks cannot be controlled.

Real-time priority

At this point, you might be thinking, if you need absolute preemption (suppress offline), why not use real-time scheduling (RT/deadline)? Compared with CFS, the real-time scheduling class just achieves the effect of "absolute suppression".

It is true. However, under this kind of thinking, you can only set the online business to real-time, and keep the offline task as CFS. In this way, the online can absolutely preempt the offline. At the same time, if you are worried about being starved offline, there is also the rt_throttle mechanism to ensure that offline is not starved to death. .

It looks "perfect", but it's not. The essence of this approach will compress the priority space and living space of online tasks (contrary to the result of lowering the priority of offline tasks before), the result is that online services can only use real-time scheduling (although most online services are not satisfied The characteristics of real-time type) can no longer use the native capabilities of CFS (such as fair scheduling, Cgroup, etc., which are just needed for online tasks).

To put it simply, the problem lies in the fact that the real-time type cannot meet the needs of the online task itself. In essence, the online business itself is not a real-time task. After such a strong real-time operation, there will be more serious side effects, such as system tasks (OS self-service). With tasks, such as various kernel threads and system services, starvation will occur.

To sum up, for the real-time priority scheme:

  1. Recognize the "absolute suppression" capability of the real-time type for the CFS type (this is exactly what we want)
  2. However, in the current Upstream kernel implementation, online tasks can only be set to a real-time type with a higher priority than CFS, which is unacceptable in actual application scenarios.

Priority inversion

Speaking of this, you may still have a huge question mark in your mind: After "absolute suppression", there will be a priority inversion problem, right? How to do?

The answer is: there is indeed a priority inversion problem

Explain the logic of priority inversion in this scenario: if there are shared resources between online tasks and offline tasks (such as some common data in the kernel, such as the /proc file system), when offline tasks access shared resources After getting a lock (abstract, not necessarily a lock), if it is "absolutely suppressed", it has been unable to run. When the online task also needs to access the shared resource and waits for the corresponding lock, the priority inversion occurs, resulting in Deadlock (long-term blocking is also possible). Priority inversion is a classic problem that needs to be considered in the scheduling model.

A rough summary of the conditions under which priority inversion occurs:

  • There are shared resources offline.
  • There are concurrent accesses to shared resources, and sleep lock protection is used.
  • After getting the lock offline, it is completely suppressed and there is no chance of operation. This sentence can be understood like this: All CPUs are 100% occupied by online tasks, resulting in no running opportunities offline. (Theoretically, as long as there is an idle CPU, offline tasks may be used through the load balance mechanism)

In the cloud-native hybrid scenario, the method (thought) to deal with the priority inversion problem depends on the perspective of the problem, we look at it from the following different perspectives:

  1. How likely is priority inversion to occur? This depends on the actual application scenario. In theory, if there is no shared resource between online services and offline services, priority inversion will not actually occur. In the cloud-native scenario, there are roughly two situations:

(1) Safe container scenario. In this scenario, the business actually runs in a "virtual machine" (abstract understanding), and the virtual machine itself guarantees the isolation of most resources. In this scenario, priority inversion can basically be avoided (if it does exist, It can be handled separately and handled separately)

(2) Ordinary container scene. In this scenario, the business runs in the container, and there are some shared resources, such as the common resources of the kernel, and the shared file system. As analyzed above, under the premise of shared resources, the conditions for priority inversion are relatively strict. The most critical condition is that all CPUs are 100% occupied by online tasks. This situation is in a real scenario. , Is very rare, it can be regarded as a very extreme scene. In reality, such "extreme scenes" can be dealt with separately.

Therefore, in (most) real cloud-native scenarios, we can think that it can be avoided if the scheduler optimization/hack is good enough.

  1. How to deal with priority inversion? Although priority inversion only occurs in extreme scenarios, if it must be handled (upstream will definitely consider it), how should it be handled?

(1) Upstream's idea. In the CFS implementation of the native Linux kernel, a certain weight is reserved for the lowest priority (which can be considered SCHED_IDLE), which means that the lowest priority task can also get a certain time slice, so priority can be (basically) avoided Reverse the problem. This is also the attitude of the community all the time: universal, even in extremely extreme scenarios, it needs a perfect cover. Such a design is precisely the reason why "absolute suppression" cannot be achieved. From a design point of view, such a design is not wrong, but for cloud native hybrid scenes, such a design is not perfect: it does not perceive the degree of offline hunger, that is, when offline is not hungry , It may also preempt online, causing unnecessary interference.

(2) Another idea. Optimized design for cloud-native scenarios: Perceive offline hunger and the possibility of priority inversion, but when offline hunger may lead to priority inversion (that is, as a last resort), preemption is performed. In this way, on the one hand, different preemption (interference) can be avoided, and the problem of priority inversion can be avoided at the same time. Achieve (relatively) perfect results. Of course, I have to admit that such a design is not so Generic and not so Graceful, so Upstream is basically unlikely to accept it.

Hyperthreading interference

So far, another key issue has been missed: hyperthreading interference. This is also a stubborn disease in the mixed scene, and the industry has never had a targeted solution.

The specific problem is that the hyperthreads on the same physical CPU share core hardware resources, such as Cache and computing units. When online tasks and offline tasks are running on a pair of hyperthreads at the same time, they will interfere with each other due to competition for hardware resources. And CFS did not consider this issue at all when designing

As a result, in the mixed department scenario, the performance of online services is impaired. The actual test uses a CPU-intensive benchmark, and the performance interference caused by hyperthreading can reach 40%+.

Note: Intel official data: the physical core performance can only be about 1.2 times the single-core performance.

Hyper-threading interference is a key issue in mixed scenes, and CFS was (almost) completely ignored in the initial design. It cannot be said that it is a lack of design. It can only be said that CFS is not designed for mixed scenes. It was born for a more general and macro scene.

Core scheduling

Speaking of this, students who are professional (engaged in kernel scheduling) may have another question: Haven't you heard of Core scheduling, can't solve the problem of hyperthreading interference?

Hearing this, I have to say that this student is indeed very professional. Core Scheduling is a new feature submitted by the kernel scheduler module Maintainer Perter in 2019 (based on the coscheduling concept proposed in the earlier community). The main goal is To solve (should be mitigation or workaround) L1TF vulnerability (data leakage due to shared cache between hyperthreads), the main application scenario is: in the cloud host scenario, avoid different virtual machine processes running on the same pair of hyperthreads, resulting in data Give way.

The core idea is to avoid differently marked processes running on the same pair of hyperthreads.

The status quo is: Core scheduling patchset has gone through the v10 version of the iteration, nearly 2 years of discussion and improvement/rework, finally, just recently (2021.4.22), Perter issued a seemingly possible entry (when will it be possible to enter) It's hard to say) The version of master (not complete yet):

https://lkml.org/lkml/2021/4/22/501

On this topic, it is worth a separate in-depth sharing, not to expand here. Please also look forward to...

Here directly throw (personal) opinion (pat):

  • Core scheduling can indeed be used to solve the problem of hyperthreading interference.
  • Core scheduling was originally designed to solve security vulnerabilities (L1TF), not for hybrid hyperthreading interference. Due to the need to ensure security, absolute isolation needs to be achieved, complex (expensive) synchronization primitives (such as core-level rq lock), heavyweight feature implementations, such as core-range pick tasks, and excessive force idle are required. In addition, there is a matching interrupt context concurrency isolation and so on.
  • The design and implementation of Core scheduling are too heavy and the overhead is too large, and the performance regression is serious after it is turned on, and it cannot distinguish between online and offline. Not suitable for (cloud native) mixed scenes.

The essence remains: Core scheduling is not designed for cloud-native hybrid scenarios.

in conclusion

Based on the previous analysis, we can abstractly summarize the advantages and problems of various existing solutions.

Based on the priority (share/SCHED_IDLE similar) scheme in CFS, advantages:

  • Universal. Strong ability, can fully hold most of the application scenarios
  • Can (basically) avoid priority inversion problems

problem:

  • Isolation effect is not perfect (no absolute suppression effect)
  • Various other minor problems (not perfect)

Based on the real-time task type, the advantages:

  • Absolute suppression, perfect isolation effect
  • There is a mechanism to avoid priority inversion (rt_throttle)

problem:

  • not applicable. Online tasks cannot (in most cases) use real-time task types.
  • There is a mechanism (rt_throttle) to avoid priority inversion, but when it is turned on, the isolation effect is not perfect.

Solve hyperthreading interference isolation based on Core scheduling, advantages:

  • Perfect hyperthreading interference isolation effect

problem:

  • The design is too heavy and the overhead is too large

Conclusion

In consideration of versatility and elegant design, the Upstream Linux kernel is difficult to meet the extreme needs in a specific scenario (cloud native hybrid). If you want to pursue excellence and extreme, you also need a deep hack, and TencentOS Server has been on the way. (Sound familiar? Indeed, I have said so before!

Regarding the specific implementation and code analysis of the Linux kernel's kernel scheduler (based on the 5.4 kernel (Tkernel4)), we will continue to launch corresponding analysis series of articles in the future. While discussing the pain points of cloud-native scenarios, we will also combine the corresponding code analysis. Reduce the mystery of the Linux kernel and explore a broader Hack space. Stay tuned.

Thinking

  1. If you want online businesses to use CFS (using the powerful capabilities of CFS), and at the same time want to have the ability to "absolutely suppress", what should the ideal approach do? (I feel that the answer is about to come out!
  2. If you don’t need a perfect isolation effect (absolute suppression), you also need to deal with priority inversion, you also need a "close to perfect" isolation effect, and you want to use the existing mechanism as much as possible (don’t want too big a scheduler Hack, the risk is less), What should I do then? (Look carefully at the analysis and summary of the various existing solutions in front, and I feel that it is almost close to the answer)
[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

账号已注销
350 声望974 粉丝