Take you to explore the mystery of CPU scheduling

Abstract: This article will start with the most basic scheduling algorithm, analyze the principles of various mainstream scheduling algorithms one by one, and take everyone to explore the mystery of CPU scheduling.

This article is shared from the Huawei Cloud Community " Exploring the Principles of CPU Scheduling ", author: Yuan Runzi.

Preface

Software engineers are always accustomed to regard OS (Operating System, operating system) as a very trustworthy butler. We only host programs to run on the OS, but rarely have a deep understanding of the operating principles of the operating system. Indeed, as a general-purpose software system, OS performs well enough in most scenarios. But there are still some special scenarios that require us to tune the OS in order to make the business system complete tasks more efficiently. This requires us to have a deep understanding of the principles of OS, not only to call the housekeeper, but also to know how to make the housekeeper do better.

OS is a very large software system. This article mainly explores the tip of the iceberg: CPU scheduling principle.

Speaking of the principle of CPU scheduling, the first reaction of many people is scheduling based on time slices, that is, each process has a time slice that occupies the CPU to run. After the time slice is used up, the CPU is given up to other processes. As for the deeper principles of how the OS judges whether a time slice is used up, how to switch to another process, etc., not many people seem to know it.

In fact, scheduling based on time slices is only one type of scheduling algorithms for many CPUs. This article will start with the most basic scheduling algorithms, analyze the principles of various mainstream scheduling algorithms one by one, and take everyone to explore the mysteries of CPU scheduling.

CPU context switching

Before exploring the principle of CPU scheduling, let's first understand the context switch of CPU, which is the basis of CPU scheduling.

Almost all OSs today support "simultaneously" running tasks far greater than the number of CPUs, and the OS will allocate CPUs to them in turn. This requires the OS to know where to load tasks and where to start after loading. These information are stored in the register of the CPU. The address of the next instruction to be executed is stored in the program counter (PC ) On this special register. We call this information of the register the CPU context, also called the hardware context.

When the OS switches running tasks, saves the context of the previous task and loads the context of the task to be run on the CPU register. This action is called CPU context switching.

The CPU context is part of the process context. The process context we often say consists of the following two parts:

user-level context: contains the runtime stack, data block, code block and other information of the process.
system-level context: contains process identification information, process site information (CPU context), process control information and other information.

This involves two questions: (1) How to save the CPU context of the previous task? (2) When will the context switch be performed?

Question 1: How to save the CPU context of the previous task?

The CPU context will be saved in the kernel space (kernel space) of the process. When the OS allocates virtual memory space to each process, it allocates a kernel space, which can only be accessed by kernel code. Before switching the CPU context, the OS will first save the current CPU's general-purpose registers, PC and other process site information in the kernel space of the process, and then take it out and reload it on the CPU when it switches to the next time to restore the running of the task.

Question 2: When is the context switch performed?

If the OS wants to perform task context switching, it must occupy the CPU to execute the switching logic. However, while the user program is running, the CPU is already occupied by the user program, that is, the OS is not in a running state at this moment, and it is naturally unable to perform context switching. To solve this problem, there are two solutions, a cooperative strategy and a preemptive strategy.

Cooperative strategy relies on the user program to actively give up the CPU, such as executing a system call (System Call) or exceptions such as division by zero. But this strategy is not reliable, if the user program does not actively give up the CPU, or even a malicious infinite loop, then the program will always occupy the CPU, and the only way to recover is to restart the system.

preemptive strategy relies on the hardware's timer interrupt mechanism (Timer Interrupt), and the OS will register the interrupt processing callback (Interrupt Handler) with the hardware during initialization. When the hardware generates an interrupt, the hardware transfers the processing power of the CPU to the incoming OS, and the OS can switch the CPU context on the interrupt callback.

Metrics for scheduling

The quality of a CPU scheduling algorithm is generally measured by the following two indicators:

Turnaround time (turnaround time), refers to the time from the arrival of the task to the completion of the task, that is, T_{turnaround}=T_{completiong}-T_{arrival}Tturnaround=Tcompletiong−Tarrival
Response time (response time), refers to the time from the arrival of the task to the first time the task is scheduled, namely T_{response}=T_{firstrun}-T_{arrival}Tresponse=Tfirstrun−Tarrival

The two indicators are opposed to a certain extent, and a high average turnaround time will inevitably reduce the average response time. The specific indicators that are pursued are related to the type of tasks. For example, program compilation tasks require a small turnaround time and complete compilation as quickly as possible; user interaction tasks require a small response time to avoid affecting user experience.

Workload assumptions

The workload on the OS (that is, the running status of various tasks) is always ever-changing. In order to better understand the principles of various CPU scheduling algorithms, we first make the following assumptions on the workload:

assumes 1: all tasks have the same running time.
Assumption 2: all tasks start time are the same
Hypothesis 3: Once the task starts, it will run until the task is completed.
Assumption 4: All tasks use only CPU resources (for example, no I/O operations are generated).
Hypothesis 5: knows the running time of all tasks in advance.

The preparations have been done, and now we begin to enter the wonderful world of CPU scheduling algorithms.

FIFO: first in first out

The FIFO (First In First Out) scheduling algorithm is known for its simple principle and easy implementation. first schedules the task that arrives until the end, and then schedules the next task, and so on . If multiple tasks arrive at the same time, one is selected at random.

Under our assumed workload conditions, FIFO efficiency is good. For example, there are three tasks A, B, and C that meet all the above load assumptions. Each task runs for 10s and arrives at t=0. Then the task scheduling situation is like this:

According to the FIFO scheduling principle, A, B, and C complete tasks at 10, 20, and 30 respectively, and the average turnaround time is 20s (\frac {10+20+30}{3}310+20+30), the effect is very good it is good.

However, the reality is always cruel. If hypothesis 1 is broken, for example, the running time of A becomes 100s, and the running time of B and C is still 10s, then the scheduling situation is like this:

According to the scheduling principle of FIFO, due to the long running time of A, B and C cannot be scheduled for a long time, which causes the average turnaround time to deteriorate to 110 (\frac {100+110+120}{3}3100+110+120 ).

Therefore, the FIFO scheduling strategy is prone to starvation of tasks in scenarios with large differences in task running time!

To solve this problem, if the shorter running time B and C are scheduled first, the problem can be solved. This is exactly the idea of the SJF scheduling algorithm.

SJF: The shortest task first

SJF (Shortest Job First, shortest task first) selects the shortest running time from multiple tasks with the same arrival time for scheduling, and then schedules the second shortest task, and so on.

For the workload in the previous section, the use of SJF for scheduling is as follows. The turnaround time has become 50s (\frac {10+20+120}{3}310+20+120), which is compared to the 110s of FIFO. It has been improved by more than 2 times.

Let us continue to break the hypothesis 2, A at t=0, B and C arrive at t=10, then the scheduling situation will become like this:

Because tasks B and C arrive later than A, they have to wait for the end of A to have a chance to be scheduled, even if A needs to run for a long time. The turnaround time has deteriorated to 103.33s (\frac {100+(110-10)+(120-10)}{3}3100+(110−10)+(120−10)), and the problem of starving to death occurs again !

STCF: The shortest time to complete first

In order to solve the problem of starvation of the SJF task, we need to break the assumption 3, that is, the task is allowed to be interrupted during the running process. If B and C are dispatched as soon as they arrive, the problem is solved. This belongs to preemptive scheduling. The principle is mentioned in the CPU context switch section. After the interrupt timer arrives, the OS completes the context switch of tasks A and B.

Based on the cooperative scheduling SJF algorithm, plus the preemptive scheduling algorithm, it has evolved into the STCF algorithm (Shortest Time-to-Completion First, the shortest time to complete first). The scheduling principle is that when the running time is short When a task arrives, the current task is interrupted, and the task with a shorter running time is prioritized.

The use of STCF algorithm to schedule the workload is as follows, the turnaround time is optimized to 50s (\frac {120+(20-10)+(30-10)}{3}3120+(20−10)+(30− 10)), once again solved the task of starving to death!

So far, we have only cared about the turnaround time. How long is the response time of the FIFO, SJF, and STCF scheduling algorithms?

Assume that the three tasks of A, B, and C all arrive at t=0, and the running time is 5s, then the scheduling of these three algorithms is as follows, the average response time is 5s (\frac {0+(5-0) +(10-0)}{3}30+(5−0)+(10−0)):

To make matters worse, as the task running time increases, the average response time also increases, which will be catastrophic for interactive tasks and seriously affect the user experience. The root of the problem is that when tasks arrive at the same time and run for the same length of time, the last task must wait for all other tasks to complete before starting scheduling.

In order to optimize the response time, we are familiar with time slice-based scheduling.

RR: round-robin scheduling based on time slice

RR (Round Robin) algorithm allocates a time slice to each task. When the time slice of the task is used up, the scheduler will interrupt the current task and switch to the next task, and so on.

It should be noted that the length of the time slice must be an integer multiple of the interrupt timer. For example, the duration of the interrupt timer is 2ms, then the time slice of the task can be set to 2ms, 4ms, 6ms... Otherwise, even after the time slice of the task is exhausted , The timed interrupt does not occur, and the OS cannot switch tasks.

Now, use RR for scheduling, and assign a 1s time slice to A, B, and C, then the scheduling situation is as follows, the average response time is 1s (\frac {0+(1-0)+(2-0)}{3 }30+(1−0)+(2−0)):

It can be found from the scheduling principle of RR that the smaller the time slice is set, the smaller the average response time. But as the time slice becomes smaller, the number of task switching also increases, that is, the consumption of context switching becomes larger. Therefore, the setting of the time slice size is a trade-off process, and the response time cannot be blindly pursued while ignoring the consumption caused by CPU context switching.

The cost of CPU context switching is not just the cost of saving and restoring registers. When the program is running, it will gradually establish its own cache data on hardware such as CPU caches at all levels, TLB, and branch predictor. When the task is switched, it means that the cache has to be warmed up again, which will bring huge consumption.

In addition, the turnaround time of the RR scheduling algorithm is 14s (\frac {(13-0)+(14-0)+(15-0)}{3}3(13−0)+(14−0)+(15 −0)), compared to the 10s of FIFO, SJF and STCF (\frac {(5-0)+(10-0)+(15-0)}{3}3(5−0)+(10 −0)+(15−0)) is a lot worse. This also verifies what has been said before that the turnaround time and response time are opposite to some extent. If you want to optimize the turnaround time, it is recommended to use SJF and STCF; if you want to optimize the response time, it is recommended to use RR.

The impact of I/O operations on scheduling

So far, we have not considered any I/O operations. We know that when an I/O operation is triggered, the process does not occupy the CPU, but blocks and waits for the completion of the I/O operation. Now let us break the hypothesis 4 and consider that both tasks A and B arrive at t=0, and the running time is 50ms, but A performs an I/O operation that blocks 10ms every 10ms, and B has no I/O.

If you use STCF for scheduling, the scheduling situation is like this:

It can be seen from the above figure that the total scheduling duration of tasks A and B has reached 140ms, which is greater than the actual total running duration of A and B of 100ms. Moreover, A was blocked during the I/O operation, and the scheduler did not switch to B, resulting in the idling of the CPU!

To solve this problem, you only need to use the RR scheduling algorithm to allocate 10ms time slices to tasks A and B, so that when A is blocked in I/O operations, B can be scheduled, and after B runs out of time slices, it happens to be A Also return from I/O blocking, and so on, the total scheduling time is optimized to 100ms.

The scheduling scheme is based on assumption 5, that is, the scheduler is required to know the running time of A and B, the length of I/O operation time and other information in advance, in order to make full use of the CPU. However, the actual situation is far more complicated than this, the I/O blocking duration will not be the same every time, and the scheduler cannot accurately know the running information of A and B. When hypothesis 5 is also broken, how should the scheduler implement it to ensure the maximum CPU utilization and the rationality of scheduling?

Next, we will introduce a CPU scheduling algorithm, MLFQ, which can perform well even when all workload assumptions are broken, and is used by many modern operating systems.

MLFQ: Multi-level feedback queue

The objectives of the MLFQ (Multi-Level Feedback Queue) scheduling algorithm are as follows:

Optimize turnaround time.
Reduce the response time of interactive tasks and improve user experience.

From the previous analysis, we know that to optimize the turnaround time, you can prioritize the scheduling of tasks with a short running time (like the practices of SJF and STCF); to optimize the response time, use time-slice-based scheduling similar to RR. However, these two goals seem to be contradictory. To reduce the response time, it will inevitably increase the turnaround time.

For MLFQ, the following two problems need to be solved:

Without knowing the task's running information (including running time, I/O operations, etc.) in advance, how to weigh the turnaround time and response time?
How to learn from historical scheduling in order to make better decisions in the future?

Prioritize tasks

The most notable feature of MLFQ and the several scheduling algorithms introduced in the previous article is the addition of priority queues to store tasks of different priorities, and the following two rules are set:

Rule 1: If Priority(A)> Priority(B), then schedule A
Rule 2: If Priority(A) = Priority(B), A and B are scheduled according to the RR algorithm
priority changes
MLFQ must consider changing the priority of the task. Otherwise, according to Rule 1 and Rule 2, for task C in the above figure, C will not get a chance to run before A and B run, resulting in a long response time for C. Therefore, the following priority change rules can be set:
Rule 3: When a new task arrives, put it in the highest priority queue
Rule 4a: If task A runs for a time slice without actively giving up the CPU (such as I/O operations), the priority is reduced by one level
Rule 4b: the CPU before the time slice runs out, the priority remains unchanged

Rule 3 mainly considers that all newly added tasks can get scheduling opportunities to avoid the problem of task starvation.

Rules 4a and 4b mainly consider that most interactive tasks are short-running and frequently give up the CPU. Therefore, in order to ensure response time, the existing priority needs to be maintained; and CPU-intensive tasks are often not paid too much attention. Response time, so the priority can be lowered.

According to the above rules, when a long-running task A arrives, the scheduling situation is as follows:

If the short-time task B arrives when task A runs to t=100, the scheduling situation is like this:

It can be seen from the above scheduling situation that MLFQ has the advantages of STCF, that is, it can complete the scheduling of short-running tasks first and shorten the turnaround time.

If task A runs to t=100 and interactive task C arrives, then the scheduling situation is like this:

MLFQ will select other tasks to run according to the priority when the task is blocked to avoid the CPU idling. Therefore, in the above figure, when task C is in the I/O blocking state, task A gets the running time slice, when task C returns from I/O blocking, A is suspended again, and so on. In addition, because task C actively surrenders the CPU within the time slice, the priority of C remains unchanged, which effectively improves the user experience for interactive tasks.

CPU-intensive tasks starve to death

So far, MLFQ seems to be able to take into account both the turnaround time and the response time of interactive tasks. Is it really perfect?

Consider the following scenario, when task A runs to t=100, interactive tasks C and D arrive at the same time, then the scheduling situation will be like this:

It can be seen that if there are many interactive tasks on the current system, CPU-intensive tasks may starve to death!

In order to solve this problem, the following rules can be established:

Rule 5: After system runs for S duration, put all tasks on the highest priority queue (Priority Boost)
After adding this rule, assuming that S is set to 50ms, then the scheduling situation is like this, and the starvation problem is solved!

Malicious task problem

Consider the following malicious task E. In order to occupy the CPU for a long time, task E deliberately performs I/O operations when the time slice is 1%, and returns soon. According to rule 4b, E will remain on the original highest priority queue, so the next scheduling will be

In order to solve this problem, we need to adjust Rule 4 to the following rules:

Rule 4: assigns a time slice to each priority. When the task runs out of that priority time slice, the priority drops one level
After applying the new rule 4, the same workload, the scheduling situation becomes as follows, and the problem of malicious task E consuming a lot of CPU no longer occurs.

So far, the basic principles of MLFQ have been introduced. Finally, we summarize the 5 most critical rules of MLFQ:

Rule 1: If Priority(A)> Priority(B), then schedule A
Rule 2: If Priority(A) = Priority(B), A and B are scheduled according to the RR algorithm
Rule 3: When a new task arrives, put it in the highest priority queue
rule 4: assigns a time slice to each priority. When the task runs out of that priority time slice, the priority drops one level
Rule 5: After the system runs for S duration, put all tasks on the highest priority queue (Priority Boost)

Now, back to the two questions raised at the beginning of this section:

task's running information (including running time, I/O operations, etc.) in advance, how does MLFQ weigh the turnaround time and response time?

When it is not clear in advance whether the task is long-running or short-running, MLFQ will first assume that the task is a shrot-running task. If the assumption is correct, the task will be completed quickly, and the turnaround time and response time will be optimized; If the assumption is wrong, the priority of the task can also be gradually reduced, giving more scheduling opportunities to other short-running tasks.

2. How does MLFQ learn from historical scheduling in order to make better decisions in the future?

MLFQ mainly judges whether it is an interactive task based on whether the task actively surrenders the CPU. If it is, it is maintained at the current priority to ensure the scheduling priority of the task and improve the responsiveness of interactive tasks.

Of course, MLFQ is not a perfect scheduling algorithm. It also has various problems. One of the most troublesome is the setting of various parameters of MLFQ, such as the number of priority queues, the length of time slices, and the interval of Priority Boost. These parameters do not have perfect reference values and can only be set according to different workloads.

For example, we can set the time slice of tasks on the low-priority queue to be longer, because low-priority tasks are often CPU-intensive tasks, and they don’t care about response time. A longer time slice can reduce context switching. Consumption.

CFS: Linux's completely fair scheduling

In this section, we will introduce a scheduling algorithm that usually deals with most, CFS (Completely Fair Scheduler) under Linux system. Unlike the MLFQ introduced in the previous section, CFS does not aim at optimizing turnaround time and response time, but hopes to distribute the CPU to each task fairly.

Of course, CFS also provides the function of setting priorities for processes, allowing users/administrators to decide which processes need more scheduling time.

Fundamental

Most scheduling algorithms are based on a fixed time slice for scheduling, but CFS takes a different approach and adopts a counting-based scheduling method. This technology is called virtual runtime.

CFS maintains a vruntime value for each task, and accumulates its vruntime whenever a task is scheduled. For example, when task A runs for a time slice of 5ms, it is updated to vruntime += 5ms. In the next scheduling of CFS, selects the task with the smallest vruntime value to schedule , such as:

When should CFS perform task switching? Switching more frequently will make task scheduling more fair, but the consumption of context switching will be greater. Therefore, CFS provides users with a configurable parameter sched_latency, allowing users to determine the timing of the switch. CFS sets the time slice assigned to each task as time_slice = sched_latency / n (n is the current number of tasks) to ensure that in the sched_latency cycle, each task can equally divide the CPU to ensure fairness.

For example, if sched_latency is set to 48ms, there are currently 4 tasks A, B, C and D, then the time slice allocated to each task is 12ms; after the end of C and D, the time slice allocated to A and B is also updated to 24ms:

From the above-mentioned principle, with the sched_latency unchanged, as the number of system tasks increases, the time slice allocated to each task also decreases, and the consumption caused by task switching also increases. In order to avoid excessive task switching consumption, CFS provides a configurable parameter min_granularity to set the minimum time slice of the task. For example, if sched_latency is set to 48ms and min_granularity is set to 6ms, even if the current number of tasks is 12, the time slice for each task number is 6ms instead of 4ms.

Assign weights to tasks

Sometimes, we want to allocate more time slices to an important business process in the system, and less time slices to other unimportant processes. But according to the basic principles introduced in the previous section, when using CFS scheduling, each task is divided into CPUs. Is there a way to do this?

can assign weights to tasks, making tasks with higher weights more CPU!

After adding the weighting mechanism, the calculation method of the task time slice becomes like this:

For example, sched_latency is still set to 48ms, the existing two tasks of A and B, the weight of A is set to 1024, and the weight of B is set to 3072. According to the above formula, the time slice of A is 12ms, and the time slice of B is 36ms.

As can be seen from the previous section, CFS selects the task with the smallest vruntime value to schedule each time, and after each scheduling is completed, the calculation rule of vruntime is vruntime += runtime, so just changing the calculation rule of the time slice will not take effect. The calculation rules of vruntime are adjusted to:

Still in the previous example, assuming that both A and B have no I/O operations, after updating the vruntime calculation rules, the scheduling situation is as follows, task B can be allocated more CPU than task A.

Use red-black trees to improve vruntime search efficiency

Each time CFS switches tasks, it selects the task with the smallest vruntime value for scheduling, so it needs a data structure to store each task and its vruntime information.

The most intuitive, of course, is to select an ordered list to store this information, and the list is sorted by vruntime. In this way, when switching tasks, CFS only needs to obtain the task at the head of the list, and the time complexity is O(1). For example, there are currently 10 tasks, and vruntime is saved as an ordered linked list [1, 5, 9, 10, 14, 18, 17, 21, 22, 24], but each time a task is inserted or deleted, the time complexity will be O (N), and the time-consuming increases linearly as the number of tasks increases!

In order to take into account the efficiency of query, insert, and delete, CFS uses red-black trees to store task and vruntime information. In this way, the complexity of query, insert, and delete operations becomes log(N), and it will not increase with the number of tasks. The linear growth greatly improves efficiency.

In addition, in order to improve storage efficiency, CFS only saves information about tasks in the Running state in the red-black tree.

Coping with I/O and hibernation

Every time the task with the smallest vruntime value is selected to schedule this strategy, there will also be the problem of task starvation. Consider that there are two tasks A and B. The time slice is 1s. At first, A and B are divided into CPUs to run in turn. After a certain scheduling, B goes to sleep, assuming that it sleeps for 10s. After B wakes up, vruntime_{B}vruntimeB will be 10s smaller than vruntime_{A}vruntimeA. In the next 10s, B will always be scheduled, and task A will starve to death.

To solve this problem, CFS stipulates that when a task returns from sleep or I/O, the task's vruntime will be set to the minimum vruntime value in the current red-black tree. In the above example, after B wakes up from sleep, vruntime_{B}vruntimeB will be set to 11, so task A will not starve to death.

This approach actually has flaws. If the sleep time of a task is very short, then it will still be scheduled first after it wakes up, which is unfair to other tasks.

Write at the end

This article has spent a long time explaining the principles of several common CPU scheduling algorithms. Each algorithm has its own advantages and disadvantages, and there is no perfect scheduling strategy. In the application, we need to select the appropriate scheduling algorithm according to the actual workload, configure reasonable scheduling parameters, and weigh the turnaround time and response time, task fairness, and switching consumption. These have all fulfilled the famous saying in "Fundamentals of Software Architecture": Everything in software architecture is a trade-off.

The scheduling algorithms described in this article are all based on single-core processors for analysis, and the scheduling algorithms on multi-core processors are much more complicated than this. For example, you need to consider shared data synchronization between processors, and cache affinity etc. But the essential principle is still inseparable from the several basic scheduling algorithms described in this article.

refer to

Operating Systems: Three Easy Pieces, Remzi H Arpaci-Dusseau / Andrea C Arpaci-Dusseau
Computer System Fundamentals (3): Abnormality, Interruption and Input/Output,

Click to follow and learn about Huawei Cloud's fresh technology for the first time~