Scheduler for Go Runtime

Concurrent programming in Go in the form of goroutines is a very convenient method, but have you ever wondered how it effectively runs these goroutines? From the design point of view, we will deeply understand and study the Go runtime scheduler, and how to use it in the performance debugging process to explain the scheduler trace information of the Go program.

To understand why there is a runtime schedule and how it works, you must first go back to the history of the operating system, where you will find the answer, because if you don't understand the root of the problem.

The history of the operating system

Single user (no operating system)
Completion of batch single programming operation
Multi-program

The purpose of multiple programs is to overlap the CPU and I/O. How to do it?

Multiple batches
IBM OS / MFT (multiple programming with a fixed number of tasks)
Multiple batches
IBM OS / MVT (Multiple Programming with Variable Number of Tasks)-Here, each job only gets the amount of memory it needs. That is, as jobs move in and out, the partition of the memory changes.
Time-sharing
This is a multi-program design that quickly switches between jobs. Deciding when to switch and which jobs to switch to is called scheduling.

Most modern operating systems use a time-sharing scheduler.

What are these schedulers scheduling?
1 Different programs are being executed (in progress)
2 The basic unit of CPU utilization (thread) as a subset of processes

These come at a price.

Dispatch cost

Therefore, it is more efficient to use a process that contains multiple threads, because process creation is time-consuming and resource-intensive. Then there was a multithreading problem: the C10k problem was the main problem.

For example, if the scheduler period is defined as 10 milliseconds (milliseconds), and there are 2 threads, each thread will get 5 milliseconds respectively. If you have 5 threads, each thread will get 2ms. But what if there are 1000 threads? Give each thread a 10s (microsecond) time slice? You will spend a lot of time on context switching, but you will not be able to complete the real work.

You need to limit the length of the time slice. In the last case, if the minimum time slice is 2ms and there are 1000 threads, the scheduler period needs to be increased to 2s (seconds). If there are 10,000 threads, the scheduler cycle is 20 seconds. In this simple example, if each thread uses its full-time slice, it will take 20 seconds for all threads to run at once. Therefore, we need something that can reduce the cost of concurrency without causing too much overhead.

User-level threads
• Threads are completely managed by the runtime system (user-level library).
• Ideally, fast and efficient: switching threads is not much more expensive than function calls.
• The kernel knows nothing about user-level threads and manages them like single-threaded processes.

In Go, we call it "Goroutine" (logically)

Goroutine

A coroutine is a lightweight thread, managed by the Go runtime (logically a thread of execution). You want go to start and run the go keyword before the function is called.

func main() {
    var wg sync.WaitGroup
    wg.Add(11)
    for i := 0; i <= 10; i++ {
   
        go func(i int) {
          defer wg.Done()
          fmt.Printf("loop i is - %d\n", i)
        }(i)
    }
    wg.Wait()
    fmt.Println("Hello, Welcome to Go")
}
运行结果

loop i is - 10
loop i is - 0
loop i is - 1
loop i is - 2
loop i is - 3
loop i is - 4
loop i is - 5
loop i is - 6
loop i is - 7
loop i is - 8
loop i is - 9
Hello, Welcome to Go

Looking at the output, there are two problems.

How do 11 goroutines run concurrently?
In what order do goroutines run?

These two issues have triggered new thinking:

How to distribute these multiple goroutines to multiple OS threads running on the available CPU processors.
In what order should these multiple goroutines run to maintain fairness?

The rest of the discussion will focus on solving these problems of the Go runtime scheduler from a design perspective. The scheduler may target one or more of many goals. For our case, we will limit ourselves to meet the following requirements.

It should be parallel, scalable, and fair.
Each process should be scalable to millions of goroutines (10⁶)
Memory efficient. (RAM is cheap, but not free.)
System calls should not cause performance degradation. (Maximize throughput and minimize waiting time)

So let's start modeling the scheduler to solve these problems step by step.

1. Each Goroutine thread-user-level thread.

limit

1. Parallel and scalable.
* Parallel
* Scalable
2. Each process cannot be scaled to millions of goroutines (10⁶)

2.M:N thread-mixed thread

M个内核线程执行N个“ goroutine”

M kernel threads execute N "goroutines"

The actual execution of code and parallel requires kernel threads. But the creation cost is very high, so N goroutines are mapped to M Kernel Thread. Goroutine is Go Code, so we can completely control it. In addition, it is in user space, so it is cheap to create.

But because the operating system knows nothing about goroutines. Each goroutine has a state to help the Scheduler know which goroutine to run based on the goroutine state. Compared with the kernel thread, this state information is very small, and the context switch of goroutine becomes very fast.

Running-The goroutine currently running on the kernel thread.
Runnable-the process waits for the kernel thread to run.
Blocked-Goroutine waiting for certain conditions (for example, blocked on channels, system calls, mutexes, etc.)

2 threads run 2 at a time

Therefore, Go Runtime Scheduler manages these goroutines in various states by multiplexing N Goroutines into M kernel threads.

Simple MN scheduler
In our simple M:N Scheduler, we have a global run queue, and certain operations put a new goroutine into the run queue. M kernel threads access the scheduler to obtain goroutines from the "run queue" to run. Multiple threads try to access the same memory area, we will use Mutex For Memory Access Synchronization to lock this structure.

Simple M:N

Where is the blocking goroutine?
Some instances of goroutines that can be blocked.

Send and receive on the channel.
Network I/O.
Block system calls.
Timer.
Mutex.

So, where do we put these blocking goroutines?

阻塞的goroutine不应阻塞底层的内核线程！（避免线程上下文切换成本）

Goroutine was blocked during channel operation.
Each channel has a recvq (waitq) to store blocked goroutines that try to read data from the channel.
Sendq (waitq) stores blocked goroutines trying to send data to the channel.

Goroutine was blocked during channel operation.

通道操作后的未阻塞的goroutine被通道放入“运行”队列。

Contact blocked goroutine after channel operation

about the 160ca190b94b05 system call?

First, let's look at blocking system calls. A system call that blocks the underlying kernel thread, so we cannot schedule any other Goroutines on this thread.

The implicit blocking system call reduces the level of parallelism.

No other Goroutines can be scheduled on the M2 thread, resulting in a waste of CPU.

The way we can restore the degree of parallelism is that when we enter the system call, we can wake up another thread, which will select a runnable goroutine from the run queue.

Now, when the system call is completed, the Groutine plan is over-executed. To avoid this, we will not run Goroutine immediately to return from the blocking system call. But we will put it in the scheduler run queue.

Avoid over-booked scheduling

Therefore, when our program is running, the number of threads is greater than the number of cores. Although not explicitly stated, the number of threads is greater than the number of cores, and all idle threads are also managed by the runtime to avoid too many threads.

The initial setting is 10,000 threads, if it exceeds, the program will crash.

Non-blocking system call---block a goroutine on the integrated runtime poller and release the thread to run another goroutine.

For example, in the case of non-blocking I/O, such as HTTP calls. The first system call-following the previous workflow-will not succeed because the resource is not yet ready, forcing Go to use the network poller and park the goroutine.

This is the realization of part of the net.Read function.

n, err := syscall.Read(fd.Sysfd, p)
if err != nil {
  n = 0
  if err == syscall.EAGAIN && fd.pd.pollable() {
    if err = fd.pd.waitRead(fd.isFile); err == nil {
    continue
  }
}

Once the first system call is completed and the resource is clearly indicated that the resource is not yet ready, the goroutine will be parked until the network poller informs it that the resource is ready. In this case, thread M will not be blocked.

Poller will use select/kqueue/epoll/IOCP based on the operating system to know which file descriptor is ready. Once the file descriptor is ready for reading or writing, it will put the goroutine back into the run queue.

There is also a Sysmon OS thread. If the polling time does not exceed 10 milliseconds, it will poll the network periodically and add ready G to the queue.

Basically all goroutines are blocked in

channel
Mutex
Network IO
Timer

now has a scheduler with the following functions at runtime.

It can handle parallel execution (multithreaded).
Handle blocking system calls and network I/O.
Handle blocking calls at the user level (on the channel).

but this is not expandable

Use Mutex's global run queue

As you can see in the figure, we have a Mutex global run queue, and we will eventually encounter some problems, such as

The cost of cache consistency guarantees.
There is fierce lock contention when creating, destroying and scheduling Goroutine G.

Use a distributed scheduler to overcome scalability issues.

distributed scheduler-each thread runs the queue.

Distributed Run Queue Scheduler

In this way, the immediate benefit we can see is that for each thread-local run queue, we now have no mutex. There is still a global run queue with a mutex, which is used in special circumstances. It will not affect scalability.

Now, we have multiple run queues.

Local run queue
Global run queue
Network training device

Where should we run the next goroutine from?

In Go, the polling sequence is defined as follows.

Local run queue
Global run queue
Network training device
Work Stealing

That is, first check the local run queue, if it is empty, check the global run queue, then check the network poller, and finally steal work. So far, we have some overviews of 1, 2, and 3. Let's look at "Stealing Work".

work theft

If the local work queue is empty, try "steal work from other queues"

"Stealing" Work

When one thread has too much work to do and another thread is idle, work stealing solves the problem. In Go, if the local queue is empty, the steal work will try to satisfy one of the following conditions.

Pull work from the global queue.
Pull work from the network poller.
Steal work from other local queues.

So far, Go has a Scheduler with the following functions at runtime.

It can handle parallel execution (multithreaded).
Handle blocking system calls and network I/O.
Handle blocking calls at the user level (on the channel).
Scalable

But this is not effective.

Remember the way we restore parallelism in blocking system calls?

Its meaning is that we can have multiple kernel threads in a system call (it can be 10 or 1000), which may increase the number of cores. We ended up incurring constant overhead during the following periods:

When stealing work, it must scan all kernel threads (ideally and run with a goroutine) local run queue at the same time, and most of them will be empty.
Garbage collection and memory allocators all suffer from the same scanning problems.

Use M:P:N threads to overcome efficiency problems.

3. M: P: N (3-level scheduler) threading-introduction to logical processor P

P — processor, which can be regarded as a local scheduler running on a thread;

M:P:N thread

The number of logical processes P is always fixed. (The default is the logical CPU that can be used by the current process)

Put the local run queue (LRQ) into a fixed number of logical processors (P).

Distributed three-level run queue scheduler

When Go runs, it will first create a fixed number of logical processors P based on the number of logical CPUs of the computer (or upon request).

Each goroutine (G) will run on the OS thread (M) assigned to the logical CPU (P).

Therefore, we now have no fixed overhead during the following periods:

Steal work-just scan the local run queue of a fixed number of logical processors (P).
Garbage collection, memory allocator also get the same benefits.

What about the system call with a fixed logic processor (P)?

Go通过将系统调用包装在运行时中来优化系统调用-无论它是否阻塞

Prevent system call wrappers

The Blocking SYSCALL method is encapsulated in runtime.entersyscall (SB)
runtime.exitsyscall (SB).
Literally, some logic is executed before entering the system call, and some logic is executed after exiting the system call. When making a blocking system call, this wrapper will automatically detach P from thread M and allow another thread to run on it.

Blocking system call to switch P

This allows the Go runtime to efficiently handle blocking system calls without increasing the run queue.

once syscall is prevented from exiting?

At runtime, try to get the exact same P and continue execution.
At runtime, try to get a P in the free list and resume execution.
At runtime, put the goroutine in the global queue and put the associated M back into the free list.

Spinning Thread and Ideal Thread.

When the M2 thread becomes the ideal ideal thread after the syscall returns. What to do with the ideal M2 thread. In theory, if a thread has completed what it needs to do, it should be destroyed by the operating system, and then threads in other processes may be scheduled for execution by the CPU. This is what we often call the "preemptive scheduling" of threads in the operating system.

Consider the situation in the above syscall. If we destroy the M2 thread, and the M3 thread is about to enter the syscall. At this point, the runnable goroutine cannot be processed until a new kernel thread is created and scheduled to be executed by the OS. Frequent pre-thread preemption operations will not only increase the load of the OS, but it is almost unacceptable for programs with higher performance requirements.

Therefore, in order to properly utilize the resources of the operating system and prevent frequent threads from preempting the load on the operating system, we will not destroy the kernel thread M2, but spin it and save it for future use. Although this seems to be a waste of resources. However, compared with frequent preemption between threads and frequent creation and destruction operations, the "ideal thread" still has to pay less.

Spinning Thread — For example, in a Go program with a core thread M (1) and a logical processor (P), if the executing M is blocked by syscall, the number of "Spinning Threads" is the same as that number. P is required The value allows the waiting runnable goroutine to continue execution. Therefore, during this period, the number of kernel threads M is greater than the number of P (spinning threads + blocking threads). Therefore, even if runtime.GOMAXPROCSvalue is set to 1, the program will be in a multithreaded state.

is the fairness of 160ca190b951b5 scheduling? -Fair selection of the goroutine to be executed next.

与许多其他调度程序一样，Go也具有公平性约束，并且由goroutine的实现所强加，因为Runnable goroutine应该最终运行

The following are four typical fairness constraints in Go Runtime Scheduler.

Any goroutine running for more than 10 milliseconds is marked as preemptible (soft limit). However, preemption is only done in the function prologue. Go currently uses cooperative preemption points inserted by the compiler in the function prologues.

Infinite loop-preemption (~10ms time slice)-soft limit

But be careful with infinite loops, because Go's scheduler is not preemptive (until 1.13). If the loop does not contain any preemption points (such as function calls or memory allocation), they will prevent other goroutines from running. A simple example is:

package main
func main() {
    go println("goroutine ran")
    for {}
}

Excuting an order

GOMAXPROCS = 1 go run main.go

It is not possible to print this statement until Go (1.13). Due to the lack of preemption points, the main Goroutine can occupy the processor.

Local run queue-preemption (~10ms time slice)-soft limit
By checking the global run queue every 61 scheduler ticks, starvation of the global run queue can be avoided.
Network Poller Starvation background thread polling the network occasionally will be polled by the main worker thread.

Go 1.14 has a new "non-cooperative preemption".

With Go, Runtime has a Scheduler, which has all the necessary functions.

It can handle parallel execution (multithreaded).
Handle blocking system calls and network I/O.
Handle blocking calls at the user level (on the channel).
Scalable
Efficient.
Fair.

This provides a lot of concurrency, and always tries to achieve maximum utilization and minimum latency.

Now that we have some understanding of the Go runtime scheduler in general, how do we use it? Go provides us with a tracking tool, scheduler tracking, which aims to provide insights about behavior and debug scalability issues related to goroutine schedulers.

scheduler tracking

Use the GODEBUG = schedtrace = DURATION environment variable to run the Go program to enable scheduler tracing. (DURATION is the output period in milliseconds.)

Scheduler for Go Runtime

1. Each Goroutine thread-user-level thread.

2.M:N thread-mixed thread

3. M: P: N (3-level scheduler) threading-introduction to logical processor P

面向加薪学习

引用和评论

真希望你也明白runtime.Map和sync.Map

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

Go 1.24 相比 Go 1.23 有哪些值得注意的改动？

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务

不愧是腾讯，面试的质量太高了