Go study notes - GMP detailed explanation - IT小马の菜园

The origin of the Golang scheduler

single process problem

Single execution process, the computer can only process one task one task at a time
Process blocking leads to wasted CPU time

Multi-process multi-threading problem

The higher the number of processes/threads, the higher the switching cost
Multithreading with synchronization contention (locks, resource conflicts)
High memory usage: Process virtual memory occupies 4GB (32bit OS), and clue occupies 4MB
High CPU scheduling consumption

Threads and co-routines

Threads are scheduled by the CPU, which is preemptive, and the basic scheduling also needs to fall into the kernel state;
The coroutine is scheduled by the user mode, which is cooperative. After a coroutine gives up the CPU, the next coroutine is executed.

Coroutine and thread relationship type

N:1 relationship: N coroutines are bound to 1 thread

Advantages: The coroutine completes the switching in the user mode thread, and will not fall into the kernel mode. This switching is very lightweight and fast .

Disadvantages: Unable to take advantage of multi-core acceleration capabilities; coroutine blocking will cause other coroutines to be unable to execute, and there is no concurrency capability.

1:1 relationship: 1 coroutine binds 1 thread

Advantages: The scheduling of coroutines is completed by the CPU, and there is no N:1 disadvantage.

Disadvantages: The cost of creation, deletion and switching of coroutines is completed by the CPU, and the cost of switching coroutines is too high

M:N relationship: M coroutines are bound to 1 thread

Pros: Ability to take advantage of multiple cores

Disadvantage: Too much reliance on coroutine scheduler optimizations and algorithms

goroutine

Coroutines in Go are called goroutines. They are very lightweight, and a goroutine only occupies a few KB.

A goroutine allows a set of reusable functions to run on a set of threads. Even if a coroutine is blocked, other coroutines of the thread can be scheduled by runtime and transferred to other runnable threads. Features:

Occupy less memory (several kb), support high concurrency
More flexible scheduling (runtime scheduling), low switching cost

Early Scheduler GM

Global Goroutine (G) queue, polling using multiple thread (M) scheduling

shortcoming:

Creating, destroying, and scheduling G all require each M to acquire a lock, which forms a fierce lock competition .
Transferring G from M will cause delays and additional system load . For example, when G includes the creation of a new coroutine, M creates G'. In order to continue to execute G, G' needs to be handed over to M' for execution, which also causes poor locality , because G' and G are related. , preferably executed on M, rather than other M'.
System calls (CPU switching between M) cause frequent thread blocking and unblocking operations to increase system overhead .

GMP model design idea

GMP model

G goroutine coroutine : G stores the execution-related information such as the code entry address, context, running environment (associated P and M), and running stack for concurrent execution. The creation, hibernation, recovery, and stop of G are all managed by the runtime.
P processor processor : created when the program starts, it is a managed data structure, P mainly reduces the complexity of M to G, and adds an indirect control layer data structure. P controls the parallelism of the GO code, it is not an entity. The upper limit is GOMAXPROCS, the default number of CPUs.
M thread OS kernel thread : It is an entity scheduled and executed at the operating system level. M is only responsible for execution, M is constantly being awakened or created. Then execute. The upper limit is 10000.

In Go, threads are entities that run goroutines, and the function of the scheduler is to assign runnable goroutines to worker threads .

Global Queue: Stores G waiting to run.
P's local queue : Similar to the global queue, it also stores G waiting to run, and the number of storage is limited, no more than 256. When a new G' is created, G' is preferentially added to the local queue of P. If the queue is full, half of the G's in the local queue will be moved to the global queue.
List of Ps: All Ps are created at program startup and stored in an array, with a maximum of GOMAXPROCS (configurable).
M : If the thread wants to run the task, it has to obtain P, and obtain G from P's local queue. When the P queue is empty, M will also try to take a batch of G from the global queue and put it into P's local queue, or from other P's local queues Steal half of it and put it into your own P's local queue. M runs G, and after G is executed, M will get the next G from P, and repeat.

Number of P

It is determined by the environment variable at startup --- runtime $GOMAXPROCS or by the method GOMAXPROCS() . This means that at any point in the program execution there are only $GOMAXPROCS goroutines running at the same time.

Number of M

When the go program starts, the maximum number of M will be set, the default is 10000
The SetMaxThreads function in runtime/debug sets the maximum number of M
An M is blocked and a new M is created.

When to create P

When the program starts, the system creates n Ps according to the maximum number of Ps.

When to create M

When there is not enough M to associate a P and run a runnable G within it.

For example, all M are blocked at this time, and there are still many ready tasks in P, so they will look for free M, and if there is no free M, they will create a new M.

Scheduler Design Strategies

multiplexing threads

Avoid frequent creation and destruction of threads, but reuse threads.

1) Work stealing mechanism

When this thread has no G to run, try to steal G from other thread's bound P instead of destroying the thread.

2) Hand off mechanism

When the thread is blocked due to a system call (syscall) performed by G, the thread releases the bound P and transfers P to other idle threads for execution.

Take advantage of parallelism

GOMAXPROCS Set the number of P, up to GOMAXPROCS threads are distributed on multiple CPUs to run at the same time. GOMAXPROCS also limits the degree of concurrency, such as GOMAXPROCS = 核数/2 , which uses up to half of the CPU cores for parallelism.

seize

In the co-routine, it is necessary to wait for a coroutine to actively give up the CPU before executing the next coroutine
In Go, a goroutine takes up to 10ms of CPU, preventing other goroutines from being starved to death

Global G queue

When the local queue of P is empty, it will try to get a batch of G from the global queue and put it in the local queue of P.

go func() scheduling process

1. Create a goroutine through go func();

2. There are two queues for storing G, one is the local queue of the local scheduler P, and the other is the global G queue. The newly created G will be stored in the local queue of P first, and if the local queue of P is full, it will be stored in the global queue;

3. G can only run in M, an M must hold a P, and the relationship between M and P is 1:1. M will pop an executable G from P's local queue for execution. If P's local queue is empty, it will steal an executable G from other MP combinations to execute;

4. The process of an M scheduling G execution is a circular mechanism;

5. When M executes a certain G, if a syscall or other blocking operations occur, M will block. If there are some Gs currently executing, the runtime will remove the thread M from P, and then create a new thread. The thread of the operating system (if there is an idle thread available, reuse the idle thread) to serve this P;

6. When the M system call ends, the G will try to obtain an idle P for execution and put it into the local queue of the P. If P cannot be obtained, then this thread M becomes dormant, joins the idle thread, and then this G is put into the global queue.

Scheduler life cycle

M0: The main thread numbered 0 after starting the program. The instance corresponding to this M will be in the global variable runtime.m0 and does not need to be allocated on the heap. M0 is responsible for performing the initialization operation and starting the first G. After that, M0 will Same as other M.
G0: Every time an M is started, the first gourtine will be created. G0 is only used for the G that is responsible for scheduling . G0 does not point to any executable function. Each M will have its own G0. The stack space of G0 is used when scheduling or system calls, and the G0 of global variables is the G0 of M0.

Example:

 package main

import "fmt"

func main() {
    fmt.Println("Hello world")
}

The runtime creates the initial m0 and g0 and associates the two.
Scheduler initialization: initialize m0, stack, GC, create and initialize P list.
示例代码中的main函数是main.main ， runtime中也有1个main函数runtime.main ，代码经过编译后， runtime.main main.main , when the program starts, it will create a main goroutine for runtime.main , and add the main goroutine to the local queue of P.
Start m0, m0 has bound P, will get G from P's local queue, and get the main goroutine.
G has a stack, and M sets the running environment according to the stack information and scheduling information in G
M run G
G exits, returns to M again to get a runnable G, and repeats this until main.main exits, runtime.main executes Defer and Panic processing, or calls runtime.exit exit program.

The life cycle of the scheduler almost occupies the whole life of a Go program. The goroutine of runtime.main prepares for the scheduler before execution, and the goroutine of runtime.main runs, which is the scheduler. The real beginning of runtime.main ends and ends.

Visual programming

1. go tool trace

 package main

import (
    "os"
    "fmt"
    "runtime/trace"
)

func main() {

    //创建trace文件
    f, err := os.Create("trace.out")
    if err != nil {
        panic(err)
    }

    defer f.Close()

    //启动trace goroutine
    err = trace.Start(f)
    if err != nil {
        panic(err)
    }
    defer trace.Stop()

    //main
    fmt.Println("Hello World")
}

run the program

 $ go run trace.go 
Hello World

Will get a trace.out file, and then we can open it with a tool to analyze this file.

 $ go tool trace trace.out 
2020/02/23 10:44:11 Parsing trace...
2020/02/23 10:44:11 Splitting trace...
2020/02/23 10:44:11 Opening browser. Trace viewer is listening on http://127.0.0.1:33479

Browser access: http://127.0.0.1:33479

G information

There are two Gs in the program, one is a special G0, which is an initialized G that every M must have, we don't need to discuss this.

Among them, G1 should be the main goroutine (the coroutine that executes the main function), which is in a runnable and running state for a period of time.

M information

There are a total of two M in the program, one is a special M0, used for initialization, we do not need to discuss this.

---a4480a148b63e17a57a8c69c17fcd716--- is called in G1, and main.main trace goroutine g18 created. G1 runs on P1 and G18 runs on P0.

There are two Ps here, and we know that a P must be bound to an M in order to schedule G.

One more M2 should be the M2 dynamically created by P0 to execute G18.

2.Debug trace

 package main

import (
    "fmt"
    "time"
)

func main() {
    for i := 0; i < 5; i++ {
        time.Sleep(time.Second)
        fmt.Println("Hello World")
    }
}

compile

 $ go build trace2.go

Run in Debug mode

 $ GODEBUG=schedtrace=1000 ./trace2 
SCHED 0ms: gomaxprocs=2 idleprocs=0 threads=4 spinningthreads=1 idlethreads=1 runqueue=0 [0 0]
Hello World
SCHED 1003ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 2014ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 3015ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World
SCHED 4023ms: gomaxprocs=2 idleprocs=2 threads=4 spinningthreads=0 idlethreads=2 runqueue=0 [0 0]
Hello World

SCHED : debug information output flag string, indicating that this line is the output of the goroutine scheduler;
0ms : the time from program startup to outputting this line of log;
gomaxprocs : The number of P, there are 2 P in this example, because the default property of P is the same as the number of cpu cores by default, of course, it can also be set by GOMAXPROCS;
idleprocs : The number of Ps in the idle state; through the difference between gomaxprocs and idleprocs, we can know the number of Ps executing go code;
t hreads: os threads/M , including the number of m used by the scheduler, plus the number of threads like sysmon used by the runtime itself;
spinningthreads : The number of os threads in the spinning state;
idlethread : the number of os threads in the idle state;
runqueue=0 : The number of G in the global queue of the Scheduler;
[0 0] : The number of Gs in the local queues of 2 Ps respectively.

GMP Scenario Analysis

Scenario 1: Local priority

P owns G1, and M1 starts running G1 after acquiring P. G1 uses go func() to create G2. For locality, G2 is preferentially added to the local queue of P1.

Scenario 2: Thread reuse

After G1 runs (function: goexit ), the goroutine running on M is switched to G0, and G0 is responsible for the switching of coroutines during scheduling (function: schedule ). Take G2 from P's local queue, switch from G0 to G2, and start running G2 (function: execute ). The multiplexing of thread M1 is realized.

Scenario 3: The local queue is full

Assume that the local queue of each P can only store 3 G. G2 wants to create 6 Gs, the first 3 Gs (G3, G4, G5) have joined the local queue of p1, and the local queue of P1 is full.

Scenario 4: Local Load Balancing

When G2 creates G7, it finds that the local queue of P1 is full and needs to perform load balancing (transfer the first half of the G in the local queue in P1 and the newly created G to the global queue).

When these Gs are transferred to the global queue, they will be out of order. So G3, G4, G7 are transferred to the global queue.

Scenario 5: Join the local queue

When G2 creates G8, the local queue of P1 is not full, so G8 will be added to the local queue of P1.

The reason why G8 is added to the local queue at point P1 is because P1 is bound to M1 at this time, and G2 is executing at this time by M1. Therefore, the new G created by G2 will be preferentially placed on the P bound by its own M.

Scenario 6: Spinning threads

When creating a G, a running G will try to wake up other idle P and M combinations to execute.

Suppose G2 wakes up M2, M2 binds P2, and runs G0, but there is no G in the local queue of P2, and M2 is a spinning thread at this time (a thread without G but in the running state, constantly looking for G) .

Scenario 7: Global Queue Load Balancing

M2 tries to fetch a batch of G from the global queue (referred to as "GQ") and put it into the local queue of P2 (function: findrunnable() ). The amount of G taken by M2 from the global queue conforms to the following formula:

 n =  min(len(GQ) / GOMAXPROCS +  1,  cap(LQ) / 2 )

Take at least 1 g from the global queue, but don't move too many g from the global queue to p's local queue each time, leaving points for other p's. This is load balancing from the global queue to P's local queue .

Suppose we have a total of 4 Ps in our scene (GOMAXPROCS is set to 4, then we allow up to 4 Ps to be used by M). So M2 only moves the P2 local queue from the one G (ie G3) that can be taken from the global queue, and then completes the switch from G0 to G3 and runs G3.

Scenario 8: work stealing

Assuming that G2 has been running on M1, after 2 rounds, M2 has obtained G7 and G4 from the global queue to the local queue of P2 and completed the operation. Both the global queue and the local queue of P2 are empty, as shown in the left half of scene 8. part.

If there is no G in the global queue, then m must perform work stealing: steal half of G from other P with G, and put it in its own P local queue . P2 takes half of the G from the tail of the local queue of P1. In this example, half of the G is only 1 G8, and puts it into the local queue of P2 and executes it.

Scenario 9: Spinning thread max limit

G1's local queues G5 and G6 have been stolen by other M's and run to completion. Currently, M1 and M2 are running G2 and G8 respectively. M3 and M4 have no goroutines to run. M3 and M4 are in a spinning state , and they are constantly looking for goroutines.

There are at most GOMAXPROCS spinning threads in the system (in the current example GOMAXPROCS =4, so a total of 4 P), the extra threads will make them sleep.

Scenario 10: Blocking System Call

Assume that M3 and M4 are currently spinning threads, and M5 and M6 are idle threads (without getting the binding of P, note that we can only have 4 Ps at most here, so the number of Ps should always be M>= P, most of which are M preempting the P that needs to be run), G8 creates G9, G8 makes a blocking system call , M2 and P2 are immediately unbound, and P2 will perform the following judgment: If the P2 local queue has G, the global queue If there is a G or an idle M, P2 will immediately wake up an M and bind to it, otherwise P2 will be added to the idle P list and wait for M to obtain an available p. In this scenario, the P2 local queue has G9, which can be bound to other idle threads M5.

Scenario 11: Non-blocking system call

G8 creates G9 if G8 makes a non-blocking system call .

M2 and P2 will be unbound, but M2 will remember P2, and then G8 and M2 will enter the system call state. When G8 and M2 exit the system call, they will try to obtain P2. If they cannot be obtained, they will obtain an idle P. If there is still no P2, G8 will be recorded as a runnable state and added to the global queue. M2 has no binding to P. Instead, it becomes dormant (sleeping for a long time and waiting for GC to recycle and destroy).

Summarize

The essence of Go scheduling is to allocate a large number of goroutines to a small number of threads for execution, and use multi-core parallelism to achieve more powerful concurrency.

Reference

[Golang self-cultivation road](

Go study notes - GMP detailed explanation