1
Coroutines are lighter user-mode threads and are the core of the Go language. So how to schedule when these coroutines are executed and how to allocate operating system resources more reasonably requires a well-designed scheduler to support.
What is a good scheduler? The right coroutine can be allocated to the right place at the right time to ensure fairness and efficiency.

Speaking from go func

func main() {
    for i := 0; i < 10; i++ {
        go func() {
            fmt.Println(i)
        }()
    }
    time.Sleep(1 * time.Second)
}

In this code, we have opened 10 coroutines, and each coroutine prints the variable i. Since the scheduling timing of these 10 goroutines is not fixed, the value of the variable i in the loop will not be taken until the goroutine is scheduled for execution.

In the code we wrote, each coroutine we started is a computing task, and these tasks will be submitted to the runtime of go. If there are a lot of computing tasks, there are tens of thousands, then these tasks cannot be executed immediately at the same time, so this computing task must be temporarily stored first. The general approach is to put it in the memory queue and wait to be executed. .

The consumer end is a scheduling cycle maintained by a go runtime. To put it simply, the scheduling loop is to continuously consume and execute computing tasks from the queue. This is essentially a producer-consumer model, which realizes the decoupling of user tasks and schedulers.

Here, G in the figure represents our goroutine computing task , and M represents operating system thread

Scheduling strategy

Next, we explain the scheduling strategy in detail.

Production side

Production side 1.0

Following the above example, we have produced 10 computing tasks. We must store them in memory and wait for the scheduler to consume them. So obviously, the most suitable data structure is queue, first come first serve. But this is problematic. Now we are all multi-core multi-threaded models, there must be more than one consumer, so if multiple consumers consume the same queue, there will be thread safety issues and must be locked. All computing tasks G must be executed on M.
G-M

Production 2.0

In Go, in order to solve the locking problem, the global queue is split into multiple local queues, and this local queue is managed by a P to manage .
G-M-P

In this way, each M only needs to find a P structure first, bind it to the P structure, and then execute the G in the P local queue, which perfectly solves the problem of locking.

But the local queue length of each P cannot be infinitely long (currently 256). Imagine a scenario where there are thousands of go routines. This may cause the local queue to not accommodate so many Goroutines, so Go retains the global queue. , To deal with the above situation.

So why is the local queue an array and the global queue a linked list? Since the global queue is the bottom line strategy of the local queue, the size of the global queue must be unlimited, so it must be a linked list.

The global queue is allocated on the global scheduler structure, and there is only one copy:

type schedt struct {
    ...
    // Global runnable queue.
    runq     gQueue // 全局队列
    runqsize int32  // 全局队列大小
    ...
}

So why is the local queue made into an array instead of a linked list? Because the operating system memory management reads the continuous storage space into the cache in advance (the principle of locality), the arrays are often read into the cache, which is cache-friendly and efficient; and the linked list is scattered in the memory. Yes, it is often not all read into the cache, and the efficiency is low. Therefore, the local queue comprehensively considers performance and scalability, and the array is selected as the final implementation.

In order to realize the principle of locality, Go adds a runnext structure to P. The size of this structure is 1. The G in runnext will always be the first to be scheduled for execution. Next, I will talk about why this runnext structure is needed. The complete production data structure is as follows:

Definition of P structure:

type p struct {
    ...
    // Queue of runnable goroutines. Accessed without lock.
    runqhead uint32 // 本地队列队头
    runqtail uint32 // 本地队列队尾
    runq     [256]guintptr // 本地队列,大小256
    runnext guintptr // runnext,大小为1
    ...
}

Complete production process

  • When we execute go func, the main thread m0 will call newproc() to generate a G structure. Here, the P structure on m0 will be selected first.
  • Each coroutine G will be tried to be put into runnext in P first, if runnext is empty, put into runnext, and production ends
  • If runnext is full, kick the G in the original runnext into the local queue, and put the current G into the runnext. End of production
  • If the local queue is also full, half of the G in the local queue is taken out, and the current coroutine G is added. This assembled structure is called batch in the source code, and the batch is put into the global queue together, and production ends. In this way, the space of the local queue will not be full, and the subsequent production process will not be blocked by the local queue being full.

So we see that the final G in runnext must be the last produced G, and it will also be scheduled for execution first. Here is that considering the principle of locality, the recently created coroutine will be executed first, and the priority is the highest.

The logic of runqput:

func runqput(_p_ *p, gp *g, next bool) {

    // 先尝试放到runnext中
    if randomizeScheduler && next && fastrand()%2 == 0 {
        next = false
    }

    if next {
    retryNext:
        // 拿到老的runnext值。
        oldnext := _p_.runnext
        // 交换当前runnext的老的G和当前G的地址,相当于将当前G放入了runnext
        if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
            goto retryNext
        }
        // 老的runnext为空,生产结束
        if oldnext == 0 {
            return
        }
        // 老的runnext不空,则将被替换掉的runnext赋值给gp,然后下面会set到本地队列的尾部
        gp = oldnext.ptr()
    }

retry:
    // 尝试放到本地队列
    h := atomic.LoadAcq(&_p_.runqhead) // load-acquire, synchronize with consumers
    t := _p_.runqtail
     // 本地队列没有满,那么set进去
    if t-h < uint32(len(_p_.runq)) {
        _p_.runq[t%uint32(len(_p_.runq))].set(gp)
        atomic.StoreRel(&_p_.runqtail, t+1) // store-release, makes the item available for consumption
        return
    }
    // 如果本地队列不满刚才会直接return;若已满会走到这里,会将本地队列的一半G放到全局队列中
    if runqputslow(_p_, gp, h, t) {
        return
    }
    // the queue is not full, now the put above must succeed
    goto retry
}

Consumer side

The consumer end is a scheduling cycle, which continuously consumes G from the local queue and the global queue, binds an M to G, executes G, then consumes G again, binds an M to G, executes G... Then execute this scheduling cycle Who is this person? The answer is g0. On each M, there is a g0, which controls the scheduling loop on its own thread:

type m struct {
    g0      *g     // goroutine with scheduling stack
    ...
}

g0 is a special coroutine. In order to prepare for the next M to execute the calculation task G, g0 needs to help obtain a thread M, bind a P to M according to the random algorithm, let the calculation task G on P be executed, and then formally enter the scheduling loop. The overall scheduling cycle is divided into four steps:

  • schedule: g0 to execute, processing specific scheduling strategies, such as obtaining a G from P's runnext/local or global queue, and then calling execute()
  • execute: bind G and M, initialize some fields, call gogo()
  • gogo: related to the operating system architecture, the G to be executed will be scheduled to the thread M for execution, and the stack switch will be completed
  • goexit: Execute some cleanup logic and call schedule() to restart a round of scheduling loop

That is, every time the scheduling loop, the context switch of g0 -> G -> g0 will be completed.

schedule

The schedule is the core of the scheduling cycle. Since G in P is distributed in runnext, local queue and global queue, it is necessary to determine whether there is an executable G one by one. The general logic is as follows:

  • First go to runnext on P to see if there is a G, and return directly if there is a
  • If runnext is empty, it will search in the local queue, and if it finds it, it will return directly
  • If the local queue is empty, go to the global queue, network poller, and other P to search for blocking, and block until a usable G is obtained.

The source code is implemented as follows:

func schedule() {
    _g_ := getg()
    var gp *g
    var inheritTime bool
    ...
    if gp == nil {
        // 每执行61次调度循环会看一下全局队列。为了保证公平,避免全局队列一直无法得到执行的情况,当全局运行队列中有待执行的G时,通过schedtick保证有一定几率会从全局的运行队列中查找对应的Goroutine;
        if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 {
            lock(&sched.lock)
            gp = globrunqget(_g_.m.p.ptr(), 1)
            unlock(&sched.lock)
        }
    }
    if gp == nil {
        // 先尝试从P的runnext和本地队列查找G
        gp, inheritTime = runqget(_g_.m.p.ptr())
    }
    if gp == nil {
        // 仍找不到,去全局队列中查找。还找不到,要去网络轮询器中查找是否有G等待运行;仍找不到,则尝试从其他P中窃取G来执行。
        gp, inheritTime = findrunnable() // blocks until work is available
        // 这个函数是阻塞的,执行到这里一定会获取到一个可执行的G
    }
    ...
    // 调用execute,继续调度循环
    execute(gp, inheritTime)
}

Among them, schedtick, every time the scheduling loop is executed 61 times, it needs to go to the global queue and try to get it once. Why do you want to do this? Assuming that one hundred thousand Gs are continuously added to P's local queue, then G in the global queue may never get executed and starved to death, so there must be a judgment logic before getting it from the local queue, periodically from the global The queue gets G to ensure fairness.

At the same time, the scheduler will calculate the scheduler in the global queue first. If all Ps are to be divided equally among the Gs in the global queue, how many P should be divided into? Here we assume that there will be n. Then transfer these n Gs to the local queue of P where the current G is located. But no more than half of the length of the P local queue (that is, 128):

func globrunqget(_p_ *p, max int32) *g {
   ...
   // gomaxprocs = p的数量
   // sched.runqsize是全局队列长度
   // 这里n = 全局队列的G平分到每个P本地队列上的数量 + 1
   n := sched.runqsize/gomaxprocs + 1
   if n > sched.runqsize {
      n = sched.runqsize
   }
   if max > 0 &amp;&amp; n > max {
      n = max
   }
   // 平分后的数量n不能超过本地队列长度的一半,也就是128
   if n > int32(len(_p_.runq))/2 {
      n = int32(len(_p_.runq)) / 2
   }

   // 执行将G从全局队列中取n个分到当前P本地队列的操作
   sched.runqsize -= n

   gp := sched.runq.pop()
   n--
   for ; n > 0; n-- {
      gp1 := sched.runq.pop()
      runqput(_p_, gp1, false)
   }
   return gp
}

The purpose of this is that if the next scheduling cycle comes, there is no need to lock in the global queue to obtain G once, and the performance is well guaranteed.

Here, the logic to find available G in other P is also called work stealing, that is, work stealing. A random algorithm is also used here to randomly select a P, steal half of the G in the P and put it into the local queue of the current P, and then take a G at the end of the local queue for execution.

GMP model

At this point, I believe that everyone has already understood the concept of GMP. Let's finally summarize:

  • G: goroutine, representing a computing task, composed of code and context (such as the current code execution position, stack information, state, etc.)
  • M: machine, system thread. If you want to execute code on the CPU, you must have a thread, which is created by the system call clone
  • P: processor, virtual processor. M must obtain P to execute the G code in the P queue, otherwise it will fall into sleep

Blocking handling

The above is only assuming that G is executed normally. If G has blocking waiting (such as channel, system call), etc., then it is necessary to unbind M at this moment with G on P to maximize CPU utilization. As well as falling into and recovering from the system call when the scheduler scheduling needs to be triggered, this part of the logic will be explained in the next article.

Follow us

Readers who are interested in this series of articles are welcome to subscribe to our official account, and pay attention to the blogger not to get lost next time~
image.png


NoSay
449 声望544 粉丝