1

WaitGroup is a concurrency control technique often used in the development process to control and wait for a group of goroutines to end in the program.

Implementation Principle

data structure

The data structure of WaitGroup includes an auxiliary field of noCopy, an array of state1 to record the state of WaitGroup:

  • Auxiliary field for noCopy;
  • state1, a field with a composite meaning, contains the count of WaitGroups, the number of waiters blocked at the checkpoint, and the semaphore.
type WaitGroup struct {
    // 避免复制使用的一个技巧,可以告诉 vet 工具违反了复制使用的规则
    noCopy noCopy
    // 前 64bit(8bytes) 的值分成两段,高 32bit 是计数值,低 32bit 是 waiter 的计数
    // 另外 32bit 是用作信号量的
    // 因为 64bit 值的原子操作需要 64bit 对齐,但是 32bit 编译器不支持,所以数组中的元素在不同的架构中不一样,具体处理看下面的方法
    // 总之,会找到对齐的那 64bit 作为 state,其余的 32bit 做信号量
    state1 [3]uint32
}

// 得到state的地址和信号量的地址
func (wg *WaitGroup) state() (statep *uint64, semap *uint32) {
    if uintptr(unsafe.Pointer(&wg.state1))%8 == 0 {
        // 如果地址是 64bit 对齐的,数组前两个元素做 state,后一个元素做信号量
        return (*uint64)(unsafe.Pointer(&wg.state1)), &wg.state1[2]
    } else {
        // 如果地址是 32bit 对齐的,数组后两个元素用来做 state,它可以用来做 64bit 的原子操作,第一个元素 32bit 用来做信号量
        return (*uint64)(unsafe.Pointer(&wg.state1[1])), &wg.state1[0]
    }
}

In the 64-bit environment, the first element of state1 is the waiter number, the second element is the count value of the WaitGroup, and the third element is the semaphore.

image.png

noCopy: auxiliary vet check

The role of the noCopy field is to indicate that the data structure cannot be used for value copying when the vet tool is checking. More strictly speaking, it cannot be used repeatedly after the first use.

vet will statically check the data type that implements the Locker interface, and will issue a warning if there is a copy of this data type in the code. However, WaitGroup does not satisfy the Locker interface, so you can implement the Locker interface by adding a noCopy field to WaitGroup. And because the noCopy field is an unexported type, WaitGroup does not expose Lock/Unlock methods.

If you want your own defined data structure not to be used for copying, or, in other words, you cannot use the vet tool to check the alarm for copying use, you can implement this by embedding the noCopy data type.

Method

Add & Done

The main operation of the Add method is the counting part of the state. After removing the code for race checking and exception checking, its implementation is as follows:

func (wg *WaitGroup) Add(delta int) {
    statep, semap := wg.state()
    // 高 32bit 是计数值 v,所以把 delta 左移 32,增加到计数上
    state := atomic.AddUint64(statep, uint64(delta)<<32)
    v := int32(state >> 32) // 当前计数值
    w := uint32(state) // waiter count
    
    if v > 0 || w == 0 {
        return
    }
    
    // 如果计数值 v 为 0 并且 waiter 的数量 w 不为 0,那么 state 的值就是 waiter 的数量
    // 将waiter的数量设置为 0,因为计数值 v 也是 0,所以它们俩的组合 *statep 直接设置为 0 即可。此时需要并唤醒所有的 waiter
    *statep = 0
    for ; w != 0; w-- {
        runtime_Semrelease(semap, false, 0)
    }
}

// Done 方法实际就是计数器减 1
func (wg *WaitGroup) Done() {
    wg.Add(-1)
}
Wait

The implementation logic of the Wait method is: constantly check the value of state. If the count value becomes 0, it means that all tasks have been completed, and the caller does not have to wait any longer and returns directly. If the count value is greater than 0, it means that there are still tasks not completed at this time, then the caller becomes a waiter and needs to join the waiter queue and block itself.

Its main implementation code is as follows:

func (wg *WaitGroup) Wait() {
    statep, semap := wg.state()
    
    for {
        state := atomic.LoadUint64(statep)
        v := int32(state >> 32) // 当前计数值
        w := uint32(state) // waiter 的数量
        if v == 0 {
            // 如果计数值为 0, 调用这个方法的 goroutine 不必再等待,继续执行它后面的逻辑即可
            return
        }
        // 否则把 waiter 数量加 1。期间可能有并发调用 Wait 的情况,增加可能会失败,所以最外层使用了一个 for 循环
        if atomic.CompareAndSwapUint64(statep, state, state+1) {
            // 阻塞休眠等待
            runtime_Semacquire(semap)
            // 被唤醒,不再阻塞,返回
            return
        }
    }
}

Common errors

counter set to negative value

The value of the WaitGroup counter must be greater than or equal to 0. When we change this count value, WaitGroup will check first, and if the count value is set to a negative number, it will cause a panic.

In general, there are two ways to cause the counter to be set to a negative number:

  1. Pass a negative number when calling Add. If you can guarantee that the current counter is still greater than or equal to 0 after adding this negative number, there is no problem, otherwise it will cause panic.
  2. The Done method was called too many times, exceeding the WaitGroup count.

Add timing error

When using WaitGroup, the principle you must follow is to call Wait after all Add methods such as are called, otherwise it may cause panic or unexpected results.

before the previous Wait is over

As long as the WaitGroup's count value returns to a state of zero, it can be treated as a newly created WaitGroup and reused. However, if we reuse the WaitGroup before the count has returned to zero, it will cause the program to panic. Let's look at an example. Initially set the count value of WaitGroup to 1, start a goroutine and call the Done method first, and then call the Add method. The Add method may be executed concurrently with the main goroutine.

func main() {
    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        time.Sleep(time.Millisecond)
        wg.Done() // 计数器减 1
        wg.Add(1) // 计数值加 1
    }()
    wg.Wait() // 主 goroutine 等待,有可能和第 7 行并发执行
}

In this example, line 6 restores the WaitGroup count to 0, but because there is a waiter waiting on line 9, if the goroutine waiting for Wait will be executed concurrently with the Add call (line 7) as soon as it wakes up conflict, so panic occurs.

Although WaitGroup can be reused, there is a premise, that is, you must wait until the previous round of Wait is completed before you can reuse the WaitGroup to execute the next round of Add/Wait. If you call the next round before Wait is finished Round the Add method, it is possible to panic.


与昊
222 声望634 粉丝

IT民工,主要从事web方向,喜欢研究技术和投资之道