The backstage boss of the goose factory teaches you Go memory management!

Introduction | This article is selected from the column of Tencent Cloud Developer Community - [Technical Thinking Guangyi·Tencent Technician Original Collection]. This column is a sharing and communication window created by the Tencent Cloud developer community for Tencent technicians and a wide range of developers. The column invites Tencent technicians to share original technical accumulation, and to inspire and grow with a wide range of developers. The author of this article is Luo Yuanguo, a backend development engineer at Tencent.

Stack Memory The memory in the stack area is automatically allocated and released by the compiler. The stack area stores the parameters of the function and local variables, which are created with the creation of the function and destroyed when the function returns.

Each goroutine maintains its own stack area, which can only be used by itself and cannot be used by other goroutines.

The initial size of the stack area is 2KB. The stack memory space, structure and initial size have been changed over several versions:

v1.0~v1.1: The minimum stack memory space is 4KB.

v1.2: Raised the minimum stack memory to 8KB.

v1.3: Replace the previous version's segmented stack with a contiguous stack.

v1.4~v1.19: Reduced the minimum stack memory to 2KB.

The stack structure has gone through the development process of segmented stack to continuous stack, which is described as follows.

Segment stack As the function level called by goroutine deepens or local variables are needed more and more, the runtime will call runtime.morestack and runtime.newstack to create a new stack space. These stack spaces are discontinuous, but currently The multiple stack spaces of the goroutine will be concatenated in the form of a doubly linked list, and the runtime will find consecutive stack segments through pointers. As shown below.

Advantages: allocate memory for the current goroutine on demand and reduce memory usage in time.

Disadvantage: If the stack of the current goroutine is almost full, any function call will trigger the expansion of the stack, and when the function returns, it will trigger the shrinkage of the stack. If the function is called in a loop, the allocation and release of the stack will cause huge With extra overhead, this is known as the hot split problem.

To solve this problem, Go had to increase the stack initialization memory from 4KB to 8KB in version 1.2.

Continuous stack Continuous stack can solve two problems in segmented stack. The core principle is that whenever the stack space of the program is insufficient, a new stack that is twice the size of the old stack is initialized and all the values in the original stack are migrated. To the new stack, there is sufficient memory space for new local variables or function calls.

The expansion caused by insufficient stack space will go through the following steps:

Call runtime.newstack to allocate a larger stack memory space in the memory space.

Use runtime.copystack to copy everything from the old stack to the new stack.

Repoint the pointer to the corresponding variable of the old stack to the new stack.

Call runtime.stackfree to destroy and reclaim the memory space of the old stack.

In addition to heap memory allocation, stack management Span is also used for stack memory allocation, but the mSpan states corresponding to different uses are different. The mSpan state used for heap memory is mSpanInUse, and the state used for stack memory is mSpanManual.

The stack space contains two important global variables at runtime, namely runtime.stackpool and runtime.stackLarge. These two variables represent the global stack cache and the large stack cache respectively. The former can allocate memory less than 32KB, and the latter use to allocate more than 32KB of stack space.

To improve stack memory allocation efficiency, the scheduler initializes two global objects for stack allocation: stackpool and stackLarge.

The introduction is as follows:

(1) StackPool stackpool is for stack allocation below 32KB. The stack size must be a power of 2, with a minimum of 2KB. In the Linux environment, stackpool provides mSpan linked lists in four sizes: 2kB, 4KB, 8KB, and 16KB.

The stackpool structure is defined as follows

 // Global pool of spans that have free stacks.
// Stacks are assigned an order according to size.
//
//  order = log_2(size/FixedStack)
//
// There is a free list for each order.
var stackpool [_NumStackOrders]struct {
  item stackpoolItem
  _    [cpu.CacheLinePadSize - unsafe.Sizeof(stackpoolItem{})%cpu.CacheLinePadSize]byte
}

//go:notinheap
type stackpoolItem struct {
  mu   mutex
  span mSpanList
}

// mSpanList heads a linked list of spans.
//
//go:notinheap
type mSpanList struct {
  first *mspan // first span in list, or nil if none
  last  *mspan // last span in list, or nil if none
}

(2) The stack with StackLarge greater than or equal to 32KB is allocated by stackLarge, which is also an array of mSpan linked lists with a length of 25. The mSpan size starts at 8KB, and each linked list after that has twice the mSpan size of the previous one.

The two linked lists of 8KB and 16KB are actually always empty, and they are reserved for the convenience of using mSpan to contain the logarithm of the number of pages (base 2) as an array index. The stackLarge structure is defined as follows:

 // Global pool of large stack spans.
var stackLarge struct {
  lock mutex
  free [heapAddrBits - pageShift]mSpanList // free lists by log_2(s.npages)
}

// mSpanList heads a linked list of spans.
//
//go:notinheap
type mSpanList struct {
  first *mspan // first span in list, or nil if none
  last  *mspan // last span in list, or nil if none
}

(3) Memory allocation If only global variables are used to allocate memory at runtime, it will inevitably cause lock competition between threads and thus affect the execution efficiency of the program. Since the stack memory is closely related to the thread, the runtime is cached in each thread. A stack cache is added to mcache to reduce the impact of lock competition.

Like heap memory allocation, each P also has a local cache (mcache.stackcache) for stack allocation, which is equivalent to the local cache of stackpool. The definition in mcache is as follows:

 //go:notinheap
type mcache struct {
  // The following members are accessed on every malloc,
  // so they are grouped here for better caching.
  nextSample uintptr // trigger heap sample after allocating this many bytes
  scanAlloc  uintptr // bytes of scannable heap allocated

  // Allocator cache for tiny objects w/o pointers.
  // See "Tiny allocator" comment in malloc.go.

  // tiny points to the beginning of the current tiny block, or
  // nil if there is no current tiny block.
  //
  // tiny is a heap pointer. Since mcache is in non-GC'd memory,
  // we handle it by clearing it in releaseAll during mark
  // termination.
  //
  // tinyAllocs is the number of tiny allocations performed
  // by the P that owns this mcache.
  tiny       uintptr
  tinyoffset uintptr
  tinyAllocs uintptr

  // The rest is not accessed on every malloc.

  alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass

  stackcache [_NumStackOrders]stackfreelist

  // flushGen indicates the sweepgen during which this mcache
  // was last flushed. If flushGen != mheap_.sweepgen, the spans
  // in this mcache are stale and need to the flushed so they
  // can be swept. This is done in acquirep.
  flushGen uint32
}

stackcache [_NumStackOrders] stackfreelist is the local cache of the stack. In the Linux environment, each P local cache has 4 (_NumStackOrders) free memory block linked lists: 2KB, 4KB, 8KB, 16KB, defined as follows:

 // Number of orders that get caching. Order 0 is FixedStack
// and each successive order is twice as large.
// We want to cache 2KB, 4KB, 8KB, and 16KB stacks. Larger stacks
// will be allocated directly.
// Since FixedStack is different on different systems, we
// must vary NumStackOrders to keep the same maximum cached size.
//   OS               | FixedStack | NumStackOrders
//   -----------------+------------+---------------
//   linux/darwin/bsd | 2KB        | 4
//   windows/32       | 4KB        | 3
//   windows/64       | 8KB        | 2
//   plan9            | 4KB        | 3
_NumStackOrders = 4 - goarch.PtrSize/4*goos.IsWindows - 1*goos.IsPlan9

Stack allocations smaller than 32KB:

For stack space less than 32KB, the local cache of current P will be used preferentially.

If the memory block linked list corresponding to the specification is empty in the local cache, allocate 16KB of memory from the stackpool to the local cache (stackcache), and then continue to allocate from the local cache.

If the corresponding linked list in the stackpool is also empty, a 32KB span is directly allocated from the heap memory and divided into the corresponding memory block size and placed in the stackpool.

However, in some cases, the local cache cannot be used, and in the case where the local cache cannot be used, it is allocated directly from the stackpool.

Stack allocations greater than or equal to 32KB:

Calculate the number of pages needed, and take the logarithm of 2 as the base (log2page), use the obtained result as the subscript of the stackLarge array, and find the corresponding free mSpan linked list.

If the linked list is not empty, take one and use it. If the linked list is empty, a span with so many pages is allocated directly from the heap memory and used for allocating stack memory.

For example, if you want to allocate a 64KB stack, 68/8 is 8 pages, log2page=log2(8)=3.

(4) When does memory release release the stack?
If the coroutine stack has not grown (still 2KB), put the coroutine in the free G queue with a stack.

If the coroutine stack has grown, release the coroutine stack, and then put the coroutine into the free G queue without a stack.

The stacks of these idle coroutines will also be released when the GC executes markroot, and at that time these coroutines will also be added to the idle coroutine queue without a stack.

So, the release of the regular goroutine stack,

One is that when the coroutine runs, gfput will release the increased stack, and the g that has not grown in the stack will be put into sched.gFree.stack;

The second is that the GC will process the sched.gFree.stack linked list, release all the g stacks in it, and put them into the sched.gFree.noStack linked list.

When the coroutine stack is released, is it put back into the local cache of the current P? Or put it back into the global stack cache? Or just return the heap memory directly? In fact, it is possible, it depends on the situation. Like stack allocation, stacks less than 32KB and those greater than or equal to 32KB will be treated differently when they are released.

Stacks smaller than 32KB will be returned to the local cache when released. If the total stack space in the linked list corresponding to the local cache is greater than 32KB, part of it is put back into the stackpool, and only 16KB is reserved for the local linked list.

If the local cache is not available, it will also be put back directly into the stackpool. Moreover, if all memory blocks in this mSpan are found to be freed, it will be returned to the heap memory.

For stack release greater than or equal to 32KB, if it is currently in the GC cleanup phase (gcphase==_GCoff), it will be released directly to the heap memory, otherwise it will be put back into StackLarge first.

(5) Stack expansion When the goroutine is running, the stack area will grow and shrink as needed. The default value of the maximum memory occupied is 1GB on a 64-bit system. The initial value and upper limit of the stack size can be viewed in the Go source code runtime/stack.go.

The expansion process compiler will insert runtime.morestack for function calls. It will check whether the stack memory of the current goroutine is sufficient before almost all function calls. If the current stack needs to be expanded, it will call runtime.newstack to create a new stack.

The size of the old stack is calculated by the memory boundary of the stack area recorded in the stack information stored in the goroutine as mentioned above, and then a new stack is created with twice the size of the old stack. Before creation, it will be checked whether the size of the new stack is the same. Exceeded the memory limit of a single stack.

The most complicated part of the whole process is to adjust the pointer to the memory in the source stack to point to the new stack. After this step is completed, the memory space of the old stack will be released.

(6) Stack shrinkage In the process of goroutine running, if the space usage of the stack area does not exceed 1/4, then use runtime.shrinkstack to shrink the stack during garbage collection. Of course, a bunch of stacks will be executed before shrinking. The pre-checks are passed before shrinking.

Shrinking process:

If you want to trigger the shrinking of the stack, the size of the new stack will be half of the original stack. If the size of the new stack is lower than the program's minimum limit of 2KB, the shrinking process will stop.

The shrinking will also call the runtime.copystack function used during the expansion to open up a new stack space, copy the data of the old stack to the new stack and adjust the original pointer.

The only place where stack shrinking is initiated is the GC. When the GC uses the scanstack function to find and mark the root node, if it finds that the stack can be safely shrunk, it will execute the stack shrinkage. If it cannot be executed immediately, it will set the stack shrinkage flag (g. stackPreempt).

This stack shrink flag will be checked before giving up the CPU. If it is true, the stack will be shrunk first, and then the CPU will be given up.

References:

1. GoLang's stack memory management

2. Vernacular Go language memory management trilogy (2) decryption stack memory management

If you are a creator of Tencent technical content, the Tencent Cloud developer community sincerely invites you to join the [Tencent Cloud Original Sharing Program] to receive gifts and help with your rank promotion.

The backstage boss of the goose factory teaches you Go memory management!

腾讯云开发者

引用和评论

从效率革命到技术觉醒-解锁 AI 编程的未来密码 | TVP 技术夜未眠

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

狂揽17k star！Docker可视化神器，一键部署项目真香！

Spring 数据校验：@Validated 与@Valid 注解全面对比与应用

OpenWebUI：一站式 AI 应用构建平台体验