How much faster can Go inline optimizations make a program?

Hello everyone, I am fried fish.

I saw @Dave Cheney's "Inlining optimisations in Go" when I was studying at home on the weekend, and it still has a lot of nutrients. I translated it and shared it with everyone, and it has been trimmed and deleted.

This is an article explaining how the Go compiler implements inlining, and how this optimization will affect your Go code.

Next, start to absorb knowledge with fried fish.

What is inline?

Inlining is the act of merging smaller functions into their respective callers. Its practices are different in different computing historical periods, as follows:

Early: This optimization is usually done by hand.
Now: inlining is one of a class of fundamental optimizations that happens automatically during compilation.

Why is inlining important?

Inlining is important, and every language must have it. The specific reasons are as follows:

It removes the overhead of the function call itself.
It allows the compiler to apply other optimization strategies more efficiently.

Basically, it's better performance.

function call overhead

basic knowledge

There is a cost to calling a function in any language. Putting parameters into registers or the stack (depending on the ABI) and reversing the process on return are all overhead.

Calling a function requires jumping the program counter from one point in the instruction stream to another, which can cause the pipeline to stall. Once inside a function, some prologue is usually needed to prepare a new stack frame for the function's execution, and a similar epilogue is needed to exit the frame before returning to the caller.

Overhead in Go

In Go, a function call has an extra cost to support dynamic stack growth. On entry, the amount of stack space available to the goroutine is compared to the amount required by the function.

If there is not enough stack space available, the prologue jumps to runtime logic to increase the stack by copying it to a new, larger location.

Once this is done, the runtime jumps back to the start of the original function, does another stack check, which now passes, and continues the call. In this way, goroutines can start with a small stack allocation and only grow when needed.

This check is cheap, requires only a few instructions, and since the goroutine's stack grows exponentially, the check rarely fails. Therefore, branch prediction units in modern processors can hide the cost of stack checks by assuming that stack checks always succeed. In the case where the processor mispredicts stack checks and has to discard the work it did while speculatively executing, the cost of pipeline stalls is relatively small compared to the cost of work required to grow the goroutine stack at runtime.

Optimization in Go

While the overhead of the generic and Go specific components of each function call is well optimized by modern processors using speculative execution techniques, these overheads cannot be completely eliminated, so each function call carries a performance cost that exceeds the execution cost useful work time. Since the overhead of function calls is fixed, smaller functions pay more relative to larger functions because they tend to do less useful work per call.

So the solution to eliminating these overheads must be to eliminate the function call itself, which the Go compiler does under certain conditions by replacing the call to the function with the content of the function. This is called inlining because it aligns the body of the function with its caller.

Opportunities to improve optimization

Dr. Cliff Click describes inlining as an optimization made by modern compilers because it is the basis for optimizations such as constant propagation and dead code elimination.

In fact, inlining allows the compiler to look further, allowing it to observe logic that can be further simplified or eliminated entirely in the case of a particular function being called.

Since inlining can be applied recursively, optimization decisions can be made not only in the context of each individual function, but also applied to function chains in the call path.

Do inline optimization

Inlining is not allowed

The effect of inlining can be demonstrated with this small example:

 package main

import "testing"

//go:noinline
func max(a, b int) int {
    if a > b {
        return a
    }
    return b
}

var Result int

func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = max(-1, i)
    }
    Result = r
}

Running this benchmark yields the following results:

 % go test -bench=. 
BenchmarkMax-4   530687617         2.24 ns/op

From the execution result, the cost of max(-1, i) is about 2.24ns, and the performance feels good.

allow inlining

Now let's remove the //go:noinline pragma statement and see if the performance changes if inlining is not allowed.

The result is as follows:

 % go test -bench=. 
BenchmarkMax-4   1000000000         0.514 ns/op

Compare the two results, 2.24ns and 0.51ns. The gap is at least more than doubled, and according to benchstat's recommendation, the performance improves by 78% in the inline case.

The result is as follows:

 % benchstat {old,new}.txt
name   old time/op  new time/op  delta
Max-4  2.21ns ± 1%  0.49ns ± 6%  -77.96%  (p=0.000 n=18+19)

Where do these improvements come from?

First, canceling function calls and related leading actions are the main improvement contributors. It pulls the contents of the max function to its caller, reducing the number of instructions the processor executes and eliminating several branches.

Now that the contents of the max function are visible to the compiler, it can make some additional improvements when it optimizes BenchmarkMax.

Consider that once max is inlined, the body of BenchmarkMax changes to the compiler and is not what the user sees.

The following code:

 func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        if -1 > i {
            r = -1
        } else {
            r = i
        }
    }
    Result = r
}

Running the benchmark again, we see that our manually inlined version performs just as well as the compiler inlined version.

The result is as follows:

 % benchstat {old,new}.txt
name   old time/op  new time/op  delta
Max-4  2.21ns ± 1%  0.48ns ± 3%  -78.14%  (p=0.000 n=18+18)

Now the compiler can get the result of max inlining into BenchmarkMax, and it can apply optimizations that were not possible before.

For example: the compiler notices that i is initialized to 0 and is only incremented, so any comparison with i can assume that i will never be negative. So the condition -1 > i will never be true.

After proving that -1 > i can never be true, the compiler can simplify the code to:

 func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        if false {  // 注意已为 false
            r = -1
        } else {
            r = i
        }
    }
    Result = r
}

And since that branch is now a constant, the compiler can eliminate unreachable paths, leaving only code like this:

 func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = i
    }
    Result = r
}

Through inlining and the optimizations it unleashes, the compiler has reduced the expression r = max(-1, i) to r = i .

This example is very good, and it is a good example of the optimization process of inlining and the reason for the performance improvement.

Inline Restrictions

In this article, so-called leaf inlining is discussed: the act of inlining a function at the bottom of the call stack into its immediate caller.

Inlining is a recursive process, once a function is inlined into its caller, the compiler may inline the resulting code into its caller, and so on.

For example the following code:

 func BenchmarkMaxMaxMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = max(max(-1, i), max(0, i))
    }
    Result = r
}

This will run as fast as the previous example because the compiler is able to repeatedly apply the above optimizations, reducing the code to the same r = i expression.

Summarize

This article introduces and analyzes the basic concepts of inlining, and analyzes it step by step through the example of Go, so that everyone has a more appropriate understanding of the real case.

Go compiler optimizations are always ubiquitous.

The article is continuously updated, you can read it on WeChat by searching [Brain Fried Fish]. This article has been included in GitHub github.com/eddycjy/blog . To learn Go language, you can see the Go learning map and route . Welcome to Star to urge you to update.

How much faster can Go inline optimizations make a program?

What is inline?

Why is inlining important?

function call overhead

basic knowledge

Overhead in Go

Optimization in Go

Opportunities to improve optimization

Do inline optimization

Inlining is not allowed

allow inlining

Where do these improvements come from?

Inline Restrictions

Summarize

Go Book Series

Recommended reading

煎鱼

引用和评论

Cloudflare 从 PHP 到 Go：迁移与经验分享

使用PHP对接StockTV全球金融市场数据API实战指南

一文掌握 MCP 上下文协议：从理论到实践

MyBatis-Plus结合Spring Boot实现数据权限

70k star，取代Postman！这款轻量级API工具，太香了！

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？