3
头图

This article should have been read by anyone who studies the bottom of Go in China, and you should also read it if you are going to learn runtime.

As we all know, Go uses Unix old antiques (plan9 assembly invented by mistakes. Even if you know something about x86 assembly, there are still some differences in plan9. Maybe when you look at the code, you accidentally find that the SP in the code looks like SP, but it’s crazy when it’s not actually SP hahaha.

This article will give a comprehensive introduction to plan9 compilation and answer most of the questions you may encounter when contacting plan9 compilation.

The platform used in this article is linux amd64, because different platforms have different instruction sets and registers, so there is no way to discuss them together. This is also determined by the nature of the compilation itself.

Basic instructions

Stack adjustment

Intel or AT&T assembly provides push and pop instruction families. plan9 does not have push and pop . Although there are push and pop instructions in plan9, they are not in the generated code. Most of the stack adjustments we see are passed It is implemented by performing operations on the hardware SP register, for example:

SUBQ $0x18, SP // 对 SP 做减法,为函数分配函数栈帧
...               // 省略无用代码
ADDQ $0x18, SP // 对 SP 做加法,清除函数栈帧

The general instructions are similar to the X64 platform, which will be detailed in the following sections.

Data handling

The constant is represented by $num in the plan9 assembly, and it can be a negative number. By default, it is decimal. You can use the form of $0x123 to represent a hexadecimal number.

MOVB $1, DI      // 1 byte
MOVW $0x10, BX   // 2 bytes
MOVD $1, DX      // 4 bytes
MOVQ $-10, AX     // 8 bytes

It can be seen that the length of the transport is determined by the suffix of MOV, which is slightly different from the intel assembly. Take a look at the similar X64 assembly:

mov rax, 0x1   // 8 bytes
mov eax, 0x100 // 4 bytes
mov ax, 0x22   // 2 bytes
mov ah, 0x33   // 1 byte
mov al, 0x44   // 1 byte

The direction of operands in plan9 assembly is opposite to intel assembly, similar to AT&T.

MOVQ $0x10, AX ===== mov rax, 0x10
       |    |------------|      |
       |------------------------|

However, there are always exceptions to everything. If you want to understand this kind of accident, you can refer to [1] in the references.

Common calculation instructions

ADDQ  AX, BX   // BX += AX
SUBQ  AX, BX   // BX -= AX
IMULQ AX, BX   // BX *= AX

Similar to data handling instructions, you can also modify the suffix of the instruction to correspond to operands of different lengths. For example, ADDQ/ADDW/ADDL/ADDB.

Conditional Jump/Unconditional Jump

// 无条件跳转
JMP addr   // 跳转到地址,地址可为代码中的地址,不过实际上手写不会出现这种东西
JMP label  // 跳转到标签,可以跳转到同一函数内的标签位置
JMP 2(PC)  // 以当前指令为基础,向前/后跳转 x 行
JMP -2(PC) // 同上

// 有条件跳转
JZ target // 如果 zero flag 被 set 过,则跳转

Instruction Set

You can refer to the arch part of the source code.

As an extra note, Go 1.10 adds a lot of SIMD instruction support, so if it is above this version, it is not as painful as it was written before, that is, there is no need to fill bytes with human flesh.

register

General register

General registers of amd64:

(lldb) reg read
General Purpose Registers:
       rax = 0x0000000000000005
       rbx = 0x000000c420088000
       rcx = 0x0000000000000000
       rdx = 0x0000000000000000
       rdi = 0x000000c420088008
       rsi = 0x0000000000000000
       rbp = 0x000000c420047f78
       rsp = 0x000000c420047ed8
        r8 = 0x0000000000000004
        r9 = 0x0000000000000000
       r10 = 0x000000c420020001
       r11 = 0x0000000000000202
       r12 = 0x0000000000000000
       r13 = 0x00000000000000f1
       r14 = 0x0000000000000011
       r15 = 0x0000000000000001
       rip = 0x000000000108ef85  int`main.main + 213 at int.go:19
    rflags = 0x0000000000000212
        cs = 0x000000000000002b
        fs = 0x0000000000000000
        gs = 0x0000000000000000

It can be used in plan9 assembly. The general registers used at the application code level are mainly: rax, rbx, rcx, rdx, rdi, rsi, r8~r15 these 14 registers, although rbp and rsp can also be used, However, bp and sp will be used to manage the top and bottom of the stack, and it is best not to use them for calculations.

Registers used in plan9 do not need to be prefixed with r or e, such as rax, just write AX:

MOVQ $101, AX = mov rax, 101

The following is the correspondence between the names of general-purpose registers in X64 and plan9:

X64raxrbxrcxrdxrdirsirbprspr8r9r10r11r12r13r14rip
Plan9AXBXCXDXDISIBPSPR8R9R10R11R12R13R14PC

Pseudo register

The Go compilation also introduces 4 pseudo-registers, citing the description of the official document:

  • FP: Frame pointer: arguments and locals.
  • PC: Program counter: jumps and branches.
  • SB: Static base pointer: global symbols.
  • SP: Stack pointer: top of stack.

The official description has some problems. We will expand these descriptions a bit:

  • FP: Use the form symbol+offset(FP) to quote the input parameters of the function. For example, arg0+0(FP) and arg1+8(FP) cannot be compiled when FP is used without symbol. At the assembly level, symbol is of no use. Adding symbol is mainly to improve code readability. In addition, although the official document refers to the pseudo-register FP as a frame pointer, it is actually not a frame pointer at all. According to the traditional x86 convention, the frame pointer refers to the BP register at the bottom of the entire stack frame. If the current callee function is add, the FP is referenced in the code of add, and the position pointed to by the FP is not in the stack frame of the callee, but on the stack frame of the caller. For details, see the stack structure .
  • PC: In fact, it is the pc register that is common in the knowledge of the architecture. It corresponds to the ip register under the x86 platform, and it is rip on the amd64. Except for individual jumps, the handwritten plan9 code deals with PC registers less frequently.
  • SB: The global static base pointer is generally used to declare functions or global variables. You will see specific usage in the function knowledge and examples section later.
  • SP: The SP register of plan9 points to the start position of the local variables of the current stack frame, using the form symbol+offset(SP) to reference the local variables of the function. The legal value of offset is [-framesize, 0). Note that it is a left-closed and right-opened interval. If the local variables are all 8 bytes, then the first local variable can be represented by localvar0-8(SP) . This is also a register of insignificance. There are two different things from the hardware register SP. When the stack frame size is 0, the pseudo register SP and the hardware register SP point to the same location. When writing assembly code by hand, if it is in the symbol+offset(SP) , it means the pseudo register SP. If it is offset(SP) it means the hardware register SP. Be careful. For the code output by compiling (go tool compile -S / go tool objdump), all SPs currently are hardware registers SP, regardless of whether they have symbols.

Here we briefly explain the points that are easy to confuse:

  1. Pseudo SP and hardware SP are not the same thing. When writing code, the distinction between pseudo SP and hardware SP is to see if there is a symbol in front of the SP. If there is a symbol, it is a pseudo register, if not, it is a hardware SP register.
  2. The relative position of SP and FP will change, so you should not try to use the pseudo SP register to find the values referenced by FP + offset, such as the input parameters and return values of functions.
  3. The pseudo SP mentioned in the official document points to the top of the stack, which is problematic. The location of the local variable pointed to is actually the bottom of the entire stack (except for caller BP), so bottom is more appropriate.
  4. In the code output by go tool objdump/go tool compile -S, there are no pseudo SP and FP registers. The method of distinguishing between pseudo SP and hardware SP registers mentioned above cannot be used for the output results of the above two commands. of. In the result of compiling and disassembling, there is only the real SP register.
  5. The framepointer in the official source code of FP and Go is not the same thing. The framepointer in the source code refers to the value of the caller BP register, which is equal to the pseudo SP of the caller here.

It doesn't matter if you don't understand the above description, you should be able to understand it after you are familiar with the stack structure of the function and then check back repeatedly. Personal opinion, these are the pits that Go officially dug. .

Variable declaration

The so-called variables in assembly are generally read-only values stored in the .rodata or .data section. Corresponding to the application layer, it is the global const, var, and static variables/constants that have been initialized.

Use DATA combined with GLOBL to define a variable. The usage of DATA is:

DATA    symbol+offset(SB)/width, value

Most of the parameters are literal, but this offset requires a little attention. Its meaning is the offset of the value relative to the symbol, rather than the offset relative to a global address.

Use the GLOBL instruction to declare the variable as global, and receive two additional parameters, one is the flag, and the other is the total size of the variable.

GLOBL divtab(SB), RODATA, $64

GLOBL must follow the DATA instruction. The following is a complete example that defines multiple readonly global variables:

DATA age+0x00(SB)/4, $18  // forever 18
GLOBL age(SB), RODATA, $4

DATA pi+0(SB)/8, $3.1415926
GLOBL pi(SB), RODATA, $8

DATA birthYear+0(SB)/4, $1988
GLOBL birthYear(SB), RODATA, $4

As mentioned before, when all symbols are declared, their offset is generally 0.

Sometimes you may want to define an array or string in a global variable. In this case, you need to use a non-zero offset, for example:

DATA bio<>+0(SB)/8, $"oh yes i"
DATA bio<>+8(SB)/8, $"am here "
GLOBL bio<>(SB), RODATA, $16

<> understand, but here we have introduced a new tag 0609b51ba797cf, which follows the symbol name, which means that the global variable only takes effect in the current file, similar to static in C language. If the variable is referenced in another file, an error of relocation target not found

The flag mentioned in this section can also have other values:

  • NOPROF = 1
(For  `TEXT`  items.) Don't profile the marked function. This flag is deprecated.
  • DUPOK = 2
It is legal to have multiple instances of this symbol in a single binary. The linker will choose one of the duplicates to use.
  • NOSPLIT = 4
(For  `TEXT`  items.) Don't insert the preamble to check if the stack must be split. The frame for the routine, plus anything it calls, must fit in the spare space at the top of the stack segment. Used to protect routines such as the stack splitting code itself.
  • RODATA = 8
(For  `DATA`  and  `GLOBL`  items.) Put this data in a read-only section.
  • NOPTR = 16
(For  `DATA`  and  `GLOBL`  items.) This data contains no pointers and therefore does not need to be scanned by the garbage collector.
  • WRAPPER = 32
(For  `TEXT`  items.) This is a wrapper function and should not count as disabling  `recover`.
  • NEEDCTXT = 64
(For  `TEXT`  items.) This function is a closure so it uses its incoming context register.

When using the literal value of these #include "textflag.h" , 0609b51ba798db must be included in the assembly file.

Interoperability of global variables in .s and .go files

In .s can be used directly file .go global variables defined, look at the following simple example:

refer.go:

package main

var a = 999
func get() int

func main() {
    println(get())
}

refer.s:

#include "textflag.h"

TEXT ·get(SB), NOSPLIT, $0-8
    MOVQ ·a(SB), AX
    MOVQ AX, ret+0(FP)
    RET

·A(SB), which means that the symbol needs the linker to help us relocation target not found (relocation). If the symbol cannot be found, an error of 0609b51ba79988 will be output.

The example is relatively simple, you can try it yourself.

Function declaration

Let's take a look at the definition of a typical plan9 assembly function:

// func add(a, b int) int
//   => 该声明定义在同一个 package 下的任意 .go 文件中
//   => 只有函数头,没有实现
TEXT pkgname·add(SB), NOSPLIT, $0-8
    MOVQ a+0(FP), AX
    MOVQ a+8(FP), BX
    ADDQ AX, BX
    MOVQ BX, ret+16(FP)
    RET

Why is it called TEXT? If you have a little understanding of the segmentation of program data in files and memory, you should know that our code is stored in the .text section in a binary file, which is a conventional way of naming. In fact, TEXT in plan9 is an instruction to define a function. In addition to TEXT, there is also the DATA/GLOBL mentioned in the previous variable declaration.

The pkgname part of the definition can be omitted, and you can write it if you want to. However, if you write pkgname, you need to change the code after renaming the package, so it is recommended not to write it.

The midpoint · is special and is a unicode midpoint. The input method of this point under mac is option+shift+9 . After the program is linked, all midpoints · will be replaced with periods . . For example, your method is runtime·main , and the symbol in the compiled program is runtime.main . Well, it looks abnormal. To summarize briefly:


                              参数及返回值大小
                                  | 
 TEXT pkgname·add(SB),NOSPLIT,$32-32
       |        |               |
      包名     函数名         栈帧大小(局部变量+可能需要的额外调用函数的参数空间的总大小,但不包括调用其它函数时的 ret address 的大小)

Stack structure

The following is a stack structure diagram of a typical function:

                                                                                   
                       -----------------                                           
                       current func arg0                                           
                       ----------------- <----------- FP(pseudo FP)                
                        caller ret addr                                            
                       +---------------+                                           
                       | caller BP(*)  |                                           
                       ----------------- <----------- SP(pseudo SP,实际上是当前栈帧的 BP 位置)
                       |   Local Var0  |                                           
                       -----------------                                           
                       |   Local Var1  |                                           
                       -----------------                                           
                       |   Local Var2  |                                           
                       -----------------                -                          
                       |   ........    |                                           
                       -----------------                                           
                       |   Local VarN  |                                           
                       -----------------                                           
                       |               |                                           
                       |               |                                           
                       |  temporarily  |                                           
                       |  unused space |                                           
                       |               |                                           
                       |               |                                           
                       -----------------                                           
                       |  call retn    |                                           
                       -----------------                                           
                       |  call ret(n-1)|                                           
                       -----------------                                           
                       |  ..........   |                                           
                       -----------------                                           
                       |  call ret1    |                                           
                       -----------------                                           
                       |  call argn    |                                           
                       -----------------                                           
                       |   .....       |                                           
                       -----------------                                           
                       |  call arg3    |                                           
                       -----------------                                           
                       |  call arg2    |                                           
                       |---------------|                                           
                       |  call arg1    |                                           
                       -----------------   <------------  hardware SP 位置           
                         return addr                                               
                       +---------------+                                           
                                                                                   

In principle, if the current function calls other functions, then return addr is also on the caller's stack, but the process of inserting return addr on the stack is completed by the CALL instruction. At RET, the SP will return to the figure.上位。 On position. When we calculate the relative position of SP and parameters, we can think that the hardware SP points to the position on the graph.

The caller BP in the picture refers to the value of the BP register of the caller. Some people call the caller BP the frame pointer of the caller. In fact, this habit is inherited from the x86 architecture. In Go's asm document, the pseudo-register FP is also called a frame pointer, but these two frame pointers are not the same thing at all.

In addition, it should be noted that the caller BP is inserted by the compiler during compilation. When the user writes the code, the caller BP part is not included when calculating the frame size. The main basis for judging whether to insert caller BP is:

  1. The stack frame size of the function is greater than 0
  2. The following function returns true
func Framepointer_enabled(goos, goarch string) bool {
    return framepointer_enabled != 0 && goarch == "amd64" && goos != "nacl"
}

If the compiler does not insert the caller BP (the frame pointer in the source code) in the final assembly result, there is only an 8-byte caller return address between the pseudo SP and the pseudo FP, and the BP is inserted , There will be an extra 8 bytes. In other words, the relative position of pseudo SP and pseudo FP is not fixed, there may be an interval of 8 bytes, or an interval of 16 bytes. And the judgment basis will be different according to the platform and the version of Go.

As you can see in the figure, the FP pseudo-register points to the starting position of the incoming parameters of the function. Because the stack grows towards the lower address, in order to facilitate the reference to the parameters through the register, the placement direction of the parameters is opposite to the growth direction of the stack. ,which is:

                              FP
high ----------------------> low
argN, ... arg3, arg2, arg1, arg0

Assuming that all parameters are 8 bytes, we can use symname+0(FP) to access the first parameter, symname+8(FP) to access the second parameter, and so on. Using pseudo SP to refer to local variables is similar in principle, but because pseudo SP points to the bottom of local variables, symname-8 (SP) means the first local variable, symname-16 (SP) means second One, and so on. Of course, it is assumed that all local variables occupy 8 bytes.

The caller return address and current func arg0 at the top of the figure are all allocated by the caller. Not counted in the current stack frame.

Because the official document itself is rather vague, let's take a panoramic view of function calls to see what the relationship between these true and false SP/FP/BP is:

                                                                                                                              
                                       caller                                                                                 
                                 +------------------+                                                                         
                                 |                  |                                                                         
       +---------------------->  --------------------                                                                         
       |                         |                  |                                                                         
       |                         | caller parent BP |                                                                         
       |           BP(pseudo SP) --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   Local Var0     |                                                                         
       |                         --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   .......        |                                                                         
       |                         --------------------                                                                         
       |                         |                  |                                                                         
       |                         |   Local VarN     |                                                                         
                                 --------------------                                                                         
 caller stack frame              |                  |                                                                         
                                 |   callee arg2    |                                                                         
       |                         |------------------|                                                                         
       |                         |                  |                                                                         
       |                         |   callee arg1    |                                                                         
       |                         |------------------|                                                                         
       |                         |                  |                                                                         
       |                         |   callee arg0    |                                                                         
       |                         ----------------------------------------------+   FP(virtual register)                       
       |                         |                  |                          |                                              
       |                         |   return addr    |  parent return address   |                                              
       +---------------------->  +------------------+---------------------------    <-------------------------------+         
                                                    |  caller BP               |                                    |         
                                                    |  (caller frame pointer)  |                                    |         
                                     BP(pseudo SP)  ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |     Local Var0           |                                    |         
                                                    ----------------------------                                    |         
                                                    |                          |                                              
                                                    |     Local Var1           |                                              
                                                    ----------------------------                            callee stack frame
                                                    |                          |                                              
                                                    |       .....              |                                              
                                                    ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |     Local VarN           |                                    |         
                                  SP(Real Register) ----------------------------                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    |                          |                                    |         
                                                    +--------------------------+    <-------------------------------+         
                                                                                                                              
                                                              callee

argsize and framesize calculation rules

argsize

In the function declaration:

 TEXT pkgname·add(SB),NOSPLIT,$16-32

As mentioned earlier, $16-32 means $framesize-argsize. When Go functions call, the parameters and return values need to be prepared by the caller on its stack frame. Callee still needs to know this argsize when it is declared. The calculation method of argsize is the sum of parameter sizes + the sum of return value sizes. For example, the input parameter is 3 int64 types, and the return value is 1 int64 type, then argsize = sizeof(int64) * 4 here.

However, the real world is never as good as we assume. Function parameters are often mixed with multiple types, and memory alignment issues need to be considered.

If you are not sure how much argsize your function signature needs, you can simply implement an empty function with the same signature, and then go tool objdump to find out how much space should be allocated in reverse.

framesize

The framesize of the function is a little more complicated. The framesize of the handwritten code does not need to consider the caller BP inserted by the compiler, but should consider:

  1. Local variables, and the size of each variable.
  2. Whether there are calls to other functions in the function, if so, the parameters and return value of callee need to be taken into account when calling. Although the value of return address (rip) is also stored on the caller's stack frame, this process is done by the CALL instruction and RET instruction to save and restore the PC register. When writing by hand, you also don't need to consider this PC register. 8 bytes needed to be occupied on the stack.
  3. In principle, as long as the local variables are not overwritten when calling the function. Allocating a few more bytes of framesize will not die.
  4. On the premise of ensuring that there is no problem with the logic, there is no problem if you are willing to override local variables. Just make sure that the caller and callee when entering and exiting the assembly function can get the return value correctly.

Address calculation

Address calculation also uses the lea instruction, the original English meaning is Load Effective Address , and the amd64 platform address is 8 bytes, so just use LEAQ directly:

LEAQ (BX)(AX*8), CX
// 上面代码中的 8 代表 scale
// scale 只能是 0、2、4、8
// 如果写成其它值:
// LEAQ (BX)(AX*3), CX
// ./a.s:6: bad scale: 3

// 用 LEAQ 的话,即使是两个寄存器值直接相加,也必须提供 scale
// 下面这样是不行的
// LEAQ (BX)(AX), CX
// asm: asmidx: bad address 0/2064/2067
// 正确的写法是
LEAQ (BX)(AX*1), CX


// 在寄存器运算的基础上,可以加上额外的 offset
LEAQ 16(BX)(AX*1), CX

// 三个寄存器做运算,还是别想了
// LEAQ DX(BX)(AX*8), CX
// ./a.s:13: expected end of operand, found (

The benefits of using LEAQ are also obvious, which can save the number of instructions. If basic arithmetic instructions are used to realize the functions of LEAQ, two to three calculation instructions are required to realize the complete functions of LEAQ.

Example

add/sub/mul

math.go:

package main

import "fmt"

func add(a, b int) int // 汇编函数声明

func sub(a, b int) int // 汇编函数声明

func mul(a, b int) int // 汇编函数声明

func main() {
    fmt.Println(add(10, 11))
    fmt.Println(sub(99, 15))
    fmt.Println(mul(11, 12))
}

math.s:

#include "textflag.h" // 因为我们声明函数用到了 NOSPLIT 这样的 flag,所以需要将 textflag.h 包含进来

// func add(a, b int) int
TEXT ·add(SB), NOSPLIT, $0-24
    MOVQ a+0(FP), AX // 参数 a
    MOVQ b+8(FP), BX // 参数 b
    ADDQ BX, AX    // AX += BX
    MOVQ AX, ret+16(FP) // 返回
    RET

// func sub(a, b int) int
TEXT ·sub(SB), NOSPLIT, $0-24
    MOVQ a+0(FP), AX
    MOVQ b+8(FP), BX
    SUBQ BX, AX    // AX -= BX
    MOVQ AX, ret+16(FP)
    RET

// func mul(a, b int) int
TEXT ·mul(SB), NOSPLIT, $0-24
    MOVQ  a+0(FP), AX
    MOVQ  b+8(FP), BX
    IMULQ BX, AX    // AX *= BX
    MOVQ  AX, ret+16(FP)
    RET
    // 最后一行的空行是必须的,否则可能报 unexpected EOF

Put these two files in any directory, execute go build and run to see the effect.

Pseudo register SP, pseudo register FP and hardware register SP

Let's write a simple code to prove the positional relationship between pseudo SP, pseudo FP and hardware SP.
spspfp.s:

#include "textflag.h"

// func output(int) (int, int, int)
TEXT ·output(SB), $8-48
    MOVQ 24(SP), DX // 不带 symbol,这里的 SP 是硬件寄存器 SP
    MOVQ DX, ret3+24(FP) // 第三个返回值
    MOVQ perhapsArg1+16(SP), BX // 当前函数栈大小 > 0,所以 FP 在 SP 的上方 16 字节处
    MOVQ BX, ret2+16(FP) // 第二个返回值
    MOVQ arg1+0(FP), AX
    MOVQ AX, ret1+8(FP)  // 第一个返回值
    RET

spspfp.go:

package main

import (
    "fmt"
)

func output(int) (int, int, int) // 汇编函数声明

func main() {
    a, b, c := output(987654321)
    fmt.Println(a, b, c)
}

Execute the above code, you can get the output:

987654321 987654321 987654321

Thinking with the code, we can know that our current stack structure is like this:

------
ret2 (8 bytes)
------
ret1 (8 bytes)
------
ret0 (8 bytes)
------
arg0 (8 bytes)
------ FP
ret addr (8 bytes)
------
caller BP (8 bytes)
------ pseudo SP
frame content (8 bytes)
------ hardware SP

The framesize of the example in this section is greater than 0. Readers can try to modify the framesize to 0, and then adjust the offset when the pseudo SP and hardware SP are referenced in the code to study the pseudo FP, pseudo SP and hardware SP when the framesize is 0. The relative position between.

The example in this section is to show you that the relative positions of pseudo SP and pseudo FP will change. You should not use pseudo SP and offset >0 to refer to data when handwriting, otherwise the result may be beyond your expectations.

Assembly calls non-assembly functions

output.s:

#include "textflag.h"

// func output(a,b int) int
TEXT ·output(SB), NOSPLIT, $24-24
    MOVQ a+0(FP), DX // arg a
    MOVQ DX, 0(SP) // arg x
    MOVQ b+8(FP), CX // arg b
    MOVQ CX, 8(SP) // arg y
    CALL ·add(SB) // 在调用 add 之前,已经把参数都通过物理寄存器 SP 搬到了函数的栈顶
    MOVQ 16(SP), AX // add 函数会把返回值放在这个位置
    MOVQ AX, ret+16(FP) // return result
    RET

output.go:

package main

import "fmt"

func add(x, y int) int {
    return x + y
}

func output(a, b int) int

func main() {
    s := output(10, 13)
    fmt.Println(s)
}

Loops in assembly

Through the combination of DECQ and JZ, loop logic in high-level languages can be realized:

sum.s:

#include "textflag.h"

// func sum(sl []int64) int64
TEXT ·sum(SB), NOSPLIT, $0-32
    MOVQ $0, SI
    MOVQ sl+0(FP), BX // &sl[0], addr of the first elem
    MOVQ sl+8(FP), CX // len(sl)
    INCQ CX           // CX++, 因为要循环 len 次

start:
    DECQ CX       // CX--
    JZ   done
    ADDQ (BX), SI // SI += *BX
    ADDQ $8, BX   // 指针移动
    JMP  start

done:
    // 返回地址是 24 是怎么得来的呢?
    // 可以通过 go tool compile -S math.go 得知
    // 在调用 sum 函数时,会传入三个值,分别为:
    // slice 的首地址、slice 的 len, slice 的 cap
    // 不过我们这里的求和只需要 len,但 cap 依然会占用参数的空间
    // 就是 16(FP)
    MOVQ SI, ret+24(FP)
    RET

sum.go:

package main

func sum([]int64) int64

func main() {
    println(sum([]int64{1, 2, 3, 4, 5}))
}

Extended topic

Some data structures in the standard library

Numerical type

There are many numeric types in the standard library:

  1. int/int8/int16/int32/int64
  2. uint/uint8/uint16/uint32/uint64
  3. float32/float64
  4. byte/rune
  5. uintptr

These types are a piece of contiguous memory that stores data in assembly, but the memory length is different, so you just need to be optimistic about the data length when operating.

slice

As mentioned in the previous example, when slice is passed to the function, it will actually expand into three parameters:

  1. First element address
  2. len of slice
  3. slice of cap

When processing in the assembly, as long as you know this principle, it is easy to handle, and you can do it in order or by index.

string

package main

//go:noinline
func stringParam(s string) {}

func main() {
    var x = "abcc"
    stringParam(x)
}

Use go tool compile -S output its assembly:

0x001d 00029 (stringParam.go:11)    LEAQ    go.string."abcc"(SB), AX  // 获取 RODATA 段中的字符串地址
0x0024 00036 (stringParam.go:11)    MOVQ    AX, (SP) // 将获取到的地址放在栈顶,作为第一个参数
0x0028 00040 (stringParam.go:11)    MOVQ    $4, 8(SP) // 字符串长度作为第二个参数
0x0031 00049 (stringParam.go:11)    PCDATA  $0, $0 // gc 相关
0x0031 00049 (stringParam.go:11)    CALL    "".stringParam(SB) // 调用 stringParam 函数

At the assembly level, string is the address + string length.

struct

At the assembly level, struct is actually a piece of continuous memory. When passed as a parameter to a function, it will be expanded on the caller's stack and uploaded to the corresponding callee:

struct.go

package main

type address struct {
    lng int
    lat int
}

type person struct {
    age    int
    height int
    addr   address
}

func readStruct(p person) (int, int, int, int)

func main() {
    var p = person{
        age:    99,
        height: 88,
        addr: address{
            lng: 77,
            lat: 66,
        },
    }
    a, b, c, d := readStruct(p)
    println(a, b, c, d)
}

struct.s

#include "textflag.h"

TEXT ·readStruct(SB), NOSPLIT, $0-64
    MOVQ arg0+0(FP), AX
    MOVQ AX, ret0+32(FP)
    MOVQ arg1+8(FP), AX
    MOVQ AX, ret1+40(FP)
    MOVQ arg2+16(FP), AX
    MOVQ AX, ret2+48(FP)
    MOVQ arg3+24(FP), AX
    MOVQ AX, ret3+56(FP)
    RET

The above program will output 99, 88, 77, 66, which shows that even if it is an embedded structure, the memory distribution is still continuous.

map

By compiling the following files (go tool compile -S), we can get the operations that a map needs to do when assigning a value to a key:

m.go:

package main

func main() {
    var m = map[int]int{}
    m[43] = 1
    var n = map[string]int{}
    n["abc"] = 1
    println(m, n)
}

Take a look at the output of the seventh line:

0x0085 00133 (m.go:7)   LEAQ    type.map[int]int(SB), AX
0x008c 00140 (m.go:7)   MOVQ    AX, (SP)
0x0090 00144 (m.go:7)   LEAQ    ""..autotmp_2+232(SP), AX
0x0098 00152 (m.go:7)   MOVQ    AX, 8(SP)
0x009d 00157 (m.go:7)   MOVQ    $43, 16(SP)
0x00a6 00166 (m.go:7)   PCDATA  $0, $1
0x00a6 00166 (m.go:7)   CALL    runtime.mapassign_fast64(SB)
0x00ab 00171 (m.go:7)   MOVQ    24(SP), AX
0x00b0 00176 (m.go:7)   MOVQ    $1, (AX)

We have analyzed the process of calling the function before, and the first few lines here are preparing the parameters of runtime.mapassign_fast64(SB). Go to the runtime to see the signature of this function:

func mapassign_fast64(t *maptype, h *hmap, key uint64) unsafe.Pointer {

Without looking at the implementation of the function, we can probably guess the relationship between the input parameters and output parameters of the function. If the input parameters correspond to the assembly instructions:

t *maptype
=>
LEAQ    type.map[int]int(SB), AX
MOVQ    AX, (SP)

h *hmap
=>
LEAQ    ""..autotmp_2+232(SP), AX
MOVQ    AX, 8(SP)

key uint64
=>
MOVQ    $43, 16(SP)

The return parameter is the memory address corresponding to the key where the value can be written. After getting the address, we can write the value we want to write in:

MOVQ    24(SP), AX
MOVQ    $1, (AX)

The whole process is quite complicated, and we can copy it by hand. However, you should also consider that for different types of maps, the assign function in runtime that needs to be executed is actually different, and interested students can compile the examples in this section and try them on their own.

Generally speaking, using assembly to manipulate maps is not a wise choice.

channel

Channel is also a more complex data structure in runtime. If you operate at the assembly level, you actually call the function in chan.go in runtime, which is similar to map, so I won't expand it here.

Get goroutine id

Go's goroutine is a structure called g, which has its own unique id inside, but the runtime does not expose this id, but I don't know why many people just want to get this id. So there are libraries of various or their goroutine id.

As we mentioned in the struct section, the structure itself is a continuous memory. If we know the starting address and the offset of the field, we can easily carry this data out:

go_tls.h:

#ifdef GOARCH_arm
#define LR R14
#endif

#ifdef GOARCH_amd64
#define    get_tls(r)    MOVQ TLS, r
#define    g(r)    0(r)(TLS*1)
#endif

#ifdef GOARCH_amd64p32
#define    get_tls(r)    MOVL TLS, r
#define    g(r)    0(r)(TLS*1)
#endif

#ifdef GOARCH_386
#define    get_tls(r)    MOVL TLS, r
#define    g(r)    0(r)(TLS*1)
#endif

goid.go:

package goroutineid
import "runtime"
var offsetDict = map[string]int64{
    // ... 省略一些行
    "go1.7":    192,
    "go1.7.1":  192,
    "go1.7.2":  192,
    "go1.7.3":  192,
    "go1.7.4":  192,
    "go1.7.5":  192,
    "go1.7.6":  192,
    // ... 省略一些行
}

var offset = offsetDict[runtime.Version()]

// GetGoID returns the goroutine id
func GetGoID() int64 {
    return getGoID(offset)
}

func getGoID(off int64) int64

goid.s:

#include "textflag.h"
#include "go_tls.h"

// func getGoID() int64
TEXT ·getGoID(SB), NOSPLIT, $0-16
    get_tls(CX)
    MOVQ g(CX), AX
    MOVQ offset(FP), BX
    LEAQ 0(AX)(BX*1), DX
    MOVQ (DX), AX
    MOVQ AX, ret+8(FP)
    RET

In this way, a simple small library to get the goid field in struct g is realized, which is put here as a toy:

https://github.com/cch123/goroutineid

SIMD

SIMD is the abbreviation of Single Instruction, Multiple Data. The SIMD instruction set on the Intel platform is SSE, AVX, AVX2, AVX512. These instruction sets introduce non-standard instructions and larger registers, such as:

  • 128-bit XMM0~XMM31 registers.
  • 256-bit YMM0~YMM31 registers.
  • 512-bit ZMM0~ZMM31 registers.

The relationship between these registers is similar to the relationship between RAX, EAX, and AX. In terms of instructions, multiple sets of data can be moved or calculated at the same time, for example:

  • movups: transfer 4 misaligned single-precision values to xmm registers or memory
  • movaps: transfer 4 aligned single-precision values to xmm registers or memory

The above instruction, when we use the array as the input parameter of the function, there is a high probability that we will see it, for example:

arr_par.go:

package main

import "fmt"

func pr(input [3]int) {
    fmt.Println(input)
}

func main() {
    pr([3]int{1, 2, 3})
}

go compile -S:

0x001d 00029 (arr_par.go:10)    MOVQ    "".statictmp_0(SB), AX
0x0024 00036 (arr_par.go:10)    MOVQ    AX, (SP)
0x0028 00040 (arr_par.go:10)    MOVUPS  "".statictmp_0+8(SB), X0
0x002f 00047 (arr_par.go:10)    MOVUPS  X0, 8(SP)
0x0034 00052 (arr_par.go:10)    CALL    "".pr(SB)

It can be seen that the compiler has considered performance issues in some cases and helped us use the SIMD instruction set to optimize data handling.

Because the topic of SIMD itself is relatively broad, I won't go into details here.

Special thanks to

During the research process, I basically harass Zhuo Juju when I don’t understand it. This is https://mzh.io/ . Special thanks to him for giving a lot of clues and hints.

Reference

  1. https://quasilyte.github.io/blog/post/go-asm-complementary-reference/#external-resources
  2. http://davidwong.fr/goasm
  3. https://www.doxsey.net/blog/go-and-assembly
  4. https://github.com/golang/go/files/447163/GoFunctionsInAssembly.pdf
  5. https://golang.org/doc/asm

Reference [4] requires special attention. The return address of the caller is also included in the callee stack frame given in the slide, which I think is not very appropriate.

wechat.png


Xargin
409 声望9 粉丝