3

Important reminder: If you don’t want to see the assembly code and see the result directly, please scroll to the result section

Seeing this question on Zhihu, I found it very interesting. The following answers are varied, but they did not directly give the real benchmark results. There are also direct disassembly codes, but the x87 FPU instruction set is used in the assembly. God, this is 202x years old, is there really any program that uses this old floating-point arithmetic instruction set?

I also decided to study it, read the disassembly, and write the benchmark. The platform is based on x86-64, the compiler is clang 12, and the compiler optimization is enabled (it is meaningless to talk about speed without optimization)

Code and disassembly

https://gcc.godbolt.org/z/rvT9nEE9Y

godbolt

Simple assembly language science

Before going deep into disassembly, we must first have a simple understanding of assembly language. Because the original code in this article is very simple, even without loops and judgments, there are few assembly instructions involved.

  1. assembly language is strongly related to the platform, here we take x86-64 (64-bit compatible instruction set of x86, because it was first invented by AMD, also called AMD64) as an example, referred to as x64
  2. x64 assembly language also has two grammars , one is Intel grammar (mainly used by Microsoft platform compilers), and the other is AT&T grammar (it is the default grammar of gcc compatible compiler, but gcc also supports output intel grammar). Personally feel that Intel grammar is easier to understand, here is Intel grammar as an example
  3. Basic grammar. For example, mov rcx, rax : mov is the command name "assignment". rcx and rax are the two operands of the mov instruction, they are both general register names. In Intel assembly syntax, the first operand is also used to store the result of the operation. so:

    1. mov rcx, rax , assignment instructions, register rax assign values to the register rcx . Translated into C language code is rcx = rax
    2. add rcx, rax , addition instruction, rcx and rax , the result is assigned to rcx . Translated to C language code is rcx += rax
  4. register . After the compiler is optimized, most of the operations are directly operated in the registers, and no memory access is involved. Only three types of registers (x64 platform) are involved in the following.

    1. r with rxx is a 64-bit register
    2. e with exx is a 32-bit register, and at the same time is the low 32-bit part of the rxx
    3. xmmX is a 128-bit SSE register. Since this article does not involve SIMD operations, it can be simply regarded as a floating-point number register. For double-precision floating-point numbers, only the low 64-bit part of the register is used
  5. calling convention . C language features, all codes are attached to functions, when calling a function, the parent function passes values to the child function, and the child function returns a value to the parent function is called the function calling convention. The calling convention is the most basic requirement to ensure the compatibility of the application ABI on different platforms. Different operating systems have different calling conventions. The disassembly code in this article is all generated using godbolt. Godbolt uses the Linux platform, so it follows the System V calling convention common to the Linux platform. Because the code involved in this article is very simple (all have only one function parameter), the reader only needs to know three points:

    1. The first integer parameter of the function is passed in through the rdi / edi register ( rdi / edi stores the value of the caller's first parameter). rdi is a 64-bit register, corresponding to long (Linux platform). edi is a 32-bit register, corresponding to int type
    2. The first floating-point number parameter of the function is passed in through the xmm0 register, and does not distinguish between single and double precision
    3. The integer type of the return value of the function is rax / eax , and the floating-point number is stored xmm0

Integer case

Divide an integer by 100
int int_div(int num) {
    return num / 100;
}

The result is

int_div(int):                            # @int_div(int)
        movsxd  rax, edi
        imul    rax, rax, 1374389535
        mov     rcx, rax
        shr     rcx, 63
        sar     rax, 37
        add     eax, ecx
        ret

A little explanation. movsxd is a signed extension assignment, which can be translated as rax = (long)edi ; imul is a signed integer multiplication; shr is a logical right shift (sign bit is complemented by 0); sar is arithmetic right shift (sign bit remains unchanged)

It can be seen that the compiler uses multiplication and shifting to simulate division operations, which means that the compiler believes that such a large series of instructions are also faster than division instructions. In the code, arithmetic shift to the right and logical shift to the right are for compatibility with negative numbers. If you specify an unsigned number, the result will be simpler

unsigned int_div_unsigned(unsigned num) {
    return num / 100;
}

The result is

int_div_unsigned(unsigned int):                  # @int_div_unsigned(unsigned int)
        mov     eax, edi
        imul    rax, rax, 1374389535
        shr     rax, 37
        ret

You can also force the compiler to generate division instructions, using volatile Dafa

int int_div_force(int num) {
    volatile int den = 100;
    return num / den;
}

The result is

int_div_force(int):                     # @int_div_force(int)
        mov     eax, edi
        mov     dword ptr [rsp - 4], 100
        cdq
        idiv    dword ptr [rsp - 4]
        ret

A little explanation. cdq (Convert Doubleword to Quadword) is a signed 32-bit to 64-bit integer conversion; idiv is a signed integer division. Integer division instructions are more complicated to use. First, the operand cannot be an immediate value. Then if the divisor is 32 bits, the dividend must be converted to 64 bits. The cdq instruction is doing this conversion (because of the sign bit stuffing problem). In addition there have been compiled in memory operations dword ptr [rsp - 4] , which is volatile negative role, the result will be some impact.

Integer times 0.01
int int_mul(int num) {
    return num * 0.01;
}

The result is

.LCPI3_0:
        .quad   0x3f847ae147ae147b              # double 0.01
int_mul(int):                            # @int_mul(int)
        cvtsi2sd        xmm0, edi
        mulsd   xmm0, qword ptr [rip + .LCPI3_0]
        cvttsd2si       eax, xmm0
        ret

A little explanation. cvtsi2sd (ConVerT Single Integer TO Single Double) is an integer to double-precision floating-point number conversion, which can be translated into xmm0 = (double) edi . mulsd is a double-precision floating-point number multiplication, cvttsd2si is a double-precision floating-point number to integer conversion (decimal part is truncated).

Because there are no instructions for integer and floating point operations, the actual operation will first convert the integer to the floating point number, and then convert it back after the operation is complete. The storage methods of integer and floating-point numbers in computers are different. Integers are simply complements. Floating-point numbers are IEEE754 by the scientific notation of 061abacaf94705. This conversion is not a simple addition of digits.

The case of floating-point numbers

Floating point number divided by 100
double double_div(double num) {
    return num / 100;
}

The result is

.LCPI4_0:
        .quad   0x4059000000000000              # double 100
double_div(double):                        # @double_div(double)
        divsd   xmm0, qword ptr [rip + .LCPI4_0]
        ret

A little explanation. divsd is a double-precision floating-point number division. Because the SSE register cannot be directly mov , the operands of the immediate value are all placed in the memory first, that is, qword ptr [rip + .LCPI4_0]

Floating point number multiplied by 0.01

double double_mul(double num) {
    return num * 0.01;
}

The result is

.LCPI5_0:
        .quad   0x3f847ae147ae147b              # double 0.01
double_mul(double):                        # @double_mul(double)
        mulsd   xmm0, qword ptr [rip + .LCPI5_0]
        ret

The result is very close to the division, there is only one instruction, no need to explain

Benchmark

https://quick-bench.com/q/1rmqhuLLUyxRJNqSlcJfhubNGdU

result

quick-bench

Sort by time from smallest to largest:

  1. Floating point number multiplied by 100%
  2. Unsigned integer divide by 150%
  3. Signed integer division (compiled to multiply and shift) 200%
  4. Whole number multiplied by 220%
  5. Force integer division by 900%
  6. Floating point number divided by 1400%

analyze

  1. Multiplication of floating-point numbers only requires one instruction mulsd , and its instruction delay is only 4~5 cycles , the theory is the fastest.
  2. Unsigned integer division is compiled into multiplication, shift and assignment instructions. The integer multiplication instruction imul delays about 3~4 cycles, plus shift and assignment, the total time is slightly higher than that of floating-point number multiplication.
  3. After the signed integer division is compiled, the number of instructions is slightly more than that of the unsigned version, but the extra instructions such as shift and addition are very lightweight, so the time is very close.
  4. I was surprised that the time taken for integer multiplication is very close to that of integer division compiled as multiplication. Integer and floating-point conversion instructions cvtsi2sd and cvttsd2si have a 3~7 instruction delay depending on the CPU model. Of course, the efficiency of CPU instruction execution can't just look at the delay, but also consider the situation of multi-instruction concurrency. However, these three instructions depend on each other and cannot be concurrent.
  5. The forced division instruction is slower and meets expectations. The delay of the 32-bit integer division instruction imul about 10~11, and even up to 57 if it is a 64-bit integer. In addition, memory access (the actual situation should only involve the cache) will also have some impact on the speed.
  6. The slowest is floating-point number division. The instruction divsd has a delay of 14 to 20 depending on the CPU model, but it is unexpectedly slower than the forced integer division with memory access.

This article does not test the case of single-precision floating-point numbers ( float ), because by default the compiler will convert float double for accuracy considerations, and then convert the results back, causing float to double slower than 061abacaf949dd. This can be --ffast-math , but quick-bench does not provide configuration for this option. It is also worth mentioning that if the --ffast-math compilation parameter is enabled, the compiler will compile floating-point number division into floating-point number multiplication

Note: The delay information of all instructions can be found here: https://www.agner.org/optimize/instruction_tables.pdf


CarterLi
1.3k 声望102 粉丝