Important reminder: If you don’t want to see the assembly code and see the result directly, please scroll to the result section
Seeing this question on Zhihu, I found it very interesting. The following answers are varied, but they did not directly give the real benchmark results. There are also direct disassembly codes, but the x87 FPU instruction set is used in the assembly. God, this is 202x years old, is there really any program that uses this old floating-point arithmetic instruction set?
I also decided to study it, read the disassembly, and write the benchmark. The platform is based on x86-64, the compiler is clang 12, and the compiler optimization is enabled (it is meaningless to talk about speed without optimization)
Code and disassembly
https://gcc.godbolt.org/z/rvT9nEE9Y
Simple assembly language science
Before going deep into disassembly, we must first have a simple understanding of assembly language. Because the original code in this article is very simple, even without loops and judgments, there are few assembly instructions involved.
- assembly language is strongly related to the platform, here we take x86-64 (64-bit compatible instruction set of x86, because it was first invented by AMD, also called AMD64) as an example, referred to as x64
- x64 assembly language also has two grammars , one is Intel grammar (mainly used by Microsoft platform compilers), and the other is AT&T grammar (it is the default grammar of gcc compatible compiler, but gcc also supports output intel grammar). Personally feel that Intel grammar is easier to understand, here is Intel grammar as an example
Basic grammar. For example,
mov rcx, rax
:mov
is the command name "assignment".rcx
andrax
are the two operands of themov
instruction, they are both general register names. In Intel assembly syntax, the first operand is also used to store the result of the operation. so:mov rcx, rax
, assignment instructions, registerrax
assign values to the registerrcx
. Translated into C language code isrcx = rax
add rcx, rax
, addition instruction,rcx
andrax
, the result is assigned torcx
. Translated to C language code isrcx += rax
register . After the compiler is optimized, most of the operations are directly operated in the registers, and no memory access is involved. Only three types of registers (x64 platform) are involved in the following.
r
withrxx
is a 64-bit registere
withexx
is a 32-bit register, and at the same time is the low 32-bit part of therxx
xmmX
is a 128-bit SSE register. Since this article does not involve SIMD operations, it can be simply regarded as a floating-point number register. For double-precision floating-point numbers, only the low 64-bit part of the register is used
calling convention . C language features, all codes are attached to functions, when calling a function, the parent function passes values to the child function, and the child function returns a value to the parent function is called the function
calling convention. The calling convention is the most basic requirement to ensure the compatibility of the application ABI on different platforms. Different operating systems have different calling conventions. The disassembly code in this article is all generated using godbolt. Godbolt uses the Linux platform, so it follows the System V calling convention common to the Linux platform. Because the code involved in this article is very simple (all have only one function parameter), the reader only needs to know three points:
- The first integer parameter of the function is passed in through the
rdi / edi
register (rdi / edi
stores the value of the caller's first parameter).rdi
is a 64-bit register, corresponding tolong
(Linux platform).edi
is a 32-bit register, corresponding toint
type - The first floating-point number parameter of the function is passed in through the
xmm0
register, and does not distinguish between single and double precision - The integer type of the return value of the function is
rax / eax
, and the floating-point number is storedxmm0
- The first integer parameter of the function is passed in through the
Integer case
Divide an integer by 100
int int_div(int num) {
return num / 100;
}
The result is
int_div(int): # @int_div(int)
movsxd rax, edi
imul rax, rax, 1374389535
mov rcx, rax
shr rcx, 63
sar rax, 37
add eax, ecx
ret
A little explanation. movsxd
is a signed extension assignment, which can be translated as rax = (long)edi
; imul
is a signed integer multiplication; shr
is a logical right shift (sign bit is complemented by 0); sar
is arithmetic right shift (sign bit remains unchanged)
It can be seen that the compiler uses multiplication and shifting to simulate division operations, which means that the compiler believes that such a large series of instructions are also faster than division instructions. In the code, arithmetic shift to the right and logical shift to the right are for compatibility with negative numbers. If you specify an unsigned number, the result will be simpler
unsigned int_div_unsigned(unsigned num) {
return num / 100;
}
The result is
int_div_unsigned(unsigned int): # @int_div_unsigned(unsigned int)
mov eax, edi
imul rax, rax, 1374389535
shr rax, 37
ret
You can also force the compiler to generate division instructions, using volatile
Dafa
int int_div_force(int num) {
volatile int den = 100;
return num / den;
}
The result is
int_div_force(int): # @int_div_force(int)
mov eax, edi
mov dword ptr [rsp - 4], 100
cdq
idiv dword ptr [rsp - 4]
ret
A little explanation. cdq
(Convert Doubleword to Quadword) is a signed 32-bit to 64-bit integer conversion; idiv
is a signed integer division. Integer division instructions are more complicated to use. First, the operand cannot be an immediate value. Then if the divisor is 32 bits, the dividend must be converted to 64 bits. The cdq
instruction is doing this conversion (because of the sign bit stuffing problem). In addition there have been compiled in memory operations dword ptr [rsp - 4]
, which is volatile
negative role, the result will be some impact.
Integer times 0.01
int int_mul(int num) {
return num * 0.01;
}
The result is
.LCPI3_0:
.quad 0x3f847ae147ae147b # double 0.01
int_mul(int): # @int_mul(int)
cvtsi2sd xmm0, edi
mulsd xmm0, qword ptr [rip + .LCPI3_0]
cvttsd2si eax, xmm0
ret
A little explanation. cvtsi2sd
(ConVerT Single Integer TO Single Double) is an integer to double-precision floating-point number conversion, which can be translated into xmm0 = (double) edi
. mulsd
is a double-precision floating-point number multiplication, cvttsd2si
is a double-precision floating-point number to integer conversion (decimal part is truncated).
Because there are no instructions for integer and floating point operations, the actual operation will first convert the integer to the floating point number, and then convert it back after the operation is complete. The storage methods of integer and floating-point numbers in computers are different. Integers are simply complements. Floating-point numbers are IEEE754
by the scientific notation of 061abacaf94705. This conversion is not a simple addition of digits.
The case of floating-point numbers
Floating point number divided by 100
double double_div(double num) {
return num / 100;
}
The result is
.LCPI4_0:
.quad 0x4059000000000000 # double 100
double_div(double): # @double_div(double)
divsd xmm0, qword ptr [rip + .LCPI4_0]
ret
A little explanation. divsd
is a double-precision floating-point number division. Because the SSE
register cannot be directly mov
, the operands of the immediate value are all placed in the memory first, that is, qword ptr [rip + .LCPI4_0]
Floating point number multiplied by 0.01
double double_mul(double num) {
return num * 0.01;
}
The result is
.LCPI5_0:
.quad 0x3f847ae147ae147b # double 0.01
double_mul(double): # @double_mul(double)
mulsd xmm0, qword ptr [rip + .LCPI5_0]
ret
The result is very close to the division, there is only one instruction, no need to explain
Benchmark
https://quick-bench.com/q/1rmqhuLLUyxRJNqSlcJfhubNGdU
result
Sort by time from smallest to largest:
- Floating point number multiplied by 100%
- Unsigned integer divide by 150%
- Signed integer division (compiled to multiply and shift) 200%
- Whole number multiplied by 220%
- Force integer division by 900%
- Floating point number divided by 1400%
analyze
- Multiplication of floating-point numbers only requires one instruction
mulsd
, and its instruction delay is only 4~5 cycles , the theory is the fastest. - Unsigned integer division is compiled into multiplication, shift and assignment instructions. The integer multiplication instruction
imul
delays about 3~4 cycles, plus shift and assignment, the total time is slightly higher than that of floating-point number multiplication. - After the signed integer division is compiled, the number of instructions is slightly more than that of the unsigned version, but the extra instructions such as shift and addition are very lightweight, so the time is very close.
- I was surprised that the time taken for integer multiplication is very close to that of integer division compiled as multiplication. Integer and floating-point conversion instructions cvtsi2sd and cvttsd2si have a 3~7 instruction delay depending on the CPU model. Of course, the efficiency of CPU instruction execution can't just look at the delay, but also consider the situation of multi-instruction concurrency. However, these three instructions depend on each other and cannot be concurrent.
- The forced division instruction is slower and meets expectations. The delay of the 32-bit integer division instruction
imul
about 10~11, and even up to 57 if it is a 64-bit integer. In addition, memory access (the actual situation should only involve the cache) will also have some impact on the speed. - The slowest is floating-point number division. The instruction
divsd
has a delay of 14 to 20 depending on the CPU model, but it is unexpectedly slower than the forced integer division with memory access.
This article does not test the case of single-precision floating-point numbers ( float
), because by default the compiler will convert float
double
for accuracy considerations, and then convert the results back, causing float
to double
slower than 061abacaf949dd. This can be --ffast-math
, but quick-bench does not provide configuration for this option. It is also worth mentioning that if the --ffast-math
compilation parameter is enabled, the compiler will compile floating-point number division into floating-point number multiplication
Note: The delay information of all instructions can be found here: https://www.agner.org/optimize/instruction_tables.pdf
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。