Float32 to binary

The binary format of floating-point numbers in C# follows the IEEE754 standard (IEEE Standard for Binary Floating-Point Arithmetic).

Taking decimal 11.25 as an example, the binary value of float32 is:
010000010011010000000000000000000

Where did this value come from?
IEEE754 is a standard for floating-point numbers, which divides floating-point numbers into three parts:
Sign(符号)|Exponent(指数)|Mantissa(尾数)

Sign (sign) indicates the positive or negative of a floating-point number (0 is 0 if greater than or equal to 0, and 1 if it is less than 0)
Exponent (exponent) represents the exponent of a floating point number (similar to the exponent part of scientific notation)
Mantissa (mantissa) indicates significant digits (similar to the significant digits of scientific notation)
The following three parts are referred to as S, E, and M respectively.

Float32 occupies 32 bits, 8 bits per byte totaling 4 bytes:

 S(1位)|E(8位 偏移127)|M(23位) 即:
0 | 1000 0010 | 01101000000000000000000

How are the three parts S, E and M determined?

For example, 11.25 is expressed in decimal scientific notation: 1.125x10 1
The sign is +, the exponent is 1, and the significand is 1.125

IEEE754 converts the decimal to binary first, and uses the binary scientific notation to represent the decimal.
Or take 11.25 as an example:
S is 0 (this number is greater than or equal to 0, the symbol S=0, if it is less than 0, the symbol S=1)
Convert the integer part and the fractional part of the floating-point number to binary respectively, and then spell them together.

Convert the integer part to binary:

Conversion method: The integer is divided by 2 and the remainder is arranged in reverse order (the integer part is divided into 0)
Integer part I is 11, converted to binary: 1011

$$ 2\sqrt{11}=5···1 \qquad\enspace \\=2···1 \\=1···0 \\=0···1 $$

Convert fractional part to binary:

Conversion method: multiply the decimal by 2 and round it up in positive order (multiplied to the fractional part is 0, that is, the remainder is 0)
The fractional part F is 0.25, converted to binary: 01

$$ 0.25\times2=0···0.5 \qquad\enspace\enspace \\=1···0 $$

Integers and decimals put together:

IF = 1011.01

Expressed in binary scientific notation:

IF = 1.01101x2 3

Index Part E:

E = 3, IEEE754 stipulates that this value should be increased by 127
E = 130 converted to binary: 1000 0010
Integer divided by 2 and the remainder is arranged in reverse order (divide to the integer part is 0)

$$ 2\sqrt{130}=65···0 \qquad\enspace\enspace \\=32···1 \\=16···0 \\=\enspace 8···0 \\=\ enspace 4···0 \\=\enspace 2···0 \\=\enspace 1···0 \\=\enspace 0···1 $$

Mantissa part M:

M = 101101
Since the first position of the binary scientific notation must be 1, 1 can be omitted and not written, and the mantissa part becomes 01101
M = 01101

The three parts S, E and M are put together:

0 | 1000 0010 | 01101 (23 bits are filled with 0)
0 | 1000 0010 | 0110 1000 0000 0000 0000 000

11.25 Converted to binary, the final result is 0 1000 0010 0110 1000 0000 0000 0000 000
Use "Online Floating Point Conversion Tool" to verify that the result is correct.

Binary to Float32

How to convert binary 11000000110110000000000000000000 to Float32?
Or first divide the binary into three parts: S, E, and M, and convert them to calculate the final result.

1 | 1000 0001 | 1011 0...
S = 1 sign - is negative
E = 1000 0001
M = 1011 0...

Convert the exponent E to an integer:

E = 1000 0001 The conversion method to integer is: from the low order to the high order, turn it into decimal and add it
1x2 7 + 0x2 6 + 0x2 5 + 0x2 4 + 0x2 3 + 0x2 2 + 0x2 1 + 1x2 0 =
128 + 0 + 0 + 0 + 0 + 0 + 0 + 1 = 129

Subtract 127 (add 127 before converting to binary, you need to subtract it back when restoring)
E = 129 - 127 = 2

Mantissa M to decimal:

M = 1011
The first 1 is added (in order to save space, the first 1 is removed when converting, and it needs to be added back when restoring)
M = 11011
significant number scientific notation
M = 1.1011 x 2 2
M = 110.11
Integer part I = 110 Fractional part F = .11

Integer part I to decimal:

I = 110 converted to decimal: 6
1x2 2 + 1x2 1 + 0x2 0 =
4 + 2 + 0 = 6
I = 6

Convert fractional part F to decimal:
1x(1/2 1 ) + 1x(1/2 2 ) =
1x0.5 + 1x0.25 =
0.5 + 0.25 = 0.75
F = 0.75

The integer and fractional parts are concatenated together:
6 + 0.75 = 6.75

Add the sign bit:
S = 1 (negative sign is -)
float32 = -6.75

Binary 11000000110110000000000000000000 converted to decimal and the final result is -6.75
Use "Online Floating Point Conversion Tool" to verify that the result is correct.

Float64 to binary

The IEEE754 standard stipulates that Float64 occupies 64 bits and a total of 8 bytes. The three parts S, E, and M are:

 S(1位)|E(11位 偏移1023)|M(52位)

Take 9.625 as an example:

S = 0 (positive number)

Convert the integer part to binary:

Conversion method: The integer is divided by 2 and the remainder is arranged in reverse order (the integer part is divided into 0)
Integer part I = 9 converted to binary: 1001

$$ 2\sqrt{9}=4···1 \qquad \\=2···0 \\=1···0 \\=0···1 $$

Convert fractional part to binary:

Conversion method: multiply the decimal by 2 and round it up in positive order (multiplied to the fractional part is 0, that is, the remainder is 0)
The fractional part F is 0.625, converted to binary: 101

$$ 0.625\times2=1···0.25 \qquad\enspace\enspace\enspace\enspace \\=0···0.5 \\=1···0\enspace $$

Integers and decimals put together:

IF = 1001.101

Expressed in binary scientific notation:

IF = 1.001101 x 2 3

Index Part E:

E = 3, plus offset 1023
E = 1026 converted to binary: 1000 0000 010
Conversion method: The integer is divided by 2 and the remainder is arranged in reverse order (the integer part is divided into 0)

$$ 2\sqrt{1026}=513···0 \qquad\enspace\enspace\enspace \\=256···1 \\=128···0 \\=\enspace 64···0 \\ =\enspace 32···0 \\=\enspace 16···0 \\=\enspace\enspace 8···0 \\=\enspace\enspace 4···0 \\=\enspace\enspace 2 0 \\=\enspace\enspace 1 0 \\=\enspace\enspace 0 1 $$

Mantissa part M:

M = 1001101
Since the first position of scientific notation in binary must be 1, 1 can be omitted and not written, and the mantissa part becomes 001101
M = 001101

The three parts S, E and M are put together:

0 | 1000 0000 010 | 001101 (add 0 to 52 bits)

The final result of 9.625 converted to binary is 0 1000 0000 010 001101 0...
Use "Online Floating Point Conversion Tool" to verify that the result is correct.

Binary to Float64

How to convert binary 110000000101100100010... to Float64?
Or first divide the binary into three parts: S, E, and M, and convert them to calculate the final result.

1 | 1000 0000 101 | 1001 0001 0...
S = 1 sign - is negative
E = 1000 0000 101
M = 1001 0001 0...

Convert the exponent E to an integer:

E = 1000 0000 101 Integer conversion method is: from low to high bit by bit to decimal addition
1x2 10 + 0x2 9 + 0x2 8 + 0x2 7 +
0x2 6 + 0x2 5 + 0x2 4 + 0x2 3
1x2 2 + 0x2 1 + 1x2 0 =
1024 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 4 + 0 + 1 = 1029

Subtract 1023 (add 023 before converting to binary, you need to subtract it back when restoring)
E = 1029 - 1023 = 6

Mantissa M to decimal:

M = 1001 0001
The first 1 is added (in order to save space, the first 1 is removed when converting, and it needs to be added back when restoring)
M = 1100 1000 1
significant number scientific notation
M = 1.10010001 x 2 6
M = 1100100.01
Integer part I = 1100100 Fractional part F = .01

Integer part I to decimal:

I = 1100100 converted to decimal: 100
1x2 6 + 1x2 5 + 0x2 4 + 0x2 3 + 1x2 2 + 0x2 1 + 0x2 0 =
64 + 32 + 0 + 0 + 4 + 0 + 0 = 100
I = 100

Convert fractional part F to decimal:
0x(1/2 1 ) + 1x(1/2 2 ) =
0x0.5 + 1x0.25 =
0 + 0.25 = 0.25
F = 0.25

The integer and fractional parts are concatenated together:
100 + 0.25 = 100.75

Add the sign bit:
S = 1 (negative sign is -)
float64 = -100.25

Binary 110000000101100100010... Converted to decimal, the final result is -100.25
Use "Online Floating Point Conversion Tool" to verify that the result is correct.

Summarize:

In most programming languages, the representation of floating-point numbers follows the IEEE754 standard, which consists of three parts S, E, and M. Similar to the representation rules of scientific notation, some special regulations have been made in order to save space as much as possible. Due to the conversion, some data such as 0.1 fractional part cannot be fully multiplied after multiplying by 2 (0.2, 0.4, 0.8, 0.6 ... infinite loop), so the decimals will lose precision, and these inexhaustible multiplication cannot be completely and accurately expressed. Therefore, if you encounter problems like loss of precision, the data becomes a little bigger, and the data becomes a little smaller, I know that the problem is caused by this principle. It is neither your writing wrong nor the computer bug. It is completely a conversion rule. problems cannot be avoided. Changing the float type to double to improve the precision will still have this problem. To solve the problem of loss of precision, you can treat the decimal part of the number as an integer calculation, such as a=1.1 b=1.3 To calculate a+b, you can do this (Math.Round(a 10) + Math.Round(b 10)) / 10 Pay attention to the number overflow maximum range


冰封百度
233 声望43 粉丝

Unity游戏程序员一枚。生命不息,学习不止。


引用和评论

0 条评论