1 score store double 64 floating-point storage format

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as

double

. The IEEE 754 standard specifies a binary64 as having:

  • Sign bit: 1 bit
  • Exponent: 11 bits
  • Significand precision: 53 bits (52 explicitly stored)

The sign bit determines the sign of the number (including when this number is zero, which is signed).

The exponent field is an 11-bit unsigned integer from 0 to 2047, in biased form: Exponents range from −1022 to +1023 because Exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.

The 53-bit significand precision gives from 15 to 17 significant decimal precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.[1]

The format is written with the significand having an implicit integer bit of value 1 (except for special data, see the exponent encoding below). With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955). The bits are laid out as follows:

From the above formula, it can be concluded that the longest accuracy of long type is 52 bits, and the accuracy will be lost if the accuracy exceeds 52 bits

Example 2

fmt.Printf("%b\n", math.Float64bits(1 << 52)) fmt.Printf("%b\n", math.Float64bits(1 << 51)) fmt.Printf("%b\n", math.Float64bits(1 << 50)) fmt.Printf("%b\n", math.Float64bits(1 << 51 + 2)) fmt.Printf("%b\n", math.Float64bits(1 << 51 + 3)) // OK 100001100100000000000000000000000000000000000000000000000000100 100001100100000000000000000000000000000000000000000000000000110 fmt.Printf("%b\n", math.Float64bits(1 << 52 + 2)) fmt.Printf("%b\n", math.Float64bits(1 << 52 + 3)) // OK 100001100110000000000000000000000000000000000000000000000000010 100001100110000000000000000000000000000000000000000000000000011 fmt.Printf("%b\n", math.Float64bits(1 << 53 + 2)) fmt.Printf("%b\n", math.Float64bits(1 << 53 + 3)) // OK 100001101000000000000000000000000000000000000000000000000000001 100001101000000000000000000000000000000000000000000000000000010 fmt.Printf("%b\n", math.Float64bits(1 << 54 + 6)) fmt.Printf("%b\n", 1 < < math.h Float64bits (54) + 7) / / precision lost 100001101010000000000000000000000000000000000000000000000000010 100001101010000000000000000000000000000000000000000000000000010 fmt.Printf("%b\n", math.Float64bits(1 << 55 + 2)) fmt.Printf("%b\n", 1 < < math.h Float64bits (55 + 3)) / / precision lost 100001101100000000000000000000000000000000000000000000000000000 100001101100000000000000000000000000000000000000000000000000000Copy the code

3, 53,54 bits after precision loss comparison

4 52 bit accuracy

​​​​