Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

In the previous article: What are fixed-point numbers? We mainly introduce the way of using fixed point numbers to represent numbers in computers.

Just to review, in simple terms, when we use fixed-point numbers to represent numbers, the convention is that the position of the decimal point is fixed, and the integer part and the decimal part are converted to binary respectively, which is the result of fixed-point numbers.

But when using fixed point number to represent decimal, there are shortcomings of numerical range and precision range limited, so in the computer, we generally use “floating point number” to represent decimal.

In this article, we’ll take a closer look at how floating-point numbers represent decimals and the range and precision of floating-point numbers.

What is a floating point number?

First, we need to understand what floating point numbers are?

So we learned about fixed point numbers, and by fixed point, we mean the convention that the decimal point is fixed. The “floating point” of a floating point number means that the decimal point can float.

How do you understand that?

In fact, floating point numbers are represented in scientific notation, such as the decimal number 8.345, which can be represented in various ways:

8.345 = 8.345 * 10^0 8.345 = 83.45 * 10^-1 8.345 = 834.5 * 10^-2...Copy the code

See? In this scientific notation, the position of the decimal point becomes “floating”, hence the name for floating-point numbers as opposed to fixed-point numbers.

Using the same rule, we can also use scientific notation for binary numbers, that is, we can change the base 10 to 2.

How do floating point numbers represent numbers?

We already know that floating-point numbers represent a number in scientific notation and can be written as follows:

V = (-1)^S * M * R^E
Copy the code

The meanings of each variable are as follows:

  • S: symbol bit. The value can be 0 or 1, indicating a digit. 0 indicates positive and 1 indicates negative
  • M: The mantissa is expressed as a decimal. For example, 8.345 * 10^0 is the mantissa
  • R: the base number, representing the decimal number R is 10 and the binary number R is 2
  • E: the exponent is expressed as an integer, such as 10^-1. -1 is the exponent

If we want to represent a number on a computer as a floating-point number, we only need to identify these variables.

Suppose we now represent a floating point number as 32 bits, and fill those bits with the above variables according to certain rules:

Suppose we define the following rules to populate these bits:

  • The symbol bit S is 1 bit
  • The exponent E takes 10 bits
  • The mantissa M takes up 21 bits

Following this rule, converting the decimal number 25.125 to a floating-point number looks like this (D for decimal, B for binary) :

  1. Integral part: 25(D) = 11001(B)
  2. 0.125(D) = 0.001(B)
  3. In binary scientific notation: 25.125(D) = 11001.001(B) = 1.1001001 * 2^4(B)

So sign bit S = 0, mantissa M = 1.001001(B), index E = 4(D) = 100(B).

Fill to 32 bits according to the rules defined above, like this:

The floating-point result is there. Isn’t that easy?

But there’s a problem. The rule that we just defined, the sign bit S is 1 bit, the exponential bit E is 10 bits, and the mantissa M is 21 bits, is something that we just made up in our heads.

If you also want to make a new rule, for example, the sign bit S is 1 bit, the exponential bit E is 5 bits this time, and the mantissa M is 25 bits, is that ok? B: Sure.

Using this rule, the floating-point number looks like this:

We can see that the difference in the number of digits assigned to exponents and mantissa results in the following:

  1. The more the exponent bit, the less the mantissa bit, the greater the range of its representation, but the accuracy will become worse, on the contrary, the less the exponent bit, the more mantissa bit, the smaller the range of representation, but the accuracy will become better
  2. The floating-point format of a number, depending on the rules defined, will result in different results with different ranges and precision

This was the case in the early days when floating point definitions were proposed. At that time, there were many computer manufacturers, such as IBM and Microsoft, and each computer manufacturer would define its own floating point rules. Different manufacturers would represent different floating point numbers for the same number.

As a result, when a program performs floating-point calculation on a computer from a different vendor, it needs to convert to the vendor’s specified floating-point format before computing, which inevitably increases the calculation cost.

So how do you solve this problem? There is an urgent need for a unified floating point standard.

Floating point standard

Until 1985, IEEE introduced the floating point standard, commonly known as IEEE754 floating point standard, which unified the representation of floating point numbers and provided two floating point formats:

  • Single-precision float: 32 bits, sign bit S 1 bit, exponent E 8 bits, mantissa M 23 bits
  • Double precision float: 64 bits, sign bit S 1 bit, exponent E 11 bit, mantissa M 52 bit

In order to maximize the range and precision of the numbers represented, the floating-point standard also provides for exponents and mantissa:

  1. The mantissa of M always begins with a 1 (because 1 <= M < 2), so the 1 can be omitted. It is a hidden bit, so that the single-precision 23-bit mantissa can represent 24 significant digits, and the double-precision 52-bit mantissa can represent 53 significant digits
  2. The exponent E is an unsigned integer representing float and takes up 8 bits, so it ranges from 0 to 255. But since the exponent can be negative, it is required to add a middle number 127 to the original value when storing E, so that the value of E ranges from -127 to 128. When it represents double, there are 11 bits in total. When it is stored in E, the middle number 1023 is added, and the value ranges from -1023 to 1024.

In addition to stipulating mantissa and exponential bits, the following provisions are also made:

  • Index E not all 0 and not all 1: normalized number, normally calculated according to the above rules
  • Index E all zeros, mantissa non-zero: the normalized number, the hidden mantissa bit is no longer 1, but 0(M = 0.xxxxx), so that 0 and very small numbers can be represented
  • Exponent E all ones, mantissa all zeros: positive infinity/negative infinity (plus or minus depending on the S sign bit)
  • Index E all 1, mantissa non 0: NaN(Not a Number)

The representation of a standard floating point number

With this uniform floating point standard, let’s convert 25.125 to a standard float:

  1. Integral part: 25(D) = 11001(B)
  2. 0.125(D) = 0.001(B)
  3. In binary scientific notation: 25.125(D) = 11001.001(B) = 1.1001001 * 2^4(B)

So S = 0, mantissa M = 1.001001 = 001001(1 removed, hidden bit), index E = 4 + 127(middle number) = 135(D) = 10000111(B). Fill it with 32 bits as follows:

This is the result of standard 32-bit floating-point numbers.

If represented by a double, the exponent bit E is filled with 11 bits and the mantissa bit M with 52 bits, similar to this rule.

Why do floating point numbers lose accuracy?

Let’s take a look at the common floating-point number loss of accuracy.

If we now wanted to represent 0.2 as a floating-point number, what would the result be?

0.2 is converted to binary by multiplying by 2 until there are no decimals left. In this calculation, the resulting integer parts are arranged from top to bottom in binary.

0.2 * 2 = 0.4 - > 0.4 * 2 = 0.8 - > 0 0 0.8 * 2 = 1.6-0.6 * 2 = 1.2 > 1-0.2 * 2 = 0.4 > 1 - > 0 (cycle)...Copy the code

So 0.2(D) = 0.00110… (B).

Because decimal 0.2 can not be accurately converted into a binary decimal, and the computer in the representation of a number, the width is limited, infinite loop decimal stored in the computer, can only be truncated, so it will lead to the loss of decimal accuracy.

What is the range and precision of floating point numbers?

Finally, how wide and accurate can a floating point number be?

Take the single-precision floating-point number float, for example. The largest binary number it can represent is +1.1.11111… 1 * 2 ^127 (23 ones after the decimal point), while binary 1.11111… 2^128 = 3.4 * 10^38 1 ≈ 2 2^128 = 3.4 * 10^38

How little precision can it represent?

The smallest binary number that float can represent is 0.0000…. One (twenty-two zeros and one one after the decimal point) is 1/2^23 in decimal notation.

Using the same method, the largest binary number that a double can represent is +1.111… 111 (52 1’s after the decimal point) * 2^1023 ≈ 2^1024 = 1.79 * 10^308, so double can represent the range: -1.79 * 10^308 ~ +1.79 * 10^308.

The minimum precision of double is: 0.0000… 1(51 0s and 1 1) is 1/2^52 in decimal notation.

As you can see, floating-point numbers have a very large range and precision, although their range and precision are also limited, so in computers we usually store the representation of decimals in floating-point numbers.

conclusion

In this article, we mainly talked about the representation of floating point numbers, summarized as follows:

  1. Floating point numbers are generally expressed in scientific notation
  2. Fill in a fixed bit with a scientific notation variable, which is the result of a floating point number
  3. In the early days of floating point, each computer manufacturer made their own floating point rules, resulting in different manufacturers for the same number of floating point number representation is different, in the calculation also need to be converted before calculation
  4. Later, IEEE proposed the standard of floating point, unified the format of floating point, and stipulated the single-precision floating point float and double. Since then, various computer manufacturers have unified the format of floating point, which has been continued to this day
  5. The floating-point representation of decimals may suffer from an accuracy loss because decimal decimals cannot be accurately converted to binary and will be truncated when stored in fixed-bit computers
  6. Floating-point numbers represent a number with such a wide range and precision that the decimals we normally use are stored in floating-point numbers in computers

Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.

Want to read more hardcore technology articles? Focus on”Water drops and silver bullets”Public number, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.