The 2021-08-20 update

Recently reading how Programs Run, I found that IEEE 754 uses nonstandard shift codes for the exponential part of both single-precision floating-point numbers and double-precision floating-point numbers. The offset is 2n−1−12^{n-1}-12n−1−1 −1, so for single-precision floating-point numbers, the exponent is 8 bits. The offset should be 127, the double should be 11 bits, and the offset should be 1023. When I first wrote this article, I used standard shift notation (offset is 2n2^n2n), but this does not affect the understanding of this article. Keep this in mind when you read the following sections that use shift notation for exponents.

Writing in the front

I met a question in my interview before: why is 0.1+0.2 in JavaScript? = = 0.3. I said floating point is inaccurate, so it’s not equal to. Obviously the interviewer is not happy, doesn’t everyone know?

Later, WHEN I read Xiaohong Book, I found that it was also emphasized that 0.1+0.2 === 0.3 should not be judged, but it only mentioned the following reasons:

One thing to be clear about the rounding error associated with floating-point calculations is that ECMAScript is not unique to floating point calculations based on IEEE754 values; Other languages that use the same numeric format have this problem.

It’s only mentioned above that ECMAScript’s use of IEEE754 floating-point storage causes errors, but why dive into IEEE754, or even into computers? In this article, I’ll dig a little deeper into the answers from a computer’s perspective, hopefully to your advantage.

What is a floating point number?

The first thing you need to know is that the representation of numbers on a computer is fixed point and floating point.

  • Fixed point number: the position of the decimal point is determined. Given a 16-bit machine word length, it is possible to specify that the decimal point is between the x and x+1 bits, for example between the 3 and 4 bits, which means that the integer part has 2 bits and the decimal part has 13 bits (the 1 bit represents the sign bit). The decimal point after the sign bit becomes a decimal point machine (such machines can only represent decimals), and the decimal point at the end becomes an integer point machine (such machines can only represent integers).

  • Floating point: The decimal point position is uncertain. A scientific notation similar to the decimal system. It consists of the significand (significand code and exponent) and the mantissa (significand code and exponent). The mantissa number determines the precision of the floating point number, and the order number determines the range of the floating point number.

Usually fixed point number due to the fixed integer and decimal number, so the range of representation is limited, and floating point number through factorial, you can make the representation range of the number expanded. Floating point numbers are therefore more versatile. The following graph illustrates the difference between the two:

Floating-point expression

Here’s an expression for floating point numbers:


N = S x r j N = S \times r^j

Where S is the tail code (decimal, which can be positive or negative, and which is specified by the sign bit), j is the order code (integer, which can be positive or negative, and which is specified by the order character), and R is the base value of the mantissa (r in computers is usually 2, 4, 8, 16, etc.).

Normalization of floating point numbers

Why normalization?

A computer’s floating-point numbers need to be normalized to ensure accuracy.

As an example, suppose a binary number: 0.000110011001100… (1100 cycles), if not normalized, then 0.0001 of the mantissa… The 0 in front of it wastes mantissa space. You can move the mantissa to the left, remove the preceding 0, and then adjust the order to normalize so that the mantissa can contain more significant bits, thereby improving accuracy.

Normalized form

The normalization of floating-point numbers varies according to the mantissa base value r:

  • R = 2, highest mantissa 1
  • R = 4, the highest two mantissa digits are not all 0 (two binary digits represent a quad digit, so only the first two digits are not 000)
  • R = 8, the highest three mantissa are not all 0 (as above)
  • .

Normalize operations

Floating-point numbers can be normalized in two ways: left gauge and right gauge. For example, r = 2:

  • Left gauge: if the mantissa moves 1 bit to the left, the order code decreases by 1
  • Right gauge: if the mantissa moves 1 bit to the right, the order code increases by 1

When r = 4, mantissa movement needs to take 2 bits as the basic unit, and so on.

It can also be seen that the larger r is, the larger the range of floating-point numbers that can be represented, and the lower the precision of the floating-point numbers represented.

Floating-point operations in computers (addition and subtraction)

Now that we know about floating-point numbers and their normalization, let’s look at how floating-point numbers can be added and subtracted on a computer. It is divided into the following steps:

  1. Order alignment (against order)
  2. Mantissa sum
  3. Normalized processing
  4. rounding
  5. Extension: Floating point numbers represent ranges and overflow judgments

The base value r in the following steps is 2. Assume that you have the following two values (see the float expression for the formula) :


x = S x 2 j x y = S y 2 j y X = S_x·2^{j_x} \quad y = S_y·2^{j_y}

Note: the following will only be explained, and specific cases will be based on the focus of this article in JavaScript 0.1 + 0.2! == 0.3 for analysis.

So let’s look at the pair order.

To order

Since the order of two floating-point numbers is not necessarily equal in addition and subtraction operations, we need to ensure that the order of floating-point numbers is equal. The process of adjusting the order and mantissa so that the order of the two numbers is equal is called the logarithmic order.

Strives for the difference

First of all, we need to know the difference between the order of two numbers. Since complement is used in the computer for addition and subtraction operations, we need to pay attention to the fact that the sum (or difference) of the mantissa, including the following mantissa, is essentially carried out by using complement.

Note: although computer subtraction usually uses complement, the order is represented by shift code according to IEEE 754 standard. Specific reasons and analysis will be given later.


Δ j = j x j y = { = 0 j x = j y Have aligned > 0 j x > j y { x to y par S x please Δ j . j x Δ j y to x par S y Δ j . j x + Δ j Square root < 0 . . . \Delta j = j_x-j_y = \begin{cases} =0 \quad j_x = j_y \quad aligned \\ >0 \quad j_x > j_y \begin{cases} x looks up to Y \quad S_x\leftarrow\Delta j, \quad j_x-\Delta j \ y look up to x\ quad S_y\rightarrow\Delta j, \quad j_x+\Delta j \quad √ \end{cases} \\ <0 \quad… \end{cases}

The principle of order

We use the principle of alignment from small order to large order for order alignment, because the mantissa shift is to the right at this time, even if the removal loss occurs, it will only affect the accuracy of the data, and if the loss occurs to the left, it will affect the size of the data.

Mantissa sum

Mantissa summation is to sum the mantissa of the two floating point numbers in order to complete the operation, which will also use complement in practice.

normalized

Normalization was briefly mentioned earlier, but on the basis of truth values. This will show that the computer uses floating point numbers to store mantissa, which is more practical for specific normalization details.

We learned that normalization is to maximize the use of the mantissa space for higher accuracy, so after the summation of the mantissa is completed, the new mantissa may no longer meet the normalization requirements and need to be normalized again.

Normalize the definition

Let’s review the normalization form mentioned above:

The normalization of floating-point numbers varies according to the mantissa base value r:

  • R = 2, highest mantissa 1
  • R = 4, the highest two mantissa digits are not all 0 (two binary digits represent a quad digit, so only the first two digits are not 000)
  • R = 8, the highest three mantissa are not all 0 (as above)
  • .

Assuming the base value r = 2, combined with the normalized form, we can give the following definition of the normalized mantissa S:


r = 2 1 2 Or less S < 1 r = 2 \quad \frac{1}{2} \leq |S| < 1

Normalize judgment

With the above definition, mantissa greater than or less than 0 has the following normalized form:

S > 0 Normalized form S < 0 Normalized form
The true value 0.1 XX… X The true value 0.1 XX… X
The original code 0.1 XX… X The original code 1.1 XX… X
complement 0.1 XX… X complement 1.0 XX… X
Radix-minus-one complement 0.1 XX… X Radix-minus-one complement 1.0 XX… X

It is not difficult to see:

  • For truth value and source code, whether positive or negative, as long as the first digit is 1, it is normalized form
  • For complement, the sign bit is not the same as the first bit, it is the normalized form (the computer can easily determine using xor circuits)

Since computers use complement, we will only look at the normalized form of the complement. It is worth noting that when we judge the normalized numbers, we often use the normalized description of the complement: the difference between the sign bit and the first digit, instead of using the definition, because there are two special cases for the mantissa S represented by the complement:

  • A special case

S = 1 2 = 0.100… 0 [ S ] The original = 1.100… 0 [ S ] fill = 1.100… 0 S = -\frac{1}{2} = -0.100… 0 \\ \left[S\right]_ = 1.100… 0 \\ \left[S\right]_ p = 1.100… 0

By definition, S=−12S= -frac {1}{2}S=−21 is a normalized number, but after the computer uses the complement code to store it, it will be considered that [−12] complement left[-frac {1}{2}\right]_ complement [−21] is not a normalized number according to the xor circuit of the computer. Therefore, If we encounter this special case in manual analysis, we need to normalize it ourselves.

  • Special case 2

S = 1 [ S ] fill = 1.000… 0 S = -1 \\ \left[S\right]_ p = 1.000… 0

In the above case, S=−1S = -1s =−1 is not a normalized number by definition, but the [−1] complement \left[-1\right]_ complement [−1] does become a normalized number.

Note: decimal fixed-point machines cannot represent the source code of -1, but can represent the complement of -1. Because +0 is not the same as -0 in the source code, and -0 in the complement code is used to encode -1.

Normalize operations

With the above judgment criteria, we can easily conclude that the normalization operation is:

  • Left gauge: when the mantissa is stored in complement
    • If the mantissa is positive, left-gauge is required when the first mantissa digit is 0
    • If the mantissa is negative, the left gauge is required when the first mantissa digit is 1 (the truth value of this digit is generally 0 due to the use of complement storage). Note: there is a special case here, which is the first case mentioned above. In this case, the right gauge mentioned below should be used instead of the left gauge (since the first digit 1 is also the truth value 1, moving the left gauge will cause data error).
  • Right rules: when calculating mantissa of overflow (∣ S ∣ > 1 | S | > 1 ∣ S ∣ > 1), the rules that need to be right to prevent data overflow errors

rounding

When the length of data exceeds the size of the machine word on which the data is stored, we need to round it. In general, it is possible for overflow to occur in both the logarithm and right metric processes mentioned above, and rounding should be considered to ensure data accuracy as much as possible.

Several ways of rounding

  1. Truncation method: discard the removed number directly
  2. Rounding of 0 to 1: similar to rounding, if discard 0, then ignore, if discard 1, add 1 in the lowest
  3. Constant 1 method: regardless of whether 0 or 1 is discarded, the minimum value of the final saved result is guaranteed to be 1
  4. .

In a computer, there are several ways to handle rounding, and the choice depends on data accuracy assurance, cumulative error, and hardware implementation difficulty. We’ll also see several rounding methods next when we talk about IEEE754 standards.

Extension: Floating point numbers represent ranges and overflow judgments

We mentioned above that the shift process can cause an overflow of numbers. In addition to overflows during operations, there are also overflows during data storage.

Given a floating-point standard (using complement storage), we need to determine which numbers will be out of storage range, also known as overflow judgment.

Here steal a lazy, directly released liu Hongwei teacher computer composition principle class screenshots:

A simple analysis shows that since the machine word length is fixed, floating-point numbers can represent a limited range and precision. This leads to four key points: the largest (small) positive (negative) number, and five intervals:

  • Less than the minimum negative number (or greater than the maximum positive number) : the minimum negative number, that is, the number represented when the mantissa can represent the largest absolute value (-1), and the order is also the largest. If the order exceeds the maximum order, it is overflowed; The overflow on the positive side is similar (note: a positive number cannot represent 1 at most, but can only represent 1−2− N1 -2^{-n}1−2− N). If the number falls within this range, overflow error capture and processing is required.
  • Corresponding positive (negative) floating-point numbers: Floats in this range can be stored correctly, but the numbers falling on it are discrete and not all floating-point numbers can be represented accurately, and some accuracy may be lost (due to rounding rules).
  • Underflow range: The underflow range indicates that a floating point number is greater than the maximum negative number but less than the minimum positive number. This range is usually represented by 0.

The IEEE 754 standard

Since floating-point numbers in JavaScript are stored and computed in accordance with IEEE 754, we will introduce the standard here.

IEEE 754 standard is a set of standard about floating point number of computers, covering floating point format, operation, rounding method and overflow exception handling and a series of floating point number related definitions and operations, this paper mainly involves floating point format and rounding method.

The main format

The three most common formats and their various bits are listed below. Click here to view all formats.

The sign bit exponent mantissa The total number
Short real numbers (single precision) 1 8 23 32
Long real number (double precision) 1 11 52 64
The temporary number 1 15 64 80

Note that because the IEEE754 standard uses an implicit “1” in the mantissa, the mantissa actually represents one more digit.

Here is a brief description of the implicit “1”. As mentioned in the normalization judgment above, the first mantissa of normalization is always “1”, so it can be hidden and not really stored in the computer, but can be completed during calculation. Doing so increases the accuracy by one bit, where “1” refers to “1” of the truth value.

Rounding rules

In total, the IEEE 754 standard defines five rounding rules.

The first two are Roundings to nearest, which, as the name implies, rounds to the nearest valid number, where to nearest, ties to even are the default rules for binary floating-point numbers and are recommended as the default rules for decimal floating-point numbers. Related to it are to nearest, ties to Odd, but not defined in this standard.

The Directed roundings are direct rounding methods that tend to be fixed (0, +∞, −∞).

Mode / Example Value + 11.5 + 12.5 – 11.5 – 12.5
to nearest, ties to even + 12.0 + 12.0 – 12.0 – 12.0
to nearest, ties away from zero + 12.0 + 13.0 – 12.0 – 13.0
toward 0 + 11.0 + 12.0 – 11.0 – 12.0
Toward + up + 12.0 + 13.0 – 11.0 – 12.0
Toward – up + 11.0 + 12.0 – 12.0 – 13.0

Application in JavaScript

JavaScript numbers are stored and manipulated using the 64-bit double precision floating-point standard mentioned above.

Round rules use the default rules mentioned above: Roundings to nearest, ties to even. Here we mainly talk about this rule, which is divided into two parts:

  1. Roundings to nearest

    Recall the floating-point calculations, expand the floating-point said of the part and the overflow, we already know that not every number can be accurately said, that is to say we need to be rounding, make it fall on the number of effective recently, and the number of effective, may be bigger than the original number, may also be smaller than the original number, We need to figure out which one is closer to the original number, and we take that one as the rounding result.

  2. ties to even

    So what does the second sentence mean? So if the original number is the same distance from the larger number as the smaller number, you pick the even number, the even number in decimal and you know, the even number in binary, which is actually the least significant bit of 0, and you pick that number as rounding.

For example, if we have a significant number of 4 bits, and we have a number of 0.000101, the larger significant number is 0.0010, and the smaller significant number is 0.0001, we obviously need 0.0001. But if this number is 0.000111, then we need 0.0010.

Roundings to nearest If the number is 0.000110, then we refer to ties to even, then we need 0.0010.

Here is a reference to zhihu’s main idea, you can quickly find the rules from binary. When rounding, if the last bit of the least significant bit is 0, you can know that the value to be removed is less than half of the value represented by the least significant bit. In this case, you can round down (for example, 01 is less than 10). If the least significant digit is 1, then round up the least significant digit. If the least significant digit is 1 and the least significant digit is 1, then round up the least significant digit. Select “even” (for example, 10 equals 10 in the example above).

For those of you who know this rule, do you recall from the Principles of Computer Composition (also mentioned above) the rounding of 0 and 1? Isn’t it a bit like that?” Roundings to nearest, ties to even (odd) Roundings to nearest, ties to even (or odd)

Calculation of 0.1 + 0.2 in JavaScript

Now that we’ve got the basics, let’s see how the computer adds 0.1 and 0.2.

Convert to a binary number

First, you need to convert 0.1 and 0.2 to binary. For decimals, conversion to binary is a round by two operation. Let’s calculate 0.1 first:

The operation process integer The decimal part
0.1 * 2 = 0.2 0 0.2
0.2 * 2 = 0.4 0 0.4
0.4 * 2 = 0.8 0 0.8
0.8 * 4 = 1.6 1 0.6
0.6 * 2 = 1.2 1 0.2
. . .

Finally, the following results can be obtained:


0.1 = 0.0001100110011001100110011001100… (in 1100 Loop) 0.1 = 0.0001100110011001100110011001100… (Loop with 1100)

Similarly, 0.2 is obtained (0.1 is also added for the convenience of display and comparison) :


0.1 = 0.0001 100110011001100110011001100… 0.2 = 0.001 100110011001100110011001100… \ begin} {aligned = 0.0001 & 0.1 100110011001100110011001100… \ \ 0.2 = 0.001 & 100110011001100110011001100… \end{aligned}

It is stored as a floating point number

We take the base value r = 2, and the order in IEEE 754 standard is represented by shifted codes (about shifted codes will not be described here, only the transformed results are given), and the mantras are represented by complement codes and normalized.

In short: shift code = original number + 2^n, n number of bits

For demonstration purposes, I use it in the following formula.Split order symbol bits and numeric bits, using.To split mantissa sign bits and numeric bits, use:Split order and mantissa, use[]Represents an overflow or implicit number.

First, the mantissa needs to be normalized. Since the mantissa is positive and the first “1” can be implied, the mantissa can be expressed as:


S ( 0.1 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011001 [ 1001… ] ( After the decimal point 52 position ) S ( 0.2 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011001 [ 1001… ] ( After the decimal point 52 position ) S [1] (0.1) = 0. 1001100110011001100110011001100110011001100110011001 [1001]… (a total of 52 decimal places) \ \ S (0.2) = 0 [1]. 1001100110011001100110011001100110011001100110011001 [1001]… (52 decimal places)

As you can see, the number above overflows with the overflow bit 1001… . We use default rounding rules Roundings to nearest, ties to even, and get:


S ( 0.1 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011010 ( After the decimal point 52 position ) S ( 0.2 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011010 ( After the decimal point 52 position ) S [1] (0.1) = 0. 1001100110011001100110011001100110011001100110011010 (a total of 52) decimal places \ \ S = (0.2) 0 [1]. 1001100110011001100110011001100110011001100110011010 (a total of 52 decimal places)

After the mantissa is represented, the order is represented (using shift code) :


j ( 0.1 ) = 0 . 1111111100 ( 4 ) j ( 0.2 ) = 0 . 1111111101 ( 3 ) J (0.1) = 0111111, 100 (4) \ \ j (0.2) = 0111111, 101 (3)

I’m just going to mention the advantage of using shift notation, and I’m sure you can see it, because with shift notation, the comparison between negative numbers becomes very clear.

Start counting

And then we’re going to calculate, first of all, the logarithmic order.

According to the alignment of the small order to the large order, the order of 0.1 +1, and the mantissa moves to the right (it should be noted that the first part of the right move is the implicit “1” instead of “0”) :


S ( 0.1 ) = 0 . 1100110011001100110011001100110011001100110011001101 [ 0 ] S ( 0.2 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011010 \ begin} {aligned S & (0.1) = 0. 1100110011001100110011001100110011001100110011001101 [0] \ \ S = (0.2) 0[1]&.1001100110011001100110011001100110011001100110011010 \end{aligned}

Next, sum the mantissa:


S ( 0.1 ) = 0. 1100110011001100110011001100110011001100110011001101 [ 0 ] S ( 0.2 ) = 0 [ 1 ] . 1001100110011001100110011001100110011001100110011010 s u m ( 0.1 + 0.2 ) = 0 [ 10 ] . 0110011001100110011001100110011001100110011001100111 j ( s u m ) = 0 . 1111111101 ( 3 ) \ the begin (0.1) = 0.} {aligned S & 1100110011001100110011001100110011001100110011001101 [0] \ \ S = (0.2) 0 [1]. & 1001100110011001100110011001100110011001100110011010 \ \ sum (0.1 + 0.2) = 0110011001100110011001100110011001100110011001100111\0 [10]. & end} {aligned \ \ j (sum) = 0111111, 101 (3)

Sum is not a normalized number. We normalize its mantissa:


S ( s u m ) = 0 [ 1 ] . 0011001100110011001100110011001100110011001100110011 [ 1 ] S(sum) = 0[1].0011001100110011001100110011001100110011001100110011[1]

Roundings to nearest, ties to even; Roundings to nearest, ties to even


S ( s u m ) = 0 [ 1 ] . 0011001100110011001100110011001100110011001100110100 j ( s u m ) = 0 . 1111111110 ( 2 ) S [1] (sum) = 0. 0011001100110011001100110011001100110011001100110100 \ \ j (sum) = 0111111, 110 (2)

Let’s calculate 0.3 and normalize it, and get:


S ( 0.3 ) = 0 [ 1 ] . 0011001100110011001100110011001100110011001100110011 j ( 0.3 ) = 0 . 1111111110 ( 2 ) S [1] (0.3) = 0. 0011001100110011001100110011001100110011001100110011 \ \ j (0.3) = 0111111, 110 (2)

Compare the results

Finally, let’s put 0.3 and sum together:


s u m = 0 . 1111111110 : 0 [ 1 ] . 0011001100110011001100110011001100110011001100110100 0.3 = 0 . 1111111110 : 0 [ 1 ] . 0011001100110011001100110011001100110011001100110011 \ begin} {aligned sum = 0111111, 110:0 [1]. & 0.3 = 0011001100110011001100110011001100110011001100110100 \ \ 0111111, 110:0011001100110011001100110011001100110011001100110011\0 [1]. & end} {aligned

The result difference between the 2-2 x 2-542-52 = 2 ^ {2} \ times2 ^ {- 52} = 2 ^ 54} {- 2-2 x 2-52-54 = 2!!!!!!

If we convert the above two binary numbers to the familiar tenth power:


s u m ( 0.1 + 0.2 ) = 0.300000000000000044408920985006… 0.3 = 0.299999999999999988897769753748… \ begin} {aligned sum (0.1 + 0.2) = 0.300000000000000044408920985006 &… \ \ = 0.3 & 0.299999999999999988897769753748… \end{aligned}

JavaScript console output:

var sum = 0.300000000000000044408920985006; // There are 15 zeros in the middle
var dot3 = 0.299999999999999988897769753748;
console.log(sum, dot3); / / 0.30000000000000004, 0.3
Copy the code

conclusion

The article went through many stages and was rewritten several times. I remember that when I finished writing it for the first time, I thought I must have understood it, but when I read it later, I found I had no idea about many places, such as the understanding of IEEE754 rounding standard, the understanding of implicit “1” operation and so on.

Because when learning “Computer composition principle”, the teacher only said “0 round 1 method” and so on relatively easy to understand the method, but in the practical application, such as IEEE754 standard, but not so simple, it can be seen that learning and practice, is the need to be combined.

I still can’t say THAT I fully understand, and there may be some confusion in the future. If you have any doubts, please raise them in the comments section, and we can learn and make progress together!

reference

  1. MOOC – Principles of Computer Composition (Liu Hongwei)
  2. Is floating point math broken?
  3. Rounding floating-point numbers
  4. IEEE 754: Rounding Rules
  5. fixed point vs floating point number
  6. IEEE 754 Roundings to nearest, ties to even