preface

Of course, this problem is not only in Javascript. Almost all programming languages use ieEE-745 floating-point representation. Any programming language that uses binary floating-point numbers will have this problem. However, many other languages have encapsulated methods to avoid accuracy problems, and JavaScript is a weakly typed language, which is not designed to have a strict data type for floating-point numbers, so the accuracy error problem is particularly prominent. But perhaps few people wonder why this is the case. In this article, we will start with the basics of computers and tell you why these problems occur and how to solve them.

1. How to convert from decimal to binary?

As we all know, the data stored inside the computer is all binary, no matter what data type, will be converted to binary storage and calculation at the end. The decimal system is no exception. We write an assignment to const a = 1 in the code, which is converted to a binary number that the computer can recognize so that the CPU can execute. The number 1 is also converted to binary, so what is the conversion pattern between the two?

For example, the integer 1 converts to binary:

0000 0000 0000 0001

How are positive integers, negative integers, and floating-point numbers converted to binary

1.2. Positive integers are converted to binary

Simply summarize its method is: divide two take mod, and then reverse order, high fill zero.

For example, the number 13, mod by dividing by two as above:

13/2 = 3 = 6-1 6/2-0 l = 1-1 1/2 = 1-0

Then discharge in reverse order: 1011

Then the high value is filled (assuming the memory is 32 bits) : 0000 0000 0000 0000 1011

1.3. Negative integers are converted to binary

In computers, negative numbers are stored as their positive complement.

So what is a complement? A little bit of computer based children should be able to know?

A computer has three major codes: source code, inverse code and complement code.

The source code is the number that was just converted from a positive integer to binary

An inverse code is a new binary number, such as 13, when the original code is reversed by bits:

0000 0000 0000 0000 0000 1011

Reverse code: 1111 1111 1111 1111 1111 1111 1111 0100

The complement is the inverse plus 1, namely:

Complement: 1111 1111 1111 1111 1111 1111 1111 1111 0101

The resulting complement is the memory storage of -13, expressed in hexadecimal: 0xFFFF FFF5

So the negative integer conversion takes three more steps than the positive integer:

  1. You take the absolute value of negative integers and you get A
  2. Converts the positive integer A to binary, resulting in B
  3. If you take the inverse of every digit of B, you get C
  4. I’m going to add 1 to C

1.4. Convert floating point numbers to binary

Turn the decimal binary using “by two integer, order” method, the specific approach is to use 2 take a decimal fraction, can get the product, the product of the integer part, and in 2 by the rest of the decimal part, and get a product, and then remove the integer part of the product, so, until the decimal portion is zero, 0 or 1 at this time as the last of the binary, Or achieve the required accuracy.

For example: 0.125 to binary:

2 = 0.25 – > 0.125 0 2 = 0.5-0.25 – > 0.5 * 2 = 1.0 – > 1

So the binary is (0.001)B

We investigate: 0.125 = 0 2 ^ (1) + 0 ^ 2 + 1 * 2 ^ (2) (3) = 1/8 = 0.125

And it works.

The principle of the above conversion method can be referred to: decimal to binary

2, JS double precision format storage

Js digital storage standard is IEEE 754, which adopts 64-bit double precision floating point number, where:

Bit 0: sign bit, s represents, 0 represents a positive number, 1 represents a negative number;

Places 1 to 11: Storage index part, e represents;

Bits 12 to 63: Store decimal parts (i.e. significant digits), denoted by f

The diagram below:

To finally understand the final storage format of numbers, we also need to know scientific notation

The expression is: m × b^n (1 ≤ m

An 🌰 :

1234 = 1.234 × 10^3

Of course, there are binary representations:

Here’s an example:

The decimal floating-point number 7.25 is converted to a binary representation: 111.01(don’t ask me how to convert it, read on in section 1 above), in binary scientific notation:

1.1101 2 ^ 2, we can come to investigate: (12 ^ ^ 0 + 12 + 12 (1) ^ (2) + 02 ^ ^ (3) + 12 (4)) ^ 2 = (1 + 0.5 + 0.25 + 0.0625) ^ 2 = 7.25

Now that we know about the scientific notation conversion for binary, let’s talk about some of IEEE 754’s rules:

1, the sign bit determines the positive and negative of a number, the exponential part determines the size of the number, and the decimal part determines the precision of the number.

2. IEEE 754 states that the significant digit M is always 1 by default and is not stored in 64-bit floating-point numbers. That is, the significant digit is always 1.xx… In the form of xx, where xx.. Parts of xx are stored in 64-bit floating point numbers, which may be up to 52 bits long, but can represent 53 significant digits. For example, when saving 1.01, only save 01, wait until read, add the first 1. The purpose of this is to save one significant digit.

3, E is an unsigned int. Since E has 11 digits, its value ranges from 0 to 2047. However, we know that in scientific notation E can be negative, so IEEE 754 states that the true value of E must be stored in memory with an intermediate number, which is 1023 for an 11-bit E. For example, 7.25, E = 2, saved in memory should be 2+127=129, which is 000 10000 0001

4. In addition, E also needs to consider the following three situations:

(1) E is not all 0 or all 1. At this point, the floating-point number is represented by the rule above, which is to subtract 127 (or 1023) from the calculated value of the exponent E to get the true value, followed by the significant number M with the first 1.

(2) E is all 0. At this point, the floating-point exponent E is equal to 0-127 (or 0-1023), and the significant number M is no longer added with the first 1, but reverted to the decimal of 0.xxxxxx. This is done to represent plus or minus zero, and very small numbers that are close to zero.

(3) E are all 1. In this case, if the significant digit M is all 0, it means ± infinity (plus or minus depending on the sign bit S); If the significant number M is not all 0, it means that the number is not a number (NaN).

To put all the information together, let’s use the following example:

The format of the floating point number 0.125 stored in memory is calculated as follows:

  1. Convert 0.125 to binary: 0.001
  2. Binary to scientific notation: 1.0 * 2^-3
  3. According to the above expression, its E=-3+1023, M=0, so the storage contents are as follows:

The reverse direction can be converted directly to decimal:

  1. 0 is a positive number
  2. E fits the first case above: the true value is 1020-1023=-3
  3. The significant number M is 0

So the scientific notation for binary is: 1.0*2^-3 => converted to decimal is 0.125

3. Floating point operation defects of JS

With the foundation laid in the previous two sections, we can now explain why 0.1+0.2 equals a long string of numbers in JS. 0.1*0.2 would also be a long string of numbers, right?

The root cause is that the conversion of 0.1 to binary is an infinite loop:

0.1 => 0.0 0011 0011 0011…..

Because in the format of the double effective number is 52, so complete after conversion is 0.0 0.1, 0011, 0011, 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011 0011

Switch to scientific notation: 1.1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 * 2^-4

So the real storage structure is:

Similarly, the real storage structure of 0.2 is:

The actual storage pattern as operands to floating-point addition, get a 0-01111111101-0011001100110011001100110011001100110011001100110100

Converted to a decimal is 0.300000000000000044408920985006

Reference: 0.300000000000000044408920985006

But why does javascript only print the first 17 significant digits?

See my stackOverflow question for the reason

How does JavaScript determine the number of digits to produce when formatting floating-point values?

Another classic example is: 0.1 + 0.125 why doesn’t the result have many decimal points?

Because the two together: 0.225000000000000005551115123126

According to the above explanation, we get 0.22500000000000000, which is 0.225 if we drop the last 0

Reference: 0.225000000000000005551115123126

3, JS large integer overflow

Because all numbers need to be converted to binary, and the converted binary is stored in a memory cell using scientific notation, we know that the length of M determines the range of numbers that can be represented.

The fixed length of M is 52 digits, plus the omitted one digit, the maximum number that can be represented is 2^ 53-1 = 9007199254740991, so the range is -9007199254740991 to 9007199254740991

This can be verified by a constant defined by JS:

Number. MIN_SAFE_INTEGER and Number. MAX_SAFE_INTEGER

What if I store a value that exceeds this maximum integer? What’s going to happen?

The answer, of course, is overflow. So now not only do we need to know about overflow, but what are the rules of overflow?

Since M is at most 53 digits, any number exceeding that number will overflow and be truncated, for example 🌰 :

Binary representation of 9007199254740992:

100000000000000000000000000000000000000000000000000000

Since only 52 bits can be saved, this value will be truncated:

1.0000000000000000000000000000000000000000000000000000 * 2 ^ 53

Save to memory is:

Take it out and restore it, you can restore it to the original value, because after adding 53 zeros you get the original binary number, and then after converting it you get the original number: 9007199254740992.

But what about 9007199254740993? Let’s follow the steps above:

Binary is: 100000000000000000000000000000000000000000000000000001

The last 1 will be truncated because M is limited to 52 bits:

1.0000000000000000000000000000000000000000000000000000 * 2 ^ 53

It will be the same as 9007199254740992 when saved to memory, so it will be the same as 9007199254740992 when restored to decimal, because it cannot restore the last digit, which is 1, so it will still be 9007199254740992.

So if you write in the console this is true: 9007199254740992 === 9007199254740993

Do you see any patterns in this example?

The overflow number will be truncated so that the corresponding numbers will be equal.

They’ve come up with a rule about this:

1, (2 ^ 2 ^ 53, 54) between the number of can choose one from two, can accurate said even number 2, (2 ^ 2 ^ 54, 55) between the number of four selected one, the only accurate said four multiples of 3,.. Skip more multiples of 2 in turn

The following image illustrates the severity of the error caused by integer overflow:

4. Solutions

  1. Using mathjs
  2. Use the number – precision
  3. Using bigjs
  4. Write yourself a few simple usable operation functions, this can be found on the Internet.

The trade-offs and choices between these methods are determined by project conditions.

After this introduction, if the article has any flaws, you can leave a message to point out, thank you.

reference

  1. A double – precision floating – point number
  2. Convert double digits online
  3. 0.30000000000000004
  4. What Every Computer Scientist Should Know About Floating-Point Arithmetic
  5. Hexadecimal transfers
  6. The toString () specification
  7. CSC231 An Introduction to Fixed- and Floating-Point Numbers
  8. Is Your Model Susceptible to Floating-Point Errors
  9. JavaScript floating point traps and solutions