preface

Finally survived the busiest business school season, clean up the mood, pick up some basic knowledge, I wish you can stick to it ~ refueling!

Before briefly introduced the basic operation of integer addition, subtraction, multiplication and division under binary, it is recommended that you have not seen a simple look, this section, together with the exploration of floating points in binary is how to carry out the basic operation, at the same time to understand the problem of accuracy loss ~

Open the Chrome Console and give a very simple input as follows:

0.1 + 0.2  / / 0.30000000000000004
Copy the code

Are you surprised that such a simple calculation, whether in JS or Python, is not exactly 0.3?

origin

To understand this, we first need to know how floating-point numbers are actually stored on a computer. I don’t know about you, but my initial reaction is to assume 32 bits of storage space, and I might think of it in terms of integers. For example, 1 to 24 bits are integers, and the remaining 8 bits represent decimals. Is that ok? Of course you can, but consider the following question first:

Imagine that the red area is the biggest space we can put the number in, and now we have a problem, when we want to add more zeros, but we can’t put them in, because the space is limited, what do we do?

Yeah, that’s right, scientific notation, that’s what we’re learning, if we have too many digits, we’re going to use scientific notation, and the nice thing about that is, if you have a small number of digits, you have a large number of digits, so if you go back to the computer, if you have 32 bits for real numbers, what’s the maximum number of digits? Two to the 32nd, that’s about 4 billion. Is 4 billion a lot of numbers? Many, but compared to the infinite number of real numbers, a drop in the ocean, not enough, so computer designers have to think about this problem, how to make the computation put more numbers?

There really is a “fixed point number”

Remember that 1-24 is the whole number and the rest is the decimal place? This storage is called fixed-point numbers, where 1 to 24 bits have 4 bits for each 0 to 9, 6 bits for the integer part, and 2 bits for the decimal part, so that we can have 32 bits for the number from 0 to 999,999,999, so that 100 million real numbers, this way of using base 2 for base 10, Called BCD (binary-coded Decimal), for example 8421, the weights from left to right are 8,4,2,1, and so on, if you’re interested.

What are the problems with fixed point number

Fixed-point numbers have several obvious disadvantages:

  • It’s a large number of digits, but the range of numbers it can represent is limited;
  • You can’t represent very large and very small numbers at the same time

In fact, the fundamental reason is still limited by the “limitation” of this way. Is there a way to make the number represented by 32 bits more “infinite” and more suitable for our demands?

Of course, the wisdom of the predecessors who designed computers was infinite

How are floating point numbers represented

The IEEE standard defines two basic floating-point formats: 32-bit single-precision floating-point and 64-bit double-precision. Float (float32) and double (float64). The two data formats are float (float32) and double (float64).

It is divided into three parts:

  1. The first part is the sign bit, which is denoted by s, which stands for plus and minus, and the thing to remember is that in the range of floating point numbers, all numbers are signed;
  2. The second part is the exponent bit, represented by e, which represents the exponent. The range of digits represented by 8 bits is 0~255. In order to represent both large and decimal numbers, we map 0~255 to 1~254 to -126~127, which can represent the largest and smallest digits at the same time.
  3. The third part is the significant digit, denoted by f, which represents the significant digit;

Combined with the above representations and scientific notation, our floating point numbers can be expressed as formulas

(-1)^s * 1.f * 2^e

Did you find anything wrong with the formula? You will find that we cannot say this formula 0, indeed, this is a clever design, we use 0 (8 bit 0) and 255 (8 bit to 1) to show some special values, can think they two is a special flag bits, such as when the e and f are 0, we think the floating-point number is 0, see the table below:

Take 0.5 for example. The sign bit of 0.5 is 0, f is 0, e is -1,

So (-1)^0 * 1.0 * 2 ^ -1 = 0.5

In 32-bit bits

s e f
0 0111, 1110, 0000 …0
1 a eight 23 0.5

In this way, it is obvious that the range of real numbers represented by 32 bits is very large, and the real numbers created in this way can “float” the position of the decimal point, so they are also called floating point numbers.

So now we know how floating-point numbers are stored, but that doesn’t solve the problem we started with, why 0.1+0.2! =0.3, first we need to know how 0.1 is stored:

(-1)^s * 1.f * 2^e = 0.1

To solve the e

S =0 f=0 e= math.log2 (0.1) // -3.321928094887362

From 0.1 to 0.9, only 0.5 can be calculated accurately, and the rest can’t be calculated accurately. This is why 0.1+0.2 will cause precision problems. That is to say, floating point numbers are approximated in both representation and calculation. And approximations are bound to lead to questions like, do you want the bank to deposit your money and calculate interest in floating point numbers? Do not hope of course, otherwise your money is calculated much still good, calculate little is not deficit big ~

Floating point & binary

To convert a binary floating-point number (0.1001) to base 10, since every place after the decimal point represents 2 to the -n power, converting to base 10 is:

(1 * 2 ^ – 1) + (0 * 2 ^ 2) + (0 * ^ 2-3) + (1 * 2 ^ 4) = 0.5625

You can think of it as, if you go from binary to decimal, if you go to the left, you take the exponent of 2 from 0 over one digit plus 1, including 0, and if you go to the right, you go from -1 to -1.

A decimal floating point Numbers, converted to binary, and integer binary representation using “divided by 2, and then the remainder” compared to the decimal part transformation is operated by a similar direction, is 2 times, and then see if greater than 1, if more than 1 note 1 and minus 1, the results have been repeated operation.

For example, in decimal 9.1, the decimal part 0.1 is converted to base 2 as follows:

This is the part that gets an infinite loop “0011”, the integer part 9 translated into binary is 1001, so the result is 1001.000110011…

Shifting the decimal by 3 places gives a floating point number 1.001000110011… * 2 ^ 3

(-1)^s * 1. F * 2^e (-1)^s * 1.

S = 0 f = 00100011001100110011 001

The exponent bit is 3, and the range of our e is 1-254 divided into positive and negative numbers in half, so 127 represents 0, and we add 3 from 127 to 130, which gives us the result: 1000 0010, so we get e=1000 0010, which is as follows:

So the final binary representation is: 0100 0001 0001 0001 1001 1001 1001 1001

If we convert this floating-point representation to decimal, the actual value is 9.09999942779541015625. I’m sure you won’t be surprised by now.

Watch your “savings”

First of all, we understand the floating point addition calculation process is what, take 0.5 + 0.125 to do the calculation, first of all, 0.5 formula calculation results are:

S = 0 significant bit 1.f = 1.0000… e = -1;

0.125 converted to:

S = 0 significant bit 1.f = 1.0000… e = -3;

Then, the calculation formula is to align the index bits first (from small to large, e should be unified as -1), and then add the sign bits and the significant bits by bits, e keeps the unified result, therefore:

The sign bit s The index of an e Effective bit 1. F
0.5 0 – 1 1.0
0.125 0 – 3 1.0
0.125 align with the exponent bit 0 – 1 0.01
0.5 + 0.125 0 – 1 1.01

(-1)^0 * 1.25 * 2^-1 = 0.625;

Ps: Why 1.25? Although we have calculated 1.01, don’t forget that the calculation is done in base 2, which is rotated back in decimal, so 0100000…. Don’t worry about the zeros. Multiply the decimal part by 2 minus N times from the beginning, remember, so 2^-2 = 0.25 plus 1 is 1.25

It can be found that, in fact, floating point calculation process can be achieved through an adder, the circuit cost is not very high, but need to pay attention to some other problems:

Calculation process, the need to align first, but the length of the effective digital is 23, if have a large number and a small number of additive, and then in the process of alignment, the decimal is 0 parts in the process of direct overflow, 23 not enough use, can appear problem, after filling some effective bit were lost, which leads to error on the results, If the difference between the exponents of two numbers is more than 23, say, 2^24 (about 16 million times), the two numbers add up to the larger number and the smaller number is completely discarded…

Some of you will be in a hurry to go to Chrome controls and type the following code:

Math.pow(2.24) + 0.1 / / 16777216.1
Copy the code

Don’t worry, buddy, js built-in Number is 64 bits, you can try

Math.pow(2.50) + 0.1 / / 1125899906842624
Copy the code

Are we out of decimals? This phenomenon is also called big numbers eat small numbers.

So if the bank uses ieEE-754 32-bit floating-point counting method to keep deposits, let’s say you are a big boss, and you have RMB 20 million in your account, and one of your employees gives you RMB 1, haha, sorry, the bank has lost the account, and your deposit is unchanged! Therefore, the general bank ah, e-commerce class will use fixed point number or integer to calculate when it comes to money, to avoid the problem of accuracy loss, if you go to the bank involved in the database, must be careful ~

conclusion

In this article, we start from the representation of floating point numbers, storage, conversion and calculation process to analyze how floating point numbers work in the real computer world, and understand why floating point numbers lose accuracy:

  1. Floating-point numbers may not be converted exactly to base 2 when stored
  2. In the calculation process, there is the possibility that large numbers eat small numbers, which will also lead to inaccurate data

extension

Accuracy loss is not impossible to solve, there are mature solutions, not too much introduction, interested in you can go to study:

A Summation Formula algorithm

Description:

Most of the content of this article refers to teacher Xu Wenhao’s column of “Simple computer composition principle”, adding some of their own understanding to make a simple summary, and then will continue to share some of their income from time to time, if you think it is good, give a thumbs-up

Ps: One might ask, if only 0.5 can be converted to a precise Numbers, 0.1 + 0.1 why there is no problem, this I haven’t studied carefully, but I guess that is because itself is a process of calculating approximation calculation, so again after the results, if still in an approximate range, will think that there is no error, more than this range, In short, we can confirm that what we get in the calculation process is indeed an approximate number, which is indeed the cause of some floating point calculation accuracy loss ~

If you are interested, you can go here to see the actual numbers stored in the computer