The cause of inaccurate floating-point precision operation

Why are floating-point precision calculations problematic

Most of the programming languages we use have a problem with floating-point accuracy calculations. Such as

double num = 0.1 + 0.1 + 0.1;
// The output is 0.30000000000000004
double num2 = 0.65 - 0.6;
// The output is 0.05000000000000004
Copy the code

In my tests, I found that C/C++ didn’t have this problem. I initially thought it was compiler optimization that fixed the problem. But if C/C++ can solve other languages why not? Based on the cause of this problem, compiler optimizations are not logical to solve this problem. Later it was found that there was a problem with the printing method, and the printing method would be rounded. Printf (” % 0.17 f \ n “, num); And cout << setprecision(17) << num2 << endl; Print a few more decimals to see the accuracy of the problem.

So the accuracy calculation is not accurate why is that? We’ll start with the binary representation of all computer data. If you are familiar with the conversion between binary and decimal, you can easily understand the cause of the accuracy problem. If not, let’s review the conversion process between decimal and binary. In general we convert binary to decimal by adding weights. Decimal to binary is mod 2, in reverse order. Those of you who are very familiar can skip it.

// Binary to decimal 10010 = 0 * 2^0 + 1 * 2^1 + 0 * 2^2 + 0 * 2^3 + 1 * 2^4 = 18 // Decimal to binary 18/2 = 9.... 0 9/2 = 4.... 1 4/2 = 2.... 0 2/2 = 1.... 0 1/2 = 0.... 1, 10010Copy the code

So, the question is how do decimal decimals and binary decimals convert to each other? Decimal decimal to binary decimal is generally the integer part divided by 2 mod, reverse order, decimal part used by 2 integer bits, order. Binary decimals to decimal decimals are still added by weight.

// Binary to decimal 10.01 = 1 * 2^-2 + 0 * 2^-1 + 0 * 2^0 + 1 * 2^1 = 2.25 // Decimal to binary // integer part 2/2 = 1.... 0 1/2 = 0.... 1 // 0.25 * 2 = 0.5.... 0 0.5 * 2 = 1.... 0 // Result 10.01Copy the code

Now that we know how to do decimals, let’s get back to why floating-point operations are inaccurate. Let’s look at a simple example 2.1 of what a decimal number looks like when converted to binary.

2.1 Split into two parts // Integer part 2/2 = 1.... 0 1/2 = 0.... 1 // 0.1 * 2 = 0.2.... 0 0.2 * 2 = 0.4.... 0 0.4 * 2 = 0.8.... 0 0.8 * 2 = 1.6.... 1 0.6 * 2 = 1.2.... 1 0.2 * 2 = 0.4.... 0 0.4 * 2 = 0.8.... 0 0.8 * 2 = 1.6.... 1 0.6 * 2 = 1.2.... 1 0.2 * 2 = 0.4.... 0 0.4 * 2 = 0.8.... 0 0.8 * 2 = 1.6.... 1 0.6 * 2 = 1.2.... 1...Copy the code

Falling into an infinite loop results in 10.0001100110011…….. , our computer must have a length limit when storing decimals, so it will cut part of decimals for storage, resulting in the computer can only store a approximate value, rather than the exact value. From this we can see that our computer simply cannot use binary to accurately represent the value of the decimal number 2.1, even the representation can not accurately represent, calculation is bound to be a problem.

The solution to the loss of precision operation

There are three approaches

If the business does not have to be very precise, you can use the rounding method to ignore this problem.
Convert it to an integer and evaluate it.
Use BCD code to store and calculate binary decimals (interested students can search and learn).

Generally, every language uses high-precision arithmetic solutions (more expensive than normal arithmetic), such as Python’s Decimal module, Java’s BigDecimal module, but must be passed into the decimal into a string construction, otherwise there is still a pit, other languages you can look for.

# Python example
from decimal import Decimal

num = Decimal('0.1') + Decimal('0.1') + Decimal('0.1')
print(num)
Copy the code

/ / Java examples
import java.math.BigDecimal;

BigDecimal add = new BigDecimal("0.1").add(new BigDecimal("0.1")).add(new BigDecimal("0.1"));
System.out.println(add);
Copy the code

Extension: Detail floating point type

As mentioned above, floating-point memory is limited, so let’s see how our computer stores floating-point data. Is it true, as we mentioned above, that floating-point memory is limited by fractional length? For example, Float is divided into three parts: sign bit, exponent bit, and mansa bit.

Usually we use scientific notation to represent a large or small number. For example, 1000.00001 is usually expressed as 1.00000001 * 10^3, or 0.0001001 is usually expressed as 1.001 * 10^-4.

The sign bit

0 is positive, 1 is negative

Index a

The exponential is interesting because it has to be positive and negative, so people create a system called EXCESS. What does this system mean? It says that the maximum value / 2 minus 1 means the exponent is 0. For example, we use a single-precision floating-point number. The single-precision floating-point number has eight decimal digits, representing a maximum of 255. So 255/2-1 is 127,127 means the exponent is 0. If the decimal value is 128, the exponent is 128-127 = 1. If the decimal value is 126, the exponent is 126-127 = -1.

Mantissa bits

For example, 1.00000001 and 1.001 are mantissa, but why are they called mantissa? Because in binary, such as 1.xx, the 1 in front of the decimal point is always there, so it’s a waste of space to store one more decimal place, so the manda will only store the decimal part. That is, 00000001 and 001 in the above example store this data.

Let’s take a little example. View the specific storage structure of a single-precision floating-point 1.25. You can use the following Python code to see the specific data structure in which floats are stored.

import struct

num = 1.1
bins = ' '.join('{:0>8b}'.format(c) for c in struct.pack('! f', num))
print(bins)
Copy the code

The Java version of the code is a little long, so I’ll put a link there. Java version code link

The storage of 1.25 is obtained by the above procedure we float the binary structure of the specific value is 00111111101000000000000000000000, we split the 0 as the sign bit he is a positive. 01111111 is the exponent bit, and 01000000000000000000000 is the mantissa. Now let’s verify that 01111111 is 127 in decimal form, so the exponent is 0. The mantissa is 01000000000000000000000 plus the default omitted 1 is 1.01 (omitting the following extra 0), which in decimal form is 1.25.

The cause of inaccurate floating-point precision operation

Why are floating-point precision calculations problematic

The solution to the loss of precision operation

Extension: Detail floating point type

The sign bit

Index a

Mantissa bits

Related Posts

Python crawls web resources to generate e-books

Kubernetes combat high availability cluster construction, configuration, operation and application

IO multiplexing Select, poll, and epoll