All of a sudden, 2021 is over and this is my year-end blog 🙂

In the computer world, there are two ways of representing numbers. One is called fixed-point numbers and the other is floating point numbers. Today we’re going to start with fixed-point numbers and then float numbers (probably a bit of uninformative nonsense).

I’m sure you’ve probably heard of floating point problems, where the numbers can be inaccurate. In finance, such as banks, their systems do not allow for a single numerical error. So they tend to store their data in fixed-point numbers. Fixed-point numbers store data without error.

Computers store data in binary. To store a number in a computer, we first have a comparison table.

Take a look at the chart below:

The decimal system BCD code
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001

The BCD code above is an encoding rule, and our rule for converting decimal numbers to fixed-point numbers can be by turning decimal numbers one by one against the table above.

Let’s say I have a number 100.20.

By looking up the table, we can convert to:

10 0 2 0 0001 0000 0000 0010 0000Copy the code

Up here, the decimal point is not stored here, and in general, people who use fixed-point numbers to store numbers, will store it somewhere else in their program. We don’t have to worry about how to store the decimal point.

Since there are also positive and negative numbers, we also need to create four symbol bits on the far left. As usual, 1 is an integer and 0 is a negative number. The top 100.20 will become:

0001 0001 0000 0000 0010 0000
Copy the code

A byte in a computer contains eight bits, or eight bits. The above data store consumes three bytes, which is not as intuitive as a group of four, but is more realistic like the following:

00010001 00000000 00100000
Copy the code

A fixed number of bytes will be stored according to the system’s needs, that is, the current fixed number will be stored in a number of bytes, generally based on the maximum and minimum.

Suppose the largest number in our system is 999.99 and the smallest is -999.99, which can be three bytes representing the number inside the current system. 999.99 will be represented as

00011001 10011001 10011001
Copy the code

The current system stores the number -0.01

00000000 00000000 00000001
Copy the code

As a result, the -0.01 number wastes two and a half bytes. But in places like banks, precision is Paramount. And the numbers for money are not unimaginably long. To take a more concrete example, suppose that a system can support up to 100 million RMB, then how many bytes do each number need? In fact, it is only 5 bytes.

1000 000000 sign bit 00010001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000Copy the code

In general, ints are also four bytes, the long integer long is eight bytes, and the double representing a floating-point double is eight bytes.

Therefore, fixed-point numbers work well in scenarios where accuracy is high but data length is not very long.

But that’s not the case in everyday use, when we’re defining numbers we might want to arbitrarily place the decimal point, we might want to define arbitrarily large numbers.

Fixed-point numbers can’t be arbitrarily placed, and we don’t want to use two types of data to define a 2-digit decimal and a 3-digit decimal.

How much space does a language have to preset for fixed-point numbers in order to be compatible with all usage scenarios? It’s hard to imagine. So it’s very difficult to continue to use fixed point numbers in normal situations.

Data storage has always been a trade-off between precision and memory, with floating-point numbers being more important.

A simple generalization of what a floating point number is is to represent a number in scientific notation as a binary number.

In junior high school, we learned how to use the science and technology method in decimal notation, so if you write 1000, you write:


1.00 ∗ 1 0 3 1.00 * 10 ^ 3

The number 1 followed by two zeros represents accuracy to two decimal places.

1.00, also known as a significant number, ranges from 1 to 10 in decimal.

10 Let’s call it basis (I forget, I vaguely remember calling it that)

Scientific notation in binary uses the same logic, except that the range of valid numbers is greater than or equal to 1 in binary and less than 10 in binary, that is, greater than or equal to 1 in decimal and less than 2 in decimal.

And the basis is not 10 anymore, it’s 2.

For a binary number 10001 (decimal 17), we can express it as


1.0001 ∗ 2 4 1.0001 * 2 ^ 4

Actually, this notation might be confusing to some of you.

Some students may wonder why 1.0001 * 16 is 16.016 and not 17.

Don’t expect the result of the above formula to be 17, because the way we think about it now is in decimal, and the carry rules of binary multiplication are different from those of decimal, which means that 1.0001 * 16 is not 17.

To actually compute the above result, we need to convert 16 to binary as well, multiplying by 1 over 10. The real result is this:

1.0001 * 10000 = 10001
Copy the code

If you decimal 10001, you get 17.

In the above binary scientific notation, we can also infer the implicit rule that a significant number in binary must be greater than or equal to binary 1 and less than binary 10, which means that a significant number must be 1 to the left of the decimal point. And then we don’t have to store this in the computer, we just have to think about the decimal part.

Next, floating point numbers.

Floating-point numbers follow the basic structure of binary scientific notation above, and are divided into two basic formats: single-precision floating-point numbers represented by 4 bytes and double-precision floating-point numbers represented by 8 bytes. Single precision takes fewer bytes and, accordingly, represents a smaller range of exact numbers than double precision.

We can get an idea of what precision means here by referring to the decimal system.

One data format specifies two significant bits, and the other specifies four significant bits. If an integer such as 10 is represented, both can be represented precisely without any difference.

Suppose we wanted to represent PI. The former will be represented by 3.14, and the latter by 3.1415. Although both have errors, the latter uses more significant bits and is more accurate than the former.

The above also reflects the two differences between single precision and double precision: double precision can represent a larger range of numbers, for a certain error data, double precision error will be small.

Let’s first introduce the single-precision format:

As stated above, it has four bytes, or 32 bits, and these four bytes are allocated as follows:

S = 1 the sign bit | e = 8 index (0 ~ 255) | f = 23 number effectivelyCopy the code

How does it represent a number in terms of these three parts?


( 1 ) s ∗ 1. f ∗ 2 e 127 (-1)^s * 1.f * 2^{e-127}

Continuing with the example of 10001, it is:


( 1 ) 0 ∗ 1.0001 ∗ 2 5 (-1)^0 * 1.0001 * 2^{5}

Let’s look at the double format, which has 8 bytes and 64 bits:

S = 1 the sign bit | e = 11 index (0 ~ 2047) | f = 52 number effectivelyCopy the code

The format for representing numbers is the same as for single precision, except that f has more bits and e is larger:


( 1 ) s ∗ 1. f ∗ 2 e 1023 (-1)^s * 1.f * 2^{e-1023}

Let’s take a look at the effects of s, e, and f on the format of a floating point number.

S is the sign bit, it’s only one bit, and it’s only 0 or 1.

When 0, (−1)s(-1)^s(−1)s results in 1; If I take 1, I get -1, so it’s a symbol.

E is the exponential part, which intuitively controls the offset of significant bits. For example, 1.0001∗251.0001 * 2^{5}1.0001∗25 represents a 5-bit shift to the right of 1.0001.

F is the significant bit in the format of scientific notation, which has a greater relationship with accuracy. Let’s just visualize it a little bit more.

Take single precision for example.

According to the single-precision formula above, it can express the maximum value is:


( 1 ) s ∗ 1. f ∗ 2 127 (-1)^s * 1.f * 2^{127}

In base 10, 3.402823466∗10383.402823466 * 10^{38}3.402823466∗1038. This number is still quite large, but a number in its range 16777216 represented as binary is:


1.00000000000000000000000 ∗ 2 24 1.00000000000000000000000 * 2 ^ {24}

Since the numbers that can be represented in a computer are not continuous but discrete, the next binary number following the above binary number is (i.e., the last binary number increases to the minimum scale) :


1.00000000000000000000001 ∗ 2 24 1.00000000000000000000001 * 2 ^ {24}

This number, in decimal form, is 16777218. (16777216, 16777218) All the data in this set is missing. In fact, everything in between is represented as a number, just like the 16777216 memory structure. In other words, the number in this interval can’t be expressed exactly.

As mentioned above, this is the precision problem that floating-point numbers face. As with single-precision floating-point numbers, double-precision floating-point numbers have the same accuracy issues, but they are considerably better than single-precision floating-point numbers because of the larger significant bits.

If you store an irrational number or a number longer than a significant bit, you will not be correct, such as storing 0.1, which is converted to binary as an infinite repeating decimal:

0.0001 1001 1001 1001 1001 1001...Copy the code

In order to store it, you have to cut the first part, but then the data you store is already inaccurate, and you can’t expect to do the four operations correctly.

Finally, in JavaScript, the representation of numeric types is stored in the double-precision format of floating point numbers. In addition to the normal values, there are several special values NaN and Infinity. Compare this to the following formula:


( 1 ) s ∗ 1. f ∗ 2 e 1023 (-1)^s * 1.f * 2^{e-1023}
  1. If e == 2047 and f! = 0, is represented as NaN.
  2. If e == 2047 and f == 0, it’s either infinite or infinitesimal (depending on the sign bit)

And for these particular values, you don’t care what it turns out to be, it’s just a notation.

Well, that’s all this article has to say about fixed-point and floating point numbers.

Thanks for reading