@[TOC]

The addition and subtraction of floating point numbers

We can start by analogizing binary by the decimal floating-point addition and subtraction steps

Decimal floating-point number addition and subtraction operation steps:

Floating point number addition and subtraction operation consists of five steps: ① order, ② mantissa addition and subtraction, ③ normalization, ④ rounding, ⑤ overflow

For example, calculate 9.85211 x 10^12^ + 9.96007 x 10^10^

Solution:

The addition and subtraction of binary floating-point numbers

We did the decimal floating-point addition and subtraction above, and we can do the same with the following five steps

Let’s look at an example: given the decimal numbers X=−5/256 and Y=+59/1024, X−Y is calculated according to the machine’s complement floating-point operation rules. The result is expressed in binary format as follows: == takes 2 bits of the rank code, 3 bits of the rank code, 2 bits of the number character, and 9 bits of the mantras ==

Solution:

First we use the complement to represent the order and mantissa,

5D = 101B, 1/256 = 2^-8^ → X = -101 × 2^-8^ = -0.101 × 2^-5^ = -0.101 × 2^-101^ -59d = 111011B, 1/1024 = 2 ^ – ^ – > Y = 10 + 111011 * 2 ^ – ^ = 10 + 0.111011 * 2 ^ – ^ 4 = + 0.111011 x ^ ^ 2-100

And then converted to the complement form X: 11011,11.011000000 == (X is negative to complement the negative +1 mantras of order code are the same operation) == Y: 11100,00.111011000

1. To order

Make the order codes of the two numbers equal, the small order equals the large order, and the mantissa moves to the right by one, and the order code is increased by 1

[δ E] complement =11011+00100=11111, know δ E=−1 ② order: X: 11011,11.011000000 → 11100, 11.101100000 X = -0.0101 × 2^-100^

2. Mantissa plus and minus

-y: 11100,11.000101000 ==

And then we have X plus minus Y

	11.101100000
+ 	11.000101000
	10.110001000
Copy the code

So x-y: 11100, 10.110001000

3. The normalized

X-Y:11100, 10.110001000 à 11101,11.011000100

4. Rounding

There is no rounding

5. To overflow

Constant order code, no overflow, the truth value of the result is 2−3 x (−0.1001111)2

Addition and subtraction of floating point numbers – rounding rules

“0” rounding “1” method:

It is similar to the rounding method in decimal number operation, that is, when the mantissa moves right, the highest numeric bit removed is 0, then it is removed; The highest numeric bit removed is 1, and 1 is added to the last mantissa. Doing so may cause the mantissa to overflow again, in which case you need to do the right gauge again.

Constant “1” method:

When the mantissa is moved to the right, the last mantissa after the right move is always set to “1” regardless of whether the highest value bit lost is “1” or “0”. This method also has two possibilities of making the mantissa larger and smaller.

For example,

Cast casting

Operability of transformation

Char → int → long → double float → double int → float: Possible loss of precision float → int: possible overflow and loss of precision

Conclusion: There is no loss in the conversion process

Int: indicates an integer ranging from -2^31^ to 2^31^-1. 32 significant digits Float: indicates an integer ranging from ±[2^-126^ ~ 2^127^×(2−2^−23^)]. 23+1=24 digits