Location>code7788 >text

c-primer-plus In-depth Interpretation Series - From Binary to Error: Line by Line Disassembly of the Mystery of 4008175468544 in C language floating point operation

Popularity:987 ℃/2025-03-27 16:09:47

Preface

Tip: When reading this article, you need to at least understand the binary representation rules of double and float.

The code examples in the book are as follows:

#include <>

int main(void)
{
  float a,b;

  b = 2.0e20 + 1.0;
  a = b - 2.0e20;
  printf("%f \n",a);

  return 0;
}

My test environment is as follows, where a equals 4008175468544.000000.

Linux version 5.15.0-134-generic (buildd@lcy02-amd64-092) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #145~20.04.1-Ubuntu SMP Mon Feb 17 13:27:16 UTC 2025

Why does such a result occur? Because the explanation in the book is not detailed enough, when I read it here, I naturally had the idea of ​​in-depth research, so I wrote the research results as this blog for your reference.

Regarding this problem, my idea is to analyze the effect of each sentence of code line by line, and maybe you can know where the error occurs. The following is a line-by-line analysis.

Line by line analysis and interpretation - first line of code

First is the first line:

b = 2.0e20 + 1.0;

Starting with the calculation on the right 2.0e20 + 1.0, we first need to represent the binary of these two constants, and then do floating point number addition operation.

Binary representations of 2.0e20 and 1.0

First, 2.0e20 will be processed as double type data. According to the storage rules of double type, its underlying binary is as follows:

0-10001000010-0101101011110001110101111000101101011000110001000000

in:

  • Symbol bit: 0

  • Index part: 10001000010

  • The last part: 01011011110001110111000101110110101100011000110001000000

Insert a little knowledge (if you don’t care, you can skip it): Given a large number, how is its binary calculated?

  1. First,\(2 \times 10^{20}\)Perform quality factorization:

\(2 \times 10^{20} = 2 \times (2 \times 5)^{20} = 2^{21} \times 5^{20}\)

This shows that\(2 \times 10^{20}\)Is it from\(2^{21}\)and\(5^{20}\)The product composition of the large number can be regarded as the binary form of 5²⁰ left shift of 21 bits (that is, multiplied by 2²¹), which means we need to obtain\(5^{20}\)binary.

  1. beg\(5^{20}\)Binary:

\(5^{20}\)The decimal system is: 95367431640625

The process of finding binary in decimal system, simply put, use 95367431640625 to continuously divide by 2 to obtain a vertical sequence of remainder 1 or remainder 0, and reverse ordering it is the binary that is found. Here I will write the binary directly:

10101101011110001110111000101101101100011000110001 (total 47 bits)

  1. Overall binary representation:

From the above calculation, we can see that\(2 \times 10^{20}\)The binary system is:

\(10101101011110001110101111000101101011000110001 \times 2^{21}\)We normalize it, and we get

\(1.0101101011110001110101111000101101011000110001 \times 2^{67}\)

  1. After finding the overall binary, the specific storage of double type is much simpler.

First is the exponential part: 67 + 1023 = 1090, binary is 10001000010

Then there is the mantissa part: directly extract the part after the decimal point, with only 46 digits, and 6 0s need to be filled (satisfied with the need for 52 digits of the mantissa)

(We will continue after a little knowledge) Secondly, 1.0 will also be received as a double type, and the underlying binary is as follows:

0 01111111111 0000000000000000000000000000000000000000000000000000

I won't explain this binary in detail. I believe that with the basis of the previous article, readers can easily calculate it.

Here is a set of tool functions to print binary bits of float and double type, which are very convenient for verification:

#include <>
#include <>

void print_bits_float(float f)
{
    union {
        float a;
        uint32_t b;
    }c;
     = f;

    for (int i = 0; i < 32; i++) {
        printf("%d",  & 1 << (31 - i) ? 1 : 0);

        if (i == 0 || i == 8)
            printf(" ");
    }
    printf("\n");
}


void print_bits_double(double d) 
{
    union {
        double a;
        uint64_t b;
    }c;
     = d;

    for (int i = 0; i < 64; i++) {
        //caution! 1ULL not 1 because 1 << or >> more than 31 is undefined behaviour.
        printf("%d",  & (1 ULL << (63 - i)) ? 1 : 0);

        if (i == 0 || i == 11)
            printf(" ");
    }
    printf("\n");
}

With the binary of 2.0e20 and 1.0, you need to add floating point numbers.

2.0e20 + 1.0 floating point number addition operation

Corresponding order

The first step of addition is calledCorresponding order. What is the right order? It is to align the smaller exponents to another larger exponent for easy calculation.

for example:

\(0.5_{10}\) + \(0.4375_{10}\), the subscript represents the binary system, so here is 0.5 in decimal and 0.4375 in decimal

Convert them into binary first.\(0.5_{10} = 1.0_{2} \times 2^{-1}\)and\(0.4375_{10} = 1.11 \times 2^{-2}\)

Find the smaller index, i.e.\(1.11 \times 2^{-2}\)(Because -2 is less than -1), align its exponent -2 to the larger exponent -1, and then align it like this:

\(1.11 \times 2^{-2} = 0.111 \times 2^{-1}\)

Note here that because the exponent has become larger, the decimal point should be shifted left by the corresponding number of digits to ensure the correct data. In this example, the data corresponding to these two expressions are\(0.0111_{2}\). If the conversion is not clear, you can write out the data without an index like this to compare it to see if the data you wrote is the same as the data without an index after the calculation.

If there is a poor reader's foundation here and I don't understand it, as long as I understand the meaning of the N power of binary decimal * 2, I should be able to understand the above content. That is, when N is greater than 0, the N power of binary decimal * 2 is equivalent to shifting the decimal point of the binary decimal point to the right N bit; when N is less than 0, the N power of binary decimal * 2 is equivalent to shifting the decimal point to the left N bit (it will definitely be understood in the decimal system under the decimal point)

Going back to our code, we now add two numbers 2.0e20 and 1.0. Their binary has been discussed before, and we will convert it into normalized numbers:

The normalized number of 2.0e20 is (52-bit full mantissa)

\(1.0101101011110001110101111000101101011000110001000000 \times 2^{67}\)

The normalized number of 1.0 is (52-bit complete mantissa) (the following will be replaced by 1.0 for the sake of convenience of expression)

\(1.0000000000000000000000000000000000000000000000000000 \times 2^{0}\)

Need to find a smaller index, i.e.\(1.0 \times 2^{0}\), align with larger exponents\(2^{67}\)After that, the data becomes:

\(0.000...0001 \times 2^{67}\)((There are 67 0s in total, and 66 0s after the decimal point))

This way the order is completed.

Add up significant numbers

After the order is completed, the second step is calledAdd up significant numbers. The official is called mantissa operation, but in fact, this operation needs to consider the leading 1, so I think it is easier to understand by calling it a significant number addition. (Valid number: includes the complete data of the leading 1, not the mantissa part after the decimal point)

The significant number of 2.0e20 is:

\(1.0101101011110001110101111000101101011000110001000000\)

The significant number for 1.0 after the order is:

0.000...0001 (There are 67 0s in total, and 66 0s after the decimal point)

In fact, if you analyze it carefully, you will find that when the order 1.0 is used, the decimal point will be shifted left by 67 places. Then, when the decimal point will be shifted left by 67 places, 1 has actually disappeared. This is because the mantissa of a double-precision floating-point number is 52 bits, and the significant number of 1.0 after the order has a total of 68 bits. The 52 bits after the decimal point are retained. The exceeding part will be discarded directly (and 1 is just within the discarding interval). Therefore, after the order is 1.0, the mantissa part becomes 0, and the entire data also becomes 0.0.

Since 1.0 has become 0.0, the result after addition is still 2.0e20. If you use the print_bits_double provided above to print 2.0e20 and 2.0e20 + 1.0 respectively, you will find that the printed binary values ​​are exactly the same.

Because the result itself is a normalized number, there is no need to do the last step to normalize it (essentially, it is to move the decimal point + modify the corresponding index. I believe readers have this foundation to understand).

The addition part on the right of the equal sign has been analyzed, so let's continue to look at the assignment operation.

Double precision assignment to single precision

As mentioned above, after 2.0e20 + 1.0, the result is actually 2.0e20, but at this time it is double-precision data. The data type of b in the code is float single-precision, so it is also necessary to convert double-precision to single-precision operations.

The first is the conversion of the sign bit and the exponential part.

Because the sign bit of the double-precision original data is 0, the sign bit of the single-precision is also 0, representing a positive number. Regarding the exponential part, we have analyzed it above. After normalization, the exponential part of 2.0e20 is 67. When converted to a single-precision floating point number, you need to add another 127, that is, 127 + 67 = 194, and the binary is 11000010.

The second is the conversion of the mantissa part.

Because the mantissa part of double precision has 52 bits, while the mantissa of single precision has only 23 bits, since the mantissa part of single precision cannot fully accommodate the mantissa part of the original double precision, you need to consider how to round. The IEEE 754 standard provides four rounding modes, namely:

  1. Round to Nearest (Ties to Even): Default rounding mode, rounding the result to the closest representable value.

  2. Round towards Zero (Truncation): Directly truncate the part beyond the target accuracy.

  3. Round towards +∞ (Ceiling): The result is always adjusted in a positive and infinite direction. For positive numbers, if the remaining part is 0, it will be directly cut; if the remaining part is not 0, it will be 1 to the lowest significant bit; if the negative number is not looking at the remaining part, it will be directly cut.

  4. Round towards −∞ (Floor): The result is always adjusted in a negative infinite direction. For negative numbers, if the remaining part is 0, it will be directly cut; if the remaining part is not 0, it will be 1 to the lowest significant bit; if the positive number is not looked at, it will be directly cut.

Next, we focus on the Round to Nearest (Ties to Even) mode, because in our experiment, the first mode is used for rounding.

Round to Nearest (Ties to Even) mode

The rule of this pattern is: first round to the nearest significant number, if it is the same distance from two adjacent significant numbers (i.e. it is the intermediate number, halfway), then round to the nearest even significant number.

For easy understanding, let's use binary data as an example. If the valid bits that need to be retained are 4 bits:

1.001 (the four digits are valid) + the last 6 digits (to be judged)

There are two most recent valid numbers: 1.001 and 1.010 .

Then analyze the last six digits, whose value range is 000000 to 1111111. We divide it into two intervals to analyze:

Analysis 000000-011111 Value range

When the last six bit values ​​range from 000000-011111, for each data, its distance from 1.001 000000 is smaller than its distance from 1.010 0000000.

Let's take the largest data in this range 1.001 011111 as an example:

  • The distance between this data and 1.001 is 1.001 011111 - 1.001 000000 = 0.000 011111

  • The distance between this data and 1.010 is 1.010 000000 - 1.001 011111 = 0.000 100001

Therefore, the most recent valid number of this data is 1.001, and the data is the maximum value in the entire interval, so the most recent valid number of other data in the interval range is 1.001.

Analysis 100001 - 111111 Value range

Similarly, when the value range of the last six bits is 100001 - 11111, for each one, the distance between it and 1.001 000000 is greater than the distance between it and 1.010 0000000.

Let's take the minimum data in this interval range 1.001 100001 as an example:

  • The distance between this data and 1.001 is 1.001 100001 - 1.001 000000 = 0.000 100001

  • The distance between this data and 1.010 is 1.010 000000 - 1.001 100001 = 0.000 011111

Therefore, the most recent valid number of this data is 1.010, and the data is the minimum value in the entire interval, so the most recent valid number of other data in the interval range is 1.010.

Analysis intermediate value 1.001 100000

Let’s take a look at the only number that has not been analyzed yet: 1.001 100000, and the calculation shows:

  • Its distance from 1.001 is 0.000 100000

  • Its distance from 1.010 is also 0.000 100000

Because the distance is the same, the rounding rule for this data is rounded to the nearest valid even number, which is 1.010.

It can be seen from this that when the first bit of the remaining part from left to right is 0, it is directly cut off; when the remaining part is equal to 1000..000, it is rounded to the nearest even number; otherwise, a carry is required.

Actual assignment analysis

After understanding the rounding rules, you can return to our topic code case and continue our assignment journey.

We have already understood that 1.0 will "disappear" after 2.0e20 + 1.0, and the result is still 2.0e20. We will review its double-precision binary again:

0-10001000010-0101101011110001110101111000101101011000110001000000

When it is necessary to convert to single-precision floating point numbers:

  • Symbol bits: Just fill in the double precision floating point symbol bits, which is 0

  • Exponential part: As analyzed in the previous article, the decimal is 194 and the binary is 11000010

  • Mantissa part: The first 23 digits need to be retained, and the last 29 digits are the judgment bits

    • Top 23 digits: 010110111100011101011

    • The last 29 digits: 110001011011000110001000000

      According to the learned rounding rules, the first digit of the remaining part (last 29 digits) other than the valid part is not 0, and the overall form is not 10,000...000, so it is necessary to carry directly, so the mantissa part after rounding is:

      01011010111100011101011 + 1 = 01011010111100011101100

  • The converted single-precision floating-point binary is: 0-11000010-010110110111001110011101100.

At this point, the first line of code has been analyzed.

Line by line analysis and interpretation - second line of code

We have analyzed the first line of code above. Next, let’s look at the crucial second line of code:

 a = b - 2.0e20;

In fact, with the above basis, we can immediately know that 2.0e20 is received using the double type, and we have already made it clear about the underlying binary. However, before actually doing subtraction, b needs to be promoted from a single-precision floating point number to a double-precision floating point number. We have analyzed the process of converting double precision to single precision, because the rounding rule will lead to carry, and it is much easier to improve from single precision to double precision. We no longer analyze the sign and exponential bits. The mantissa part only needs to directly fill the 23-bit mantissa of the single precision floating point number to the high of the mantissa part of the double precision floating point number, and then the remaining part can be supplemented with 0. The binary after conversion is as follows:

0-10001000010-01011010111100011101100 00000000000000000000000000000

After both types are double, you can do subtraction.

When doing subtraction, first supplement the leading 1, and then subtract them in bits:

1.0101101011110001110110000000000000000000000000000000

-(Subtraction)

1.0101101011110001110101111000101101011000110001000000

--------------------------------------------------------------------------

0.0000000000000000000000000111010010100111001111000000

Don't forget the index below, we write out the complete floating point number:

\(0.0000000000000000000000000111010010100111001111000000 \times 2^{67}\)

This form is not a standardized form yet, and needs to be converted into a normalized expression:

\(1.11010010100111001111000000 \times 2^{41}\)

Finally, we need to convert this data into a single-precision floating point number before it can be stored in the a variable. Because the valid bits in the mantissa part are only 20 bits in total, they can be represented by 23 bits of a single-precision floating point number, so there is no need to round. The converted single-precision floating point binary is:

  • Symbol bit: 0

  • Exponential part: 41 + 127 = 168, converted to binary is 10101000

  • Persona part: 1101001010011100111000

This number is converted into a decimal floating point number and printed out, which is: 4008175468544.000000. At this point, our analysis journey is over.

Summarize

Below, we summarize the key steps in the entire process:

  • Both 2.0e20 and 1.0 are received as double types

  • 2.0e20 + 1.0 After 1.0, because of the order, the effective part disappears during the left shift of the decimal point (existing at positions greater than 52 digits)

  • The process of converting double precision to single precision, because the rounding rule generates carry, resulting in the actual stored float value being larger than the original double type value.

  • The single precision is subsequently improved to double precision, resulting in the next 29 bits being directly supplemented with 0.

Therefore, the key to errors throughout the process is:

  1. The first double precision addition is assigned to single precision, causing double precision data to become single precision data, and the rounding rule causes carry;

  2. During the subsequent subtraction calculation process, the single-precision data is converted into double-precision bits, and the last 29 bits are directly supplemented with 0, which is larger than the binary of 2.0e20 with actual double-precision. The error generated is the final result: 4008175468544.000000.

This also reminds us that during the calculation process, a unified data type should be used to avoid the implicit operation of double-precision conversion of single-precision.

If the above code case uses the following two writing methods, the correct output of 0.0 will be

#include <>

int main(void)
{
  double a,b;

  b = 2.0e20 + 1.0;
  a = b - 2.0e20;
  printf("%f \n",a);

  return 0;
}
#include <>

int main(void)
{
  float a,b;

  b = 2.0e20f + 1.0f;
  a = b - 2.0e20f;
  printf("%f \n",a);

  return 0;
}