Internal implementation of the floating point algorithm

Scientific computing will be used in a number of floating point operations , these floating point numbers may be 16-bit , 32-bit , 64-bit , 80-bit or even 128-bit . The open source project SoftFloat provides an efficient floating-point implementation that can efficiently simulate floating-point operations without hardware support.

So, how exactly are comparisons between floating point numbers, basic arithmetic and all of this implemented? You can take a 32-bit floating point number as an example.

This is an implementation of 32-bit floating-point number addition, starting with the declaration of a structure float32_t.

typedef struct { uint32_t v; } float32_t;

This provides the underlying bit representation of a 32-bit floating point number and also declares a union.

union ui32_f32 { uint32_t ui; float32_t f; };

On the one hand, it preserves the bitwise representation of floating-point numbers, and on the other hand, it can be converted to a 32-bit unsigned integer for direct comparison, which will be directly covered later in the algorithm. Let's look at addition first.

float32_t f32_add( float32_t a, float32_t b )
{
    union ui32_f32 uA;
    uint_fast32_t uiA;
    union ui32_f32 uB;
    uint_fast32_t uiB;
#if ! defined INLINE_LEVEL || (INLINE_LEVEL < 1)
    float32_t (*magsFuncPtr)( uint_fast32_t, uint_fast32_t );
#endif

     = a;
    uiA = ;
     = b;
    uiB = ;
#if defined INLINE_LEVEL && (1 <= INLINE_LEVEL)
    if ( signF32UI( uiA ^ uiB ) ) {
        return softfloat_subMagsF32( uiA, uiB );
    } else {
        return softfloat_addMagsF32( uiA, uiB );
    }
#else
    magsFuncPtr =
        signF32UI( uiA ^ uiB ) ? softfloat_subMagsF32 : softfloat_addMagsF32;
    return (*magsFuncPtr)( uiA, uiB );
#endif

}

Here uiA and uiB are storing unsigned integers, signF32UI is extracting the sign bits. signF32UI(uiA ^ uiB) determines whether the sign bits are the same or not, if they are the same it calls addition, if the sign bits are not the same it calls subtraction, since there are no floating point numbers, it can only be simulated by integer types, also, there is a term for UNION storing floats and integers. It seems to be called type punning technique? But here union stores just the bit representation, not really floating point numbers.

float32_t f32_sub( float32_t a, float32_t b )
{
    union ui32_f32 uA;
    uint_fast32_t uiA;
    union ui32_f32 uB;
    uint_fast32_t uiB;
#if ! defined INLINE_LEVEL || (INLINE_LEVEL < 1)
    float32_t (*magsFuncPtr)( uint_fast32_t, uint_fast32_t );
#endif

     = a;
    uiA = ;
     = b;
    uiB = ;
#if defined INLINE_LEVEL && (1 <= INLINE_LEVEL)
    if ( signF32UI( uiA ^ uiB ) ) {
        return softfloat_addMagsF32( uiA, uiB );
    } else {
        return softfloat_subMagsF32( uiA, uiB );
    }
#else
    magsFuncPtr =
        signF32UI( uiA ^ uiB ) ? softfloat_addMagsF32 : softfloat_subMagsF32;
    return (*magsFuncPtr)( uiA, uiB );
#endif

}

Subtraction is reversed at the judgment sign, otherwise it's the same. This is a good time to see how the comparison operation works.

bool f32_le( float32_t a, float32_t b )
{
    union ui32_f32 uA;
    uint_fast32_t uiA;
    union ui32_f32 uB;
    uint_fast32_t uiB;
    bool signA, signB;

     = a;
    uiA = ;
     = b;
    uiB = ;
    if ( isNaNF32UI( uiA ) || isNaNF32UI( uiB ) ) {
        softfloat_raiseFlags( softfloat_flag_invalid );
        return false;
    }
    signA = signF32UI( uiA );
    signB = signF32UI( uiB );
    return
        (signA != signB) ? signA || ! (uint32_t) ((uiA | uiB)<<1)
            : (uiA == uiB) || (signA ^ (uiA < uiB));

}

The final expression is a bit convoluted, so let's break it down one step at a time. First of all, if the signs are not equal (one positive and one negative), if the sign of A is 1, that is, a negative number, it must be smaller than B. Otherwise, go to the branch after | |. A and B of the highest bit (sign bit) eliminated, to determine whether the same, that is, the case of +0 and -0, here remember not to miss the front! sign, because to determine whether both are 0; if A and B are the same number, if they are both positive then directly compare, if they are both negative, then the previous signA will be the result of the inverse.

concluding remarks

Recently in the school recruitment stage, I am preparing for it and will share my insights and experiences when I have time, hoping to get on board as early as possible.