## Stupid Float Tricks

## Type Punning is Not a Joke

I left the last post with a promise to share an interesting property of the IEEE float format. There are several equivalent ways of stating this property, and here are two of them.

For floats of the same sign:

- Adjacent floats have adjacent integer representations
- Incrementing the integer representation of a float moves to the next representable float, moving away from zero

Depending on your math and mental state these claims will seem somewhere between fantastic/improbable and obvious/inevitable. I think it’s worth pointing out that these properties are certainly not inevitable. Many floating-point formats before the IEEE standard did not have these properties. These tricks only work because of the implied one in the mantissa (which avoids duplicate encodings for the same value), the use of an exponent bias, and the placement of the different fields of the float. The float format was carefully designed in order to guarantee this interesting characteristic.

I could go on at length to explain why incrementing the integer representation of a float moves to the next representable float (incrementing the mantissa increases the value of the float, and when the mantissa wraps to zero that increments the exponent, QED) but instead I recommend you either trust me or else play around with Float_t in your debugger until you see how it works.

One thing to be aware of is the understated warning that this only applies for floats of the same sign. The representation of positive zero is adjacent to the representation for 1.40129846e-45, but the representation for negative zero is about two billion away, because its sign bit is set which means that its integer representation is the most negative 32-bit integer. That means that while positive and negative zero compare equal as floats, their integer representations have radically different values. This also means that tiny positive and negative numbers have integer representations which are about two billion apart. Beware!

Another thing to be aware of is that while incrementing the integer representation of a float normally increases the value by a modest and fairly predictable ratio (typically the larger number is at most about 1.0000012 times larger) this does not apply for very small numbers (between zero and FLT_MIN) or when going from FLT_MAX to infinity. When going from zero to the smallest positive float or from FLT_MAX to infinity the ratio is actually infinite, and when dealing with numbers between zero and FLT_MIN the ratio can be as large as 2.0. However in-between FLT_MIN and FLT_MAX the ratio is relatively predictable and consistent.

Here’s a concrete example of using this property. This code prints all 255*2^23+1 positive floats, from +0.0 to +infinity:

union Float_t { int32_t i; float f; struct { uint32_t mantissa : 23; uint32_t exponent : 8; uint32_t sign : 1; } parts; }; void IterateAllPositiveFloats() { // Start at zero and print that float. Float_t allFloats; allFloats.f = 0.0f; printf("%1.8e\n", allFloats.f); // Continue through all of the floats, stopping // when we get to positive infinity. while (allFloats.parts.exponent < 255) { // Increment the integer representation // to move to the next float. allFloats.i += 1; printf("%1.8e\n", allFloats.f); } } |

The (partial) output looks like this:

0.00000000e+000

1.40129846e-045

2.80259693e-045

4.20389539e-045

5.60519386e-045

7.00649232e-045

8.40779079e-045

9.80908925e-045

…

3.40282306e+038

3.40282326e+038

3.40282347e+038

1.#INF0000e+000

For double precision floats you could use _nextafter() to walk through all of the available doubles, but I’m not aware of a simple and portable alternative to this technique for 32-bit floats.

We can use this property and the Float_t union to find out how much precision a float variable has at a particular range. We can assign a float to Float_t::f, then increment or decrement the integer representation, and then compare the before/after float values to see how much they have changed. Here is some sample code that does this:

float TestFloatPrecisionAwayFromZero(float input) { union Float_t num; num.f = input; // Incrementing infinity or a NaN would be bad! assert(num.parts.exponent < 255); // Increment the integer representation of our value num.i += 1; // Subtract the initial value to find our precision float delta = num.f - input; return delta; } float TestFloatPrecisionTowardsZero(float input) { union Float_t num; num.f = input; // Decrementing from zero would be bad! assert(num.parts.exponent || num.parts.mantissa); // Decrementing a NaN would be bad! assert(num.parts.exponent != 255 || num.parts.mantissa == 0); // Decrement the integer representation of our value num.i -= 1; // Subtract the initial value to find our precision float delta = num.f - input; return -delta; } struct TwoFloats { float awayDelta; float towardsDelta; }; struct TwoFloats TestFloatPrecision(float input) { struct TwoFloats result = { TestFloatPrecisionAwayFromZero(input), TestFloatPrecisionTowardsZero(input), }; return result; } |

Note that the difference between the values of two adjacent floats can always be stored exactly in a (possibly subnormal) float. I have a truly marvelous proof of this theorem which the margin is too small to contain.

These functions can be called from test code to learn about the float format. Better yet, when sitting at a breakpoint in Visual Studio you can call them from the watch window. That allows dynamic exploration of precision:

Usually the delta is the same whether you increment the integer representation or decrement it. However if incrementing or decrementing changes the exponent then the two deltas will be different. This can be seen in the example above where the precision at 65536 is twice as good (half the delta) going towards zero compared to going away from zero.

## Caveat Incrementor

Pay close attention to the number of caveats that you have to watch for when you start partying on the integer representation of a float. It’s safe in a controlled environment, but things can quickly go bad in the real world:

- Some compilers may prohibit the type-punning/aliasing used by Float_t (gcc and VC++ allow it)
- Incrementing the integer representation of infinity gives you a NaN
- Decrementing the integer representation of zero gives you a NaN
- Incrementing or decrementing some NaNs will give you zero or infinity
- The ratio of the value of two adjacent floats is usually no more than about 1.0000012, but is sometimes much much larger
- The representations of positive and negative zero are far removed from each other
- The representations of FLT_MIN and -FLT_MIN are as far from each other as the representations of FLT_MAX and -FLT_MAX
- Floating-point math is
*always*more complicated than you expect

## Log Rolling

A related property is that, for floats that are positive, finite, and non-zero, the integer representation of a float is a piecewise linear approximation of its base 2 logarithm I just like saying that. It sounds cool.

The reason that the integer representation is (after appropriate scaling and biasing) a piecewise linear representation of the base 2 logarithm of a (positive) float is because the exponent is logically the base-2 logarithm of a float, and it is in the high bits. The mantissa linearly interpolates between power-of-2 floats. The code below demonstrates the concept:

void PlotFloatsVersusRep() { // Let's plot some floats and their representations from 1.0 to 32.0 for (float f = 1.0f; f <= 32.0f; f *= 1.01f) { Float_t num; num.f = f; // The representation of 1.0f is 0x3f800000 and the representation // of 2.0f is 0x40000000, so if the representation of a float is // an approximation of its base 2 log then 0x3f800000 must be // log2(1.0) == 0 and 0x40000000 must be log2(2.0) == 1. // Therefore we should scale the integer representation by // subtracting 0x3f800000 and dividing by // (0x40000000 - 0x3f800000) double log2Estimate = (num.i - 0x3f800000) / double(0x40000000 - 0x3f800000); //printf("%1.5f,%1.5f\n", f, log2Estimate); double log2 = log(f) / log(2.0); printf("%1.5f,%1.5f,%1.5f,%1.5f\n", f, log2Estimate, log2, log2Estimate / log2); } } |

If we drop the results into Excel and plot them with the x-axis on a base-2 log scale then we get this lovely chart:

If it’s plotted linearly then the ‘linear’ part of ‘piecewise linear’ becomes obvious, but I like the slightly scalloped straight line better. The estimate is exact when the float is a power of two, and at its worst is about 0.086 too small.

In the last of this week’s stupid float tricks I present you with this dodgy code:

int RoundedUpLogTwo(uint64_t input) { assert(input > 0); union Float_t num; num.f = (float)input; // Increment the representation enough that for any non power- // of-two (FLT_MIN or greater) we will move to the next exponent. num.i += 0x7FFFFF; // Return the exponent after subtracting the bias. return num.parts.exponent - 127; } |

Depending on how you think about such things this is either the simplest and most elegant, or the most complicated and obtuse way of finding out how many bits it takes to represent a particular integer.

## Random aside

The IEEE standard guarantees correct rounding for addition, multiplication, subtraction, division, and square-root. If you’ve ever wondered if this is important, then try this: using the Windows calculator (I’m using the Windows 7 version), calculate sqrt(4) - 2. The answer should, of course, be zero. However the answer that the calculator actually returns is:

Scientific mode: -8.1648465955514287168521180122928e-39

Standard mode: -1.068281969439142e-19

This is utterly fascinating. It shows that the calculator is using impressive precision (it did the calculations to *40* digits of accuracy, 20 digits in standard mode) and yet it got the wrong answer. Because the Windows calculator is using so much precision it will, for many calculations, get a more accurate answer than an IEEE float or double. However, because it fails to correctly round the answer (the last digit cannot be trusted) it sometimes gives answers that are laughably wrong.

I can just imagine the two calculator modes arguing: the standard mode says “according to my calculations sqrt(4) is 1.9999999999999999999” and the scientific mode says “according to *my* calculations it’s 1.999999999999999999999999999999999999992 – that is far more accurate”. Meanwhile an IEEE float says “I’ve only got about eight digits of precision but I think sqrt(4) is 2.”

Having lots of digits of precision is nice, but without correct rounding it can just end up making you look foolish.

## That’s all folks

In the next post I’ll discuss some concrete examples that show the value of knowing how much precision is available around certain values. Until next time, float on.