Skip to the content of the web site.

Floating-point Error Propagation

Single-precision floating-point numbers (floats) store only 24-bits of precision, and consequently after a sequence of operations, the accuracy of said sequence diminishes. An important question is: how quickly does the accuracy diminish, and is the result of a sequence of operations still at least a reasonable approximation.

This class allows the user to create a special class of floats which track:

When a Float( float ) is generated, it stores the float, the float cast as a double, and as bounds, it is assumed that the floating point number represents the largest possible range of doubles which could be rounded to that value.

For example, when 1.0023253 is stored as a float, its binary representation is 0    01111111-00000000100110000110010, which, when cast as a double is 0 01111111111-0000000010011000011001000000000000000000000000000000, however, the range of possible values which could be represented by the initial float are 0 01111111111-0000000010011000011000110000000000000000000000000000 to 0 01111111111-0000000010011000011001010000000000000000000000000000. These four values are stored.

With each operation (+, -, *, and /) performed on a Float, these operations are also performed on the float and double values whereas the lower and upper bounds are calculated with the rounding-modes set not to nearest (the default), but towards +∞ or -∞, as appropriate. This is done by the appropriate function calls in the fenv.h.

Accessors

When cast as a float, the object returns the float value.

When cast as a double, the object returns the double value.

The member functions lower() and upper() return the lower and upper bounds.

Printing

The operator << is overloaded to print the float ~ double (lower, upper).

Example

Consider the following code which shows that calculating (a + b)*c is not necessarily better than calculating a*c + b*c. This uses my other function num2bin() which converts a numeric value to a string representation of the binary format.

#include <iostream>
#include "Float.h"
#include "num2bin.h"

using namespace ca_uwaterloo_alumni_dwharder;

using namespace std;

int main() {
	Float a = 1.0023253;
	Float b = 1.9998235;
	Float c = 1.9998235;

	cout.precision( 12 );

	Float x = (a + b)*c;
	Float y = a*c + b*c;

	cout << "(a + b)*c =" << endl;
	cout << (a + b)*c << endl;
	cout << "a*c + b*c =" << endl;
	cout << a*c + b*c << endl << endl;

	cout << "a =" << endl;
	cout << static_cast<float>( a ) << '\t'
		<< num2bin( static_cast<float>( a ) ) << endl;
	cout << static_cast<double>( a ) << '\t'
		<< num2bin( static_cast<double>( a ) ) << endl;
	cout << a.lower() << '\t' << num2bin( a.lower() ) << endl;
	cout << a.upper() << '\t' << num2bin( a.upper() ) << endl << endl;

	cout << "b =" << endl;
	cout << static_cast<float>( b ) << '\t'
		<< num2bin( static_cast<float>( b ) ) << endl;
	cout << static_cast<double>( b ) << '\t'
		<< num2bin( static_cast<double>( b ) ) << endl;
	cout << b.lower() << '\t' << num2bin( b.lower() ) << endl;
	cout << b.upper() << '\t' << num2bin( b.upper() ) << endl << endl;

	cout << "c =" << endl;
	cout << static_cast<float>( c ) << '\t'
		<< num2bin( static_cast<float>( c ) ) << endl;
	cout << static_cast<double>( c ) << '\t'
		<< num2bin( static_cast<double>( c ) ) << endl;
	cout << c.lower() << '\t' << num2bin( c.lower() ) << endl;
	cout << c.upper() << '\t' << num2bin( c.upper() ) << endl << endl;

	cout << "(a + b)*c =" << endl;
	cout << static_cast<float>( x ) << '\t'
		<< num2bin( static_cast<float>( x ) ) << endl;
	cout << static_cast<double>( x ) << '\t'
		<< num2bin( static_cast<double>( x ) ) << endl;
	cout << x.lower() << '\t' << num2bin( x.lower() ) << endl;
	cout << x.upper() << '\t' << num2bin( x.upper() ) << endl << endl;

	cout << "a*c + b*c =" << endl;
	cout << static_cast<float>( y ) << '\t'
		<< num2bin( static_cast<float>( y ) ) << endl;
	cout << static_cast<double>( y ) << '\t'
		<< num2bin( static_cast<double>( y ) ) << endl;
	cout << y.lower() << '\t' << num2bin( y.lower() ) << endl;
	cout << y.upper() << '\t' << num2bin( y.upper() ) << endl << endl;

	return 0;
}

The output is:

(a + b)*c =
6.00376701355 ~ 6.00376746866 (6.00376705132, 6.00376788600)
a*c + b*c =
6.00376749039 ~ 6.00376746866 (6.00376705132, 6.00376788600)

a =
1.00232529640	0    01111111-00000000100110000110010
1.00232529640	0 01111111111-0000000010011000011001000000000000000000000000000000
1.00232523680	0 01111111111-0000000010011000011000110000000000000000000000000000
1.00232535601	0 01111111111-0000000010011000011001010000000000000000000000000000

b =
1.9998234510	0    01111111-11111111111101000110111
1.9998234510	0 01111111111-1111111111110100011011100000000000000000000000000000
1.99982339144	0 01111111111-1111111111110100011011010000000000000000000000000000
1.99982351065	0 01111111111-1111111111110100011011110000000000000000000000000000

c =
1.9998234510	0    01111111-11111111111101000110111
1.9998234510	0 01111111111-1111111111110100011011100000000000000000000000000000
1.99982339144	0 01111111111-1111111111110100011011010000000000000000000000000000
1.99982351065	0 01111111111-1111111111110100011011110000000000000000000000000000

(a + b)*c =
6.00376701355	0    10000001-10000000001111011011100
6.00376746866	0 10000000001-1000000000111101101110011110100010101010100011110000
6.00376705132	0 10000000001-1000000000111101101110000010100010001101001001000000
6.00376788600	0 10000000001-1000000000111101101110111010100011000111111110110000

a*c + b*c =
6.00376749039	0    10000001-10000000001111011011101
6.00376746866	0 10000000001-1000000000111101101110011110100010101010100011110000
6.00376705132	0 10000000001-1000000000111101101110000010100010001101001001000000
6.00376788600	0 10000000001-1000000000111101101110111010100011000111111110110000

You will note that the float resulting from the calculation (a + b)*c is not only not the appropriately rounded value of the corresponding double, but it does not even fall in the calculated interval. In the second case, the result from calculating a*c + b*c, the float is the appropriately rounded value of the double.

Specifically, in the first case, the calculated float (blue) falls outside the interval: 6.00376701⋅⋅⋅ < 6.00376705⋅⋅⋅ < 6.00376746⋅⋅⋅ < 6.00376788⋅⋅⋅ but in the second, the calculated float (blue) falls within the interval: 6.00376705⋅⋅⋅ < 6.00376746⋅⋅⋅ < 6.00376749⋅⋅⋅ < 6.00376788⋅⋅⋅.