Skip to the content of the web site.

Error Analysis

Integers are relatively easy to work with: all arithmetic operations are well defined, operations obey familiar properties such as associativity (a + b) + c = a + (b + c), and the amount of memory required to store an integer n is ⌊log10(n)⌋, for example, 56323 requires five digits. Rational numbers, however, do not share this last property: 509825308235/509825308236 is approximately 1, but requires 24 digits.

We will look at an efficient method of storing real numbers (floating-point numbers), an efficient way of storing numbers on a computer (binary) and the double-precision floating-point number representation (double), however, we will also look at weaknesses which result from using this format, as is demonstrated by this example:

>> format long
>> sum = 0    
sum = 0
>> for i = 1:100000       % a for loop from 1 to 100000
       sum = sum + 0.1;
end
>> sum
sum = 10000.0000000188

Here, we add 0.1 one-hundred thousand times, and therefore the sum should be 10000, however, the actual answer is not as accurate as we may wish: the relative error is 1.88×10-12. While this may seem insignificant, if a calculation as simple as this results in an incorrect answer, consider what may happen with a longer sequence of more complex calculation.

Readers who program in C++ may note that the exact same sum occurs with the double built-in data type in this program: sum.cpp. The output of this program is the text The sum is 10000.0000000188.