Integers are relatively easy to work with: all
arithmetic operations are well defined, operations obey
familiar properties such as associativity (a + b) + c = a + (b + c), and the amount
of memory required to store an integer n is
⌊log10(n)⌋, for example,
56323 requires five digits. Rational numbers, however,
do not share this last property: 509825308235/509825308236 is approximately 1, but requires 24 digits.
We will look at an efficient method of storing real numbers
(floating-point numbers), an efficient way of storing
numbers on a computer (binary) and the double-precision
floating-point number representation (double), however, we
will also look at weaknesses which result from using this format,
as is demonstrated by this example:
>> format long
>> sum = 0
sum = 0
>> for i = 1:100000 % a for loop from 1 to 100000
sum = sum + 0.1;
sum = 10000.0000000188
Here, we add 0.1 one-hundred thousand times, and therefore the sum should be
10000, however, the actual answer is not as accurate as we may wish: the relative error is 1.88×10-12.
While this may seem insignificant, if a calculation as simple as this results in
an incorrect answer, consider what may happen with a longer sequence of more complex calculation.
Readers who program in C++ may note that the exact same sum occurs with the double built-in data
type in this program: sum.cpp. The output of this program is the text
The sum is 10000.0000000188.