Introduction
Theory
HOWTO
Examples
Questions
Matlab
Maple

# Introduction

In this topic, we consider some of the problems which occur
as a result of using a floating-point representation. Specifically,
we will look at the quadratic formula as an example.

# Background

# References

- Chapra, Section 3.4.2, Arithmetic Manipulations of Computer Numbers, p.66.

# Theory

Because we are attempting to represent an infinite number of different
real numbers with a finite number of floating-point numbers, we must run
into problems. Some of the more common problems, some of which we will
see in this course, are listed below.

# Overflow and Underflow

The largest floating-point number which can be represented using
our format is 0999999, or 9.999 × 10^{50}. Beyond this,
we can no longer approximate a number with the given relative error.
The solution for this is to represent this number by a special
floating-point number designated as positive infinity (+∞).
In our case, we could, for example, represent this as 0000001 as
we require that the first digit of the mantissa is not zero for
normal floats. Similarly, 1000001 could represent -∞.

Similarly, the smallest positive number is 0001000 or 10^{-49}.
To represent any number smaller than this, we use +0.

# Subtractive Cancellation

Consider the following difference: 3.593 - 3.574 = 0.019. Using
our system, these are represented by 0493593, 0493574, and 0471900.
All three appear to have the same precision: four decimal digits
in the mantissa. However, because 3.593 represents any number in the
range (3.5925, 3.5935), and 3.574 represents any number in the range
[3.5735, 3.5745], the actual difference may be any number in the
interval (0.018, 0.020), and therefore while we have an apparent
precision of four digits, we actually have zero significant digits.

This phenomena which occurs when we try to subtract similar numbers
is termed *subtractive cancellation*. The easiest example
is with the formula for the derivative: (f(*x* + *h*) - f(*x*))/*h*. If
we use f(*x*) = *x*^{2} with *x* = 3.253 and *h* = 0.002, we get
an approximation (3.255^{2} - 3.253^{2})/0.002 = (10.60 - 10.58)/0.002 = 10.,
which is a poor approximation for the actual derivative 6.506.

Many of the topics in this course will be affected by subtractive
cancelation.

# Adding Large and Small Numbers

Consider 1038. + 0.02353 = 1038.02353 ≈ 1038. Thus,
even though the second number is not zero, when it is added to
1038., there is no change to the first. This leads to the
further problem that floats are not associative, i.e.,
(a + b) + c != a + (b + c).
We will see this in one instance, in Gaussian elimination, in this
course.

# Order of Additions

If we consider the sum 1000 + 0.5 + 0.5 and calculate it from
left to right, we get (1000 + 0.5) + 0.5 = 1000 + 0.5 = 1000
because 1000.5 rounds to 1000 for four decimal places. However,
if we calculate the sum from right to left, we get
1000. + (0.5 + 0.5) = 1000. + 1 = 1001.

This is presented as yet another example of non associativity, similar to
the previous question. This will not be seen in any of the algorithms covered
in this course.

# HOWTO

# Subtractive Cancellation

Any time you subtract two numbers which are almost equal, subtractive
cancellation will occur, and the difference will not have as much
precision as either of subtrahend or the minuend. In designing
numerical algorithms, we must avoid such situations.

# Overflow and Underflow

Overflow occurs when the exponent is too large to be represented
with the given form. According to IEEE 754, the appropriate representation
of such a number is ±infinity. Underflow occurs when the exponent is
too small (too negative) to be represented by the given form and should
be represented by ±0.

# Adding Large and Small Numbers

If a large float is added to a small float, the small float may
be too negligible to change the value of the larger float. Algorithms
which may cause such behaviour must be avoided.

# Order of Additions

In performing a sequence of additions, the numbers should be added in
the order of the smallest in magnitude to the largest in magnitude. This
ensures that the cumulative sum of many small arguments is still felt.

# Examples

1. Consider the factored polynomial (*x* − 1)^{8}.
Expanded, this produces the polynomial:

*x*^{8} − 8*x*^{7} + 28*x*^{6} − 56*x*^{5} + 70*x*^{4} − 56*x*^{3} + 28*x*^{2} − 8*x* + 1

First, we can evaluate the original factored polynomal:

d = 0.000007237;
x = [1 - d, 1 + d]
0.999992763000000 1.000007237000000
y0 = (x - 1).^8

Usually, polynomials are not available in factored form
and consequently they must be evaluated in expanded form. By default,
this may be done by summing the terms and this may be done
either in order of decreasing or increasing degree:

y1 = [0,0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y1 = y1 + p(i)*x.^(9 - i);
end
y1
2.66453525910038e-015 -1.24344978758018e-014

y2 = [0,0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y2 = y2 + p(i)*x.^(i - 1);
end
y2
2.77555756156289e-015 -1.24344978758018e-014

The error using these evaluations is increased by 26 orders of magnitude.

Another means of evaluating expanded polynomials is
Horner's rule, however, says that we rewrite the polynomial in
the form and to now evalute this from inside out:

(((((((*x* − 8)*x* + 28)*x* − 56)*x* + 70)*x* − 56)*x* + 28*x*) − 8)*x* + 1

This may be evaluated as follows:

y3 = [0 0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y3 = y3.*x + p(i);
end
y3
-1.55431223447522e-015 3.44169137633799e-015

In this case, the error is less than that of either evaluating the polynomials in either
decreasing or increasing degree.

Figure 1 plots the evaluation of the factored polynomial (black) and
the expanded polynomial using standard evaluation in decreasing degree (magenta) and increasing degree (blue) and
Horner's rule (red) at 2^{17} + 1 equally-spaced points on
[1 − 2^{-10}, 1 + 2^{-10}].

Figure 1. The polynomial (*x* − 1)^{8} plotted (in black)
together with the standard evaluation of the expanded form both in increasing
degree (blue), decreasing degree (magenta), and
Horner's rule (red). (Click to enlarge.)

What is immediately noticed is that when evaluating the expanded decreasing degree, the
error is exceptionally discrete. The error is less discrete when evaluated in increasing
degree; however, in both cases, there is a distinct jump in the error by a factor of two
as we cross the value *x* = 1. Horner's rule remains more bounded and is not affected
by the change in the precision.

2. Using four decimal digits of precision, subtract 3.523 from 3.537.

0.014 which equals itself when rounded to four decimal digits.
This result may be interpreted as 0.01400, however, recall that
3.523 could represent any number in (3.5225, 3.5235) and 3.537
could represent any number in (3.5365, 3.5375), and thus the
actual answer could be any number in the range (0.013, 0.015),
and thus, while the original numbers had a maximum relative error of
0.00014, the answer has a relative error of 0.071 which is larger
by a factor of 500.

3. Using 4 decimal digits of precision add 532.3 and 3.235.

535.535 which, when rounded to four decimal digits equals 535.5. Thus,
part of the precision of the smaller number has been lost. This is
not as dramatic as x + h = x when h ≠ 0, but
this can still be a problem.

4. Using 3 decimal digits of precision, add the powers of 2 from 0 to 17 in
the order from 0 to 17 and then in reverse order from 17 to 0.

Recall that with any such system, numbers must be rounded before
and after any operation is performed. For example, 2^{10} = 1024 = 1020
using 3 significant digits (after rounding). Thus, the partial sums are in increasing order are:

1 3 7 15 31 63 127 255 511 1020 2040 4090 8190 16400 32800 65600 131000 262000

while in decreasing order, they are:

131000 196000 229000 245000 253000 257000 259000 260000 261000 261000 261000 261000 261000 261000 261000 261000 261000 261000

The correct value of this sum is 2^{18} - 1 = 262143, and thus
the relative error of the first sum is 0.00055 while the relative error
of the second sum is 0.0044, and thus we have gained almost an entire
digit of precision.

5. Figure 2 gives a summary of the calculation of
the roots of the given quadratic equation. The left
column calculates the roots using 4 decimal digits
of precision, the right uses 10 digits. Note that
up until we perform the difference that all numbers
in the left-hand column are the best possible 4-digit
approximation of the numbers in the right-hand column.
With the difference, a significant amount of relative
error is introduced and all calculations after that
point are equally imprecises.

Figure 2. Finding the roots of a quadratic.

In both cases, the inaccurate digits are marked in
red.

# Questions

Unless stated otherwise, assume that floats use the representation
of four decimal digits in the mantissa.

1. What is the largest float which may be added to 3.523 which will result
in a sum of 3.523 and why? (0.0004999)

2. a. What is the largest float which may be added to 722.4 which will result
in a sum of 722.4 and why? (0.05000)

2. b. What is the largest float which may be added to 722.3 which will result
in a sum of 722.3 and why? (0.04999)

3. How would you calculate the sum of *n*^{-2} for *n* = 1, 2, ..., 100000 and why?

(starting at 100000)

# Matlab

We can still observe these phenomena in Matlab:

>> format long
>> x = 0;
>> for i=1:1000000
x = x + 1/i;
end
>> x
x = 14.3927267228648
>> x = 0;
>> for i=1000000:-1:1
x = x + 1/i;
end
>> x
x = 14.3927267228658

This approximation to the derivative of sin(x) at x = 1 is not very good,
even though *h* = 10^{-15} is quite small:

>> (sin(1 + 1e-15) - sin(1))/1e-15
ans = 0.555111512312578
>> cos(1)
ans = 0.540302305868140

# Maple

You can see the problems with floating-point numbers with
numerous examples in Maple:

53.23559325 - 53.23823923;
53235593.25 + 53.23823923;

As a demonstrated example:

> Digits := 3:
> S := 0.0:
> for i from 1 to 10000 do
> S := S + 1.0/i;
> end:
> S;
6.16
> S := 0.0:
> for i from 10000 to 1 by -1 do
> S := S + 1.0/i;
> end:
> S;
7.15