# Introduction

In this topic, we consider some of the problems which occur as a result of using a floating-point representation. Specifically, we will look at the quadratic formula as an example.

# References

• Chapra, Section 3.4.2, Arithmetic Manipulations of Computer Numbers, p.66.

# Theory

Because we are attempting to represent an infinite number of different real numbers with a finite number of floating-point numbers, we must run into problems. Some of the more common problems, some of which we will see in this course, are listed below.

# Overflow and Underflow

The largest floating-point number which can be represented using our format is 0999999, or 9.999 × 1050. Beyond this, we can no longer approximate a number with the given relative error. The solution for this is to represent this number by a special floating-point number designated as positive infinity (+∞). In our case, we could, for example, represent this as 0000001 as we require that the first digit of the mantissa is not zero for normal floats. Similarly, 1000001 could represent -∞.

Similarly, the smallest positive number is 0001000 or 10-49. To represent any number smaller than this, we use +0.

# Subtractive Cancellation

Consider the following difference: 3.593 - 3.574 = 0.019. Using our system, these are represented by 0493593, 0493574, and 0471900. All three appear to have the same precision: four decimal digits in the mantissa. However, because 3.593 represents any number in the range (3.5925, 3.5935), and 3.574 represents any number in the range [3.5735, 3.5745], the actual difference may be any number in the interval (0.018, 0.020), and therefore while we have an apparent precision of four digits, we actually have zero significant digits.

This phenomena which occurs when we try to subtract similar numbers is termed subtractive cancellation. The easiest example is with the formula for the derivative: (f(x + h) - f(x))/h. If we use f(x) = x2 with x = 3.253 and h = 0.002, we get an approximation (3.2552 - 3.2532)/0.002 = (10.60 - 10.58)/0.002 = 10., which is a poor approximation for the actual derivative 6.506.

Many of the topics in this course will be affected by subtractive cancelation.

# Adding Large and Small Numbers

Consider 1038. + 0.02353 = 1038.02353 ≈ 1038. Thus, even though the second number is not zero, when it is added to 1038., there is no change to the first. This leads to the further problem that floats are not associative, i.e., (a + b) + c != a + (b + c). We will see this in one instance, in Gaussian elimination, in this course.

If we consider the sum 1000 + 0.5 + 0.5 and calculate it from left to right, we get (1000 + 0.5) + 0.5 = 1000 + 0.5 = 1000 because 1000.5 rounds to 1000 for four decimal places. However, if we calculate the sum from right to left, we get 1000. + (0.5 + 0.5) = 1000. + 1 = 1001.

This is presented as yet another example of non associativity, similar to the previous question. This will not be seen in any of the algorithms covered in this course.

# Subtractive Cancellation

Any time you subtract two numbers which are almost equal, subtractive cancellation will occur, and the difference will not have as much precision as either of subtrahend or the minuend. In designing numerical algorithms, we must avoid such situations.

# Overflow and Underflow

Overflow occurs when the exponent is too large to be represented with the given form. According to IEEE 754, the appropriate representation of such a number is ±infinity. Underflow occurs when the exponent is too small (too negative) to be represented by the given form and should be represented by ±0.

# Adding Large and Small Numbers

If a large float is added to a small float, the small float may be too negligible to change the value of the larger float. Algorithms which may cause such behaviour must be avoided.

In performing a sequence of additions, the numbers should be added in the order of the smallest in magnitude to the largest in magnitude. This ensures that the cumulative sum of many small arguments is still felt.

# Examples

1. Consider the factored polynomial (x − 1)8. Expanded, this produces the polynomial:

x8 − 8x7 + 28x6 − 56x5 + 70x4 − 56x3 + 28x2 − 8x + 1

First, we can evaluate the original factored polynomal:

```d = 0.000007237;
x = [1 - d, 1 + d]
0.999992763000000   1.000007237000000
y0 = (x - 1).^8
```

Usually, polynomials are not available in factored form and consequently they must be evaluated in expanded form. By default, this may be done by summing the terms and this may be done either in order of decreasing or increasing degree:

```y1 = [0,0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y1 = y1 + p(i)*x.^(9 - i);
end
y1
2.66453525910038e-015  -1.24344978758018e-014
```
```y2 = [0,0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y2 = y2 + p(i)*x.^(i - 1);
end
y2
2.77555756156289e-015  -1.24344978758018e-014
```

The error using these evaluations is increased by 26 orders of magnitude.

Another means of evaluating expanded polynomials is Horner's rule, however, says that we rewrite the polynomial in the form and to now evalute this from inside out:

(((((((x − 8)x + 28)x − 56)x + 70)x − 56)x + 28x) − 8)x + 1

This may be evaluated as follows:

```y3 = [0 0];
p = [1 -8 28 -56 70 -56 28 -8 1];
for i = 1:length(p)
y3 = y3.*x + p(i);
end
y3
-1.55431223447522e-015  3.44169137633799e-015
```

In this case, the error is less than that of either evaluating the polynomials in either decreasing or increasing degree.

Figure 1 plots the evaluation of the factored polynomial (black) and the expanded polynomial using standard evaluation in decreasing degree (magenta) and increasing degree (blue) and Horner's rule (red) at 217 + 1 equally-spaced points on [1 − 2-10, 1 + 2-10].

Figure 1. The polynomial (x − 1)8 plotted (in black) together with the standard evaluation of the expanded form both in increasing degree (blue), decreasing degree (magenta), and Horner's rule (red). (Click to enlarge.)

What is immediately noticed is that when evaluating the expanded decreasing degree, the error is exceptionally discrete. The error is less discrete when evaluated in increasing degree; however, in both cases, there is a distinct jump in the error by a factor of two as we cross the value x = 1. Horner's rule remains more bounded and is not affected by the change in the precision.

2. Using four decimal digits of precision, subtract 3.523 from 3.537.

0.014 which equals itself when rounded to four decimal digits. This result may be interpreted as 0.01400, however, recall that 3.523 could represent any number in (3.5225, 3.5235) and 3.537 could represent any number in (3.5365, 3.5375), and thus the actual answer could be any number in the range (0.013, 0.015), and thus, while the original numbers had a maximum relative error of 0.00014, the answer has a relative error of 0.071 which is larger by a factor of 500.

3. Using 4 decimal digits of precision add 532.3 and 3.235.

535.535 which, when rounded to four decimal digits equals 535.5. Thus, part of the precision of the smaller number has been lost. This is not as dramatic as x + h = x when h ≠ 0, but this can still be a problem.

4. Using 3 decimal digits of precision, add the powers of 2 from 0 to 17 in the order from 0 to 17 and then in reverse order from 17 to 0.

Recall that with any such system, numbers must be rounded before and after any operation is performed. For example, 210 = 1024 = 1020 using 3 significant digits (after rounding). Thus, the partial sums are in increasing order are:

```1 3 7 15 31 63 127 255 511 1020 2040 4090 8190 16400 32800 65600 131000 262000
```

while in decreasing order, they are:

``` 131000 196000 229000 245000 253000 257000 259000 260000 261000 261000 261000 261000 261000 261000 261000 261000 261000 261000
```

The correct value of this sum is 218 - 1 = 262143, and thus the relative error of the first sum is 0.00055 while the relative error of the second sum is 0.0044, and thus we have gained almost an entire digit of precision.

5. Figure 2 gives a summary of the calculation of the roots of the given quadratic equation. The left column calculates the roots using 4 decimal digits of precision, the right uses 10 digits. Note that up until we perform the difference that all numbers in the left-hand column are the best possible 4-digit approximation of the numbers in the right-hand column. With the difference, a significant amount of relative error is introduced and all calculations after that point are equally imprecises.

Figure 2. Finding the roots of a quadratic.

In both cases, the inaccurate digits are marked in red.

# Questions

Unless stated otherwise, assume that floats use the representation of four decimal digits in the mantissa.

1. What is the largest float which may be added to 3.523 which will result in a sum of 3.523 and why? (0.0004999)

2. a. What is the largest float which may be added to 722.4 which will result in a sum of 722.4 and why? (0.05000)

2. b. What is the largest float which may be added to 722.3 which will result in a sum of 722.3 and why? (0.04999)

3. How would you calculate the sum of n-2 for n = 1, 2, ..., 100000 and why?

(starting at 100000)

# Matlab

We can still observe these phenomena in Matlab:

```>> format long
>> x = 0;
>> for i=1:1000000
x = x + 1/i;
end
>> x
x = 14.3927267228648
>> x = 0;
>> for i=1000000:-1:1
x = x + 1/i;
end
>> x
x = 14.3927267228658
```

This approximation to the derivative of sin(x) at x = 1 is not very good, even though h = 10-15 is quite small:

```>> (sin(1 + 1e-15) - sin(1))/1e-15
ans = 0.555111512312578
>> cos(1)
ans = 0.540302305868140
```

# Maple

You can see the problems with floating-point numbers with numerous examples in Maple:

```53.23559325 - 53.23823923;
53235593.25 + 53.23823923;
```

As a demonstrated example:

```> Digits := 3:
> S := 0.0:
> for i from 1 to 10000 do
>    S := S + 1.0/i;
> end:
> S;
6.16
> S := 0.0:
> for i from 10000 to 1 by -1 do
>    S := S + 1.0/i;
> end:
> S;
7.15
```