Skip to the content of the web site.
## Laboratory 5

# 5.1 The Double-Precision Floating-Point Format

## 5.1.1 Required Background

## 5.1.2 The Double-precision Floating-point Representation

### 5.1.2.1 Sign Bit

### 5.1.2.2 Exponent

### 5.1.2.3 The Mantissa

### 5.1.2.4 Zero and Denormalized Numbers

### 5.1.2.5 Infinity and Not-a-Number

### 5.1.2.6 Comparisons

### 5.1.2.7 Boolean Operations in Matlab

### 5.1.2.8 While Loops

# 5.2 Newton Polynomials and Horner's Rule

## 5.2.1 Subtractive Cancellation and Polynomials

## 5.2.2 Newton Polynomials

## 5.2.3 Newton Polynomials in Matlab

## 5.2.4 Horner's Rule

This laboratory will investigate two topics:

- The double-precision floating-point format (MATH 211), and
- Newton Interpolation and Horner's Rule (MATH 215).

The IEEE *Standard for Floating-point Arithmetic*
(IEEE 754-2008) is a
ubiquitous standard for numerical computations. The first
version, IEEE 754-1985 was the collaborative work of numerous
individuals from many different companies with many potential
conflicting interests and loyalties; however, lead by Dr. William Kahan,
the committee was able to produce a very remarkable document.
In a 1998 interview, Dr. Kahan spoke about some of the issues
which had to be dealt with in creating the original standard.

In the MATH 211 Laboratory 1, you looked at a decimal floating-point representation in Figure 1.

Figure 1. The six-decimal-digit floating-point representation.

In this Laboratory, we will look at the double-precision floating-point
representation which is implemented as a `double` in the C, C#,
C++, and Java programming languages.

In this laboratory, you may need the table

Hexadecimal Digit | Binary Equivalent |
---|---|

0 | 0000 |

1 | 0001 |

2 | 0010 |

3 | 0011 |

4 | 0100 |

5 | 0101 |

6 | 0110 |

7 | 0111 |

8 | 1000 |

9 | 1001 |

a | 1010 |

b | 1011 |

c | 1100 |

d | 1101 |

e | 1110 |

f | 1111 |

Students must be aware of binary and hexadecimal numbers taught in ECE 124 and the decimal floating-point format taught in the MATH 211 Laboratory 1.

The double-precision floating-point representation, or `double`,
derives its name that each value occupies 8 bytes—twice that which
the 4-byte single-precision floating-point value (or `float`) uses.
It has been recognized that for most computations, the single-precision
floating-point representation, or `float`, does not store sufficient
precision. Unless you are sure that a single-precision floating-point
number is appropriate, use `double`.

Like the representation given in the MATH 215 Laboratory 1, the 8 bytes = 8×8 bits = 64 bits are divided into a single sign bit, 11 bits for the exponent, and the remaining 52 bits for the mantissa. This is shown in Figure 2.

Figure 2. The double-precision floating-point format.

Any 64-bit number may be written as a 16-hexadecimal digit number. Therefore, when a double is written as a hexadecimal number, the first three hexadecimal digits represent the sign bit and the exponent. Figures 3 and 4 represent the interpretation of these as numbers.

Figure 3. The double-precision floating-point representation using 64 bits.

Figure 4. The double-precision floating-point representation using 16 hexadecimal digits.

Matlab will display a double by using the format `hex`, for
example:

format long 1 1 -2 -2 pi 3.14159265358979 format hex 13ff0000000000000 -2c000000000000000 pi400921fb54442d18

To begin, the first bit is the sign bit: `0` represents
positive numbers and `1` represents negative numbers. Therefore,
in the hexadecimal representation, if the first three bytes are between
`800` and `fff`, the number is negative.

Unless you are absolutely sure that the precision of a `double`
is not needed, __do not use float__.

The next 11 bits represent the exponent. Thus, the exponent can be
a value between `00000000000`_{2} and
`11111111111`_{2}
or 0_{10} and 2047_{10}. As with the decimal format,
the double format uses a bias and based on the first
three hexadecimal digits of 1, *i*.*e*., `3FF`,
it is apparent that the bias is `01111111111`_{2}
or 1023_{10}.

Two exponents are especially reserved for special representations:
both `00000000000`_{2} = 0_{10} and `11111111111`_{2} = 2047_{10}.
Thus, the exponent represents numbers in the range
2^{1 − 1023} = 2^{-1022}, ...,
2^{2046 − 1023} = 2^{1023} which
is from 2.2251×10^{-308} to 8.9885×10^{307}.

From the decimal floating-point format requires that the first digit
of the mantissa is non-zero. This was required to ensure a unique
representation. With binary numbers, the **only** non-zero bit is 1 and
therefore there it is unnecessary to store this number. Therefore,
the mantissa for each of the three numbers 1, -2, and π are
`1.0000000000000`_{16},
`1.0000000000000`_{16}, and
`1.921fb54442d18`_{16}, respectively. In binary,
the last mantissa is
**1**.1001001000011111101101010100010001000010110100011000_{2}.

Previously, it was commented that the exponent
`00000000000`_{2} is
reserved for special representations. This occurs when the first
three hexadecimal digits are `000`_{16}
or `800`_{16}. In this case, if the mantissa
is `XXXXXXXXXXXXX`_{16}, it represents
the number .

For example, the smallest normalized number is when the exponent
is `001` or which is
demonstrated here:

format hex 2^(-1022) ans = 0010000000000000

For any number smaller than this, we use denormalized numbers:

2^-1023 ans = 0008000000000000 ans/2 ans = 0004000000000000

The first number is while the second number is .

For example, to interpret the number `0000580000000000`, we note the first three hexadecimal
digits are `000` and consequently, this is a denormalized number. The mantissa is
`0580000000000` which represents the binary number
`000001011000000...0`. Interpreting this as
a floating point number, we have . You should recall
the following values:

Binary | Decimal | |
---|---|---|

In binary, the mantissa represents . If we enter this into Matlab, we get the desired result:

0.021484375 * 2^-1022 ans = 0000580000000000

The smallest possible number which can be represented is `0000000000000001` which
is with 51 zeros following the
radix point. That is, it is the number .
We can verify this in Matlab:

2^-1074 ans = 0000000000000001 ans / 2 ans = 0000000000000000

The last answer is, of course, the representation of 0.

The other reserved exponent is the largest: or the first three hexadecimal digits being
`7FF` and `FFF`. If the number is plus or minus infinity,
the mantissa is zero:

format hex Inf ans = 7ff0000000000000 -Inf ans = fff0000000000000

This does not represent the mathematical infinity, but rather, any floating point number outside and including :

format 2^1023 ans = 8.9885e+307 2^1024 ans = Inf

The properties are as as expected:

Inf + 10 ans = Inf 1/0 ans = Inf -3 * Inf ans = -Inf 1/Inf ans = 0

There are some mathematical operations which are not defined.
For example, what is `Inf - Inf` or `0/0` or
even `0 * Inf`? It is
not reasonable to represent any of these as ,
and therefore there is one final number: NaN or *not-a-number*.
This is what you would have referred to in high-school as *undefined*:

0/0 ans = NaN Inf - Inf ans = NaN Inf * 0 ans = NaN format hex NaN ans = fff8000000000000

As you can see in the last example, the representation is
close to that of `-Inf` except that the mantissa is
all zeros except for a leading 1 bit.

NaN has one particularly interesting property according to the IEEE 754 standard: when compared to itself, it must return false:

3 == 3 ans = 1 Inf == Inf ans = 1 pi == pi ans = 1 NaN == NaN ans = 0

Thus, to detect `NaN`, one must make an explicit
function call:

isnan( [0 1 -1 Inf -Inf NaN pi] ) ans = 0 0 0 0 0 1 0

A double-precision floating-point is positive if the first hexadecimal character
is `≤7` and negative otherwise.

Two double precision floating-point numbers are equal if and only if all the bits are equal.

Because of denormalized numbers, two doubles `x` and `y` are equal if and only if
`(x - y) == 0` evaluates to true (1). For example, both and
represent the same number and evaluate to the
same value:

format (sqrt(2) - sqrt(8)/2) == 0 ans = 1 format hex sqrt(2) ans = 3ff6a09e667f3bcd sqrt(8)/2 ans = 3ff6a09e667f3bcd

On the other hand, while should also equal , due to numerical error, they are not precisely the same:

format (sqrt(2) - sqrt(18)/3) == 0 ans = 0 format hex sqrt(2) ans = 3ff6a09e667f3bcd sqrt(18)/3 ans = 3ff6a09e667f3bcc

In addition, two positive numbers may be compared by simply comparing their bits. For example, given the hexadecimal representations of ten numbers:

52832e8d36bf5eed 15b7e09834b53d11 2bd1f986a855506b 17033585aedd9016 1b9b8cfca975ca98 5d191fca8e5f7883 5180e4696a543549 2f71833e25f42212 6887272068f2910d 15f45c2c1f3ad0d0

It is immediately clear that they are, from smallest to largest:

15b7e09834b53d11 15f45c2c1f3ad0d0 17033585aedd9016 1b9b8cfca975ca98 2bd1f986a855506b 2f71833e25f42212 5180e4696a543549 52832e8d36bf5eed 5d191fca8e5f7883 6887272068f2910d

When we look at each of these, we see that is true:

Representation | Actual Value | Actual Value |
---|---|---|

52832e8d36bf5eed | 3.052664278175422e+089 | |

15b7e09834b53d11 | 4.759798560884662e-204 | |

2bd1f986a855506b | 1.314871551396497e-097 | |

17033585aedd9016 | 8.030437775543794e-198 | |

1b9b8cfca975ca98 | 1.087816536642441e-175 | |

5d191fca8e5f7883 | 2.991919959128576e+140 | |

5180e4696a543549 | 4.102002070144849e+084 | |

2f71833e25f42212 | 3.692440052604274e-080 | |

6887272068f2910d | 3.380270393986167e+195 | |

15f45c2c1f3ad0d0 | 6.493842243300212e-203 |

These are operations which may be performed very quickly in hardware.

Like C and C++, 0 is considered to be equivalent to `false`
and any other number is considered to be true. Operators such
as `==` and `<=` or `>` do not return
Boolean values, but rather, they return 0 or 1.

You may ask: I can tell whether or not any one entry within
a matrix or `NaN`, but how can I tell if no entries
are `NaN`?

This introduces two column-based matrix functions
(like `sum`, `max`, etc.). These are
the `any` and `all` functions:

The `any` function returns 1 for a column if any
of the entries of the column are non-zero. The `all`
function returns 1 for a column if all entries are non-zero.
For example:

M = [1 0 0 0; 1 1 0 0; 1 1 1 0] M = 1 0 0 0 1 1 0 0 1 1 1 0 any( M ) ans = 1 1 1 0 all( M ) ans = 1 0 0 0

Thus, a quick test to determine if all entries a matrix are
not `NaN`, use:

~any( any( isnan( M ) ) )

The `~` is the Matlab equivalent of *not*. It
is the same as the C/C++/Java/C# `!`.

The first call to `any` will be one in any column which
contains a `NaN`. If any column contained a `NaN`,
then the result of the first call would produce a 1 in that column
and therefore the second call to `any` would also return
1. The negation of 1 is 0.

Boolean tests can be used in `if` statements, but they
can also be used in while loops:

format long s1 = 0; s2 = 1; i = 2; while (s1 ~= s2 ) s1 = s2; s2 = s2 + 1/i^2; i = i + 1; end Executes for a while... s2 s2 = 1.644934057834575 i i = 94906267 pi^2/6 ans = 1.644934066848226

This is not the way to test for convergence: we generally
do not require this much accuracy and in some cases, it may not stop
even if it is *close enough*.

In this section, we will see a new means of finding and evaluating interpolating polynomials.

In the previous laboratory, you saw how to calculate interpolating polynomials. For example, one could find the polynomial which interpolates the three points is the polynomial .

There are problems, however, in evaluating such polynomials
at an arbitrary point . For example, it
is not possible to avoid subtractive cancellation: it may be
possible that is approximately
zero for one value of , for example,
. Evaluating the polynomial
at this point yields the value `1.056122122394809e-8`
while the correct answer is exactly `1.056122122449e-8`.

In a higher degree polynomial, it is almost impossible to determine the order in which terms should be added and consequently, it is more appropriate to avoid such a situation altogether.

In order to avoid these problems, we will introduce what at first appears to be a non-answer: Newton polynomials. We will then look at a very efficient means of evaluating polynomials: Horner's rule.

Given a single point , it's really easy to find the interpolating polynomial of degree zero:

.

This is shown in Figure 5.

Figure 5. The constant function passing through .

This seems anti-climactic, but it is the first step to finding
a Newton polynomial. The next question is: given a second point
, instead of finding a
completely new polynomial, can we find a *corrector* function
which we can add to such
that
passes through both points? This is demonstrated in Figure 6 where
we are trying to find the purple function which we can add to the
red line.

The first step is determining that must be zero at and therefore .

Figure 6. Finding the correcting term which,
when added to , causes the result to pass through
.

The next step is to find the coefficient so that it forces .

We can solve this for to get that

.

Finally, in order to conclude this example, we will add a third point and thus we must find the next quadratic function which causes the resulting sum to pass through the third point. This is shown in Figure 7 where we must now find the orange function so that when it is added to the blue function, the resulting sum (magenta) passes through all three points.

Figure 7. Finding the correcting term which,
when added to , causes the result to pass through
.

To find this second polynomial , we begin by noting it must be zero at the two points and . This requires that it must be of the form .

We also require that

,

or, in other words,

.

In order to find an appropriate solution, we will use the trick of adding *zero*; that is,
we will replace in the second term with :

.

We can now distribute the multiplier:

.

The third term now simplifies to

and we can now cancel the two which appear on the left-hand side:

.

With the next step, we subtract from each side and then divide through by to get

.

Taking the first term on the left-hand side to the other and dividing through by yields

.

What you see here is a systematic mechanism for finding the coefficients. Consider the following approach: define

and

.

Next, we see that we may write

.

To find these in a straight-forward manner, consider that we can calculate the values using the format in Figure 8.

Figure 8. Finding three Newton polynomial coefficients.

The interpolating Newton polynomial is now

.

Similarly, we can find the polynomial which interpolates a fourth point by calculating the table in Figure 9 to yield the polynomial

.

Figure 9. Finding the Newton polynomial coefficients for interpolating four points.

You can read more about Newton polynomials at the appropriate chapter of Numerical Analysis for Engineers.

We must now calculate these coefficients in Matlab. Consider the vectors

x = [-1 2 3]'; y = [-1.4 1.3 5.4]';

Notice that if we calculate

y1 = (y(3:2) - y(2:1))./(x(3:2) - x(2:1)) y1 = 0.9000 4.1000

You will notice that and . Similarly, we may calculate

y2 = (y1(2) - y(1))./(x(3) - x(1)) y2 = 0.8

which is again .

We will create a matrix with the entries

and then, using Matlab efficiently, calculate the second column as

and finally we will calculate the last entry

We may extract the simplest Newton polynomial by accessing the entries in the first row.

The algorithm itself may be used for an arbitrary number of points. For example, with five points, we would begin with

and then calculate, in order,

, , ,

and finally

.

We can read the Newton polynomial off of the first row:

.

The code for coming up with this matrix is given by

function [Y] = newtonpoly( x, y ) [m n] = size( x ); if ( m ~= 1 && n ~= 1 ) throw( MException( 'MATLAB:invalid_argument', ... 'The arguments must be vectors' ) ); end if ( ~all( size( y ) == size( x ) ) ) throw( MException( 'MATLAB:invalid_argument', ... 'The vectors ''x'' and ''y'' must have the same dimensions' ) ); end m = max( m, n ); Y = zeros( m, m ); Y(:,1) = y; for k=2:m Y(1:(end - k + 1), k) = ( ... Y(2:(end - k + 2), k - 1) - Y(1:(end - k + 1), k - 1) ... ) ./ (x(k:end) - x(1:(end - k + 1))); end end

A very quick trial run demonstrates that this works

newtonpoly( x, y ) ans = -1.4000 0.9000 0.8000 1.3000 4.1000 0 5.4000 0 0

Thus, the Newton interpolating polynomial is

.

Suppose we now add a fourth point: :

x = [x; 5]; y = [y; 7.6];

When calculating the coefficients of the Newton polynomial, we get

newtonpoly( x, y ) ans = -1.4000 0.9000 0.8000 -0.3000 1.3000 4.1000 -1.0000 0 5.4000 1.1000 0 0 7.6000 0 0 0

You will notice that the six non-zero entries remained unchanged and the only entries which were added were those on the anti-diagonal. Thus, the Newton interpolating polynomial is now

.

Now comes the more important question: how do we evaluate this polynomial? This actually seems complicated, but using Horner's rule, it is actually a very fast algorithm which is the number of points (that is, the number of terms the polynomial).

If we have the polynomial defined by the vector `a = [a1 a2 a3]`

may be evaluated efficiently as follows:

% Evaluate the polynomial 'a' at the point 'x' term = 1; y = 0; for k = length(a):-1:1 y = y + a(k)*term; term = term * x; end

With the first iteration, , with the second iteration , and finally with the last iteration, .

We can do away with the local variable `term` by using the
backward approach:

% Evaluate the polynomial 'a' at the point 'x' y = a(1); for k = 2:length(a) y = y*x + a(k); end

With the first iteration, , with the second iteration , and finally with the last iteration, .

which evaluates the polynomial defined by the vector `a`
at the point `x` and assigns the result to the variable `y`.

Your implementation must, in addition to evaluating the polynomial
at the point `x`, allow the user to evaluate the polynomial
at each point of a vector or matrix.

Your function must have the following output:

format long horners( [1 2 3], 3.2 ) ans = 19.640000000000001 horners( [1 2 3], 3.5 ) ans = 22.250000000000000

Your function should also be able to perform the following:

format horners( [1 2 3], 0:0.1:1 ) ans = 3.0000 3.2100 3.4400 3.6900 3.9600 4.2500 4.5600 4.8900 5.2400 5.6100 6.0000

If we define `xs` to be 100 equally spaced points on the interval
, we can also plot the polynomial on that interval:

xs = linspace( -3, 3, 100 ); ys = horners( [1 2 3], xs ); plot( xs, ys ); ylim( [min( ys ) - 1, max( ys ) + 1] ); title( 'The polynomial x^2 + 2x + 3 plotted on the interval [-3, 3]' );

The output of this sequence of commands should be a plot similar to Figure 10.

Figure 10. The output of Matlab when plotting a polynomial.

What is the purpose of the statement `ylim( [min( ys ) - 1, max( ys ) + 1] );`?

Recall that in Matlab, the function defined
by `function [y] = f(x)` must be saved in the
file `f.m`. Many Matlab functions are also implemented in a
similar fashion. Take a look at the Matlab library function
`polyval.m` and determine how
it evaluates polynomials.

During the lab, you will modify `horners` to evaluate Newton
polynomials.

You can read more about Horner's rule at the appropriate chapter of Numerical Analysis for Engineers.