Skip to the content of the web site.

## Laboratory 5

This laboratory will investigate two topics:

1. The double-precision floating-point format (MATH 211), and
2. Newton Interpolation and Horner's Rule (MATH 215).

# 5.1 The Double-Precision Floating-Point Format

The IEEE Standard for Floating-point Arithmetic (IEEE 754-2008) is a ubiquitous standard for numerical computations. The first version, IEEE 754-1985 was the collaborative work of numerous individuals from many different companies with many potential conflicting interests and loyalties; however, lead by Dr. William Kahan, the committee was able to produce a very remarkable document. In a 1998 interview, Dr. Kahan spoke about some of the issues which had to be dealt with in creating the original standard.

In the MATH 211 Laboratory 1, you looked at a decimal floating-point representation in Figure 1.

Figure 1. The six-decimal-digit floating-point representation.

In this Laboratory, we will look at the double-precision floating-point representation which is implemented as a double in the C, C#, C++, and Java programming languages.

In this laboratory, you may need the table

00000
10001
20010
30011
40100
50101
60110
70111
81000
91001
a1010
b1011
c1100
d1101
e1110
f1111

## 5.1.1 Required Background

Students must be aware of binary and hexadecimal numbers taught in ECE 124 and the decimal floating-point format taught in the MATH 211 Laboratory 1.

## 5.1.2 The Double-precision Floating-point Representation

The double-precision floating-point representation, or double, derives its name that each value occupies 8 bytes—twice that which the 4-byte single-precision floating-point value (or float) uses. It has been recognized that for most computations, the single-precision floating-point representation, or float, does not store sufficient precision. Unless you are sure that a single-precision floating-point number is appropriate, use double.

Like the representation given in the MATH 215 Laboratory 1, the 8 bytes = 8×8 bits = 64 bits are divided into a single sign bit, 11 bits for the exponent, and the remaining 52 bits for the mantissa. This is shown in Figure 2.

Figure 2. The double-precision floating-point format.

Any 64-bit number may be written as a 16-hexadecimal digit number. Therefore, when a double is written as a hexadecimal number, the first three hexadecimal digits represent the sign bit and the exponent. Figures 3 and 4 represent the interpretation of these as numbers.

Figure 3. The double-precision floating-point representation using 64 bits.

Figure 4. The double-precision floating-point representation using 16 hexadecimal digits.

Matlab will display a double by using the format hex, for example:

format long
1
1
-2
-2
pi
3.14159265358979
format hex
1
3ff0000000000000
-2
c000000000000000
pi
400921fb54442d18


### 5.1.2.1 Sign Bit

To begin, the first bit is the sign bit: 0 represents positive numbers and 1 represents negative numbers. Therefore, in the hexadecimal representation, if the first three bytes are between 800 and fff, the number is negative.

Unless you are absolutely sure that the precision of a double is not needed, do not use float.

### 5.1.2.2 Exponent

The next 11 bits represent the exponent. Thus, the exponent can be a value between 000000000002 and 111111111112 or 010 and 204710. As with the decimal format, the double format uses a bias and based on the first three hexadecimal digits of 1, i.e., 3FF, it is apparent that the bias is 011111111112 or 102310.

Two exponents are especially reserved for special representations: both 000000000002 = 010 and 111111111112 = 204710. Thus, the exponent represents numbers in the range 21 − 1023 = 2-1022, ..., 22046 − 1023 = 21023 which is from 2.2251×10-308 to 8.9885×10307.

### 5.1.2.3 The Mantissa

From the decimal floating-point format requires that the first digit of the mantissa is non-zero. This was required to ensure a unique representation. With binary numbers, the only non-zero bit is 1 and therefore there it is unnecessary to store this number. Therefore, the mantissa for each of the three numbers 1, -2, and π are 1.000000000000016, 1.000000000000016, and 1.921fb54442d1816, respectively. In binary, the last mantissa is 1.10010010000111111011010101000100010000101101000110002.

### 5.1.2.4 Zero and Denormalized Numbers

Previously, it was commented that the exponent 000000000002 is reserved for special representations. This occurs when the first three hexadecimal digits are 00016 or 80016. In this case, if the mantissa is XXXXXXXXXXXXX16, it represents the number .

For example, the smallest normalized number is when the exponent is 001 or which is demonstrated here:

format hex
2^(-1022)
ans =
0010000000000000


For any number smaller than this, we use denormalized numbers:

2^-1023
ans =
0008000000000000

ans/2
ans =
0004000000000000


The first number is while the second number is .

For example, to interpret the number 0000580000000000, we note the first three hexadecimal digits are 000 and consequently, this is a denormalized number. The mantissa is 0580000000000 which represents the binary number 000001011000000...0. Interpreting this as a floating point number, we have . You should recall the following values:

BinaryDecimal

In binary, the mantissa represents . If we enter this into Matlab, we get the desired result:

0.021484375 * 2^-1022
ans =
0000580000000000


The smallest possible number which can be represented is 0000000000000001 which is with 51 zeros following the radix point. That is, it is the number . We can verify this in Matlab:

2^-1074
ans =
0000000000000001

ans / 2
ans =
0000000000000000


The last answer is, of course, the representation of 0.

### 5.1.2.5 Infinity and Not-a-Number

The other reserved exponent is the largest: or the first three hexadecimal digits being 7FF and FFF. If the number is plus or minus infinity, the mantissa is zero:

format hex
Inf
ans =
7ff0000000000000

-Inf
ans =
fff0000000000000


This does not represent the mathematical infinity, but rather, any floating point number outside and including :

format
2^1023
ans =
8.9885e+307

2^1024
ans =
Inf


The properties are as as expected:

Inf + 10
ans =
Inf

1/0
ans =
Inf

-3 * Inf
ans =
-Inf

1/Inf
ans =
0


There are some mathematical operations which are not defined. For example, what is Inf - Inf or 0/0 or even 0 * Inf? It is not reasonable to represent any of these as , and therefore there is one final number: NaN or not-a-number. This is what you would have referred to in high-school as undefined:

0/0
ans =
NaN

Inf - Inf
ans =
NaN

Inf * 0
ans =
NaN

format hex
NaN
ans =
fff8000000000000


As you can see in the last example, the representation is close to that of -Inf except that the mantissa is all zeros except for a leading 1 bit.

NaN has one particularly interesting property according to the IEEE 754 standard: when compared to itself, it must return false:

3 == 3
ans =
1

Inf == Inf
ans =
1

pi == pi
ans =
1

NaN == NaN
ans =
0


Thus, to detect NaN, one must make an explicit function call:

isnan( [0 1 -1 Inf -Inf NaN pi] )
ans =
0   0   0   0   0   1   0


### 5.1.2.6 Comparisons

A double-precision floating-point is positive if the first hexadecimal character is ≤7 and negative otherwise.

Two double precision floating-point numbers are equal if and only if all the bits are equal.

Because of denormalized numbers, two doubles x and y are equal if and only if (x - y) == 0 evaluates to true (1). For example, both and represent the same number and evaluate to the same value:

format
(sqrt(2) - sqrt(8)/2) == 0
ans =
1
format hex
sqrt(2)
ans =
3ff6a09e667f3bcd

sqrt(8)/2
ans =
3ff6a09e667f3bcd


On the other hand, while should also equal , due to numerical error, they are not precisely the same:

format
(sqrt(2) - sqrt(18)/3) == 0
ans =
0
format hex
sqrt(2)
ans =
3ff6a09e667f3bcd

sqrt(18)/3
ans =
3ff6a09e667f3bcc


In addition, two positive numbers may be compared by simply comparing their bits. For example, given the hexadecimal representations of ten numbers:

   52832e8d36bf5eed
15b7e09834b53d11
2bd1f986a855506b
17033585aedd9016
1b9b8cfca975ca98
5d191fca8e5f7883
5180e4696a543549
2f71833e25f42212
6887272068f2910d


It is immediately clear that they are, from smallest to largest:

   15b7e09834b53d11
17033585aedd9016
1b9b8cfca975ca98
2bd1f986a855506b
2f71833e25f42212
5180e4696a543549
52832e8d36bf5eed
5d191fca8e5f7883
6887272068f2910d


When we look at each of these, we see that is true:

RepresentationActual ValueActual Value
52832e8d36bf5eed3.052664278175422e+089
15b7e09834b53d114.759798560884662e-204
2bd1f986a855506b1.314871551396497e-097
17033585aedd90168.030437775543794e-198
1b9b8cfca975ca981.087816536642441e-175
5d191fca8e5f78832.991919959128576e+140
5180e4696a5435494.102002070144849e+084
2f71833e25f422123.692440052604274e-080
6887272068f2910d3.380270393986167e+195

These are operations which may be performed very quickly in hardware.

### 5.1.2.7 Boolean Operations in Matlab

Like C and C++, 0 is considered to be equivalent to false and any other number is considered to be true. Operators such as == and <= or > do not return Boolean values, but rather, they return 0 or 1.

You may ask: I can tell whether or not any one entry within a matrix or NaN, but how can I tell if no entries are NaN?

This introduces two column-based matrix functions (like sum, max, etc.). These are the any and all functions:

The any function returns 1 for a column if any of the entries of the column are non-zero. The all function returns 1 for a column if all entries are non-zero. For example:

M = [1 0 0 0; 1 1 0 0; 1 1 1 0]
M =
1     0     0     0
1     1     0     0
1     1     1     0

any( M )
ans =
1     1     1     0

all( M )
ans =
1     0     0     0


Thus, a quick test to determine if all entries a matrix are not NaN, use:

~any( any( isnan( M ) ) )


The ~ is the Matlab equivalent of not. It is the same as the C/C++/Java/C# !.

The first call to any will be one in any column which contains a NaN. If any column contained a NaN, then the result of the first call would produce a 1 in that column and therefore the second call to any would also return 1. The negation of 1 is 0.

### 5.1.2.8 While Loops

Boolean tests can be used in if statements, but they can also be used in while loops:

format long
s1 = 0;
s2 = 1;
i = 2;
while (s1 ~= s2 )
s1 = s2;
s2 = s2 + 1/i^2;
i = i + 1;
end
Executes for a while...
s2
s2 =
1.644934057834575
i
i =
94906267

pi^2/6
ans =
1.644934066848226


This is not the way to test for convergence: we generally do not require this much accuracy and in some cases, it may not stop even if it is close enough.

# 5.2 Newton Polynomials and Horner's Rule

In this section, we will see a new means of finding and evaluating interpolating polynomials.

## 5.2.1 Subtractive Cancellation and Polynomials

In the previous laboratory, you saw how to calculate interpolating polynomials. For example, one could find the polynomial which interpolates the three points is the polynomial .

There are problems, however, in evaluating such polynomials at an arbitrary point . For example, it is not possible to avoid subtractive cancellation: it may be possible that is approximately zero for one value of , for example, . Evaluating the polynomial at this point yields the value 1.056122122394809e-8 while the correct answer is exactly 1.056122122449e-8.

In a higher degree polynomial, it is almost impossible to determine the order in which terms should be added and consequently, it is more appropriate to avoid such a situation altogether.

In order to avoid these problems, we will introduce what at first appears to be a non-answer: Newton polynomials. We will then look at a very efficient means of evaluating polynomials: Horner's rule.

## 5.2.2 Newton Polynomials

Given a single point , it's really easy to find the interpolating polynomial of degree zero:

.

This is shown in Figure 5.

Figure 5. The constant function passing through .

This seems anti-climactic, but it is the first step to finding a Newton polynomial. The next question is: given a second point , instead of finding a completely new polynomial, can we find a corrector function which we can add to such that passes through both points? This is demonstrated in Figure 6 where we are trying to find the purple function which we can add to the red line.

The first step is determining that must be zero at and therefore .

Figure 6. Finding the correcting term which, when added to , causes the result to pass through .

The next step is to find the coefficient so that it forces .

We can solve this for to get that

.

Finally, in order to conclude this example, we will add a third point and thus we must find the next quadratic function which causes the resulting sum to pass through the third point. This is shown in Figure 7 where we must now find the orange function so that when it is added to the blue function, the resulting sum (magenta) passes through all three points.

Figure 7. Finding the correcting term which, when added to , causes the result to pass through .

To find this second polynomial , we begin by noting it must be zero at the two points and . This requires that it must be of the form .

We also require that

,

or, in other words,

.

In order to find an appropriate solution, we will use the trick of adding zero; that is, we will replace in the second term with :

.

We can now distribute the multiplier:

.

The third term now simplifies to

and we can now cancel the two which appear on the left-hand side:

.

With the next step, we subtract from each side and then divide through by to get

.

Taking the first term on the left-hand side to the other and dividing through by yields

.

What you see here is a systematic mechanism for finding the coefficients. Consider the following approach: define

and

.

Next, we see that we may write

.

To find these in a straight-forward manner, consider that we can calculate the values using the format in Figure 8.

Figure 8. Finding three Newton polynomial coefficients.

The interpolating Newton polynomial is now

.

Similarly, we can find the polynomial which interpolates a fourth point by calculating the table in Figure 9 to yield the polynomial

.

Figure 9. Finding the Newton polynomial coefficients for interpolating four points.

You can read more about Newton polynomials at the appropriate chapter of Numerical Analysis for Engineers.

## 5.2.3 Newton Polynomials in Matlab

We must now calculate these coefficients in Matlab. Consider the vectors

x = [-1 2 3]';
y = [-1.4 1.3 5.4]';


Notice that if we calculate

y1 = (y(3:2) - y(2:1))./(x(3:2) - x(2:1))
y1 =
0.9000
4.1000


You will notice that and . Similarly, we may calculate

y2 = (y1(2) - y(1))./(x(3) - x(1))
y2 =
0.8


which is again .

We will create a matrix with the entries

and then, using Matlab efficiently, calculate the second column as

and finally we will calculate the last entry

We may extract the simplest Newton polynomial by accessing the entries in the first row.

The algorithm itself may be used for an arbitrary number of points. For example, with five points, we would begin with

and then calculate, in order,

, , ,

and finally

.

We can read the Newton polynomial off of the first row:

.

The code for coming up with this matrix is given by

function [Y] = newtonpoly( x, y )
[m n] = size( x );

if ( m ~= 1 && n ~= 1 )
throw( MException( 'MATLAB:invalid_argument', ...
'The arguments must be vectors' ) );
end

if ( ~all( size( y ) == size( x ) ) )
throw( MException( 'MATLAB:invalid_argument', ...
'The vectors ''x'' and ''y'' must have the same dimensions' ) );
end

m = max( m, n );

Y = zeros( m, m );

Y(:,1) = y;

for k=2:m
Y(1:(end - k + 1), k) = (                                     ...
Y(2:(end - k + 2), k - 1) - Y(1:(end - k + 1), k - 1) ...
) ./ (x(k:end) - x(1:(end - k + 1)));
end
end


A very quick trial run demonstrates that this works

newtonpoly( x, y )
ans =
-1.4000    0.9000    0.8000
1.3000    4.1000         0
5.4000         0         0


Thus, the Newton interpolating polynomial is

.

Suppose we now add a fourth point: :

x = [x; 5];
y = [y; 7.6];


When calculating the coefficients of the Newton polynomial, we get

newtonpoly( x, y )
ans =
-1.4000    0.9000    0.8000   -0.3000
1.3000    4.1000   -1.0000         0
5.4000    1.1000         0         0
7.6000         0         0         0


You will notice that the six non-zero entries remained unchanged and the only entries which were added were those on the anti-diagonal. Thus, the Newton interpolating polynomial is now

.

Now comes the more important question: how do we evaluate this polynomial? This actually seems complicated, but using Horner's rule, it is actually a very fast algorithm which is the number of points (that is, the number of terms the polynomial).

## 5.2.4 Horner's Rule

If we have the polynomial defined by the vector a = [a1 a2 a3]

may be evaluated efficiently as follows:

% Evaluate the polynomial 'a' at the point 'x'
term = 1;
y = 0;

for k = length(a):-1:1
y = y + a(k)*term;
term = term * x;
end


With the first iteration, , with the second iteration , and finally with the last iteration, .

We can do away with the local variable term by using the backward approach:

% Evaluate the polynomial 'a' at the point 'x'
y = a(1);

for k = 2:length(a)
y = y*x + a(k);
end


With the first iteration, , with the second iteration , and finally with the last iteration, .

which evaluates the polynomial defined by the vector a at the point x and assigns the result to the variable y.

Your implementation must, in addition to evaluating the polynomial at the point x, allow the user to evaluate the polynomial at each point of a vector or matrix.

Your function must have the following output:

format long
horners( [1 2 3], 3.2 )
ans =
19.640000000000001

horners( [1 2 3], 3.5 )
ans =
22.250000000000000


Your function should also be able to perform the following:

format
horners( [1 2 3], 0:0.1:1 )
ans =
3.0000 3.2100 3.4400 3.6900 3.9600 4.2500 4.5600 4.8900 5.2400 5.6100 6.0000


If we define xs to be 100 equally spaced points on the interval , we can also plot the polynomial on that interval:

xs = linspace( -3, 3, 100 );
ys = horners( [1 2 3], xs );
plot( xs, ys );
ylim( [min( ys ) - 1, max( ys ) + 1] );
title( 'The polynomial x^2 + 2x + 3 plotted on the interval [-3, 3]' );


The output of this sequence of commands should be a plot similar to Figure 10.

Figure 10. The output of Matlab when plotting a polynomial.

What is the purpose of the statement ylim( [min( ys ) - 1, max( ys ) + 1] );?

Recall that in Matlab, the function defined by function [y] = f(x) must be saved in the file f.m. Many Matlab functions are also implemented in a similar fashion. Take a look at the Matlab library function polyval.m and determine how it evaluates polynomials.

During the lab, you will modify horners to evaluate Newton polynomials.

You can read more about Horner's rule at the appropriate chapter of Numerical Analysis for Engineers.