This laboratory will investigate two topics:
The IEEE Standard for Floating-point Arithmetic (IEEE 754-2008) is a ubiquitous standard for numerical computations. The first version, IEEE 754-1985 was the collaborative work of numerous individuals from many different companies with many potential conflicting interests and loyalties; however, lead by Dr. William Kahan, the committee was able to produce a very remarkable document. In a 1998 interview, Dr. Kahan spoke about some of the issues which had to be dealt with in creating the original standard.
In the MATH 211 Laboratory 1, you looked at a decimal floating-point representation in Figure 1.
Figure 1. The six-decimal-digit floating-point representation.
In this Laboratory, we will look at the double-precision floating-point representation which is implemented as a double in the C, C#, C++, and Java programming languages.
In this laboratory, you may need the table
|Hexadecimal Digit||Binary Equivalent|
Students must be aware of binary and hexadecimal numbers taught in ECE 124 and the decimal floating-point format taught in the MATH 211 Laboratory 1.
The double-precision floating-point representation, or double, derives its name that each value occupies 8 bytes—twice that which the 4-byte single-precision floating-point value (or float) uses. It has been recognized that for most computations, the single-precision floating-point representation, or float, does not store sufficient precision. Unless you are sure that a single-precision floating-point number is appropriate, use double.
Like the representation given in the MATH 215 Laboratory 1, the 8 bytes = 8×8 bits = 64 bits are divided into a single sign bit, 11 bits for the exponent, and the remaining 52 bits for the mantissa. This is shown in Figure 2.
Figure 2. The double-precision floating-point format.
Any 64-bit number may be written as a 16-hexadecimal digit number. Therefore, when a double is written as a hexadecimal number, the first three hexadecimal digits represent the sign bit and the exponent. Figures 3 and 4 represent the interpretation of these as numbers.
Figure 3. The double-precision floating-point representation using 64 bits.
Figure 4. The double-precision floating-point representation using 16 hexadecimal digits.
Matlab will display a double by using the format hex, for example:
format long 1 1 -2 -2 pi 3.14159265358979 format hex 1 3ff0000000000000 -2 c000000000000000 pi 400921fb54442d18
To begin, the first bit is the sign bit: 0 represents positive numbers and 1 represents negative numbers. Therefore, in the hexadecimal representation, if the first three bytes are between 800 and fff, the number is negative.
Unless you are absolutely sure that the precision of a double is not needed, do not use float.
The next 11 bits represent the exponent. Thus, the exponent can be a value between 000000000002 and 111111111112 or 010 and 204710. As with the decimal format, the double format uses a bias and based on the first three hexadecimal digits of 1, i.e., 3FF, it is apparent that the bias is 011111111112 or 102310.
Two exponents are especially reserved for special representations: both 000000000002 = 010 and 111111111112 = 204710. Thus, the exponent represents numbers in the range 21 − 1023 = 2-1022, ..., 22046 − 1023 = 21023 which is from 2.2251×10-308 to 8.9885×10307.
From the decimal floating-point format requires that the first digit of the mantissa is non-zero. This was required to ensure a unique representation. With binary numbers, the only non-zero bit is 1 and therefore there it is unnecessary to store this number. Therefore, the mantissa for each of the three numbers 1, -2, and π are 1.000000000000016, 1.000000000000016, and 1.921fb54442d1816, respectively. In binary, the last mantissa is 1.10010010000111111011010101000100010000101101000110002.
Previously, it was commented that the exponent 000000000002 is reserved for special representations. This occurs when the first three hexadecimal digits are 00016 or 80016. In this case, if the mantissa is XXXXXXXXXXXXX16, it represents the number .
For example, the smallest normalized number is when the exponent is 001 or which is demonstrated here:
format hex 2^(-1022) ans = 0010000000000000
For any number smaller than this, we use denormalized numbers:
2^-1023 ans = 0008000000000000 ans/2 ans = 0004000000000000
The first number is while the second number is .
For example, to interpret the number 0000580000000000, we note the first three hexadecimal digits are 000 and consequently, this is a denormalized number. The mantissa is 0580000000000 which represents the binary number 000001011000000...0. Interpreting this as a floating point number, we have . You should recall the following values:
In binary, the mantissa represents . If we enter this into Matlab, we get the desired result:
0.021484375 * 2^-1022 ans = 0000580000000000
The smallest possible number which can be represented is 0000000000000001 which is with 51 zeros following the radix point. That is, it is the number . We can verify this in Matlab:
2^-1074 ans = 0000000000000001 ans / 2 ans = 0000000000000000
The last answer is, of course, the representation of 0.
The other reserved exponent is the largest: or the first three hexadecimal digits being 7FF and FFF. If the number is plus or minus infinity, the mantissa is zero:
format hex Inf ans = 7ff0000000000000 -Inf ans = fff0000000000000
This does not represent the mathematical infinity, but rather, any floating point number outside and including :
format 2^1023 ans = 8.9885e+307 2^1024 ans = Inf
The properties are as as expected:
Inf + 10 ans = Inf 1/0 ans = Inf -3 * Inf ans = -Inf 1/Inf ans = 0
There are some mathematical operations which are not defined. For example, what is Inf - Inf or 0/0 or even 0 * Inf? It is not reasonable to represent any of these as , and therefore there is one final number: NaN or not-a-number. This is what you would have referred to in high-school as undefined:
0/0 ans = NaN Inf - Inf ans = NaN Inf * 0 ans = NaN format hex NaN ans = fff8000000000000
As you can see in the last example, the representation is close to that of -Inf except that the mantissa is all zeros except for a leading 1 bit.
NaN has one particularly interesting property according to the IEEE 754 standard: when compared to itself, it must return false:
3 == 3 ans = 1 Inf == Inf ans = 1 pi == pi ans = 1 NaN == NaN ans = 0
Thus, to detect NaN, one must make an explicit function call:
isnan( [0 1 -1 Inf -Inf NaN pi] ) ans = 0 0 0 0 0 1 0
A double-precision floating-point is positive if the first hexadecimal character is ≤7 and negative otherwise.
Two double precision floating-point numbers are equal if and only if all the bits are equal.
Because of denormalized numbers, two doubles x and y are equal if and only if (x - y) == 0 evaluates to true (1). For example, both and represent the same number and evaluate to the same value:
format (sqrt(2) - sqrt(8)/2) == 0 ans = 1 format hex sqrt(2) ans = 3ff6a09e667f3bcd sqrt(8)/2 ans = 3ff6a09e667f3bcd
On the other hand, while should also equal , due to numerical error, they are not precisely the same:
format (sqrt(2) - sqrt(18)/3) == 0 ans = 0 format hex sqrt(2) ans = 3ff6a09e667f3bcd sqrt(18)/3 ans = 3ff6a09e667f3bcc
In addition, two positive numbers may be compared by simply comparing their bits. For example, given the hexadecimal representations of ten numbers:
52832e8d36bf5eed 15b7e09834b53d11 2bd1f986a855506b 17033585aedd9016 1b9b8cfca975ca98 5d191fca8e5f7883 5180e4696a543549 2f71833e25f42212 6887272068f2910d 15f45c2c1f3ad0d0
It is immediately clear that they are, from smallest to largest:
15b7e09834b53d11 15f45c2c1f3ad0d0 17033585aedd9016 1b9b8cfca975ca98 2bd1f986a855506b 2f71833e25f42212 5180e4696a543549 52832e8d36bf5eed 5d191fca8e5f7883 6887272068f2910d
When we look at each of these, we see that is true:
|Representation||Actual Value||Actual Value|
These are operations which may be performed very quickly in hardware.
Like C and C++, 0 is considered to be equivalent to false and any other number is considered to be true. Operators such as == and <= or > do not return Boolean values, but rather, they return 0 or 1.
You may ask: I can tell whether or not any one entry within a matrix or NaN, but how can I tell if no entries are NaN?
This introduces two column-based matrix functions (like sum, max, etc.). These are the any and all functions:
The any function returns 1 for a column if any of the entries of the column are non-zero. The all function returns 1 for a column if all entries are non-zero. For example:
M = [1 0 0 0; 1 1 0 0; 1 1 1 0] M = 1 0 0 0 1 1 0 0 1 1 1 0 any( M ) ans = 1 1 1 0 all( M ) ans = 1 0 0 0
Thus, a quick test to determine if all entries a matrix are not NaN, use:
~any( any( isnan( M ) ) )
The ~ is the Matlab equivalent of not. It is the same as the C/C++/Java/C# !.
The first call to any will be one in any column which contains a NaN. If any column contained a NaN, then the result of the first call would produce a 1 in that column and therefore the second call to any would also return 1. The negation of 1 is 0.
Boolean tests can be used in if statements, but they can also be used in while loops:
format long s1 = 0; s2 = 1; i = 2; while (s1 ~= s2 ) s1 = s2; s2 = s2 + 1/i^2; i = i + 1; end Executes for a while... s2 s2 = 1.644934057834575 i i = 94906267 pi^2/6 ans = 1.644934066848226
This is not the way to test for convergence: we generally do not require this much accuracy and in some cases, it may not stop even if it is close enough.
In this section, we will see a new means of finding and evaluating interpolating polynomials.
In the previous laboratory, you saw how to calculate interpolating polynomials. For example, one could find the polynomial which interpolates the three points is the polynomial .
There are problems, however, in evaluating such polynomials at an arbitrary point . For example, it is not possible to avoid subtractive cancellation: it may be possible that is approximately zero for one value of , for example, . Evaluating the polynomial at this point yields the value 1.056122122394809e-8 while the correct answer is exactly 1.056122122449e-8.
In a higher degree polynomial, it is almost impossible to determine the order in which terms should be added and consequently, it is more appropriate to avoid such a situation altogether.
In order to avoid these problems, we will introduce what at first appears to be a non-answer: Newton polynomials. We will then look at a very efficient means of evaluating polynomials: Horner's rule.
Given a single point , it's really easy to find the interpolating polynomial of degree zero:
This is shown in Figure 5.
Figure 5. The constant function passing through .
This seems anti-climactic, but it is the first step to finding a Newton polynomial. The next question is: given a second point , instead of finding a completely new polynomial, can we find a corrector function which we can add to such that passes through both points? This is demonstrated in Figure 6 where we are trying to find the purple function which we can add to the red line.
The first step is determining that must be zero at and therefore .
Figure 6. Finding the correcting term which, when added to , causes the result to pass through .
The next step is to find the coefficient so that it forces .
We can solve this for to get that
Finally, in order to conclude this example, we will add a third point and thus we must find the next quadratic function which causes the resulting sum to pass through the third point. This is shown in Figure 7 where we must now find the orange function so that when it is added to the blue function, the resulting sum (magenta) passes through all three points.
Figure 7. Finding the correcting term which, when added to , causes the result to pass through .
To find this second polynomial , we begin by noting it must be zero at the two points and . This requires that it must be of the form .
We also require that
or, in other words,
In order to find an appropriate solution, we will use the trick of adding zero; that is, we will replace in the second term with :
We can now distribute the multiplier:
The third term now simplifies to
and we can now cancel the two which appear on the left-hand side:
With the next step, we subtract from each side and then divide through by to get
Taking the first term on the left-hand side to the other and dividing through by yields
What you see here is a systematic mechanism for finding the coefficients. Consider the following approach: define
Next, we see that we may write
To find these in a straight-forward manner, consider that we can calculate the values using the format in Figure 8.
Figure 8. Finding three Newton polynomial coefficients.
The interpolating Newton polynomial is now
Similarly, we can find the polynomial which interpolates a fourth point by calculating the table in Figure 9 to yield the polynomial
Figure 9. Finding the Newton polynomial coefficients for interpolating four points.
You can read more about Newton polynomials at the appropriate chapter of Numerical Analysis for Engineers.
We must now calculate these coefficients in Matlab. Consider the vectors
x = [-1 2 3]'; y = [-1.4 1.3 5.4]';
Notice that if we calculate
y1 = (y(3:2) - y(2:1))./(x(3:2) - x(2:1)) y1 = 0.9000 4.1000
You will notice that and . Similarly, we may calculate
y2 = (y1(2) - y(1))./(x(3) - x(1)) y2 = 0.8
which is again .
We will create a matrix with the entries
and then, using Matlab efficiently, calculate the second column as
and finally we will calculate the last entry
We may extract the simplest Newton polynomial by accessing the entries in the first row.
The algorithm itself may be used for an arbitrary number of points. For example, with five points, we would begin with
and then calculate, in order,
, , ,
We can read the Newton polynomial off of the first row:
The code for coming up with this matrix is given by
function [Y] = newtonpoly( x, y ) [m n] = size( x ); if ( m ~= 1 && n ~= 1 ) throw( MException( 'MATLAB:invalid_argument', ... 'The arguments must be vectors' ) ); end if ( ~all( size( y ) == size( x ) ) ) throw( MException( 'MATLAB:invalid_argument', ... 'The vectors ''x'' and ''y'' must have the same dimensions' ) ); end m = max( m, n ); Y = zeros( m, m ); Y(:,1) = y; for k=2:m Y(1:(end - k + 1), k) = ( ... Y(2:(end - k + 2), k - 1) - Y(1:(end - k + 1), k - 1) ... ) ./ (x(k:end) - x(1:(end - k + 1))); end end
A very quick trial run demonstrates that this works
newtonpoly( x, y ) ans = -1.4000 0.9000 0.8000 1.3000 4.1000 0 5.4000 0 0
Thus, the Newton interpolating polynomial is
Suppose we now add a fourth point: :
x = [x; 5]; y = [y; 7.6];
When calculating the coefficients of the Newton polynomial, we get
newtonpoly( x, y ) ans = -1.4000 0.9000 0.8000 -0.3000 1.3000 4.1000 -1.0000 0 5.4000 1.1000 0 0 7.6000 0 0 0
You will notice that the six non-zero entries remained unchanged and the only entries which were added were those on the anti-diagonal. Thus, the Newton interpolating polynomial is now
Now comes the more important question: how do we evaluate this polynomial? This actually seems complicated, but using Horner's rule, it is actually a very fast algorithm which is the number of points (that is, the number of terms the polynomial).
If we have the polynomial defined by the vector a = [a1 a2 a3]
may be evaluated efficiently as follows:
% Evaluate the polynomial 'a' at the point 'x' term = 1; y = 0; for k = length(a):-1:1 y = y + a(k)*term; term = term * x; end
With the first iteration, , with the second iteration , and finally with the last iteration, .
We can do away with the local variable term by using the backward approach:
% Evaluate the polynomial 'a' at the point 'x' y = a(1); for k = 2:length(a) y = y*x + a(k); end
With the first iteration, , with the second iteration , and finally with the last iteration, .
which evaluates the polynomial defined by the vector a at the point x and assigns the result to the variable y.
Your implementation must, in addition to evaluating the polynomial at the point x, allow the user to evaluate the polynomial at each point of a vector or matrix.
Your function must have the following output:
format long horners( [1 2 3], 3.2 ) ans = 19.640000000000001 horners( [1 2 3], 3.5 ) ans = 22.250000000000000
Your function should also be able to perform the following:
format horners( [1 2 3], 0:0.1:1 ) ans = 3.0000 3.2100 3.4400 3.6900 3.9600 4.2500 4.5600 4.8900 5.2400 5.6100 6.0000
If we define xs to be 100 equally spaced points on the interval , we can also plot the polynomial on that interval:
xs = linspace( -3, 3, 100 ); ys = horners( [1 2 3], xs ); plot( xs, ys ); ylim( [min( ys ) - 1, max( ys ) + 1] ); title( 'The polynomial x^2 + 2x + 3 plotted on the interval [-3, 3]' );
The output of this sequence of commands should be a plot similar to Figure 10.
Figure 10. The output of Matlab when plotting a polynomial.
What is the purpose of the statement ylim( [min( ys ) - 1, max( ys ) + 1] );?
Recall that in Matlab, the function defined by function [y] = f(x) must be saved in the file f.m. Many Matlab functions are also implemented in a similar fashion. Take a look at the Matlab library function polyval.m and determine how it evaluates polynomials.
During the lab, you will modify horners to evaluate Newton polynomials.
You can read more about Horner's rule at the appropriate chapter of Numerical Analysis for Engineers.