Introduction
Theory
HOWTO
Examples
Questions
Applications to Engineering
Matlab
Maple

# Introduction

In the previous section, we saw how we may represent a wide range
of real numbers using only six decimal digits and a sign bit. However,
computers use binary numbers and we would like more precision than
what we used in the previous section.

This topic deals with the binary double-precision floating-point
representation (usually abbreviated as *double*) used on most computers today.

# Background

# References

- Bradie, Section 1.3, The IEEE Standard, p.36.
- Mathews, Section 1.2, Computer Floating-Point Numbers, p.22.

# Interactive Maplet

A Double-Precision Floating-Point Number Interpreter
# Theory

The `double` format is a method of storing approximations to real numbers in
a binary format. The term *double* comes from the full name, *double-precision
floating-point numbers*. Originally, a 4-byte floating-point number was used,
(`float`), however, it was found that this was not precise enough for most
scientific and engineering calculations, so it was decided to double the amount of memory allocated,
hence the abbreviation *double*.

The double format uses eight bytes, comprised of 1 bit for the sign, 11 bits
to store the exponent, and 52 bits for the mantissa. Additionally, because we require
that the leading bit be non-zero, and the only non-zero number is 1, we simply
do not store the leading 1. The exponent is stored by adding a *bias* of
of 01111111111_{2} to the actual exponent. Thus, this is all the information we need to
interpret a double-precision floating point number in binary form. Unfortunately,
Matlab only gives us a hexadecimal version through `format hex`, for
example

>> format hex
>> sin(1)
ans = 3feaed548f090cee
>> 1/8 + 1/1024
ans = 3fc0200000000000

so we have one additional step:

# Hexadecimal-to-Binary Conversion

Replace each hexadecimal (hex) number with the four-bit binary
equivalent, as given in Table 1.

Table 1. Hexadecimal to Binary Conversions.

Hex | Binary |

0 | 0000 |

1 | 0001 |

2 | 0010 |

3 | 0011 |

4 | 0100 |

5 | 0101 |

6 | 0110 |

7 | 0111 |

8 | 1000 |

9 | 1001 |

a | 1010 |

b | 1011 |

c | 1100 |

d | 1101 |

e | 1110 |

f | 1111 |

# Binary-to-Hex Conversion

Group the binary number into sets of four bits and replace each
quartet with its corresponding hex number, as given in Table 1.

# Real-to-Double Conversion

The steps to converting a number from decimal to a double
representation are:

- Convert the real number to its binary representation.
- Find the appropriate power of 2 which will move the radix
point to the right of the most-significant bit.
- The sign bit is 0 if the number is positive, 1 if it is
negative.
- Convert the power to binary and add it to 01111111111
_{2}.
- Strip the most-significant bit and round to 52 bits.
- Concatenate the results of the last three steps to create a
number 64 bits long.

If necessary, separate into groups of four bits and convert each
to a hexadecimal number.

# Double-to-Real Conversion

The steps to converting a double to a decimal real number are:

- Separate the number into three components: the sign bit (1), the
exponent (11), and the mantissa (52).
- Subtract 01111111111
_{2} (= 1023) from the exponent
- Append the mantissa to a leading
`1.` and convert to decimal.
- Multiply the result of Step 3 by 2 raised to the power given in Step 2.
- Negate the result of Step 4 if the sign bit is 1.

# Floats and Doubles

The following table compares the floating-point representation and the
double-precision floating-point representation:

Format | Bytes | Sign Bit | Exponent | Bias | Mantissa |

float | 4 | 1 bit | 7 bits | 0111111_{2} = 63 | 24 (+1) bits |

double | 8 | 1 bit | 11 bits | 01111111111_{2} = 1023 | 52 (+1) bits |

As you may note, float uses 25 bits to store the mantissa (including the unrecorded leading
1) while the double uses 53 bits. Thus, more emphasis was placed on increasing the
precision than on increasing the range which the floats can approximate.

# Example of Subtractive Cancellation and Doubles

The following example shows how using double-precision
floating-point numbers to approximate the derivative leads to invalid results even though Calculus teaches us that
the technique used should provide better and better results.

# IEEE 754

The properties of the double are specified by the document
IEEE 754. For more information,
there are a few excellent documents which should be read on the page provided
by the above link, especially David Goldberg's article and Prof W. Kahan's tour, though,
for convenience, these two files are provided here in pdf format:

# HOWTO

# Hexadecimal Format to Binary Format

Consider the following Matlab code which prints out a hexadecimal representation
of π:

>> format hex
>> pi
pi = 400921fb54442d18

First, we must convert this to binary by replacing each hexadecimal character
with its corresponding quartet of binary numbers:

4 0 0 9 2 1 f b 5 4 4 4 2 d 1 8
0100 0000 0000 1001 0010 0001 1111 1011 0101 0100 0100 0100 0010 1101 0001 1000

# Binary Format to Decimal

The next step is to split the number into the sign bit, the exponent, and the mantissa
(the first three hexadecimal characters (12 bits) make up the sign bit and the exponent):

0 10000000000 1001001000011111101101010100010001000010110100011000

Subtracting 01111111111_{2} from the exponent 10000000000 yields
1_{2}, and thus, this represents the binary number

or

11.001001000011111101101010100010001000010110100011000_{2}
The integer portion is 11_{2}, which is 3 in decimal. The
fractional part is 1/8 + 1/64 + 1/2048 + 1/4096 + 1/8192 + ⋅⋅⋅ ≈ 0.14159265358979
which is a reasonable approximation of π.

# Decimal to Binary Format

To convert a number from decimal into binary, first we must write it in binary form. For
example, -523.25 is negative, so we set the sign bit to 1 and 523.25 = 512 + 8 + 2 + 1 + 1/4, and 512 = 2^{9}. Thus, this number
may be written in binary as 1.00000101101 2^{1001}. We add the exponent 1001_{2} to
the bias 01111111111_{2} to get 10000001000_{2}, thus we write down the
sign bit, the sum of the exponent and the bias, and the mantissa (dropping the leading 1 and
padding to the right with zeros):

1 10000001000 0000010110100000000000000000000000000000000000000000

or

1100000010000000010110100000000000000000000000000000000000000000

# Binary Format to Hexadecimal Format

To check this answer, we may break the number into quartets and convert
to hexadecimal form:

1100 0000 1000 0000 0101 1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
c 0 8 0 5 a 0 0 0 0 0 0 0 0 0 0

which is c0805a0000000000, and comparing this to the output of Matlab:

>> format hex
>> -523.25
ans = c0805a0000000000

# Examples

1. Convert the hex representation c066f40000000000 of a double to binary.

Replacing each hexadecimal digit with its corresponding binary quartet:

C 0 6 6 F 4 0 0 0 0 0 0 0 0 0 0
1100 0000 0110 0110 1111 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

yielding 1100000001100110111101000000000000000000000000000000000000000000.

2. What is the number which
the double 1100000001100110111101000000000000000000000000000000000000000000 represents?

The first bit is 1, so the number is negative. The next 11 bits
are 10000000110_{2}. Subtracting 01111111111_{2} from this yields
111_{2}, which equals 7. Thus, the result is multiplied by 2^{7} = 128.

The mantissa is 1. followed by all bits after the 12th bit, that is:

1.0110111101000000000000000000000000000000000000000000

This number is

1 + 1/4 + 1/8 + 1/32 + 1/64 + 1/128 + 1/256 + 1/1024

which equals 1.4345703125 . Thus, the number is -1.4345703125 × 128 = -183.625
(recalling that the number is negative).

3. What is the decimal number which is represented by the the double
0011111111101000100000000000000000000000000000000000000000000000 ?

The first bit is 0, so the number is positive. The next 11 bits
are 01111111110, which is one less than 01111111111. Thus, the result is multiplied
by 2^{-1} (or divided by 2).

The mantissa is

1.1000100000000000000000000000000000000000000000000000

which is the number

1 + 1/2 + 1/32

which equals 1.53125 . Thus, the number is 1.53125 / 2 = 0.765625 .

4. Find the double representation of the integer 289.

The number is positive, so the first bit is 0. The binary representation
of this number is 100100001_{2} (289 = 256 + 32 + 1). Thus, the mantissa will be
001000010000⋅⋅⋅. To get the exponent, we note that
100100001_{2} = 1.00100001_{2} × 2^{8} (we must move the radix point
eight places to the left) and therefore we must add 8 (= 1000_{2}) to 01111111111_{2} to get
10000000111_{2}.

Therefore, our double representation is:

0 10000000111 0010000100000000000000000000000000000000000000000000

5. Find the double representation of 1/8.

1/8 = 2^{-3} = 1.0000 × 2^{-3}, and thus the mantissa is
000⋅⋅⋅0 and the exponent is 01111111111_{2} minus 3 (= 11_{2}).
Thus, the exponent is 01111111100 and because the number is positive, the representation is:

0 01111111100 00000000000000000000000000000000000000000000000000000

6. Find the double-precision floating-point format of -324/33 given that its
binary representation
is -1001.11010001011101000101110100010111010001011101000101110100010111010001⋅⋅⋅ .

The number is negative, so the first bit is 1. The radix point must be moved three spots to
the left to produce a number of the form 1.⋅⋅⋅, so the exponent is 3 = 11_{2},
and 01111111111_{2} + 11_{2} = 10000000010_{2}. Finally, rounding
1.00111010001011101000101110100010111010001011101000101110100010111010001 to 53 bits yields
1.0011101000101110100010111010001011101000101110100011 and thus the representation is

1 10000000010 0011101000101110100010111010001011101000101110100011

This can be confirmed by using `format hex` and typing -324/33 into Matlab.

7. Describe what the exponent looks like for:

- any number greater than or equal to 2,
- any number in the interval [1, 2), and
- any number less than 1.

Any number greater than or equal to 2 must have an exponent 2^{1} or
greater, and therefore the first bit of the exponent (that is, the second bit
of the double) must be 1.

Any number in [1, 2) must have the exponent 0 and therefore the exponent
must equal the bias, that is, `01111111111`.

Any (positive) number less than 1 must have a negative exponent, and therefore
the exponent must be some number less than `01111111111`. We could
say that: the leading bit the exponent is `0` and there is at least
one other bit in the exponent which is also `0`.

8. Consider finding the derivative.

# Questions

1. Convert the hexadecimal representation c01d600000000000 to binary. (1100000000011101011000000000000000000000000000000000000000000000)

2. What number does the hexadecimal representation c01d600000000000 of a double represent? (-7.34375)

3. What number does the binary representation 0100000001100011001011111000000000000000000000000000000000000000
of a double represent? (153.484375).

4. By converting to decimal and converting the result back to double, add the following
two hexadecimal representations of doubles: 3fe8000000000000 and 4011000000000000. (4014000000000000)

# Applications to Engineering

In engineering, a less accurate result with a predictable error is better than
a more accurate result with an unpredictable error. This was one of the main
reasons behind standardizing the format of floating-point representations on
computers. Without standardization, the same code run on many machines could
produce different answers. IEEE 754 standardized the representation and behaviour
of floating-point numbers and therefore allowed better prediction of the error, and
thus, an algorithm designed to run within certain tolerances will perform similarly
on all platforms. Without standardization, a particular computation could have
potentially very different results when run on different machines. Standardization
allows the algorithm designer to focus on a single standard, as opposed to wasting
time fine-tuning each algorithm for each different machine.

One interesting modification is used by the Intel Pentium processors for double-precision
floating-point computations: The processor internally stores doubles using 10 bytes
with a 64-bit mantissa and 15-bit exponent. Thus, a floating-point computation using
doubles on an Intel processor must be at least as accurate as a computation on another
processor which stores doubles the default 8 bytes.

# Matlab

Matlab uses doubles for all numeric calculations and you
can see the representation by using `format hex`.

>> format long
>> 325/23
ans = 14.1304347826087
>> format hex
>> 325/23
ans = 402c42c8590b2164

The Matlab-clone Octave has the additional `format bit`:

>> format bit
>> 325/23
ans = 0100000000101100010000101100100001011001000010110010000101100100

# Maple

Maple uses doubles if an expression is surrounded by `evalhf` (evaluate
using hardware floats), but you cannot see the representation.

> 3.0 + sin(1.0); % 10 digits by default
3.841470985
> evalhf( 3.0 + sin(1.0) );
3.84147098480789672