# Introduction

In the previous section, we saw how we may represent a wide range of real numbers using only six decimal digits and a sign bit. However, computers use binary numbers and we would like more precision than what we used in the previous section.

This topic deals with the binary double-precision floating-point representation (usually abbreviated as double) used on most computers today.

# References

• Bradie, Section 1.3, The IEEE Standard, p.36.
• Mathews, Section 1.2, Computer Floating-Point Numbers, p.22.

# Interactive Maplet

A Double-Precision Floating-Point Number Interpreter

# Theory

The double format is a method of storing approximations to real numbers in a binary format. The term double comes from the full name, double-precision floating-point numbers. Originally, a 4-byte floating-point number was used, (float), however, it was found that this was not precise enough for most scientific and engineering calculations, so it was decided to double the amount of memory allocated, hence the abbreviation double.

The double format uses eight bytes, comprised of 1 bit for the sign, 11 bits to store the exponent, and 52 bits for the mantissa. Additionally, because we require that the leading bit be non-zero, and the only non-zero number is 1, we simply do not store the leading 1. The exponent is stored by adding a bias of of 011111111112 to the actual exponent. Thus, this is all the information we need to interpret a double-precision floating point number in binary form. Unfortunately, Matlab only gives us a hexadecimal version through format hex, for example

```>> format hex
>> sin(1)
ans = 3feaed548f090cee
>> 1/8 + 1/1024
ans = 3fc0200000000000
```

so we have one additional step:

Replace each hexadecimal (hex) number with the four-bit binary equivalent, as given in Table 1.

Table 1. Hexadecimal to Binary Conversions.

HexBinary
00000
10001
20010
30011
40100
50101
60110
70111
81000
91001
a1010
b1011
c1100
d1101
e1110
f1111

# Binary-to-Hex Conversion

Group the binary number into sets of four bits and replace each quartet with its corresponding hex number, as given in Table 1.

# Real-to-Double Conversion

The steps to converting a number from decimal to a double representation are:

1. Convert the real number to its binary representation.
2. Find the appropriate power of 2 which will move the radix point to the right of the most-significant bit.
3. The sign bit is 0 if the number is positive, 1 if it is negative.
4. Convert the power to binary and add it to 011111111112.
5. Strip the most-significant bit and round to 52 bits.
6. Concatenate the results of the last three steps to create a number 64 bits long.

If necessary, separate into groups of four bits and convert each to a hexadecimal number.

# Double-to-Real Conversion

The steps to converting a double to a decimal real number are:

1. Separate the number into three components: the sign bit (1), the exponent (11), and the mantissa (52).
2. Subtract 011111111112 (= 1023) from the exponent
3. Append the mantissa to a leading 1. and convert to decimal.
4. Multiply the result of Step 3 by 2 raised to the power given in Step 2.
5. Negate the result of Step 4 if the sign bit is 1.

# Floats and Doubles

The following table compares the floating-point representation and the double-precision floating-point representation:

FormatBytesSign BitExponentBiasMantissa
float41 bit 7 bits01111112 = 63 24 (+1) bits
double81 bit11 bits011111111112 = 102352 (+1) bits

As you may note, float uses 25 bits to store the mantissa (including the unrecorded leading 1) while the double uses 53 bits. Thus, more emphasis was placed on increasing the precision than on increasing the range which the floats can approximate.

# Example of Subtractive Cancellation and Doubles

The following example shows how using double-precision floating-point numbers to approximate the derivative leads to invalid results even though Calculus teaches us that the technique used should provide better and better results.

# IEEE 754

The properties of the double are specified by the document IEEE 754. For more information, there are a few excellent documents which should be read on the page provided by the above link, especially David Goldberg's article and Prof W. Kahan's tour, though, for convenience, these two files are provided here in pdf format:

# Hexadecimal Format to Binary Format

Consider the following Matlab code which prints out a hexadecimal representation of π:

```>> format hex
>> pi
pi = 400921fb54442d18
```

First, we must convert this to binary by replacing each hexadecimal character with its corresponding quartet of binary numbers:

```      4    0    0    9    2    1    f    b    5    4    4    4    2    d    1    8
0100 0000 0000 1001 0010 0001 1111 1011 0101 0100 0100 0100 0010 1101 0001 1000
```

# Binary Format to Decimal

The next step is to split the number into the sign bit, the exponent, and the mantissa (the first three hexadecimal characters (12 bits) make up the sign bit and the exponent):

```     0 10000000000 1001001000011111101101010100010001000010110100011000
```

Subtracting 011111111112 from the exponent 10000000000 yields 12, and thus, this represents the binary number

or

11.0010010000111111011010101000100010000101101000110002

The integer portion is 112, which is 3 in decimal. The fractional part is 1/8 + 1/64 + 1/2048 + 1/4096 + 1/8192 + ⋅⋅⋅ ≈ 0.14159265358979 which is a reasonable approximation of π.

# Decimal to Binary Format

To convert a number from decimal into binary, first we must write it in binary form. For example, -523.25 is negative, so we set the sign bit to 1 and 523.25 = 512 + 8 + 2 + 1 + 1/4, and 512 = 29. Thus, this number may be written in binary as 1.00000101101 21001. We add the exponent 10012 to the bias 011111111112 to get 100000010002, thus we write down the sign bit, the sum of the exponent and the bias, and the mantissa (dropping the leading 1 and padding to the right with zeros):

```              1 10000001000 0000010110100000000000000000000000000000000000000000
```

or

```              1100000010000000010110100000000000000000000000000000000000000000
```

# Binary Format to Hexadecimal Format

To check this answer, we may break the number into quartets and convert to hexadecimal form:

```              1100 0000 1000 0000 0101 1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
c    0    8    0    5    a    0    0    0    0    0    0    0    0    0    0
```

which is c0805a0000000000, and comparing this to the output of Matlab:

```>> format hex
>> -523.25
ans = c0805a0000000000
```

# Examples

1. Convert the hex representation c066f40000000000 of a double to binary.

Replacing each hexadecimal digit with its corresponding binary quartet:

```    C    0    6    6    F    4    0    0    0    0    0    0    0    0    0    0
1100 0000 0110 0110 1111 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
```

yielding 1100000001100110111101000000000000000000000000000000000000000000.

2. What is the number which the double 1100000001100110111101000000000000000000000000000000000000000000 represents?

The first bit is 1, so the number is negative. The next 11 bits are 100000001102. Subtracting 011111111112 from this yields 1112, which equals 7. Thus, the result is multiplied by 27 = 128.

The mantissa is 1. followed by all bits after the 12th bit, that is:

```              1.0110111101000000000000000000000000000000000000000000
```

This number is

```               1 + 1/4 + 1/8 + 1/32 + 1/64 + 1/128 + 1/256 + 1/1024
```

which equals 1.4345703125 . Thus, the number is -1.4345703125 × 128 = -183.625 (recalling that the number is negative).

3. What is the decimal number which is represented by the the double 0011111111101000100000000000000000000000000000000000000000000000 ?

The first bit is 0, so the number is positive. The next 11 bits are 01111111110, which is one less than 01111111111. Thus, the result is multiplied by 2-1 (or divided by 2).

The mantissa is

```              1.1000100000000000000000000000000000000000000000000000
```

which is the number

```               1 + 1/2 + 1/32
```

which equals 1.53125 . Thus, the number is 1.53125 / 2 = 0.765625 .

4. Find the double representation of the integer 289.

The number is positive, so the first bit is 0. The binary representation of this number is 1001000012 (289 = 256 + 32 + 1). Thus, the mantissa will be 001000010000⋅⋅⋅. To get the exponent, we note that 1001000012 = 1.001000012 × 28 (we must move the radix point eight places to the left) and therefore we must add 8 (= 10002) to 011111111112 to get 100000001112.

Therefore, our double representation is:

```             0 10000000111 0010000100000000000000000000000000000000000000000000
```

5. Find the double representation of 1/8.

1/8 = 2-3 = 1.0000 × 2-3, and thus the mantissa is 000⋅⋅⋅0 and the exponent is 011111111112 minus 3 (= 112). Thus, the exponent is 01111111100 and because the number is positive, the representation is:

```             0 01111111100 00000000000000000000000000000000000000000000000000000
```

6. Find the double-precision floating-point format of -324/33 given that its binary representation is -1001.11010001011101000101110100010111010001011101000101110100010111010001⋅⋅⋅ .

The number is negative, so the first bit is 1. The radix point must be moved three spots to the left to produce a number of the form 1.⋅⋅⋅, so the exponent is 3 = 112, and 011111111112 + 112 = 100000000102. Finally, rounding 1.00111010001011101000101110100010111010001011101000101110100010111010001 to 53 bits yields 1.0011101000101110100010111010001011101000101110100011 and thus the representation is

```               1 10000000010 0011101000101110100010111010001011101000101110100011
```

This can be confirmed by using format hex and typing -324/33 into Matlab.

7. Describe what the exponent looks like for:

1. any number greater than or equal to 2,
2. any number in the interval [1, 2), and
3. any number less than 1.

Any number greater than or equal to 2 must have an exponent 21 or greater, and therefore the first bit of the exponent (that is, the second bit of the double) must be 1.

Any number in [1, 2) must have the exponent 0 and therefore the exponent must equal the bias, that is, 01111111111.

Any (positive) number less than 1 must have a negative exponent, and therefore the exponent must be some number less than 01111111111. We could say that: the leading bit the exponent is 0 and there is at least one other bit in the exponent which is also 0.

8. Consider finding the derivative.

# Questions

1. Convert the hexadecimal representation c01d600000000000 to binary. (1100000000011101011000000000000000000000000000000000000000000000)

2. What number does the hexadecimal representation c01d600000000000 of a double represent? (-7.34375)

3. What number does the binary representation 0100000001100011001011111000000000000000000000000000000000000000 of a double represent? (153.484375).

4. By converting to decimal and converting the result back to double, add the following two hexadecimal representations of doubles: 3fe8000000000000 and 4011000000000000. (4014000000000000)

# Applications to Engineering

In engineering, a less accurate result with a predictable error is better than a more accurate result with an unpredictable error. This was one of the main reasons behind standardizing the format of floating-point representations on computers. Without standardization, the same code run on many machines could produce different answers. IEEE 754 standardized the representation and behaviour of floating-point numbers and therefore allowed better prediction of the error, and thus, an algorithm designed to run within certain tolerances will perform similarly on all platforms. Without standardization, a particular computation could have potentially very different results when run on different machines. Standardization allows the algorithm designer to focus on a single standard, as opposed to wasting time fine-tuning each algorithm for each different machine.

One interesting modification is used by the Intel Pentium processors for double-precision floating-point computations: The processor internally stores doubles using 10 bytes with a 64-bit mantissa and 15-bit exponent. Thus, a floating-point computation using doubles on an Intel processor must be at least as accurate as a computation on another processor which stores doubles the default 8 bytes.

# Matlab

Matlab uses doubles for all numeric calculations and you can see the representation by using format hex.

```>> format long
>> 325/23
ans = 14.1304347826087
>> format hex
>> 325/23
ans = 402c42c8590b2164
```

The Matlab-clone Octave has the additional format bit:

```>> format bit
>> 325/23
ans = 0100000000101100010000101100100001011001000010110010000101100100
```

# Maple

Maple uses doubles if an expression is surrounded by evalhf (evaluate using hardware floats), but you cannot see the representation.

```> 3.0 + sin(1.0);   % 10 digits by default
3.841470985
> evalhf( 3.0 + sin(1.0) );
3.84147098480789672
```