Introduction
Theory
HOWTO
Examples
Questions
Matlab
Maple

# Introduction

We look at ways of storing real numbers on a computer first using
decimal numbers. In the next sub-topic, 2.4, we will look at using
binary numbers.

# Background

# References

- Bradie, Section 1.3, Mathematics on the Computer Floating-Point Number System, p.30.
- Mathews, Section 1.2, Machine Numbers, p.20.
- Weisstein, http://mathworld.wolfram.com/Floating-PointArithmetic.html.
- James W. Demmel,
*Applied Numerical Linear Algebra*,
Society for Industrial and Applied Mathematics, 1997.

# Interactive Maplet

A Decimal Floating-Point Number Interpreter
# Theory

# Assumptions

At any time, it is unreasonable, or impossible, to store some numbers
exactly, nor is it reasonable to keep all a maximum amount of precision.
For example, π = 3.1415926535897932385⋅⋅⋅ which
cannot be exactly stored as a decimal number. Other examples
are:

- The product 1.2345678901 × 2.3456789012 = 2.89589985190657035812 where
the two multiplicands each have 11 significant digits but the product requires
21 significant digits.
- The sum 123456789.0 + 0.123456789 = 123456789.123456789 where the two
summands each have 9 significant digits but the sum requires 18 significant
digits.

Thus, any attempt to store all possible digits in performing a complex
calculation will require a significant amount of memory very quickly, and
therefore it is necessary to truncate (or round) the numbers we store, but this
truncation will reduce the accuracy of our answers. The most important
questions of numerical methods are how much will this truncation affect
our calculations and can we avoid some of the problems?

# Requirements

To begin, let us set some practical requirements for storing real
numbers:

- To use a fixed amount of memory,
- To be able to represent both very large and very small numbers,
- To be able to represent numbers with a small relative error, and
- To be able to easily test equality and relative magnitude.

# Rounding

Let us first look at the third requirement. If we are to store
a real number (such as π) with only *n* digits in such a way to minimize the
relative error, we must use rounding. Thus, given a real number, if all digits after
the *n*th are 4999⋅⋅⋅ or less, then truncating all digits
after the *n*th digit will minimize the relative error. If all digits after
the *n*th digit are greater than 5000⋅⋅⋅ then truncating the digits
after the *n*th digit and incrementing the *n*th digit will minimize the
relative error. The first process is termed *rounding down*, and the second
*rounding up*. If all digits after the *n*th digit are 5000⋅⋅⋅
exactly, we may use either rounding (up or down), as both values have the same
relative error. Unfortunately, picking one of these
will lead to a *bias* in our answers:

Suppose we always round down, this will lead to a bias in our answers which will
probably underestimate the actual answer, and if we always round up, this will lead
to a bias in our answers which will probably overestimate the actual answer.
Thus, when the IEEE 754 specification came out, they suggested the following
rule:

*If all digits after the *n*th digit are 500⋅⋅⋅ exactly,
then round up if the *n*th digit is odd and round down if the *n*th
digit is even.*

Thus, with sufficiently many calculations, we hope that the choices of rounding
up or down will average each other out, and thus lead to a better approximation.

To summarize, the rules for rounding to *n* digits are:

- If all digits after the
*n*th digit are less than 5000⋅⋅⋅ then round down,
- If all digits after the
*n*th digit are exactly 5000⋅⋅⋅ then round down if the
*n*th digit is even and round up otherwise, and
- If all digits after the
*n*th digit are greater than 5000⋅⋅⋅, then round up.

# Two Representations

Suppose we have the ability to store six decimal digits and a sign, `+` or `-`. We will look at
two methods of storing numbers:

## A Fixed-Point Number (A Poor Choice)

Perhaps the easiest method of a real number would be `±NNN.NNN` where
each N is a decimal digit a sign ±.
We would store such a number as `±NNNNNN` (discarding the decimal point).
Thus, `+123456` would represent the real
number +123.456 . This satisfies the 1st and 4th requirements, but it
is suboptimal for the other two requirements:

- The largest number we can store is
`999.999` and `000.001`, the first of which
is not very large, and the second is not very small.
- The real number 999.9985 would have to be stored as
`999.998` which has a relative
error of 0.0000005, while 0.0015 would be stored as `0.002`, which has a very significant
relative error of 0.33

## A Floating-Point Number (A Better Choice)

Instead, let us store a number with `±M.NNN × 10`^{EE − 49}
where `NNN` and `EE` are decimal digits, `M` is a non-zero
decimal digit, and ` ` is a sign (as before). To store such a number,
we would use the format ` EEMNNN`. We will refer to the
digits ` M.NNN` as the *mantissa*, the `EE` as the *exponent*, and the
49 is termed a *bias*. Looking at the requirements:

- It uses the same fixed amount of memory as before (6 decimal digits and a sign)
- The largest and smallest values of the mantissa are 9.999 and 1.000, respectively, and
the largest and smallest values of the exponent are 99 − 49 = 50 and 00 − 49 = -49, so the
largest and smallest representable values are 9.999 × 10
^{50} and
1.000 × 10^{-49}, respectively; certainly a significant range.
- The error analysis section demonstrates that the largest possible relative error for
representing any number on the real interval [10
^{-49}, 9.999 × 10^{50}]
is 1/2001 ≈ 0.0005 and that for most numbers, it is significantly smaller.
- We require that the first digit
`M` is non-zero, as otherwise, two
different forms could be used to represent the same number, for example, 2.000 × 10^{49 − 49}
and 0.200 × 10^{50 − 49}. By using the bias, if the sign of of two
floating-point numbers *x* and *y* are equal, then comparing magnitude is as easy
as comparing the representations. As integers, 471234 < 479876 < 491234 and as floating-point numbers
0.01234 < 0.09876 < 1.234.

Thus, this *floating-point* number format appears to satisfy all of our needs,
and while six digits is clearly insufficient for most calculations, we can simply choose
to add more digits to the mantissa and exponent to satisfy our needs.

# Zero and Denormalized Numbers

Because we require that the digit `M` ≠ 0, the representations `±000000`
are not currently used and thefore can be used to represent 0; however, two representations are
actually beneficial, as we may use `+000000` to represent all numbers in [0, 0.9995×10^{-49})
and `-000000` to represent all numbers in (0.9995×10^{-49}, 0].

Problems with floating-point numbers in this format includes two
related observations:

- A calculation like
`+003535` divided by 10 results
in a number which cannot be represented using our format, and
therefore must be represented with `+000000` or +0; and
- The difference of two numbers like
`+006500` and
`+006400` should be 10^{-50}; however, this
difference cannot be represented using our format, and
therefore must, again, be represented with `+000000` or +0.
[Demmel]

This can be solved by allowing one exception to the requirement
that `M` ≠ 0 in the case where `EE` is `00`.
Now, if we have a number like `±000NNN`, this will
represent the denormalized number 0.NNN × 10^{-49}.

Both of the above two problems are solved by these denormalized
numbers: in the first case, the division will result in
`+000354` (using the appropriate rounding), and in the second
the difference is `+000100`.

With denormalized numbers, there is a range of numbers
[-10^{-49}, 10^{-49}] on which the maximum
absolute error is now 0.0005 × 10^{-49} or
5×10^{53}; however, the relative error is no
longer bounded. The second guarantee is that the difference
of two floating-point numbers is zero **if and only** the
two representations are equal.

# Important

For the purposes of this class, we will only worry about the mantissa, that is
we will say that we are using *n* digits of precision if we mean that we intend
to keep *n* significant digits. This is a sufficiently good approximation to
our decimal floating-point number representation, as we will simply assume that the
exponent will not be either too large or too small.

# HOWTO

# Rounding

Given a real number, to round such a number to *n* digits,
consider all digits after the *n*th digit:

- If they are less than 5000⋅⋅⋅, round down (that is, simply remove all digits
after the
*n*th digit),
- If they are greater than 5000⋅⋅⋅, round up (that is, simply remove all digits
after the
*n*th digit and increment the *n*th digit), and
- If they are exactly 5000⋅⋅⋅, round up only if the
*n*th digit is odd, otherwise, round down.

# Interpreting

Interpreting the number `±EEMNNN` as a real number, we
extract the exponent `EE` and subtract 49 (the bias) from it.

# Storing

Consider |*r*|: round the number to 4 decimal digits
and write the number in the form *M.NNN* × 10^{n}.
Set `EE` = *n* + 49. Collect the numbers and store as `±EEMNNN`.

# Examples

1. Round the following numbers to 5 digits:

- 5.82380353
- 5.82384358
- 5.82385000
- 5.82385031
- 5.82389584

In the first two cases, the digits 0353 and 4358 are less than 5000,
so they are rounded down, in the third case, the 5th digit is even, so we
do not change the 5th digit, and in the last two cases, 5031 and 9584 are
both greater than 5000, so we round up:

- 5.8238
- 5.8238
- 5.8238
- 5.8239
- 5.8239

2. Round the following numbers to 5 digits:

- 9.28305
- 9.28315
- 9.28325
- 9.33335
- 9.33345
- 9.33355
- 9.33365
- 9.33375
- 9.33385
- 9.33395

In all cases, all digits after the 5th digit are 5000⋅⋅⋅. Thus
we must look at the parity of the 5th digit. In 1, 3, 5, 7, 9, the 5th digit is
even, so we leave it, and in 2, 4, 6, 8, 10 the 5th digit is odd, so we raise it:

- 9.2830
- 9.2832
- 9.2832
- 9.3334
- 9.3334
- 9.3336
- 9.3336
- 9.3338
- 9.3338
- 9.3340

3. Represent π in our six-digit floating-point format.

First, rounding π = 3.141592654⋅⋅⋅ yields 3.142, and
this should be 3.142 × 10^{0}, so we chose the exponent 49 to
remove the bias. Therefore, we represent π by `0493142`.

4. Represent 10! in our six-digit floating-point format.

First, 10! = 3628800, which, rounded to four digits is 3629000. This is
equal to 3.629 × 10^{6}, and thus we use the exponent `55`,
and thus our representation is `0553629`.

5. What number is represented, using our six-digit floating-point format, by `1234567`?

The leading 1 indicates that it is negative, so therefore, this represents the
number -4.567 × 10^{23 - 49} = -4.567 × 10^{-26}.

# Questions

1. Round the following numbers to 4 digits:

- 832529.5
- 83262.95
- 8325.500
- 832.6500
- 83.55602
- 8.366602

(832500., 83260., 8326., 832.6, 83.56, 8.367)

2. Using our decimal representation, what numbers to the following represent using our six-digit floating-point format?

- +479323
- -499323
- +509323
- -549323

(0.09323, -9.323, 93.23, -932300.)

3. Using our decimal representation, what range of numbers is represented by `+521234`? (The closed interval [1233.5, 1234.5] )

4. What range of numbers is represented by `0522345` using our representation? (The open interval (2344.5, 2345.5) )

5. Represent the following numbers in our format:

- Square root of two (≈ 1.414213562)
- One million (1000000)
- -
*e*^{-10} ≈ -0.00004539992976

(0491414, 0551000, 1444540)

# Matlab

Matlab uses binary to store its numbers, so it is difficult to
represent our decimal floating-point number system. Matlab will
be much more useful for subsection 4.4.

# Maple

Maple already uses decimal numbers, so you can simply use:

evalf[4]( 0.00095232342345 );
evalf[4]( 9.5232342345 );
evalf[4]( 9523.2342345 );
evalf[4]( 952323423.45 );
evalf[4]( 95232342345000 );

The index [4] indicates that the number should be rounded to four
decimal digits. From here, you may set the appropriate exponent.