Skip to the content of the web site.

2.3 Floating-point Numbers

Introduction Theory HOWTO Examples Questions Matlab Maple


We look at ways of storing real numbers on a computer first using decimal numbers. In the next sub-topic, 2.4, we will look at using binary numbers.



Interactive Maplet

A Decimal Floating-Point Number Interpreter



At any time, it is unreasonable, or impossible, to store some numbers exactly, nor is it reasonable to keep all a maximum amount of precision. For example, π = 3.1415926535897932385⋅⋅⋅ which cannot be exactly stored as a decimal number. Other examples are:

  • The product 1.2345678901 × 2.3456789012 = 2.89589985190657035812 where the two multiplicands each have 11 significant digits but the product requires 21 significant digits.
  • The sum 123456789.0 + 0.123456789 = 123456789.123456789 where the two summands each have 9 significant digits but the sum requires 18 significant digits.

Thus, any attempt to store all possible digits in performing a complex calculation will require a significant amount of memory very quickly, and therefore it is necessary to truncate (or round) the numbers we store, but this truncation will reduce the accuracy of our answers. The most important questions of numerical methods are how much will this truncation affect our calculations and can we avoid some of the problems?


To begin, let us set some practical requirements for storing real numbers:

  1. To use a fixed amount of memory,
  2. To be able to represent both very large and very small numbers,
  3. To be able to represent numbers with a small relative error, and
  4. To be able to easily test equality and relative magnitude.


Let us first look at the third requirement. If we are to store a real number (such as π) with only n digits in such a way to minimize the relative error, we must use rounding. Thus, given a real number, if all digits after the nth are 4999⋅⋅⋅ or less, then truncating all digits after the nth digit will minimize the relative error. If all digits after the nth digit are greater than 5000⋅⋅⋅ then truncating the digits after the nth digit and incrementing the nth digit will minimize the relative error. The first process is termed rounding down, and the second rounding up. If all digits after the nth digit are 5000⋅⋅⋅ exactly, we may use either rounding (up or down), as both values have the same relative error. Unfortunately, picking one of these will lead to a bias in our answers:

Suppose we always round down, this will lead to a bias in our answers which will probably underestimate the actual answer, and if we always round up, this will lead to a bias in our answers which will probably overestimate the actual answer. Thus, when the IEEE 754 specification came out, they suggested the following rule:

If all digits after the nth digit are 500⋅⋅⋅ exactly, then round up if the nth digit is odd and round down if the nth digit is even.

Thus, with sufficiently many calculations, we hope that the choices of rounding up or down will average each other out, and thus lead to a better approximation.

To summarize, the rules for rounding to n digits are:

  • If all digits after the nth digit are less than 5000⋅⋅⋅ then round down,
  • If all digits after the nth digit are exactly 5000⋅⋅⋅ then round down if the nth digit is even and round up otherwise, and
  • If all digits after the nth digit are greater than 5000⋅⋅⋅, then round up.

Two Representations

Suppose we have the ability to store six decimal digits and a sign, + or -. We will look at two methods of storing numbers:

A Fixed-Point Number (A Poor Choice)

Perhaps the easiest method of a real number would be ±NNN.NNN where each N is a decimal digit a sign ±. We would store such a number as ±NNNNNN (discarding the decimal point). Thus, +123456 would represent the real number +123.456 . This satisfies the 1st and 4th requirements, but it is suboptimal for the other two requirements:

  • The largest number we can store is 999.999 and 000.001, the first of which is not very large, and the second is not very small.
  • The real number 999.9985 would have to be stored as 999.998 which has a relative error of 0.0000005, while 0.0015 would be stored as 0.002, which has a very significant relative error of 0.33

A Floating-Point Number (A Better Choice)

Instead, let us store a number with ±M.NNN × 10EE − 49 where NNN and EE are decimal digits, M is a non-zero decimal digit, and   is a sign (as before). To store such a number, we would use the format  EEMNNN. We will refer to the digits  M.NNN as the mantissa, the EE as the exponent, and the 49 is termed a bias. Looking at the requirements:

  1. It uses the same fixed amount of memory as before (6 decimal digits and a sign)
  2. The largest and smallest values of the mantissa are 9.999 and 1.000, respectively, and the largest and smallest values of the exponent are 99 − 49 = 50 and 00 − 49 = -49, so the largest and smallest representable values are 9.999 × 1050 and 1.000 × 10-49, respectively; certainly a significant range.
  3. The error analysis section demonstrates that the largest possible relative error for representing any number on the real interval [10-49, 9.999 × 1050] is 1/2001 ≈ 0.0005 and that for most numbers, it is significantly smaller.
  4. We require that the first digit M is non-zero, as otherwise, two different forms could be used to represent the same number, for example, 2.000 × 1049 − 49 and 0.200 × 1050 − 49. By using the bias, if the sign of of two floating-point numbers x and y are equal, then comparing magnitude is as easy as comparing the representations. As integers, 471234 < 479876 < 491234 and as floating-point numbers 0.01234 < 0.09876 < 1.234.

Thus, this floating-point number format appears to satisfy all of our needs, and while six digits is clearly insufficient for most calculations, we can simply choose to add more digits to the mantissa and exponent to satisfy our needs.

Zero and Denormalized Numbers

Because we require that the digit M ≠ 0, the representations ±000000 are not currently used and thefore can be used to represent 0; however, two representations are actually beneficial, as we may use +000000 to represent all numbers in [0, 0.9995×10-49) and -000000 to represent all numbers in (0.9995×10-49, 0].

Problems with floating-point numbers in this format includes two related observations:

  • A calculation like +003535 divided by 10 results in a number which cannot be represented using our format, and therefore must be represented with +000000 or +0; and
  • The difference of two numbers like +006500 and +006400 should be 10-50; however, this difference cannot be represented using our format, and therefore must, again, be represented with +000000 or +0. [Demmel]

This can be solved by allowing one exception to the requirement that M ≠ 0 in the case where EE is 00. Now, if we have a number like ±000NNN, this will represent the denormalized number 0.NNN × 10-49.

Both of the above two problems are solved by these denormalized numbers: in the first case, the division will result in +000354 (using the appropriate rounding), and in the second the difference is +000100.

With denormalized numbers, there is a range of numbers [-10-49, 10-49] on which the maximum absolute error is now 0.0005 × 10-49 or 5×1053; however, the relative error is no longer bounded. The second guarantee is that the difference of two floating-point numbers is zero if and only the two representations are equal.


For the purposes of this class, we will only worry about the mantissa, that is we will say that we are using n digits of precision if we mean that we intend to keep n significant digits. This is a sufficiently good approximation to our decimal floating-point number representation, as we will simply assume that the exponent will not be either too large or too small.



Given a real number, to round such a number to n digits, consider all digits after the nth digit:

  1. If they are less than 5000⋅⋅⋅, round down (that is, simply remove all digits after the nth digit),
  2. If they are greater than 5000⋅⋅⋅, round up (that is, simply remove all digits after the nth digit and increment the nth digit), and
  3. If they are exactly 5000⋅⋅⋅, round up only if the nth digit is odd, otherwise, round down.


Interpreting the number ±EEMNNN as a real number, we extract the exponent EE and subtract 49 (the bias) from it.


Consider |r|: round the number to 4 decimal digits and write the number in the form M.NNN × 10n. Set EE = n + 49. Collect the numbers and store as ±EEMNNN.


1. Round the following numbers to 5 digits:

  1. 5.82380353
  2. 5.82384358
  3. 5.82385000
  4. 5.82385031
  5. 5.82389584

In the first two cases, the digits 0353 and 4358 are less than 5000, so they are rounded down, in the third case, the 5th digit is even, so we do not change the 5th digit, and in the last two cases, 5031 and 9584 are both greater than 5000, so we round up:

  1. 5.8238
  2. 5.8238
  3. 5.8238
  4. 5.8239
  5. 5.8239

2. Round the following numbers to 5 digits:

  1. 9.28305
  2. 9.28315
  3. 9.28325
  4. 9.33335
  5. 9.33345
  6. 9.33355
  7. 9.33365
  8. 9.33375
  9. 9.33385
  10. 9.33395

In all cases, all digits after the 5th digit are 5000⋅⋅⋅. Thus we must look at the parity of the 5th digit. In 1, 3, 5, 7, 9, the 5th digit is even, so we leave it, and in 2, 4, 6, 8, 10 the 5th digit is odd, so we raise it:

  1. 9.2830
  2. 9.2832
  3. 9.2832
  4. 9.3334
  5. 9.3334
  6. 9.3336
  7. 9.3336
  8. 9.3338
  9. 9.3338
  10. 9.3340

3. Represent π in our six-digit floating-point format.

First, rounding π = 3.141592654⋅⋅⋅ yields 3.142, and this should be 3.142 × 100, so we chose the exponent 49 to remove the bias. Therefore, we represent π by 0493142.

4. Represent 10! in our six-digit floating-point format.

First, 10! = 3628800, which, rounded to four digits is 3629000. This is equal to 3.629 × 106, and thus we use the exponent 55, and thus our representation is 0553629.

5. What number is represented, using our six-digit floating-point format, by 1234567?

The leading 1 indicates that it is negative, so therefore, this represents the number -4.567 × 1023 - 49 = -4.567 × 10-26.


1. Round the following numbers to 4 digits:

  1. 832529.5
  2. 83262.95
  3. 8325.500
  4. 832.6500
  5. 83.55602
  6. 8.366602

(832500., 83260., 8326., 832.6, 83.56, 8.367)

2. Using our decimal representation, what numbers to the following represent using our six-digit floating-point format?

  1. +479323
  2. -499323
  3. +509323
  4. -549323

(0.09323, -9.323, 93.23, -932300.)

3. Using our decimal representation, what range of numbers is represented by +521234? (The closed interval [1233.5, 1234.5] )

4. What range of numbers is represented by 0522345 using our representation? (The open interval (2344.5, 2345.5) )

5. Represent the following numbers in our format:

  1. Square root of two (≈ 1.414213562)
  2. One million (1000000)
  3. -e-10 ≈ -0.00004539992976

(0491414, 0551000, 1444540)


Matlab uses binary to store its numbers, so it is difficult to represent our decimal floating-point number system. Matlab will be much more useful for subsection 4.4.


Maple already uses decimal numbers, so you can simply use:

evalf[4]( 0.00095232342345 );
evalf[4]( 9.5232342345 );
evalf[4]( 9523.2342345 );
evalf[4]( 952323423.45 );
evalf[4]( 95232342345000 );

The index [4] indicates that the number should be rounded to four decimal digits. From here, you may set the appropriate exponent.