Skip to the content of the web site.

Lesson 1.25: Floating-point numbers

Up to now, we have discussed the integer data types char, short, int, and long, which occupy 1, 2, 4 and 8 bytes, respectively. These, can, however, only store integral numbers. It is often necessary to use approximations of real numbers. For this, we use the float and double primitive data types.

While we will very briefly describe the binary representation of float and double at the end of this topic, this is more

In secondary school, you learned two representations for real numbers: standard and scientific notation.

Standard or fixed-point notation

For the standard notation, you wrote the decimal point and then zero or more decimal digits before or after the decimal point, with at least one such digit. For example, all of the following are in standard notation:

$.3$, $5.$, $865253.252$, $523000000.$ and $0.000000000713$.

We will refer this notation as fixed point because the decimal point remains fixed after the ones digit.

As can be seen, when expressing either very large or very small numbers, the resulting fixed-point notation may have a significant number of zeros, and writing or counting the zeros can be error prone and not very insightful.

For example, the Avogadro constant is $602214076000000000000000$ while the Planck constant is $0.000000000000000000000000000000000662607015$.

A better notation is scientific notation.

Scientific or floating-point notation

In secondary school, you were told to write numbers using a number (the significand) $1 \le s < 10$ multiplied by 10 raised to an integer power.

For example, using scientific notation, the Avogadro constant is $6.02214076 \times 10^{23}$ while the Planck constant is $6.62607015 \times 10^{-34}$.

We will refer to this notation as floating point because the ten raised to an integer power moves or floats decimal point.

Storing floating-point numbers

If you had a fixed amount of space in which you can write down digits used for calculations, to simplify your job, you might consider a simpler notation

      +  23 6.02214076
      + -34 6.62607015

where the first symbol is the sign (either + or -), the next number is the exponent, and the last number is the significand. Some examples are shown here:

- -3 7.55433 $-7.55433 \times 10^{-3}$
+ -8 5.01935332 $5.01935332 \times 10^{-8}$
- 5 9.710 $-9.710 \times 10^{5}$
+ 12 4.79173 $4.79173 \times 10^{12}$

Significant digits and range of the exponent

Now, the more number of digits you can write down, the more accurate your answer is. Like integers, however, the computer can only store a fixed number of digits in both the significand and the exponent. The more digits, the more precise the numbers we are storing are.

While there are four different primitive data types for storing integers, there are two common primitive data types for storing floating-point numbers. The single-precision floating-point representation (or float) uses four bytes, while the double-precision floating-point representation (or double) uses eight bytes.

Similarly, the more digits we allocate the the exponent, the greater the range of numbers we can store.

To grasp approximately how much precision and range of the exponent these two primitive data types have, you can think of it as follows:

    float   ± ±EE  M.MMMMMM                seven significant digits
   double   ± ±EEE M.MMMMMMMMMMMMMMM     sixteen significant digits

The data type float can store numbers approximately on the range $10^{-99}$ up to $10^{99}$ (in reality, its a little more) with seven significant digits.

The data type double can store numbers approximately on the range $10^{-999}$ up to $10^{999}$ (in reality, its much less) with sixteen significant digits.

Floating-point calculations

The biggest issues with floating-point calculations are that first, every intermediate calculation must be rounded to the specified amount of memory, and consequently, second, errors accumulate as more operations are performed.

For example, in multiplying the two floats,

            + +11 2.728930
            - -03 3.109243

we would calculate the product of the mantissas and add the exponents to get

            - +08 8.484906499990

however, we can only store this number to seven significant digits, so we must round to as many digits:

            - +08 8.484906

This introduces an error in this and every calculation. In this case, the introduced error was $4.9999$ (about one part in 100 million). If we were to store these numbers as double, the resulting calculation would be significantly more accurate abut also require more space and more time to perform the calculations:

            + +011 2.728930000000000
            - -003 3.109243000000000

multiplied together gives

            + +008 8.484906499990000

Now the error is closer to the order of one part in 100 quadrillion, so very small, but still not insignificantly small—it can still cause problems.

There is significantly more precision for double-precision floating-point numbers, so as an engineer, you must always use double. The one exception to this is computer graphics, where the errors introduced are usually not noticeable by the human eye.

Note: In your course on numerical analysis, you will learn how the actual binary representation, but the explanation above is a good first approximation, useful for an intuitive understanding of the representation.

Problems with the floating-point representation

We will see that there are three obvious weaknesses:

  1. It may happen that $x + y = x$ even if $y \ne 0$.
  2. It is not necessarily true that $(x + y) + z = x + (y + z)$.
  3. You may lose a lot of precision if you subtract similar numbers.

We will only give examples of each. Adding these two numbers

            + +011 2.728938275813209
            + -011 3.109243720395902

should be

            + +011 2.728938275813209000003109243720395902

but rounded to 16 significant digits results in

            + +011 2.728938275813209

which is the first of the numbers.

Next, consider adding these three numbers:

            - +011 2.728938275813209
            + +011 2.728938275813209
            + -011 3.109243720395902

The first two numbers simply have the opposite sign. Thus, if we add the first two, we get $0$, and if we add the third number to $0$, we get that last number.

However, if I add the second two numbers first, from the previous example, we saw that that results in the larger number unchanged. If we add that to the first number, this results in $0$, which is not the correct answer.

Finally, suppose we want to approximate the derivative of $\sin(x)$ at $x = 1$. We can use the formula

$\frac{\sin(1 + h) - \sin(1 - h)}{2h}$

with $h = 10^{-8}$. To sixteen significant digits, $\sin(1 + h) = 0.5403023058681397$ and $\sin(1 - h) = 0.8414709794048734$ so when we calculate the formula, we get $0.5403023050000000$, which is a reasonable approximation of $\cos(1) = 0.54030230586813972$, which is the value we're trying to estimate.

From calculus, you know that the derivative is the limit of this formula as $h$ approaches $0$, so we should get a better approximation with $h = 10^{-13}$:

With this new $h$, we have $\sin(1 + h) = 0.8414709848079505$ and $\sin(1 - h) = 0.8414709848078425$, but if we now calculate our formula, it works out to $0.5400000000000000$, which is a much worse approximation of $\cos(1)$ than our first approximation. This is because, to significant significant digits, $\sin(1 + h)$ and $\sin(1 - h)$ are essentially the same.

You do not have to understand the details of these problems, but you must be aware of them.

IEEE 754

The format and computations associated with float and double are specified in the IEEE 754 standard. In essence, this guarantees that if you perform the same calculation on two different computers (or with two different programming languages), you should not notice a significant difference, and usually the results will be identical.

The one well-known exception to this is Java, which has a float and a double, but where the implementation is considered by many to be highly sub-optimal.

Questions and practice:

1. Add the following two numbers:

   + +12 3.598273
   + +12 2.708808

2. Add the following two numbers:

   + -12 5.293857
   + -12 8.291735

3. Add the following two numbers:

   + -23 5.293857
   - -23 1.261523

4. Add the following two numbers:

   + +38 5.293857
   - +38 5.293472

5. What is the multiplicative inverse (the reciprocal) of the following number?

   + +27 8.000000

Could you see an easier way of calculating it rather than calculating the reciprocal of $8000000000000000000000000000$?

(You should get + -28 1.250000.

6. Show that if you were to keep 26 decimal digits of precision that the value of $h = 10^{-13}$ would give a better approximation of $\cos(1)$. For your information:

   sin(1 + h) = 0.84147098480795053688308913
   sin(1 - h) = 0.84147098480784247642191550

You should get an answer close to $0.54030230586815$, which is much more accurate than our approximation when we used $h = 10^{-8}$.