Skip to the content of the web site.

Lesson 1.25: Floating-point numbers

Up to now, we have only used literal integers in our code, and literal integers were always written as decimal numbers. In the computer, however, these numbers must be stored as binary. The compiler will take your numbers and convert them into binary, but we will look at that binary representation.

Review of binary and hexadecimal

To begin, we will have a quick review of the binary representation together with hexadecimal.

Recall that if $b_n b_{n - 1} \cdots b_2 b_1 b_0$ is a binary number (each $b_k$ being either a $0$ or a $1$), it represents the decimal number:

$b_n \times 2^{n} + b_{n - 1} \times 2^{n - 1} + \cdots + b_2 \times 2^2 + b_1 \times 2^1 + b_0 \times 2^0$

Thus, $1001110$ as a binary number represents

$1 \times 2^6 + 0 \times 2^5 + 0 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 0 \times 2^0$

or

$1 \times 64 + 0 \times 32 + 0 \times 16 + 1 \times 8 + 1 \times 4 + 1 \times 2 + 0 \times 1$.

or

$64 + 0 + 0 + 8 + 4 + 2 + 0 = 78$.

To quickly convert the binary integer to decimal, we could start reading from the least-significant bit, and add the corresponding power of two if the bit is a 1, so $1001110$ would be $2 + 4 + 8 + 64 = 78$.

Similarly, $1100110011$ would be $1 + 2 + 16 + 32 + 256 + 512 = 819$.

This should help you understand approximately how big a number is by looking at the largest power of two in the number, so $101100011$ is larger than $256$ (which equals $100000000$) but smaller than $512$ (the next highest power of two).

Fixed memory

In primary and secondary school, you used as many digits as were necessary to store the integer you were writing down, but unless you were writing down the population of the world, these numbers were usually small.

For a computer, storing both the number of digits together with the size would be prohibitively expensive, and therefore integers are always stored using a known and fixed number of bits. In a future topic, we will see that in main memory, eight bits are stored together in a unit called a byte (a small bite of memory, just a little more than a bit). It is convenient to use multiple of these bytes, so in C++, integers may be stored using 1, 2, 4 or 8 bytes.

If we are storing only positive integers, then

BytesSmallestLargestValue
10000000011111111$2^8 - 1 = 255$
200000000000000001111111111111111$2^{16} - 1 = 65535$
40000000000000000000000000000000011111111111111111111111111111111$2^{32} - 1 \approx 4{\rm \ billion}$
800000000000000000000000000000000
00000000000000000000000000000000
11111111111111111111111111111111
11111111111111111111111111111111
$2^{64} - 1 \approx 16{\rm \ quintillion}$

Unsigned integer addition

Unsigned integer addition works as per normal, but there may be a left-over carry at the end. For example, in adding the following two shorts, we have:

        111  1  1 111  1
         1010010010101001
       + 0110010010111101
         ----------------
        10000100101100110

The carry produces an integer that requires 17 bits, not 16 bits. As our storage is limited to two bytes, we will discard that leading bit. Thus, one plus the largest integer wraps back to the value 0.

// Pre-processor include directives
#include <iostream>

// Function declarations
int main();

// Function definitions
int main() {
	// The largest unsigned short; all bits are 1
	unsigned short n{0xffff}; // = 0b1111111111111111 = 65535

	std::cout << "n = " << n << std::endl;
	std::cout << "n + 1 = " << (n + 1) << std::endl;
	std::cout << "Auto-incrementing n..." << std::endl;
	++n;
	std::cout << "n = " << n << std::endl;

	return 0;
}

You can execute this code on repl.it. You should, however, be curious about the output:

n = 65535
n + 1 = 65536
Auto-incrementing n...
n = 0

Why does calculating n + 1 print 65536, but auto-incrementing n (++n) cause it to wrap back to zero? This is because when the compiler sees the literal integer 1, it automatically interprets it as a 32-bit int, so it converts the value stored in n to an int (that is, 0x0000ffff), so adding one to this value yields 0x00010000, which equals 65536. When the auto-increment operation is done, it auto-increments the short, and thus the last carry is ignored.

The general rule is if the operands of an arithmetic operation on integer data types have different sizes (e.g., adding an unsigned short to an unsigned long), the value using less memory is up-cast to the type using more memory.

Memory and the call stack

Previously, we discussed the call stack, and we described that parameters and local variables were given memory on a call stack. We can now explicitly state how much memory is allocated. For example, a function unsigned int gcd( unsigned int m, unsigned int n ); would require eight bytes for parameters (four for each unsigned int) and then if there are additional local variables, they would assigned additional space as necessary on the call stack.

Wasted space? What is the appropriate size?

Suppose you are designing a swarm of drones expected to interact. Each drone needs to be given a unique identifying number. If you are 100% certain that you will never need more than 256 drones, you could use an identifying number that occupies only one byte, so unsigned char. However, suppose you may, at some point, decide to expand your fleet of drones so that you may be dealing with thousands (for example, to participate in a coordinated search-and-rescue mission over hundreds of square kilometers). In this case, a prudent designer would immediately allocate an unsigned short for the identifying number, allowing up to 65536 drones. It may be easier to have an unnecessarily large type initially

Suppose, however, you only want to run a for-loop from 0 to 19. For the most part, with small programs that are being executed on a general-purpose computer (be it a desktop or laptop computer), using int or unsigned int is fine. It is once you start taking courses in operating systems and embedded systems as part of your third year courses and technical electives that you will be concerned about such details.

It is in embedded systems that wasted memory will become an issue specifically because additional memory requires higher costs, more power, and generates more heat, all of which must be mitigated by the developers.

Determining the size of a type

In order to determine the amount of memory occupied by a particular type, the easiest way is to use the sizeof operator:

// Pre-processor include directives
#include <iostream>

// Function declarations
int main();

// Function definitions
int main() {
  std::cout << "Signed and unsigned integer data "
            << std::endl;
  std::cout << "types occupy the same memory"
            << std::endl;
  std::cout << "sizeof( char )           == "
            <<  sizeof( char ) << std::endl;
  std::cout << "sizeof( unsigned char )  == "
            <<  sizeof( unsigned char ) << std::endl;
  std::cout << "sizeof( short )          == "
            <<  sizeof( short ) << std::endl;
  std::cout << "sizeof( unsigned short ) == "
            <<  sizeof( unsigned short ) << std::endl;
  std::cout << "sizeof( int )            == "
            <<  sizeof( int ) << std::endl;
  std::cout << "sizeof( unsigned int )   == "
            <<  sizeof( unsigned int ) << std::endl;
  std::cout << "sizeof( long )           == "
            <<  sizeof( long ) << std::endl;
  std::cout << "sizeof( unsigned long )  == "
            <<  sizeof( unsigned long ) << std::endl;
  std::cout << "sizeof( float )          == "
            <<  sizeof( float ) << std::endl;
  std::cout << "sizeof( double )         == "
            <<  sizeof( double ) << std::endl;
  std::cout << "sizeof( bool )           == "
            <<  sizeof( bool ) << std::endl;

  return 0;
}

The number printed is the number of bytes that the data type occupies. Recall that a byte is eight bits. You can try this code on repl.it.

Now, you may have noticed that sizeof is referred to as an operator and not a function. This is because the operand is a type, not a value (e.g., a literal, local variable or parameter). The compiler determines the memory allocated for the type; this is not determined at run time.

Upcasting (aside)

When upcasting an unsigned integer, this is quite simple: the additional bytes are simply set to zero. Thus, if an unsigned short has the value 0b0010101001011001, if this is up-cast as an unsigned int or unsigned long, the additional bits are simply set to zero:

    00000000000000000010101001011001
    0000000000000000000000000000000000000000000000000010101001011001

All three of these have the same numeric value (in this example, 10841.

When upcasting a signed integer, this may appear to be a problem, as the first bit is 1 when the number is negative. How would you up-cast 0b1101010110100111, which is the short representation of -10841? The short answer is, simply set all the additional bits to 1; thus upcasting this value as an int or long, you have

    11111111111111111101010110100111
    1111111111111111111111111111111111111111111111111101010110100111

Initial values

Suppose you assign a local variable a literal that cannot be represented using that primitive data type. In this case, whatever the number is, its value is taken with the appropriate modulus.

Arithmetic

Arithmetic of integers, like initial values, can be problematic if the result is a number beyond the range that can be stored by a particular type. Fortunately, if we follow the simple rule that whatever a calculation is made, we store only the $k$ least significant bits where $k$ is 8, 16, 32 or 64 for char or unsigned char, short or unsigned short, int or unsigned int, or long or unsigned long, then this is straight-forward and will give reasonable answers.

For example, suppose we multiply 0b1111111111110101 and 0b0000000001001010:

            1111111111110101
	  * 0000000001001010
           11111111111101010
         1111111111110101000
    + 1111111111110101000000
     10010011111110011010010

The 16 least-significant bits are 1111110011010010, and we note that the original calculation was $-11 \times 74 = -814$, and 1111110011010010 is the 2s complement representation of $-814$.

To see that the product of two negative numbers produces the correct positive number, consider $3 \times 42 = 126$ as 8-bit signed characters:

            11111101
	  * 11010110
           111111010
          1111110100
        111111010000
      11111101000000
   + 111111010000000
    1101001101111110

Taking the last eight bits, we get 01111110, which is the representation of 126 (that is, 127 - 1).

The problem, however, is, what happens if the answer does not fit into the memory allocated. For example, suppose we add 1 to the largest unsigned short:

         1111111111111111
       +                1
        10000000000000000

The last 16 bits are 0000000000000000, so the result is zero. Thus, there is a lost carry. When the sum of two unsigned integers produces something larger than can be stored, we will say that a carry has occurred.

Suppose we have signed integer arithmetic. In this case, we may add the largest signed short (32767) to one:

         1111111111111111
       +                1
        10000000000000000

In this case, the result continued to fit into the two bytes, but 1000000000000000 is a negative number, specifically, the representation of $-32768$. Thus, we will describe this as an overflow, as the result, while still fitting into two bytes, no longer represents the appropriate answer.

Casting

If you perform an addition of two integer data types that do not have the same storage, the smaller one will be interpreted as the larger data type. For positive values, this simply means adding more zeros. Thus, if you were to add 0b0010101010011001 to 0b00000000000101000000101011101111, the first would be interpreted as 0b00000000000000000010101010011001.

If the value being added is signed, however, then adding 0b1010101010011001 to 0b00000000000101000000101011101111 by first interpreting the 0b00000000000000001010101010011001 would convert the number from negative to positive. Instead, it is necessary to pad by the most-significant bit: 0b11111111111111111010101010011001.

Summary

There are eight primitive data types that can represent integers, four signed and four unsigned, occupying 1, 2, 4 and 8 bytes. All negative numbers are represented using 2s complement. For signed integers, if the first bit is 1, the number is negative, and thus there are $2^{n - 1}$ possible negative integers that can be represented. If the first bit is 0, the number is positive, but 000···0 represents $0$, so there are only $2^{n - 1} - 1$ positive integers that can be represented.

TypeBytesBitsRangeApproximate
range
unsigned char1 8$0, ..., 2^{8} - 1$$0, ..., 255$
unsigned short216$0, ..., 2^{16} - 1$$0, ..., 65535$
unsigned int432$0, ..., 2^{32} - 1$$0, ..., 4{\rm\ billion}$
unsigned long864$0, ..., 2^{64} - 1$$0, ..., 18{\rm quintillion}$
char1 8$-2^{-7}, ..., 2^{8} - 1$$-128, ..., 127$
short216$-2^{-15}, ..., 2^{15} - 1$$-32768, ..., 32767$
int432$-2^{-31}, ..., 2^{31} - 1$$-2{\rm\ billion}, ..., 2{\rm\ billion}$
long864$-2^{-63}, ..., 2^{63} - 1$$-9{\rm\ quintillion}, ..., 9{\rm\ quintillion}$

Questions and practice: