Why use partial pivoting?

In typical linear algebra courses, Gaussian elimination is introduced as follows:

Given an $m \times n$ matrix and a target vector, the augmented matrix is formed. The steps to go to row-echelon form are as follows:

Start with $i = 1$ and $j = 1$.
While $i \leq m$ and $j \leq n + 1$,

If $a_{i,j} \neq 0$,

For $k$ from $i + 1$ to $m$, add $-\frac{a_{k,j}}{a_{i,j}}$ times Row $i$ onto Row $k$ to eliminate the entry $a_{k,j}$ below $a_{i,j}$.
Increment both $i$ and $j$, and return to Step 2.

If, $a_{i,j} = 0$ and if any $a_{k,j} \neq 0$ for $k > i$, swap Rows $i$ and $k$, and go back to Step 2a.
Otherwise, all $a_{k,j} = 0$ for $k = i, \ldots, m$, so increment $j$ and return to Step 2.

Important: notice that my row-echelon form does not require that the leading non-zero value be normalized to $1$. This is because this is an unnecessary step: it adds $O(n)$ additional and unnecessary work onto each step.

This algorithm, while mathematically correct, is prone to numerical instability. Let us explore why using tw two examples.

Example 1

The issues with the naive algorithm become apparent when we consider the precision limits of floating-point arithmetic. This is, however, difficult at the first-year level, as this requires knowledge of both binary numbers, binary arithmetic, biases and requires that 53 bits of precision in binary are used. Instead, I propose an analogous representation of real numbers that uses six digits $\pm \verb|EENMMM|$ where these are six decimal digit numbers, and where $N \neq 0$. This represents the number $\pm \textrm{N.MMM} \times 10^{\textrm{EE} - 49}$. If the exponent is $\verb|EE| = 99$, these represent either $\pm \infty$ or NaN (undefined). If $\verb|EE| = 00$, this represents denormalized numbers, so we may allow $N = 0$. If we use this representation:

We can represent numbers as small as $0.001 \times 10^{-49} = 10^{-52}$ and as large as $9.999 \times 10^{48}$.
Real numbers on the range $10^{-49}$ and $9.999 \times 10^{48}$ may be represented with a maximum of relative error of $\frac{1}{2001} \approx 0.00049975$ or "less than $0.05\%$ relative error".
If two numbers have the same sign, we can determine if $\verb|EENMMM|$ is less than, equal to or greater than $\verb|FFPQQQ|$ just by looking at the integer values $\textrm{EENMMM}$ versus $\textrm{FFPQQQ}$.

After every calculation, we must round the result to four decimal digits of precision.

Using this floating-point representation, if we try to perform Gaussian elimination on the following system of linear equations, always storing any result to four digits of precision, we see that $\left(\begin{array}{ll}0.00001 & 1 \\ 1 & 1\end{array}\right)\textbf{u} = \left(\begin{array}{c}1 \\ 2 \end{array}\right)$ should have a solution very close to $\textbf{u} = \left(\begin{array}{c} 1 \\ 1 \end{array}\right)$. However, if you perform the naive Gaussian elimination algorithm above, you get $\left(\begin{array}{ll|l}0.00001 & 1 & 1 \\ 1 & 1 & 2 \end{array}\right)\textbf{u} \sim \left(\begin{array}{lr|r}0.00001 & 1 & 1 \\ 0 & -100000 & -100000 \end{array}\right)\textbf{u}$. This is because you must add $-100000$ times Row 1 onto Row 2, but $-100000 + 1 = -99999$ and $-100000 + 2 = -999998$, and rounded to four decimal digits of precision, both of these are now $-100000$. Solving this now yeilds $u_2 = 1$ and substiuting this into the first row, we get $u_1 = 0$.

As an approximation to $\left(\begin{array}{c} 1 \\ 1 \end{array}\right)$, the vector $\left(\begin{array}{c} 0 \\ 1 \end{array}\right)$ has a percent relative error of $70.71\%$.

Example 2

Example 1 may seem a little artifical. If students want to see an example that uses double-precision floating-point numbers, we can now look at:

$\left(\begin{array}{lll} 2.1 & 0.7 & 7 \\ 0.3 & 0.1 & 0 \\ 0 & 1 & 1 \end{array}\right)\textbf{u} = \left(\begin{array}{l} 9.8 \\ 0.4 \\ 2 \end{array}\right)$.

We see that the unique solution is clearly $\textbf{u} = \left(\begin{array}{c} 1 \\ 1 \\ 1 \end{array}\right)$.

This is not a pathalogical matrix: the condition number is approximately $24.605$, and so therefore small errors should not be significantly magnified.

However, first, the only numbers that can be represented exactly using double-precision floating-point numbers are $0$ and $1$. All other numbers are stored as approximations:

     2.1  10.000110011001100110011001100110011001100110011001101
     0.3   0.010011001100110011001100110011001100110011001100110011
     0.7   0.10110011001100110011001100110011001100110011001100110
     0.1   0.00011001100110011001100110011001100110011001100110011010

To understand this, you must realize that in binary, many (well, most) real numbers that have a finite decimal expansion end up having repeated binary expansions, so $2.1 = 10.0\overline{0011}$, $0.3 = 0.01\overline{0011}$, $0.7 = 0.1\overline{0110}$, and $0.1 = 0.0001\overline{1001}$. The double-precision floating-point representation only saves 53 bits of precision, so these repeated binary expansions must be rounded to 53 bits, resulting in the representations of $2.1$ and $0.1$ being rounded up and $0.3$ and $0.7$ being rounded down. Now, when we calculate $-\frac{0.3}{2.1}$, we get

     -0.0010010010010010010010010010010010010010010010010010010

which is rounded down. When we now multiply this by $0.7$, which is also rounded down, we don't get the closest representation to $-0.1$, but rather we get something smaller in magnitude than expected, for if $\epsilon$ is sufficiently small so that $\epsilon^2 \approx 0$, then $(a - a\epsilon)(b - b\epsilon) \approx ab - 2 ab \epsilon$, so the relative error is doubled. Thus, if we calculated $-\frac{0.3}{2.1}0.7$ using the above operations, we get

     -0.00011001100110011001100110011001100110011001100110011001

instead of the closest possible representation of $-0.1$ in binary, which is

     -0.00011001100110011001100110011001100110011001100110011010

Thus, calculating $-\frac{2.1}{0.3}\cdot 0.7 + 0.1$, we get

      0.00011001100110011001100110011001100110011001100110011010
     -0.00011001100110011001100110011001100110011001100110011001
      ----------------------------------------------------------
      0.00000000000000000000000000000000000000000000000000000001

which is $2^{-56}$ and not zero. Thus, we have:

So far, so good, for the solution of the right-hand matrix is still very close to $\textbf{u} = \left(\begin{array}{c} 1 \\ 1 \\ 1 \end{array}\right)$.

Unfortunately, if we apply the naive Gaussian elimination algorithm, we see that $2^{-56} \neq 0$, and thus we add $-2^{56}$ times Row 2 onto Row 3. Unfortunately, neither $2^{56} + 1$ nor $2^{56} + 2$ can be represented exactly as double-precision floating-point numbers (as the representation can only store 53 bits):

     56
    2   + 1 = 100000000000000000000000000000000000000000000000000000001
     56
    2   + 2 = 100000000000000000000000000000000000000000000000000000010
              |<-------------------- 53 bits -------------------->|

Therefore, both are rounded down and thus represented by $2^{56}$:

$\left(\begin{array}{lll|l} 2.1 & 0.7 & 7 & 9.8 \\ 0.3 & 0.1 & 0 & 0.4 \\ 0 & 1 & 1 & 2 \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & 2^{-56} & -1 & -1\phantom{.0} \\ 0 & 1 & 1 & 2\phantom{.0} \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & 2^{-56} & -1 & -1\phantom{.0} \\ 0 & 0 & 2^{56} & 2^{56} \end{array}\right)$.

The matrix on the right is now in row-echelon form, so we may perform backward substitution: $u_3 = 1$, so $2^{-56} u_2 - 1 = -1$, so $u_2 = 0$, and thus $u_1 = \frac{2.8}{2.1} = \frac{4}{3} = 1.\overline{3}$. This approximation of the solution, $\textbf{u} = \left(\begin{array}{c} 1.\overline{3} \\ 0 \\ 1 \end{array}\right)$, has a percent relative error of $60.86\%$ when contrasted to the correct answer $\textbf{u} = \left(\begin{array}{c} 1 \\ 1 \\ 1 \end{array}\right)$.

Example 2 with partial pivoting

One could argue that we should treat "small" numbers at the pivot location as being equal to zero, but this would just require more elaborate examples to demonstrate an issue. Instead, if we always move the largest entry in absolute value to the pivot location $a_{i,j}$, the ratio $-\frac{a_{k,j}}{a_{i,j}}$ will never have an absolute value greater than $1$.

Thus, I teach the following algorithm from the start:

Given an $m \times n$ matrix and a target vector, having created the augmented matrix:

Start with $i = 1$ and $j = 1$.
While $i \leq m$ and $j \leq n + 1$,

Find a Row $k$ where $k$ is between $i$ and $m$ and where $|a_{k,j}| = \max\{|a_{i,j}|, \ldots, |a_{m,j}|\}$ is the largest of these values.
Swap Row $i$ and Row $k$.
If $a_{i,j} \neq 0$,

Add $-\frac{a_{k,j}}{a_{i,j}}$ times Row $i$ onto Row $k$ for $k = i + 1, \ldots, m$. This ensures $a_{k,j} = 0$.
Increment both $i$ and $j$ and return to Step 2.

Otherwise, all $a_{k,j} = 0$ for $k = i, \ldots, m$, so just increment $j$ and return to Step 2.

In Example 2, had we applied this, we would now have:

$\left(\begin{array}{lll|l} 2.1 & 0.7 & 7 & 9.8 \\ 0.3 & 0.1 & 0 & 0.4 \\ 0 & 1 & 1 & 2 \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & 1 & 1 & 2 \\ 0 & 2^{-56} & -1 & -1 \end{array}\right)$.

Now, adding $-2^{-56}$ times Row 2 onto Row 3 leaves Row 3 otherwise unchanged:

$\left(\begin{array}{lll|l} 2.1 & 0.7 & 7 & 9.8 \\ 0.3 & 0.1 & 0 & 0.4 \\ 0 & 1 & 1 & 2 \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & 1 & 1 & 2 \\ 0 & 0 & -1 & -1 \end{array}\right)$.

Applying backward substitution yeilds $u_3 = 1$, and then $u_2 = 1$ and finally $u_1 = 1$, which is the exact solution.

Example 2 with `long double`

You may wonder if the problem goes away if you use long double (80 bits), as opposed to the 64-bit double data type. Unfortunately, it does not. If we look at the representation of all of these values using the 64 bits of precision of the 80-bit long double¹, we have:

2.1  10.00011001100110011001100110011001100110011001100110011001100110
0.3   0.01001100110011001100110011001100110011001100110011001100110011010
-0.3/2.1
     -0.001001001001001001001001001001001001001001001001001001001001001010
0.7   0.1011001100110011001100110011001100110011001100110011001100110011
(-0.3/2.1)*0.7
     -0.0001100110011001100110011001100110011001100110011001100110011001110
0.1   0.0001100110011001100110011001100110011001100110011001100110011001101

0.1 + (-0.3/2.1)*0.7
      0.0001100110011001100110011001100110011001100110011001100110011001101
     -0.0001100110011001100110011001100110011001100110011001100110011001110
      ---------------------------------------------------------------------
     -0.0000000000000000000000000000000000000000000000000000000000000000001

This last number is $-2^{-67}$.

You will note that $2.1$ is rounded down while $0.3$ is rounded up, so the ratio $-\frac{0.3}{2.1}$ is larger in magnitude than expected, for if $\epsilon$ is sufficiently small so that $\epsilon^2 \approx 0$, then $-\frac{a + a\epsilon}{b - b\epsilon} \approx -\frac{a}{b} - 2 \frac{a}{b} \epsilon$, so the relative error is doubled. Thus, if we calculated $-\frac{1}{7}$ exactly, we get

     -0.001001001001001001001001001001001001001001001001001001001001001001

but because we are calculating $-\frac{0.3}{2.1}$, where both the numerator and denominator have small errors, we now get:

     -0.001001001001001001001001001001001001001001001001001001001001001010

Thus, we would end up with

$\left(\begin{array}{lll|l} 2.1 & 0.7 & 7 & 9.8 \\ 0.3 & 0.1 & 0 & 0.4 \\ 0 & 1 & 1 & 2 \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & -2^{-67} & -1 & -1\phantom{.0} \\ 0 & 1 & 1 & 2\phantom{.0} \end{array}\right) \sim \left(\begin{array}{llr|r} 2.1 & 0.7 & 7 & 9.8 \\ 0 & -2^{-67} & -1 & -1\phantom{.0} \\ 0 & 0 & -2^{67} & -2^{67} \end{array}\right)$,

Again, we have:

     67
   -2   + 1 = -1111111111111111111111111111111111111111111111111111111111111111111
     67
   -2   + 2 = -1111111111111111111111111111111111111111111111111111111111111111110
               |<------------------------- 64 bits -------------------------->|

and each of these rounded to 64 bits of precision would result in

        67
      -2   = -10000000000000000000000000000000000000000000000000000000000000000000
              |<------------------------- 64 bits -------------------------->|

Consequently, the incorrect solution the solution would be the same incorrect solution from the original Example 2.

This may be confusing: A double occupies 64 bits (8 bytes) but only 52 of those bits are used to store the mantissa. A long double occupies 80 bits (10 bytes).

Why use partial pivoting?

Example 1

Example 2

Example 2 with partial pivoting

Example 2 with long double

Example 2 with `long double`