Why use partial pivoting?


In typical linear algebra courses, Gaussian elimination is introduced as follows:

Given an m×n matrix and a target vector, the augmented matrix is formed. The steps to go to row-echelon form are as follows:

  1. Start with i=1 and j=1.
  2. While im and jn+1,
    1. If ai,j0,
      1. For k from i+1 to m, add ak,jai,j times Row i onto Row k to eliminate the entry ak,j below ai,j.
      2. Increment both i and j, and return to Step 2.
    2. If, ai,j=0 and if any ak,j0 for k>i, swap Rows i and k, and go back to Step 2a.
    3. Otherwise, all ak,j=0 for k=i,,m, so increment j and return to Step 2.

Important: notice that my row-echelon form does not require that the leading non-zero value be normalized to 1. This is because this is an unnecessary step: it adds O(n) additional and unnecessary work onto each step.

This algorithm, while mathematically correct, is prone to numerical instability. Let us explore why using tw two examples.

Example 1

The issues with the naive algorithm become apparent when we consider the precision limits of floating-point arithmetic. This is, however, difficult at the first-year level, as this requires knowledge of both binary numbers, binary arithmetic, biases and requires that 53 bits of precision in binary are used. Instead, I propose an analogous representation of real numbers that uses six digits ±EENMMM where these are six decimal digit numbers, and where N0. This represents the number ±N.MMM×10EE49. If the exponent is EE=99, these represent either ± or NaN (undefined). If EE=00, this represents denormalized numbers, so we may allow N=0. If we use this representation:

  1. We can represent numbers as small as 0.001×1049=1052 and as large as 9.999×1048.
  2. Real numbers on the range 1049 and 9.999×1048 may be represented with a maximum of relative error of 120010.00049975 or "less than 0.05% relative error".
  3. If two numbers have the same sign, we can determine if EENMMM is less than, equal to or greater than FFPQQQ just by looking at the integer values EENMMM versus FFPQQQ.

After every calculation, we must round the result to four decimal digits of precision.

Using this floating-point representation, if we try to perform Gaussian elimination on the following system of linear equations, always storing any result to four digits of precision, we see that (0.00001111)u=(12) should have a solution very close to u=(11). However, if you perform the naive Gaussian elimination algorithm above, you get (0.0000111112)u(0.00001110100000100000)u. This is because you must add 100000 times Row 1 onto Row 2, but 100000+1=99999 and 100000+2=999998, and rounded to four decimal digits of precision, both of these are now 100000. Solving this now yeilds u2=1 and substiuting this into the first row, we get u1=0.

As an approximation to (11), the vector (01) has a percent relative error of 70.71%.

Example 2

Example 1 may seem a little artifical. If students want to see an example that uses double-precision floating-point numbers, we can now look at:

(2.10.770.30.10011)u=(9.80.42).

We see that the unique solution is clearly u=(111).

This is not a pathalogical matrix: the condition number is approximately 24.605, and so therefore small errors should not be significantly magnified.

However, first, the only numbers that can be represented exactly using double-precision floating-point numbers are 0 and 1. All other numbers are stored as approximations:

     2.1  10.000110011001100110011001100110011001100110011001101
     0.3   0.010011001100110011001100110011001100110011001100110011
     0.7   0.10110011001100110011001100110011001100110011001100110
     0.1   0.00011001100110011001100110011001100110011001100110011010
To understand this, you must realize that in binary, many (well, most) real numbers that have a finite decimal expansion end up having repeated binary expansions, so 2.1=10.0¯0011, 0.3=0.01¯0011, 0.7=0.1¯0110, and 0.1=0.0001¯1001. The double-precision floating-point representation only saves 53 bits of precision, so these repeated binary expansions must be rounded to 53 bits, resulting in the representations of 2.1 and 0.1 being rounded up and 0.3 and 0.7 being rounded down. Now, when we calculate 0.32.1, we get

     -0.0010010010010010010010010010010010010010010010010010010

which is rounded down. When we now multiply this by 0.7, which is also rounded down, we don't get the closest representation to 0.1, but rather we get something smaller in magnitude than expected, for if ϵ is sufficiently small so that ϵ20, then (aaϵ)(bbϵ)ab2abϵ, so the relative error is doubled. Thus, if we calculated 0.32.10.7 using the above operations, we get

     -0.00011001100110011001100110011001100110011001100110011001

instead of the closest possible representation of 0.1 in binary, which is

     -0.00011001100110011001100110011001100110011001100110011010

Thus, calculating 2.10.30.7+0.1, we get

      0.00011001100110011001100110011001100110011001100110011010
     -0.00011001100110011001100110011001100110011001100110011001
      ----------------------------------------------------------
      0.00000000000000000000000000000000000000000000000000000001

which is 256 and not zero. Thus, we have:

(2.10.779.80.30.100.40112)(2.10.779.8025611.00112.0).

So far, so good, for the solution of the right-hand matrix is still very close to u=(111).

Unfortunately, if we apply the naive Gaussian elimination algorithm, we see that 2560, and thus we add 256 times Row 2 onto Row 3. Unfortunately, neither 256+1 nor 256+2 can be represented exactly as double-precision floating-point numbers (as the representation can only store 53 bits):

     56
    2   + 1 = 100000000000000000000000000000000000000000000000000000001
     56
    2   + 2 = 100000000000000000000000000000000000000000000000000000010
              |<-------------------- 53 bits -------------------->|

Therefore, both are rounded down and thus represented by 256:

(2.10.779.80.30.100.40112)(2.10.779.8025611.00112.0)(2.10.779.8025611.000256256).

The matrix on the right is now in row-echelon form, so we may perform backward substitution: u3=1, so 256u21=1, so u2=0, and thus u1=2.82.1=43=1.¯3. This approximation of the solution, u=(1.¯301), has a percent relative error of 60.86% when contrasted to the correct answer u=(111).

Example 2 with partial pivoting

One could argue that we should treat "small" numbers at the pivot location as being equal to zero, but this would just require more elaborate examples to demonstrate an issue. Instead, if we always move the largest entry in absolute value to the pivot location ai,j, the ratio ak,jai,j will never have an absolute value greater than 1.

Thus, I teach the following algorithm from the start:

Given an m×n matrix and a target vector, having created the augmented matrix:

  1. Start with i=1 and j=1.
  2. While im and jn+1,
    1. Find a Row k where k is between i and m and where |ak,j|=max{|ai,j|,,|am,j|} is the largest of these values.
    2. Swap Row i and Row k.
    3. If ai,j0,
      1. Add ak,jai,j times Row i onto Row k for k=i+1,,m. This ensures ak,j=0.
      2. Increment both i and j and return to Step 2.
    4. Otherwise, all ak,j=0 for k=i,,m, so just increment j and return to Step 2.

In Example 2, had we applied this, we would now have:

(2.10.779.80.30.100.40112)(2.10.779.80112025611).

Now, adding 256 times Row 2 onto Row 3 leaves Row 3 otherwise unchanged:

(2.10.779.80.30.100.40112)(2.10.779.801120011).

Applying backward substitution yeilds u3=1, and then u2=1 and finally u1=1, which is the exact solution.

Example 2 with long double

You may wonder if the problem goes away if you use long double (80 bits), as opposed to the 64-bit double data type. Unfortunately, it does not. If we look at the representation of all of these values using the 64 bits of precision of the 80-bit long double1, we have:

2.1  10.00011001100110011001100110011001100110011001100110011001100110
0.3   0.01001100110011001100110011001100110011001100110011001100110011010
-0.3/2.1
     -0.001001001001001001001001001001001001001001001001001001001001001010
0.7   0.1011001100110011001100110011001100110011001100110011001100110011
(-0.3/2.1)*0.7
     -0.0001100110011001100110011001100110011001100110011001100110011001110
0.1   0.0001100110011001100110011001100110011001100110011001100110011001101

0.1 + (-0.3/2.1)*0.7
      0.0001100110011001100110011001100110011001100110011001100110011001101
     -0.0001100110011001100110011001100110011001100110011001100110011001110
      ---------------------------------------------------------------------
     -0.0000000000000000000000000000000000000000000000000000000000000000001

This last number is 267.

You will note that 2.1 is rounded down while 0.3 is rounded up, so the ratio 0.32.1 is larger in magnitude than expected, for if ϵ is sufficiently small so that ϵ20, then a+aϵbbϵab2abϵ, so the relative error is doubled. Thus, if we calculated 17 exactly, we get

     -0.001001001001001001001001001001001001001001001001001001001001001001

but because we are calculating 0.32.1, where both the numerator and denominator have small errors, we now get:

     -0.001001001001001001001001001001001001001001001001001001001001001010

Thus, we would end up with

(2.10.779.80.30.100.40112)(2.10.779.8026711.00112.0)(2.10.779.8026711.000267267),

Again, we have:

     67
   -2   + 1 = -1111111111111111111111111111111111111111111111111111111111111111111
     67
   -2   + 2 = -1111111111111111111111111111111111111111111111111111111111111111110
               |<------------------------- 64 bits -------------------------->|

and each of these rounded to 64 bits of precision would result in

        67
      -2   = -10000000000000000000000000000000000000000000000000000000000000000000
              |<------------------------- 64 bits -------------------------->|

Consequently, the incorrect solution the solution would be the same incorrect solution from the original Example 2.

This may be confusing: A double occupies 64 bits (8 bytes) but only 52 of those bits are used to store the mantissa. A long double occupies 80 bits (10 bytes).