Skip to the content of the web site.

6.1 Simple Linear Regression

Introduction Theory HOWTO Error Analysis Examples Questions Applications in Engineering Matlab Maple

Introduction

In this first topic, we look at the special case of fitting a straight line to some given data. The general process of fitting data to a linear combination of basis functions is termed linear regression. When the curve being fit is a straight line, that is, of the form ax + b, the description is simple linear regression.

References

Theory

Linear Regression

Consider the points (xi, yi) shown in Figure 1.

Figure 1. A collection of sampled points.

It looks like the points appear to lie in a straight line, something of the form y(x) = ax + b where a and b are unknown real values. The question is, how can we find the best values for a and b. For example, Figure 2 shows the two functions y(x) = 1.2 x + 2.4 and y(x) = 1.3 x + 2.5 in red and blue, respectively. The blue line looks better, but how did we even pick the values of 1.3 and 2.5, and can we do better?

Figure 2. Two lines of regression.

To begin, we must define the term regression. In this case, we are regressing the values of y to some value on a curve, in this case, y(x) = c1x + c2. Because this is an expression which is linear in c1 and c2, it is termed linear regression. (This has nothing to do with the fact that the function is linear.)

The technique we will use to find the best fitting line will be called the method of least squares.

Derivation of the Method of Least Squares

Given the n points (x1, y1), ..., (xn, yn), we will find a straight line which minimizes the sum of the squares of the errors, that is, in Figure 3, we have an arbitrary curve and the errors are marked in light-blue.

Figure 3. The errors between an arbitrary curve y(xi) = c1xi + c2 and the points yi.

Writing this out as mathematically, we would like to minimize the sum-of-the-squares-of-the-errors (SSE):

Notice that the only unknowns in this expression are c1 and c2. Thus, from calculus, we know that if we want to minimize this, we must differentiate with-respect-to these variables and solve (simultaneously) for 0:

Expanding the first equation (and dividing both sides by -2), we get:

If we, with some foresight, define the following, the sum of the x's (Sx), the sum of the y's (Sy), the sum of the squares of the x's (SSy), and the sum of the products of the x's and y's (SPx, y), that is,

the we get the linear equation:

Expanding the second equation (and dividing both sides by -2), we get:

By calculating the third sum and rearranging, we get the linear equation:

We could solve these the long way (as you probably did in high school), however, we note that this describes the system of equations:

This is a system of linear equations which we can, quite easily, solve.

The Critical Simplification

Fortunately, we can make life even easier. Recall that the Vandermonde matrix for finding the linear polynomial which interpolates two points is:

If we define the general Vandermonde matrix:

then an astute observation reveals that the system of linear equations may be written as:

VTVc = VTy

Thus, to find the coefficient vector c = (c1, c2)T, all we must do is use Matlab, as follows:

>> c = (V' * V) \ (V' * y);

in which case, the linear function y(x) = c1 x + c2 is that function which best fits the points, that is, it minimizes the sum of the squares of the errors.

For points shown in Figure 1, the actual points are:

(0.350, 2.909), (0.406, 2.987), (0.597, 3.259), (1.022, 3.645), (1.357, 4.212), (1.507, 4.295), (2.228, 5.277), (2.475, 5.574), (2.974, 6.293), (2.975, 6.259)

the Vandermonde matrix is

Solving the above system of equations defined by VTVc = VTy yields the vector c = (1.278, 2.440)T, and therefore the best fitting curve is y(x) = 1.278 x + 2.440. The points and this function are shown in Figure 4. You will note that it is significantly better than any of the lines shown in Figures 2 or 3.

Figure 4. The least-squares line passing through the given data points.

HOWTO

Problem

Given data (xi, yi), for i = 1, 2, ..., n which is known to approximate a straight line, find the best fitting straight line.

Assumptions

We will assume the model is correct and that the data is defined by two vectors x = (xi) and y = (yi).

Tools

We will use linear algebra.

Process

We wish to find the best fitting line of the form y(x) = c1x + c2, and thus, we define the matrix

where the first column is the function x evaluated at each of the x values, and the second column is the function 1 evaluated at each of the x values.

Hence, we solve the linear system VTVc = VTy.

Having found the coefficient vector c, we now associate the appropriate entries with the appropriate basis function: y(x) = c1x + c2.

Error Analysis

The study of the error associated with a linear regression is beyond the scope of this class, but the following is a summary of some points:

  • The line passes through the average of the x values and the average of the y values.
  • The error of the line gets larger as you move away from the average of the x values, though the error does not increase as quickly as it does for interpolation, and
  • It is, statistically speaking, better to take multiple readings at a smaller number of different independent values than it is to take one reading at all different independent values.

The last point essentially says, it is better to take three readings at each of x = 2, 5, and 8, than it is to take one reading at each of x = 1, 2, 3, 4, 5, 6, 7, 8, and 9. Always talk to a statistician before you design any experiment.

For interest, we may define the variable s2 = SSE/(n - 2). In this case, we can give a 95% confidence interval which indicates how well we are estimating the mean, namely, using Equation 1.4.11 of Draper and Smith, we have the plot shown in Figure 1.

Figure 1. The 95% confidence interval associated with each y value.

Examples

Example 1

The following data is known to be linear in nature:

(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),
(6, 4.6468), (7, 5.4008), (8, 6.3853), (9, 6.7494), (10, 7.3864)

This data is shown in Figure 1.

Figure 1. The given data points.

Finding the technique of least squares, we define:

We now solve VTVc = VTy to get c = (0.53900, 1.8886)T.

Therefore, the best fitting line using the least-squares technique is y(x) = 0.53900x + 1.8886, which is shown in Figure 2.

Figure 2. The best-fitting line using least squares.

Example 2

The following data is similar to that in Example 1. What would you consider to be problematic, and what is a reasonable solution?

(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),
(6, 4.6468), (7, 5.4008), (8, 63.853), (9, 6.7494), (10, 7.3864)

The 8th point appears to be significantly different from all other values. This would almost certainly appear to be an error in measurement or an error in recording. There are two possible solutions:

  • Leave the point out, or
  • Re-sample the value at the point x = 8.

If we do nothing, we get the following best-fitting line: y(x) = 2.2805 x - 1.9427. If we remove the point and simply using the remaining nine points, we get the line y(x) = 0.53220 x + 1.9035.

This second line is reasonably close to the line we found in Example 1, as is shown in Figure 3.

Figure 3. The best-fitting line without using the 8th point (in blue) compared to the best-fitting line in Figure 1 (in red).

Questions

Question 1

Find the least-squares curve which fits the linear data:

x = (1, 2, 3, 4)T
y = (0, 1, 1, 2)T

Answer: y(x) = 0.6 x - 0.5.

Question 2.

Find the least squares straight line which fits the data:

x = (0.282, 0.555, 0.089, 0.157, 0.357, 0.572, 0.222, 0.800, 0.266, 0.056)T
y = (0.685, 0.563, 0.733, 0.722, 0.662, 0.588, 0.693, 0.530, 0.650, 0.713)T

Answer: y(x) = -0.28714 x + 0.75027 .

Applications to Engineering

The least-squares technique for finding a linear regression of the form y = ax + b is critical in engineering, as all sampled data always has an error associated with it, and while models may suggest that the response of a system should be linear, the actual output may less obviously be so, for any number of reasons, including limitations in measuring equipment, random errors and fluctuations, and unaccounted variables.

The method of least squares results in a fast and efficient manner of finding the best fitting straight line which may pass through given data, yielding approximations of unknown coeffiicents.

For example, it is well known that an ideal resistor is linear in its response. This, however, may be less true in practice. Simply using one reading with one current to approximate the resistance of a resistor has two problems:

  • We cannot give any estimate as to what the error of our approximation is, and
  • The resistor may not be linear for the given range of currents, hence our approximation may be completely inaccurate because the model of an ideal linear resistor does not even apply.

Using multiple readings and linear regression gives us the ability to make much more definitive statements about the accuracy of our approximation and the applicability of our model.

Matlab

Finding the coefficient vector in Matlab is very simple:

x = [1 2 3 4 5 6 7 8 9 10]';
y = [2.6228, 2.9125, 3.1390, 4.2952, 4.9918, 4.6468, 5.4008, 6.3853, 6.7494, 7.3864]';
V = [x, ones(size(x))];
c = V \ y;   % same as c = (V' * V) \ (V' * y)
<      

To plot the points and the best fitting curve, you can enter:

xs = (0:0.1:11)';
plot( x, y, 'o' )
hold on
plot( xs, c(1)*xs + c(2) );
        

Be sure to issue the command hold off if you want to start with a clean plot window.

Maple

The following commands in Maple:

with(CurveFitting):
pts := [[1, 2.6228], [2, 2.9125], [3, 3.1390], [4, 4.2952], [5, 4.9918], [6, 4.6468], [7, 5.4008], [8, 6.3853], [9, 6.7494], [10, 7.3864]];
fn := LeastSquares( pts, x, curve = a*x + b );
plots[pointplot]( pts );
plots[display]( plot( fn, x = 0..4 ), plots[pointplot]( pts ) );

calculates the least-squares line of best fit for given data points, a plot those points, and a plot of the points together with the best-fitting curve. This last plot is shown in Figure 1.

Figure 1. The given points and the least squares line passing through those points.

For more help on the least squares function or on the CurveFitting package, enter:

?CurveFitting,LeastSquares
?CurveFitting