Introduction
Theory
HOWTO
Error Analysis
Examples
Questions
Applications in Engineering
Matlab
Maple

# Introduction

In this first topic, we look at the special case
of fitting a straight line to some given data.
The general process of fitting data to a *linear*
combination of basis functions is termed *linear regression*.
When the curve being fit is a straight line, that is, of the form
*ax* + *b*, the description is *simple* linear regression.

# References

# Theory

# Linear Regression

Consider the points (*x*_{i}, *y*_{i}) shown in Figure 1.

Figure 1. A collection of sampled points.

It looks like the points appear to lie in a straight line, something
of the form *y*(*x*) = *ax* + *b* where *a* and *b*
are unknown real values. The question is, how can we find the best values
for *a* and *b*. For example, Figure 2 shows the two functions
*y*(*x*) = 1.2 *x* + 2.4 and *y*(*x*) = 1.3 *x* + 2.5
in red and blue, respectively. The blue line *looks* better, but how
did we even pick the values of 1.3 and 2.5, and can we do better?

Figure 2. Two lines of regression.

To begin, we must define the term *regression*. In this case,
we are *regressing* the values of *y* to some value on a curve,
in this case, *y*(*x*) = *c*_{1}*x* + *c*_{2}.
Because this is an expression which is *linear* in *c*_{1}
and *c*_{2}, it is termed *linear regression*. (This has
nothing to do with the fact that the function is linear.)

The technique we will use to find the *best fitting* line will be
called the *method of least squares*.

# Derivation of the Method of Least Squares

Given the *n* points (*x*_{1}, *y*_{1}), ...,
(*x*_{n}, *y*_{n}),
we will find a straight line which minimizes the *sum of the squares
of the errors*, that is, in Figure 3, we have an arbitrary curve
and the errors are marked in light-blue.

Figure 3. The errors between an arbitrary curve *y*(*x*_{i}) = *c*_{1}*x*_{i} + *c*_{2} and the points *y*_{i}.

Writing this out as mathematically, we would like to minimize the sum-of-the-squares-of-the-errors (SSE):

Notice that the **only** unknowns in this expression are *c*_{1} and
*c*_{2}. Thus, from calculus, we know that if we want to minimize this,
we must differentiate with-respect-to these variables and solve (simultaneously) for 0:

Expanding the first equation (and dividing both sides by -2), we get:

If we, with some foresight, define the following, the *sum of the x's* (S_{x}), the
*sum of the y's* (S_{y}), the
*sum of the squares of the x's* (SS_{y}), and the
*sum of the products of the x's and y's* (SP_{x, y}), that is,

the we get the linear equation:

Expanding the second equation (and dividing both sides by -2), we get:

By calculating the third sum and rearranging, we get the linear equation:

We could solve these the long way (as you probably did in high school), however,
we note that this describes the system of equations:

This is a system of linear equations which we can, quite easily, solve.

# The Critical Simplification

Fortunately, we can make life even easier. Recall that the Vandermonde matrix for
finding the linear polynomial which interpolates two points is:

If we define the *general* Vandermonde matrix:

then an astute observation reveals that the system of linear equations may be written as:

**V**^{T}**V****c** = **V**^{T}**y**
Thus, to find the coefficient vector **c** = (*c*_{1}, *c*_{2})^{T},
all we must do is use Matlab, as follows:

>> c = (V' * V) \ (V' * y);

in which case, the linear function *y*(*x*) = *c*_{1} *x* + *c*_{2} is that function
which best fits the points, that is, it minimizes the sum of the squares of the errors.

For points shown in Figure 1, the actual points are:

(0.350, 2.909), (0.406, 2.987), (0.597, 3.259), (1.022, 3.645), (1.357, 4.212), (1.507, 4.295), (2.228, 5.277), (2.475, 5.574), (2.974, 6.293), (2.975, 6.259)

the Vandermonde matrix is

Solving the above system of equations defined by **V**^{T}**V****c** = **V**^{T}**y** yields the vector **c** = (1.278, 2.440)^{T}, and
therefore the best fitting curve is *y*(*x*) = 1.278 *x* + 2.440.
The points and this function are shown in Figure 4. You will note that it is significantly
better than any of the lines shown in Figures 2 or 3.

Figure 4. The least-squares line passing through the given data points.

# HOWTO

# Problem

Given data (*x*_{i}, *y*_{i}),
for *i* = 1, 2, ..., *n* which is known to approximate
a straight line, find the best fitting straight line.

# Assumptions

We will assume the model is correct and that the data
is defined by two vectors **x** = (*x*_{i}) and
**y** = (*y*_{i}).

# Tools

We will use linear algebra.

# Process

We wish to find the best fitting line of the form
*y*(*x*) = *c*_{1}*x* + *c*_{2}, and thus, we define the matrix

where the first column is the function *x* evaluated at
each of the *x* values, and the second column is the function
1 evaluated at each of the *x* values.

Hence, we solve the linear system **V**^{T}**Vc** = **V**^{T}**y**.

Having found the coefficient vector **c**, we now
associate the appropriate entries with the appropriate basis function:
*y*(*x*) = *c*_{1}*x* + *c*_{2}.

# Error Analysis

The study of the error associated with a linear regression
is beyond the scope of this class, but the following is a
summary of some points:

- The line passes through the average of the x values and
the average of the y values.
- The error of the line gets larger as you move away from
the average of the x values, though the error does not increase
as quickly as it does for interpolation, and
- It is, statistically speaking, better to take multiple
readings at a smaller number of different independent values
than it is to take one reading at all different independent
values.

The last point essentially says, it is better to take
three readings at each of *x* = 2, 5, and 8, than it is to take
one reading at each of *x* = 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Always talk to a statistician before you design any experiment.

For interest, we may define the variable *s*^{2} = SSE/(*n* - 2).
In this case, we can give a 95% confidence interval which indicates how well we
are estimating the mean, namely, using Equation 1.4.11 of
Draper and Smith, we have the plot shown in Figure 1.

Figure 1. The 95% confidence interval associated with each y value.

# Examples

# Example 1

The following data is known to be linear in nature:

(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),

(6, 4.6468), (7, 5.4008), (8, 6.3853), (9, 6.7494), (10, 7.3864)
This data is shown in Figure 1.

Figure 1. The given data points.

Finding the technique of least squares, we define:

We now solve **V**^{T}**V****c** = **V**^{T}**y** to get **c** = (0.53900, 1.8886)^{T}.

Therefore, the best fitting line using the least-squares
technique is *y*(*x*) = 0.53900*x* + 1.8886, which
is shown in Figure 2.

Figure 2. The best-fitting line using least squares.

# Example 2

The following data is similar to that in Example 1. What
would you consider to be problematic, and what is a reasonable
solution?

(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),

(6, 4.6468), (7, 5.4008), (8, 63.853), (9, 6.7494), (10, 7.3864)
The 8th point appears to be significantly different from
all other values. This would almost certainly appear to be
an error in measurement or an error in recording. There
are two possible solutions:

- Leave the point out, or
- Re-sample the value at the point
*x* = 8.

If we do nothing, we get the following *best-fitting*
line: *y*(*x*) = 2.2805 *x* - 1.9427. If
we remove the point and simply using the remaining nine points,
we get the line *y*(*x*) = 0.53220 *x* + 1.9035.

This second line is reasonably close to the line we found in
Example 1, as is shown in Figure 3.

Figure 3. The best-fitting line without using the 8th point (in blue) compared to the best-fitting line in Figure 1 (in red).

# Questions

# Question 1

Find the least-squares curve which fits the linear
data:

**x** = (1, 2, 3, 4)^{T}

**y** = (0, 1, 1, 2)^{T}

Answer: *y*(*x*) = 0.6 *x* - 0.5.

# Question 2.

Find the least squares straight line which fits the data:

**x** = (0.282, 0.555, 0.089, 0.157, 0.357, 0.572, 0.222, 0.800, 0.266, 0.056)^{T}

**y** = (0.685, 0.563, 0.733, 0.722, 0.662, 0.588, 0.693, 0.530, 0.650, 0.713)^{T}

Answer: *y*(*x*) = -0.28714 *x* + 0.75027 .

# Applications to Engineering

The least-squares technique for finding a linear
regression of the form *y* = *ax* + *b* is
critical in engineering, as all sampled data always has
an error associated with it, and while models may suggest
that the response of a system should be linear, the actual
output may less obviously be so, for any number of reasons,
including limitations in measuring equipment, random
errors and fluctuations, and unaccounted variables.

The method of least squares results in a fast and
efficient manner of finding the best fitting straight line
which may pass through given data, yielding approximations
of unknown coeffiicents.

For example, it is well known that an *ideal*
resistor is linear in its response. This, however, may be
less true in practice. Simply using one reading with one
current to approximate the resistance of a resistor has
two problems:

- We cannot give any estimate as to what the error
of our approximation is, and
- The resistor may not be linear for the given range
of currents, hence our approximation may be completely
inaccurate because the model of an ideal linear resistor
does not even apply.

Using multiple readings and linear regression gives us
the ability to make much more definitive statements about
the accuracy of our approximation and the applicability of
our model.

# Matlab

Finding the coefficient vector in Matlab is very simple:

x = [1 2 3 4 5 6 7 8 9 10]';
y = [2.6228, 2.9125, 3.1390, 4.2952, 4.9918, 4.6468, 5.4008, 6.3853, 6.7494, 7.3864]';
V = [x, ones(size(x))];
c = V \ y; % same as c = (V' * V) \ (V' * y)
<

To plot the points and the best fitting curve, you can enter:

xs = (0:0.1:11)';
plot( x, y, 'o' )
hold on
plot( xs, c(1)*xs + c(2) );

Be sure to issue the command `hold off` if you want to
start with a clean plot window.

# Maple

The following commands in Maple:

with(CurveFitting):
pts := [[1, 2.6228], [2, 2.9125], [3, 3.1390], [4, 4.2952], [5, 4.9918], [6, 4.6468], [7, 5.4008], [8, 6.3853], [9, 6.7494], [10, 7.3864]];
fn := LeastSquares( pts, x, curve = a*x + b );
plots[pointplot]( pts );
plots[display]( plot( fn, x = 0..4 ), plots[pointplot]( pts ) );

calculates the least-squares line of best fit for given data points,
a plot those points, and a plot of the points together with the
best-fitting curve. This last plot is shown in Figure 1.

Figure 1. The given points and the least squares line passing through those points.

For more help on the least squares function or on the CurveFitting package, enter:

?CurveFitting,LeastSquares
?CurveFitting