Introduction
Theory
HOWTO
Error Analysis
Examples
Questions
Applications in Engineering
Matlab
Maple

# Introduction

Consider flipping a coin 10 times. If the coin was fair, one
would expect approximately 5 heads and 5 tails; however, one would
also expect a randomness in the sequences. A sequence such
as HTHTHTHT would be just as unexpected as the sequence HHHHHTTTTT.

Now, consider data which is being fit by a least-squares line
*ax* + *b*. It would be expected that the variation
around the line (either above or below) would be similar to
flipping a coin **if** the line actually fits the data.

We will see how we can use the sequence of *aboves* and
*belows* to determine if a least-squares curve is actually
a good description of the data.

# References

# Theory

Given a sequence of *n* samples of the from
(*x*_{1}, *y*_{1}), ...
(*x*_{n}, *y*_{n}), assuming
that there are sufficiently many different *x* values, it is
always possible to put a straight line though those points.

The previous topic looked at the lack-of-fit test which requires
the additional assumption that the data is normally distributed
around the curves. This may not always be the case, so we will
now look at a more general but very intuitive test: the
Wald-Wolfowitz runs test.

Consider the data plotted in Figure 1 where a straight line
is fit to the data.

Figure 1. Data fit to a straight line.

If we highlight all the points which are above the least-squares
line, we see in Figure 2 that the sequence of points is
of the form `+---+-++-+++--++++--`. This is not unlike a sequence
of flips of a coin.

Figure 2. Data fit to a straight line with entries above the
line in magenta.

If consider the data in Figure 3, it too appears to be well
approximated by a straight line, but the sequence of
points above and below the least-squares line is
`---+++++++++++-+----`. This would suggest the data
may not be linear.

Figure 3. Data which appears to be fit by a straight line.

Figure 4. Data which appears to be fit by a straight line.

If, however, we fit a quadratic polynomial to this data,
the the sequence of points above and below the curve
are now `-+-+-++---+--++++-+-`, as is shown in Figure 5.

Figure 5. The data in Figure 2 fit with a quadratic polynomial.

The Wald-Wolfowitz runs test defines a *run* as any consecutive
sequence of samples which contain the same property. In this case, the
property is either above or below the least-squares curve.

Given a sequence of *n* samples where *n*_{+} are
above the line and *n*_{−} are below the line,
the average (mean) number of runs is

*μ* = 2*n*_{+}*n*_{+}/*n* + 1

and the standard deviation is

In the data in Figure 2, there are *n* = 20 samples
where
*n*_{+} = 11 and
*n*_{−} = 9, we may calculate
*μ* = 10.9 and *σ* = 2.1535. There are 10 runs and
this easily falls within μ±σ.

In Figure 4, there are also *n* = 20 samples but
*n*_{+} = 12 and
*n*_{−} = 8, we may calculate
*μ = 10.6* and *σ* = 2.0845. There are only
5 runs and
this easily falls well outside μ±σ. One
may determine that 5 = μ − 2.6865σ which is exceptionally
unlikely. Assuming that the data is actually linear then there
is a 95% chance (19 times out of 20) that the number of runs
will fall within *μ*±1.96*σ*.

Instead, in Figure 5 where we fit a quadratic function to
the data, we see that
*n*_{+} = 10 and
*n*_{−} = 10, we may calculate
*μ* = 11 and *σ* = 2.1764. There are 13 runs and
this falls well within *mu;*±*σ*.

Note that the runs test may be used to demonstrate that data does
not fit a particular model: it cannot be used to *prove* that
data does for certain follow a particular mode. While it is
incredibly unlikely, the above data could come from a polynomial
of degree 20.

could

# HOWTO

# Problem

Given *n* data points
(*x*_{1}, *y*_{1}), ...,
(*x*_{n}, *y*_{n})
and a curve f(*x*) which is claimed to fit the data.

# Assumptions

The runs test does not make any assumptions about the data other
than random distribution from the same distribution.

# Process

Count the number of data points *n*_{+} which are above the curve,
that is *y*_{k} > f(*x*_{k}), and
the number of points *n*_{−} fall below the curve,
that is *y*_{k} < f(*x*_{k}).
If one point falls exactly on the curve (almost impossible except if
the example is artificially constructed), it could be discarded.

Calculate the average (mean) value

*μ* = 2*n*_{+}*n*_{+}/*n* + 1

and the standard deviation

If the number of runs falls outside the
interval *μ*±1.96*σ*, then it is reasonable
to reject the hypothesis that the curve is a good description of the
data.

# Error Analysis

# Examples

1. Given the 20 data points

(0.2289, 0.7810),
(0.2916, 0.7536),
(0.3528, 0.4339),
(0.4716, 0.4926),
(0.5887, 0.0780),
(0.5974, 0.2005),
(0.6236, 0.1895),
(0.6745, 0.03601),
(0.7187, -0.6034),
(0.8371, 0.07676),
(0.8545, -0.6773),
(0.9261, -0.7672),
(0.9390, -0.7627),
(0.9543, -0.2431),
(1.006, -0.8010),
(1.033, -0.7364),
(1.096, -0.8527),
(1.112, -0.5327),
(1.138, -1.015),
(1.200, -1.012)

determine if this is best fit by a line.

The best fitting line is *y* = 1.225 − 1.883*x*.
This is shown in Figure E1 with the points above the line in magenta.

Figure E1. The data in Example 1 together with the least-squares line.

The signs of the points are `-+-+-+++-+---+---+-+`
with *n*_{+} = 9 and *n*_{−} = 11. The
above test gives *μ* = 10.9 and *σ* = 2.153 and
the number of runs is 14. This falls with *μ*±1.96*σ* which is the interval [6.7, 15.1].

Therefore, we could no reject the hypothesis that this data is fit
by a straight line.

2. Periodic sampling produced the following data:

(1, 3.984), (2, 3.713), (3, 3.331), (4, 2.957), (5, 2.801),
(6, 2.260), (7, 1.819), (8, 1.605), (9, 1.408), (10, 1.306),
(11, 1.131), (12, 1.149), (13, 0.9031), (14, 0.7661), (15, 0.7615),
(16, 0.6180), (17, 0.6413), (18, 0.5464), (19, 0.4570), (20, 0.4126)

The data appears to suggest that it may be fit by a quadratic
function and therefore the least-squares best-fitting quadratic
function p(*x*) = 0.01160*x*^{2} − 0.4273*x* + 4.451 is found.

Figure E2a shows the data and the least-squares quadratic polynomial.
There are only 5 runs, yet the Walz-Wolfowitz runs test
with *n*_{+} = 11 and *n*_{−} = 9 would
indicate that we should expect between 7 and 15 runs. This would lead
us to reject the hypothesis that the data is described by a quadratic
function.

Figure E2a. The data in Example 2 together with the least-squares quadratic function.

If the model for the data suggests that an exponential curve may
be appropriate (for example, are the *y* values expected to
go to 0 as *x* → ∞), then it would be useful to
find, after using the appropriate transform, the function
*y* = 4.589*e*^{-0.1216x}. Figure E2b shows
the plot of the data and the best fitting exponential function.
This has eight runs which considered to be acceptable.

Figure E2b. The data in Example 2 together with the least-squares exponential function.

3. The data

(1, 0.002058), (2, 0.005495), (4, 0.01653), (8, 0.03960), (16, 0.1089),
(32, 0.3355), (64, 0.8213), (128, 2.060), (256, 5.531), (512, 16.83),
(1024, 45.25), (2048, 115.5), (4096, 311.2)

may appear to be growing quadratically; however, the best-fitting
quadratic function 0.000009998x^{2} + 0.03545x − 1.256
has only four runs. With *n* = 13,
*n*_{−} = 7, and
*n*_{+} = 6, the the mean and standard deviations
are *μ* = 7.462 and *σ* = 1.715 and therefore
one would expect 5, 6, 7, 8, 9, or 10 runs 95% of the time. This
would lead us to reject the hypothesis that the data is growing
quadratically.

As the data appears to be converging on 0 as *x* → 0, it
would be unlikely that the data should be modeled by an exponential
curve. A more careful analysis using the power transform suggests that
the data is growing according to 0.002099*x*^{1.4333}
which may lead to the hypothesis that the data is growing
according to *x*^{√2}.

# Questions

1. Given *n* = 12 and *n*_{+} = 5 and *n*_{−} = 7, what number of runs would fall with 1.96 standard deviations of
the mean?
(4, 5, 6, 7, 8, and 9; however, 10 > 9.97125 and is therefore outside the
interval)

2. Given *n* = 31 and *n*_{+} = 17 and *n*_{−} = 14, what number of runs would fall with 1.96 standard deviations of
the mean?
(12, 13, ..., 21; however, 11 < 11.04211 and is therefore outside the
interval.)

3. Assuming *n*_{+} = *n*_{−} = *n*/2,
what is the limit of the mean and standard deviation as *n* becomes
large? (*n*/2 + 1 and √*n*/2.)

# Applications to Engineering

# Matlab

# Maple