# Introduction

Consider flipping a coin 10 times. If the coin was fair, one would expect approximately 5 heads and 5 tails; however, one would also expect a randomness in the sequences. A sequence such as HTHTHTHT would be just as unexpected as the sequence HHHHHTTTTT.

Now, consider data which is being fit by a least-squares line ax + b. It would be expected that the variation around the line (either above or below) would be similar to flipping a coin if the line actually fits the data.

We will see how we can use the sequence of aboves and belows to determine if a least-squares curve is actually a good description of the data.

# Theory

Given a sequence of n samples of the from (x1, y1), ... (xn, yn), assuming that there are sufficiently many different x values, it is always possible to put a straight line though those points.

The previous topic looked at the lack-of-fit test which requires the additional assumption that the data is normally distributed around the curves. This may not always be the case, so we will now look at a more general but very intuitive test: the Wald-Wolfowitz runs test.

Consider the data plotted in Figure 1 where a straight line is fit to the data.

Figure 1. Data fit to a straight line.

If we highlight all the points which are above the least-squares line, we see in Figure 2 that the sequence of points is of the form +---+-++-+++--++++--. This is not unlike a sequence of flips of a coin.

Figure 2. Data fit to a straight line with entries above the line in magenta.

If consider the data in Figure 3, it too appears to be well approximated by a straight line, but the sequence of points above and below the least-squares line is ---+++++++++++-+----. This would suggest the data may not be linear.

Figure 3. Data which appears to be fit by a straight line.

Figure 4. Data which appears to be fit by a straight line.

If, however, we fit a quadratic polynomial to this data, the the sequence of points above and below the curve are now -+-+-++---+--++++-+-, as is shown in Figure 5.

Figure 5. The data in Figure 2 fit with a quadratic polynomial.

The Wald-Wolfowitz runs test defines a run as any consecutive sequence of samples which contain the same property. In this case, the property is either above or below the least-squares curve.

Given a sequence of n samples where n+ are above the line and n are below the line, the average (mean) number of runs is

μ = 2n+n+/n + 1

and the standard deviation is

In the data in Figure 2, there are n = 20 samples where n+ = 11 and n = 9, we may calculate μ = 10.9 and σ = 2.1535. There are 10 runs and this easily falls within μ±σ.

In Figure 4, there are also n = 20 samples but n+ = 12 and n = 8, we may calculate μ = 10.6 and σ = 2.0845. There are only 5 runs and this easily falls well outside μ±σ. One may determine that 5 = μ − 2.6865σ which is exceptionally unlikely. Assuming that the data is actually linear then there is a 95% chance (19 times out of 20) that the number of runs will fall within μ±1.96σ.

Instead, in Figure 5 where we fit a quadratic function to the data, we see that n+ = 10 and n = 10, we may calculate μ = 11 and σ = 2.1764. There are 13 runs and this falls well within mu;±σ.

Note that the runs test may be used to demonstrate that data does not fit a particular model: it cannot be used to prove that data does for certain follow a particular mode. While it is incredibly unlikely, the above data could come from a polynomial of degree 20.

could

# Problem

Given n data points (x1, y1), ..., (xn, yn) and a curve f(x) which is claimed to fit the data.

# Assumptions

The runs test does not make any assumptions about the data other than random distribution from the same distribution.

# Process

Count the number of data points n+ which are above the curve, that is yk > f(xk), and the number of points n fall below the curve, that is yk < f(xk). If one point falls exactly on the curve (almost impossible except if the example is artificially constructed), it could be discarded.

Calculate the average (mean) value

μ = 2n+n+/n + 1

and the standard deviation

If the number of runs falls outside the interval μ±1.96σ, then it is reasonable to reject the hypothesis that the curve is a good description of the data.

# Examples

1. Given the 20 data points

(0.2289, 0.7810), (0.2916, 0.7536), (0.3528, 0.4339), (0.4716, 0.4926), (0.5887, 0.0780), (0.5974, 0.2005), (0.6236, 0.1895), (0.6745, 0.03601), (0.7187, -0.6034), (0.8371, 0.07676), (0.8545, -0.6773), (0.9261, -0.7672), (0.9390, -0.7627), (0.9543, -0.2431), (1.006, -0.8010), (1.033, -0.7364), (1.096, -0.8527), (1.112, -0.5327), (1.138, -1.015), (1.200, -1.012)

determine if this is best fit by a line.

The best fitting line is y = 1.225 − 1.883x. This is shown in Figure E1 with the points above the line in magenta.

Figure E1. The data in Example 1 together with the least-squares line.

The signs of the points are -+-+-+++-+---+---+-+ with n+ = 9 and n = 11. The above test gives μ = 10.9 and σ = 2.153 and the number of runs is 14. This falls with μ±1.96σ which is the interval [6.7, 15.1].

Therefore, we could no reject the hypothesis that this data is fit by a straight line.

2. Periodic sampling produced the following data:

(1, 3.984), (2, 3.713), (3, 3.331), (4, 2.957), (5, 2.801), (6, 2.260), (7, 1.819), (8, 1.605), (9, 1.408), (10, 1.306), (11, 1.131), (12, 1.149), (13, 0.9031), (14, 0.7661), (15, 0.7615), (16, 0.6180), (17, 0.6413), (18, 0.5464), (19, 0.4570), (20, 0.4126)

The data appears to suggest that it may be fit by a quadratic function and therefore the least-squares best-fitting quadratic function p(x) = 0.01160x2 − 0.4273x + 4.451 is found.

Figure E2a shows the data and the least-squares quadratic polynomial. There are only 5 runs, yet the Walz-Wolfowitz runs test with n+ = 11 and n = 9 would indicate that we should expect between 7 and 15 runs. This would lead us to reject the hypothesis that the data is described by a quadratic function.

Figure E2a. The data in Example 2 together with the least-squares quadratic function.

If the model for the data suggests that an exponential curve may be appropriate (for example, are the y values expected to go to 0 as x → ∞), then it would be useful to find, after using the appropriate transform, the function y = 4.589e-0.1216x. Figure E2b shows the plot of the data and the best fitting exponential function. This has eight runs which considered to be acceptable.

Figure E2b. The data in Example 2 together with the least-squares exponential function.

3. The data

(1, 0.002058), (2, 0.005495), (4, 0.01653), (8, 0.03960), (16, 0.1089), (32, 0.3355), (64, 0.8213), (128, 2.060), (256, 5.531), (512, 16.83), (1024, 45.25), (2048, 115.5), (4096, 311.2)

may appear to be growing quadratically; however, the best-fitting quadratic function 0.000009998x2 + 0.03545x − 1.256 has only four runs. With n = 13, n = 7, and n+ = 6, the the mean and standard deviations are μ = 7.462 and σ = 1.715 and therefore one would expect 5, 6, 7, 8, 9, or 10 runs 95% of the time. This would lead us to reject the hypothesis that the data is growing quadratically.

As the data appears to be converging on 0 as x → 0, it would be unlikely that the data should be modeled by an exponential curve. A more careful analysis using the power transform suggests that the data is growing according to 0.002099x1.4333 which may lead to the hypothesis that the data is growing according to x√2.

# Questions

1. Given n = 12 and n+ = 5 and n = 7, what number of runs would fall with 1.96 standard deviations of the mean? (4, 5, 6, 7, 8, and 9; however, 10 > 9.97125 and is therefore outside the interval)

2. Given n = 31 and n+ = 17 and n = 14, what number of runs would fall with 1.96 standard deviations of the mean? (12, 13, ..., 21; however, 11 < 11.04211 and is therefore outside the interval.)

3. Assuming n+ = n = n/2, what is the limit of the mean and standard deviation as n becomes large? (n/2 + 1 and √n/2.)