The variance of n sample values \(x_1\), \(x_2\), … , \(x_n\) is one measure of statistical dispersion, obtained by averaging the squared distance of the sample values from the expected value (mean). The mean captures the location of a distribution and the variance captures the degree of the values being spread out. The square root of the variance is called the standard deviation.

Definition 1 for computing the variance is: \[m = \frac{1}{n} \sum_{i=1}^n x_i \] and \[ v = \frac { \sum_{i=1}^n (x_i - m)^2 } {n-1} \] Using Definition 1 in a computation requires to first compute the mean and then, making another path over the data, computing the variance. This is called the 2-pass method.

Definition 2 for the computing the variance is: \[ v = \frac { n \sum_{i=1}^n x_i^2 - ( \sum_{i=1}^n x_i )^2 } {n(n-1)} \] Definition 2 follows from the linearity of expectations. It allows one to avoid making two passes and is called the 1-pass method.

The 1-pass method is known to be numerically unstable. This happens in situations when the variance is small and the numbers are large and close in size and subtracting two large numbers magnifies the error. This kind of subtlety arises in other computations and has been missed by programmers. In fact, Microsoft Excel initially implemented the unstable one-pass algorithm in statistical library functions (the bug was fixed in Excel 2003). From: http://support.microsoft.com/default.aspx?kbid=826393:

**Results in earlier versions of Excel**:
In extreme cases where there are many significant digits in the data but a small variance, the old computational formula leads to inaccurate results. Earlier versions of Excel used a single pass through the data to compute the sum of squares of the data values, the sum of the data values, and the count of the data values (sample size). These quantities were then combined into the computational formula that is specified in the Help file in earlier versions of Excel.

**Results in Excel 2003 and in later versions of Excel**: The procedure that is used in Excel 2003 and in later versions of Excel uses a two-pass process through the data. First, the sum and count of the data values are computed and from these the sample mean (average) can be computed. Then, on the second pass, the squared difference between each data point and the sample mean is found and these squared differences are summed.

In this lab assignment, you are to compute the mean and variance on several arrays of numbers. Compute the variance in three ways: the 1-pass algorithm, the 2-pass algorithm, and using the `var`

function provided by NumPy. (You may use the `mean`

function from NumPy.)

Run each one of the three variance algorithms on the following data sets and print the results (including the mean). You may also want to print the standard deviation (square root of the variance) to check your results. Use a NumPy array of floats (e.g., created with the `zeros`

function) to store the values.

- N=200. Generate N random integers in the open-ended range [0,1000) (uniform distribution).
- N=200. Generate N integers using 1000*normalvariate(0.5, 0.1) - parameter 0.5 is the mean and 0.1 is the std deviation (resulting in a normal distribution with mean 500 and std deviation 100).
- N=5. Start with the five values [100000, 100001, 100002, 100003, 100004] and compute the variance (using all three methods) and the mean. Repeat five more times, with each iteration multiplying the “base” (0th element) by 10 and incrementing by one to get the remaining four elements. That is, the second iteration uses the five values [1000000, 1000001, 1000002, 1000003, 1000004].

Note that the NumPy `var`

function computes the “biased” variance (divides by N, rather than N-1 as shown above). Thus, your computed variances will differ slightly from the NumPy ones, especially for small N.

**Numpy Examples**

>>> from numpy import * >>> a = array([1,2,7]) >>> a.var() 6.8888888888888875 >>> from numpy import * >>> a = array([1,2,7]) >>> a.mean() 3.3333333333333335

Here is a solution: lab8.py