The variance of n sample values \(x_1\), \(x_2\), … , \(x_n\) is one measure of statistical dispersion, obtained by averaging the squared distance of the sample values from the expected value (mean). The mean captures the location of a distribution and the variance captures the degree of the values being spread out. The square root of the variance is called the standard deviation.

Definition 1 for computing the variance is: \[m = \frac{1}{n} \sum_{i=1}^n x_i \] and \[ v = \frac { \sum_{i=1}^n (x_i - m)^2 } {n-1} \] Using Definition 1 in a computation requires to first compute the mean and then, making another path over the data, computing the variance. This is called the 2-pass method.

Definition 2 for the computing the variance is: \[ v = \frac { n \sum_{i=1}^n x_i^2 - ( \sum_{i=1}^n x_i )^2 } {n(n-1)} \] Definition 2 follows from the linearity of expectations. It allows one to avoid making two passes and is called the 1-pass method.

The 1-pass method is known to be numerically unstable. This happens in situations when the variance is small and the numbers are large and close in size and subtracting two large numbers magnifies the error. This kind of subtlety arises in other computations and has been missed by programmers. In fact, Microsoft Excel initially implemented the unstable one-pass algorithm in statistical library functions (the bug was fixed in Excel 2003). From: http://support.microsoft.com/default.aspx?kbid=826393:

**Results in earlier versions of Excel**:
In extreme cases where there are many significant digits in the data but a small variance, the old computational formula leads to inaccurate results. Earlier versions of Excel used a single pass through the data to compute the sum of squares of the data values, the sum of the data values, and the count of the data values (sample size). These quantities were then combined into the computational formula that is specified in the Help file in earlier versions of Excel.

**Results in Excel 2003 and in later versions of Excel**: The procedure that is used in Excel 2003 and in later versions of Excel uses a two-pass process through the data. First, the sum and count of the data values are computed and from these the sample mean (average) can be computed. Then, on the second pass, the squared difference between each data point and the sample mean is found and these squared differences are summed.

Download the skeleton code here: variance.py

In this lab assignment, you are to compute the mean and variance on several arrays of numbers. Compute the variance in three ways: the 1-pass algorithm, the 2-pass algorithm, and using the `mean`

and `var`

functions provided by NumPy. A function `variance_test`

is provided for you that automatically calls all three algorithms given an array of numbers. For example, if you have an array called `numbers`

, you can simply call it with `variance_test(numbers)`

.

Run each one of the three variance algorithms on the following data sets and print the results. Use a NumPy array of floats (e.g., created with the `zeros`

function) to store the values. Remember that the random functions described are in the random module.

- N=200. Generate N random integers in the open-ended range [0,1000) (uniform distribution).
- N=200. Generate N integers using 1000*normalvariate(0.5, 0.1) - parameter 0.5 is the mean and 0.1 is the std deviation (resulting in a normal distribution with mean 500 and variance 100).
- The third case is already written for you. When you run your program, you'll notice that the one pass method eventually diverges and gives you massively different answers from the other two algorithms. Include an explanation (in the file as a comment or as a separate text file) as to why you think this happens.

**Numpy Examples**

>>> from numpy import * >>> a = array([1,2,7]) >>> a.var() 6.8888888888888875 >>> from numpy import * >>> a = array([1,2,7]) >>> a.mean() 3.3333333333333335

Note that `variance_test`

uses an optional argument to `var`

called `ddof`

. In order to make NumPy calculate an unbiased variance, this parameter is required; otherwise, it will generate results that are different from the other two algorithms.

import numpy import random import math def variance1(nums): """Calculates variance using the one pass method.""" n = len(nums) sum = 0.0 sum_sqr = 0.0 for xi in nums: sum += xi sum_sqr += xi**2 return (n*sum_sqr - sum**2)/(n*(n-1)) def mean(nums): """Calculates the average of a sequence of numbers.""" mean = 0.0 for xi in nums: mean += xi return mean / len(nums) def variance2(nums): """Calculates variance using the two pass method.""" m = mean(nums) v = 0.0 for xi in nums: v += (xi-m)**2 return v / (len(nums)-1) def variance_test(nums): """Runs the three variance algorithms and prints results.""" one_var = variance1(nums) two_var = variance2(nums) numpy_var = numpy.var(nums, ddof=1) print 'One pass variance:', one_var print 'Two pass variance:', two_var print 'NumPy variance: ', numpy_var print if __name__ == '__main__': N = 200 nums = numpy.zeros(200) for i in xrange(N): nums[i] = random.uniform(0, 1000) variance_test(nums) for i in xrange(N): nums[i] = random.normalvariate(0.5, 0.1) variance_test(nums) print '*' * 80 N = 5 nums = numpy.zeros(N) for i in xrange(N): nums[i] = random.uniform(0, 20) + 100 for y in xrange(3, 12): variance_test(nums) for i in xrange(N): nums[i] += 10**y - 10**(y-1)