# CS 190C LAB 10: Friday, March 24, 2009

## Numerically Stable and Unstable Methods to Compute the Variance

The variance of n sample values $$x_1$$, $$x_2$$, … , $$x_n$$ is one measure of statistical dispersion, obtained by averaging the squared distance of the sample values from the expected value (mean). The mean captures the location of a distribution and the variance captures the degree of the values being spread out. The square root of the variance is called the standard deviation.

Definition 1 for computing the variance is: $m = \frac{1}{n} \sum_{i=1}^n x_i$ and $v = \frac { \sum_{i=1}^n (x_i - m)^2 } {n-1}$ Using Definition 1 in a computation requires to first compute the mean and then, making another path over the data, computing the variance. This is called the 2-pass method.

Definition 2 for the computing the variance is: $v = \frac { n \sum_{i=1}^n x_i^2 - ( \sum_{i=1}^n x_i )^2 } {n(n-1)}$ Definition 2 follows from the linearity of expectations. It allows one to avoid making two passes and is called the 1-pass method.

The 1-pass method is known to be numerically unstable. This happens in situations when the variance is small and the numbers are large and close in size and subtracting two large numbers magnifies the error. This kind of subtlety arises in other computations and has been missed by programmers. In fact, Microsoft Excel initially implemented the unstable one-pass algorithm in statistical library functions (the bug was fixed in Excel 2003). From: http://support.microsoft.com/default.aspx?kbid=826393:

Results in earlier versions of Excel: In extreme cases where there are many significant digits in the data but a small variance, the old computational formula leads to inaccurate results. Earlier versions of Excel used a single pass through the data to compute the sum of squares of the data values, the sum of the data values, and the count of the data values (sample size). These quantities were then combined into the computational formula that is specified in the Help file in earlier versions of Excel.

Results in Excel 2003 and in later versions of Excel: The procedure that is used in Excel 2003 and in later versions of Excel uses a two-pass process through the data. First, the sum and count of the data values are computed and from these the sample mean (average) can be computed. Then, on the second pass, the squared difference between each data point and the sample mean is found and these squared differences are summed.

In this lab assignment, you are to compute the mean and variance on several arrays of numbers. Compute the variance in three ways: the 1-pass algorithm, the 2-pass algorithm, and using the mean and var functions provided by NumPy. A function variance_test is provided for you that automatically calls all three algorithms given an array of numbers. For example, if you have an array called numbers, you can simply call it with variance_test(numbers).

Run each one of the three variance algorithms on the following data sets and print the results. Use a NumPy array of floats (e.g., created with the zeros function) to store the values. Remember that the random functions described are in the random module.

• N=200. Generate N random integers in the open-ended range [0,1000) (uniform distribution).
• N=200. Generate N integers using 1000*normalvariate(0.5, 0.1) - parameter 0.5 is the mean and 0.1 is the std deviation (resulting in a normal distribution with mean 500 and variance 100).
• The third case is already written for you. When you run your program, you'll notice that the one pass method eventually diverges and gives you massively different answers from the other two algorithms. Include an explanation (in the file as a comment or as a separate text file) as to why you think this happens.

Numpy Examples

>>> from numpy import *
>>> a = array([1,2,7])
>>> a.var()
6.8888888888888875

>>> from numpy import *
>>> a = array([1,2,7])
>>> a.mean()
3.3333333333333335

Note that variance_test uses an optional argument to var called ddof. In order to make NumPy calculate an unbiased variance, this parameter is required; otherwise, it will generate results that are different from the other two algorithms.

## Solution

import numpy
import random
import math

def variance1(nums):
"""Calculates variance using the one pass method."""
n = len(nums)
sum = 0.0
sum_sqr = 0.0
for xi in nums:
sum += xi
sum_sqr += xi**2
return (n*sum_sqr - sum**2)/(n*(n-1))

def mean(nums):
"""Calculates the average of a sequence of numbers."""
mean = 0.0
for xi in nums:
mean += xi
return mean / len(nums)

def variance2(nums):
"""Calculates variance using the two pass method."""
m = mean(nums)
v = 0.0
for xi in nums:
v += (xi-m)**2
return v / (len(nums)-1)

def variance_test(nums):
"""Runs the three variance algorithms and prints results."""
one_var = variance1(nums)
two_var = variance2(nums)
numpy_var = numpy.var(nums, ddof=1)
print 'One pass variance:', one_var
print 'Two pass variance:', two_var
print 'NumPy variance:   ', numpy_var
print

if __name__ == '__main__':
N = 200
nums = numpy.zeros(200)
for i in xrange(N):
nums[i] = random.uniform(0, 1000)
variance_test(nums)

for i in xrange(N):
nums[i] = random.normalvariate(0.5, 0.1)
variance_test(nums)

print '*' * 80
N = 5
nums = numpy.zeros(N)
for i in xrange(N):
nums[i] = random.uniform(0, 20) + 100
for y in xrange(3, 12):
variance_test(nums)
for i in xrange(N):
nums[i] += 10**y - 10**(y-1)        