[Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
Terry Reedy
tjreedy at udel.edu
Fri Sep 20 22:01:28 CEST 2013
More information about the Python-ideas mailing list
Fri Sep 20 22:01:28 CEST 2013
- Previous message: [Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
- Next message: [Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 9/20/2013 6:45 AM, Oscar Benjamin wrote: > I'm going to assume that numerical instability is just a way of saying > that a method is inaccurate in some cases. Good enough for me. > Although the incremental algorithm is much better than the naive > approach Steven (knowingly) showed above I don't think it's true that > constraining yourself to a single pass doesn't limit the possible > accuracy. That may be one difference between integer and float arithmetic. The order of operations makes a difference. > Another point of relevance here is that the incremental > formula cannot be as efficiently implemented in Python since you don't > get to take advantage of the super fast math.fsum function which is > also more accurate than a naive Kahan algorithm. Yes. One of the differences between 'theoretical' algorithms and practical algorithms coded in CPython is the bias toward using functions already coded in C. > The script at the bottom of this post tests a few methods on a > deliberately awkward set of random numbers and typical output is: Thanks for doing this. > $ ./stats.py > exact: 0.989661716301 > naive -> error = -21476.0408922 > incremental -> error = -1.0770901604e-07 > two_pass -> error = 1.29118937764e-13 > three_pass -> error = 0.0 The incremental method is good enough for data measured to 3 significant figures, as is typical in as least parts of some sciences, and the data I worked with for a decade. But it is not good enough for substantially more accurate data. The Python statistics module should cater to the latter. The doc should just say that is requires a serially re-iterable input. (A person with data too large to fit in memory could write an iterable that opens a file and returns an iterator that reads blocks of values and yields them one at a time.) The incremental method is useful for returning running means and deviations for data collected sporadically and indefinitely, without needing to store the cumulative data. It is a nice, non-obvious example of the principle that it is sometimes possible to summarize cumulative data with a relatively small and fixed set of sufficient statistics. > For these numbers the three_pass method usually has an error of 0 but > otherwise 1ulp (1e-16). (It can actually be collapsed into a two pass > method but then we couldn't use fsum.) > > If you know of a one-pass algorithm (or a way to improve the > implementation I showed) that is as accurate as either the two_pass or > three_pass methods I'd be very interested to see it (I'm sure Steven > would be as well). If I were trying to improve the incremental variance algorithm, I would study the fsum method until a really understood it and then see if I could apply the same ideas. > > > Oscar > > > $ cat stats.py > #!/usr/bin/env python > > from __future__ import print_function > > from random import gauss > from math import fsum > from fractions import Fraction > > # Return the exact result as a Fraction. Nothing wrong > # with using the computational formula for variance here. > def variance_exact(data): > data = [Fraction(x) for x in data] > n = len(data) > sumx = sum(data) > sumx2 = sum(x**2 for x in data) > ss = sumx2 - (sumx**2)/n > return ss/(n-1) > > # Although this is the most efficient formula when using > # exact computation it fails under fixed precision > # floating point since it ends up subtracting two large > # almost equal numbers leading to a catastrophic loss of > # precision. > def variance_naive(data): > n = len(data) > sumx = fsum(data) > sumx2 = fsum(x**2 for x in data) > ss = sumx2 - (sumx**2)/n > return ss/(n-1) > > # Incremental variance calculation from Wikipedia. If > # the above uses fsum then a fair comparison should > # use some compensated summation here also. However > # it's not clear (to me) how to incorporate compensated > # summation here. > # http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Compensated_variant > def variance_incremental(data): > n = 0 > mean = 0 > M2 = 0 > > for x in data: > n = n + 1 > delta = x - mean > mean = mean + delta/n > M2 = M2 + delta*(x - mean) > > variance = M2/(n - 1) > return variance > > # This is to me the obvious formula since I think of > # this as the definition of the variance. > def variance_twopass(data): > n = len(data) > mean = fsum(data) / n > sumdev2 = fsum((x - mean)**2 for x in data) > variance = sumdev2 / (n - 1) > return variance > > > # This is the three-pass algorithm used in Steven's > # statistics module. It's not one I had seen before but > # AFAICT it's very accurate. In fact the 2nd and 3rd passes > # can be merged as in variance_incremental but then we > # wouldn't be able to take advantage of fsum. > def variance_threepass(data): > n = len(data) > mean = fsum(data) / n > sumdev2 = fsum((x-mean)**2 for x in data) > # The following sum should mathematically equal zero, but due to rounding > # error may not. > sumdev2 -= fsum((x-mean) for x in data)**2 / n > return sumdev2 / (n - 1) > > methods = [ > ('naive', variance_naive), > ('incremental', variance_incremental), > ('two_pass', variance_twopass), > ('three_pass', variance_threepass), > ] > > # Test numbers with large mean and small standard deviation. > # This is the case that causes trouble for the naive formula. > N = 100000 > testnums = [gauss(mu=10**10, sigma=1) for n in range(N)] > > # First compute the exact result > exact = variance_incremental([Fraction(num) for num in testnums]) > print('exact:', float(exact)) > > # Compare each with the exact result > for name, var in methods: > print(name, '-> error =', var(testnums) - exact) -- Terry Jan Reedy
- Previous message: [Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
- Next message: [Python-ideas] Numerical instability was: Re: Introduce collections.Reiterable
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-ideas mailing list