Matthew Healy is involved in "proprietary" biomedical research (probably
biological weapons, if you ask me, but for now it appears he's on our
side, so we're cool). He stumbled across the main hypergeometric distribution
probability calculator here at nerdbucket.com, and wanted to share some tips
and tricks for real-world use. His application of the hypergeometric
distribution is to determine if statistical results are significant, via the
Fisher Exact Test:
Curious where the hypergeometric distribution is used in the real
(non-gambling) world? Of course you are! You're a nerd! Knowledge of obscure
things is your calling!
Well, the
Fisher
Exact Test is a way of basically determining if some results that were
measure are likely to be random chance, or if there's really some significance
to the data. Its best use is when the data has a small sample size, which
is apparently common in medical research. Matthew sums it up nicely:
Out of N1 patients given drug A, M1 get better (and N1 - M1 don't)
Out of N2 patients given drug B, M2 get better (and N2 - M2 don't)
From these data can we conclude there is a significant difference between these drugs?
I won't try to explain the formula, as I don't fully understand it myself.
But sufficed to say, it's very important in all kinds of medical research
situations, even those that don't involve biological weapons.
The important thing here isn't so much the Fisher Exact test, and not really
even the real-world use of hypergeometric distributions, as the fact that Mr.
Healy gave me a piece of his code and agreed to let me put it up on my site:
InteractiveHypergeometricCalculator.pl.
The code may be hard to read if you don't know what's going on, but the most
interesting function is pretty simple:
sub LnFactorial {
my $n = shift;
return $cache[$n] if defined $cache[$n];
return undef if $n < 0;
for (my $i = scalar(@cache); $i <= $n; $i++) {
$cache[$i] = $cache[$i-1] + log($i);
}
return $cache[$n];
}
I've made a few minor formatting changes, but for all intents and purposes,
the code is the same as he sent it to me. There are a couple things I really
liked about his approach here.
First, he built this LnFactorial function to compute huge factorial
values in a "compressed" form. By storing the natural log of each value, he
is able to use addition instead of multiplication, and can make use of
relatively small floating-point numbers -
which modern CPUs can deal with
very effeciently - instead of the typical approach for factorials of doing
multiplication on extremely large integers -
something that modern CPUs
don't inherently support.
The trick is that the natural logarithm of any number returns a value such
that
E to the
power of that value will return the original number. Once you're dealing with
exponents, you can add them together to simulate multiplication or subtract
them to simulate division - once you raise E to the power of the the new
value, it's as if you'd multiplied or divided the original values, only it's
much more CPU-friendly. Here's a simplistic example:
Method 1 - normal math:
25 * 48 = 1200
Method 2 - logarithmic method:
exp(ln(25) + ln(48)) =>
exp(3.21888 + 3.87120) =>
exp(7.09008) = 1200.00380
Clearly there's a bit of error if you round the values too much, but on a
computer, where a lot of precision is kept, this isn't a large enough issue to
be concerned with. Additionally, when you're looking at the factorial of a
large number like 10,000, the CPU savings are incredible. When your language of
choice is Perl, but you really need speedy computations of this nature, using
the logarithmic method is not just clever or "cool", but absolutely necessary.
The
second thing Healy did in a clever way was his memoization. If
you're anything at all of a programmer, you know about memoization. But in a
typical situation, a memoization system just memoizes values one at a time.
For Perl's Memoize module, this is the case - you call a function with the
input of X, and it will know that every time you call with X, the result is
pre-computed. But what if in computing X, you've also computed the result for
X-1, X-2, etc? In the case of factorials, this is obviously what happens.
The obvious solution is to use recursion. If calling factorial(x) internally
calls factorial(x-1), you get free memoization! But recursion has a huge
price tag associated with it. On large values of x, recursion will simply
crash due to lack of stack space. Even if it doesn't crash, the overhead of
calling a function isn't trivial - hundreds of thousands of function calls
within CPU-sensitive code will kill performance.
So while Matthew's "roll your own" memoization solution isn't likely to win
any major awards, I think it shows he was thinking about the problem instead
of just grabbing up the first pre-made solution available. Today, we're jaded
when it comes to programming - we know that if we have a problem, somebody has
likely solved it before us, and we can probably just find a library to do what
we need. But if we don't pay attention, we could use a really great library
in just the wrong situation, saving us time writing code, but hurting our
applications in other, unexpected ways.