Hypergeometric Probability In Biomedical Research: The Fisher Exact Test


	Nerdbucket Blog! Leave Comments Contact Me Nerdy Links Privacy Policy Member Area: Member FAQ Log In Register Jump To Section: Calculations Explanation Real-world use

Real-world use

Matthew Healy is involved in "proprietary" biomedical research (probably biological weapons, if you ask me, but for now it appears he's on our side, so we're cool). He stumbled across the main hypergeometric distribution probability calculator here at nerdbucket.com, and wanted to share some tips and tricks for real-world use. His application of the hypergeometric distribution is to determine if statistical results are significant, via the Fisher Exact Test:

The Fisher Exact Test

Curious where the hypergeometric distribution is used in the real (non-gambling) world? Of course you are! You're a nerd! Knowledge of obscure things is your calling!

Well, the Fisher Exact Test is a way of basically determining if some results that were measure are likely to be random chance, or if there's really some significance to the data. Its best use is when the data has a small sample size, which is apparently common in medical research. Matthew sums it up nicely:

Out of N1 patients given drug A, M1 get better (and N1 - M1 don't)
Out of N2 patients given drug B, M2 get better (and N2 - M2 don't)
From these data can we conclude there is a significant difference between these drugs?

I won't try to explain the formula, as I don't fully understand it myself. But sufficed to say, it's very important in all kinds of medical research situations, even those that don't involve biological weapons.

Yawn...

The important thing here isn't so much the Fisher Exact test, and not really even the real-world use of hypergeometric distributions, as the fact that Mr. Healy gave me a piece of his code and agreed to let me put it up on my site: InteractiveHypergeometricCalculator.pl. The code may be hard to read if you don't know what's going on, but the most interesting function is pretty simple:

    sub LnFactorial {
      my $n = shift;
      return $cache[$n] if defined $cache[$n];
      return undef if $n < 0;
      for (my $i = scalar(@cache); $i <= $n; $i++) {
        $cache[$i] = $cache[$i-1] + log($i);
      }

      return $cache[$n];
    }

I've made a few minor formatting changes, but for all intents and purposes, the code is the same as he sent it to me. There are a couple things I really liked about his approach here.

First, he built this LnFactorial function to compute huge factorial values in a "compressed" form. By storing the natural log of each value, he is able to use addition instead of multiplication, and can make use of relatively small floating-point numbers - which modern CPUs can deal with very effeciently - instead of the typical approach for factorials of doing multiplication on extremely large integers - something that modern CPUs don't inherently support.

The trick is that the natural logarithm of any number returns a value such that E to the power of that value will return the original number. Once you're dealing with exponents, you can add them together to simulate multiplication or subtract them to simulate division - once you raise E to the power of the the new value, it's as if you'd multiplied or divided the original values, only it's much more CPU-friendly. Here's a simplistic example:

    Method 1 - normal math:
      25 * 48 = 1200
    Method 2 - logarithmic method:
      exp(ln(25) + ln(48))    =>
      exp(3.21888 + 3.87120)  =>
      exp(7.09008) = 1200.00380

Clearly there's a bit of error if you round the values too much, but on a computer, where a lot of precision is kept, this isn't a large enough issue to be concerned with. Additionally, when you're looking at the factorial of a large number like 10,000, the CPU savings are incredible. When your language of choice is Perl, but you really need speedy computations of this nature, using the logarithmic method is not just clever or "cool", but absolutely necessary.

The second thing Healy did in a clever way was his memoization. If you're anything at all of a programmer, you know about memoization. But in a typical situation, a memoization system just memoizes values one at a time. For Perl's Memoize module, this is the case - you call a function with the input of X, and it will know that every time you call with X, the result is pre-computed. But what if in computing X, you've also computed the result for X-1, X-2, etc? In the case of factorials, this is obviously what happens.

The obvious solution is to use recursion. If calling factorial(x) internally calls factorial(x-1), you get free memoization! But recursion has a huge price tag associated with it. On large values of x, recursion will simply crash due to lack of stack space. Even if it doesn't crash, the overhead of calling a function isn't trivial - hundreds of thousands of function calls within CPU-sensitive code will kill performance.

So while Matthew's "roll your own" memoization solution isn't likely to win any major awards, I think it shows he was thinking about the problem instead of just grabbing up the first pre-made solution available. Today, we're jaded when it comes to programming - we know that if we have a problem, somebody has likely solved it before us, and we can probably just find a library to do what we need. But if we don't pay attention, we could use a really great library in just the wrong situation, saving us time writing code, but hurting our applications in other, unexpected ways.

Amazon.com 100 Hot DVDs