Scribble, scribble, scribble

I suppose this is what they call a blog. Except that blogs are supposed to be updated more often than this is.

Feeds: Atom 1.0 (preferred), RSS 0.91. Front page: link.

< May 2010 >
SuMoTuWeThFrSa
       1
2 3 4 5 6 7 8
9101112131415
16171819202122
23242526272829
3031     
Wednesday 2010-05-26

tests

I ran across this problem elsewhere in the blogosphere, where unfortunately some very clever people are offering a distinctly suboptimal solution.

Suppose you have lots of people, each of whom (independently, with quite low probability) may have a certain disease. There is a perfectly reliable blood test for the disease. You want to identify everyone in your group who has the disease, using as few tests as possible.

We assume that you can pool blood samples, which is why the answer isn't just "one test per person, duh". The test will then tell you whether at least one person in the pool has the disease.

The solution offered on the blogs I linked to above has the following form: divide your group into several roughly equally sized subgroups; test each subgroup; then, for each positive-testing subgroup, test each person in that subgroup. If you do that, then it turns out that you want to use groups of size approximately 1/sqrt(p) where p is the fraction of people who have the disease; the total number of tests you need is then approximately twice the number of groups.

So, for instance, with a group of 1000 people and p=0.000154, you want groups of size 81 and the average number of tests is about 25. That's quite a big improvement on testing everyone.

But you can, and should, do better: when a subgroup tests positive, after all, you don't immediately have to test everyone in it: you can do the same thing to the subgroup as you did to the whole group. (Except that now you know at least one person in the subgroup has the disease, which changes things a bit.) Since you can do this, the cost of processing a subgroup that's tested positive grows more slowly (as a function of subgroup size) than you'd naïvely expect, which means that you typically want to use larger groups.

So far as I can tell, there isn't any simple elegant optimal solution. But one can get a computer to crunch the numbers. It turns out, for instance, that for 1000 people and p=0.000154, you should begin by testing the whole group. (After all, about 86% of the time that test will be negative and you'll be done.) If the test is positive, you should then test a subgroup of 377 people. (Why 377? Search me.)

With optimal choices, the average number of tests you need to do is about 3.2. That's a whole lot better than 25.

You can take a look at my hacked-up code if you like.

In general, it seems (though I haven't tried to prove it) that if your number of people is at most 1/p+1 then you should begin by testing all of them; otherwise you should begin by testing a smaller number, which fluctuates in peculiar ways between about 1/2p and 1/p.

If you have a group of people known to have at least one case, as will happen as soon as you get a positive result, it seems (and again I haven't tried to prove anything) that when the group size is at most about 1/p the optimal size of subgroup to test first is usually a Fibonacci number. More specifically, starting at a group size of Fk+2 there is a long run of values of Fk, followed by a gradual mostly-monotonic transition to the next run of Fk+1 starting at Fk+3.