I ran across this problem elsewhere in the blogosphere, where unfortunately some very clever people are offering a distinctly suboptimal solution.

Suppose you have lots of people, each of whom (independently, with quite low probability) may have a certain disease. There is a perfectly reliable blood test for the disease. You want to identify everyone in your group who has the disease, using as few tests as possible.

We assume that you can pool blood samples, which is why the answer isn't just "one test per person, duh". The test will then tell you whether at least one person in the pool has the disease.

The solution offered on the blogs I linked to above has the following form: divide your group into several roughly equally sized subgroups; test each subgroup; then, for each positive-testing subgroup, test each person in that subgroup. If you do that, then it turns out that you want to use groups of size approximately 1/sqrt(p) where p is the fraction of people who have the disease; the total number of tests you need is then approximately twice the number of groups.

So, for instance, with a group of 1000 people and p=0.000154, you want groups of size 81 and the average number of tests is about 25. That's quite a big improvement on testing everyone.

But you can, and should, do better: when a subgroup tests positive, after all, you don't immediately have to test everyone in it: you can do the same thing to the subgroup as you did to the whole group. (Except that now you know at least one person in the subgroup has the disease, which changes things a bit.) Since you can do this, the cost of processing a subgroup that's tested positive grows more slowly (as a function of subgroup size) than you'd naïvely expect, which means that you typically want to use larger groups.

So far as I can tell, there isn't any simple elegant optimal solution. But one can get a computer to crunch the numbers. It turns out, for instance, that for 1000 people and p=0.000154, you should begin by testing the whole group. (After all, about 86% of the time that test will be negative and you'll be done.) If the test is positive, you should then test a subgroup of 377 people. (Why 377? Search me.)

With optimal choices, the average number of tests you need to do is about 3.2. That's a whole lot better than 25.

You can take a look at my hacked-up code if you like.

In general, it seems (though I haven't tried to prove it) that if your number of people is at most 1/p+1 then you should begin by testing all of them; otherwise you should begin by testing a smaller number, which fluctuates in peculiar ways between about 1/2p and 1/p.

If you have a group of people known to have at least one case,
as will happen as soon as you get a positive result, it seems (and
again I haven't tried to prove anything) that when the group size
is at most about 1/p the optimal size of subgroup to test first
is usually a Fibonacci number. More specifically, starting
at a group size of F_{k+2} there is a long run of values of
F_{k}, followed by a gradual mostly-monotonic transition
to the next run of F_{k+1} starting at F_{k+3}.

Yeah, "incorrect" is a bit too strong. I'm changing it to "suboptimal".

You can only test a group of size N with log N tests if you know that there's exactly one person with the disease. Well, I bet that if you know that there are at most k people for any fixed k, you can probably do it in time O(log N), with the implied constant growing with k. But if the number of people with the disease is proportional to N then I bet you can't do it with O(log N) tests. (Ignoring questions of probability, you certainly can't, by an easy counting argument.)

Hmm. I just spent a little while staring at the simple case of three coin tosses in order to convince myself that the distribution of infections amongst 623 people having tested 1000 people and found infection then tested the other 377 and found infection is the same as for 623 people in the population at large.

Maybe Ian Stewart would like to know about this problem?

Actually, I was referring to g, the group size. Once you get a positive in your group, you can use log g tests (not g) to identify the (very likely only one) person who has it.

But, again, you can only test a group of size g with log(g) tests if you *know* there's at most one person in the group with the disease. Again, if the proportion of people who have the disease is bounded away from 1 and the group size isn't O(1) then you'll need more than O(log g) tests for a group of size g. Or am I missing something?

Gareth,

This is pretty cute stuff. But I'm finding it hard work justifying an assumption you're making.

Unless I've completely misunderstood, your algorithm assumes that, if you've found a subset of the population who have tested positive, the optimal strategy is to test the positive subset and its complement (the not-yet-known-to-be-positive subset) completely separately for the rest of time.

Well, I guess I don't mean that. Code isn't sentient and so is not capable of making assumptions. Perhaps I mean you've made that assumption, but I don't actually know your mental state. Hmmm. What I mean is "You're only checking this case, and I don't know why."

In fact, I think this assumption is invalid. Suppose we have the 1000 people and tiny p you've been using as an example.

Then suppose the 1000 test positive. Then suppose the first 377 people also test positive. That leaves you with 623 people whom you do not know to have syphilis.

If you first pursue the 377 people, next you test some subset of that. I haven't worked out how big, let's say it's 100. Suppose that tests positive straight off.

We should chase that 100 people at some point. But presumably it's best to pool all the 900 people we have who are not known to have syphilis, because it's very likely we can dismiss them all with a single test. Whereas your approach, if I've understood correctly, would find us testing the 623 and the 277 separately, which is very probably wasting a test.

I've created a horrible Frankenstein's monster out of your code and my own, and put it online at:

http://www.srcf.ucam.org/~jdc41/syphilis2.txt

The "james" algorithm, which implements my previous suggestion, runs quite slowly (as you might expect). But I believe it demonstrates a new upper bound of 2.6822110584.

I've updated the code mentioned before. Interestingly, the fibonacci numbers are replaced by powers of two in my approach.

Jeffrey: Again, you can only test a group of size g with log(g) tests if you *know* there's at most one person in the group with the disease. Again, if the proportion of people who have the disease is bounded away from 1 and the group size isn't O(1) then you'll need more than O(log g) tests for a group of size g. Or am I missing something?

James: I think you're right: I didn't consider all the available options. Good catch!

Yes, I agree. I was thinking of the case where the probability p is roughly on the order of 1/g, where g is the group size.

Post a comment:

I think it's inaccurate to say my solution is "incorrect". I didn't claim my solution was optimal and, as another poster has pointed out, you can actually test a group of size N with only (log N) tests. I was basing my solution on the method described in the NPR broadcast.