Estimating remaining download time

On 2005-11-11, Bram Cohen asked on his blog for suggestions of an algorithm for estimating how long it will take for a download to complete. He wanted one with the following property: if the download rate stabilizes then after a while of steady downloading the estimated completion time should be "basically nailed" and remain stable.

Here's such an algorithm.

The algorithm

Keep track of when each blob of data comes in. With this information, you can determine the average transfer rate over any interval in which any data arrived: it's the total amount of data that arrived therein, divided by the length of the interval.

At time t, estimate the instantaneous transfer rate at time t+h as the average transfer rate over the interval [t-kh,t]. (k is a tunable parameter.) Motivation: making predictions a certain distance into the future, you should be looking about that far into the past. (It might prove better do calculate some kind of weighted average, but let's stick with what we have for now.) If t-kh is negative, then of course just use 0 as the start of the interval.

To estimate the time until completion, integrate the estimated transfer rate and see at what time the whole file is estimated to have been transferred.

Properties of the algorithm

When the parameter k is very small, we're basically always using the currently observed instantaneous transfer rate as a predictor for the future.

When k is very large, we're basically always using the average transfer rate so far as a predictor. (Many programs do this at present.)

Consider Bram's scenario, where after a certain point the transfer rate is constant. Once we are k/(k+1) of the way from that point to the end of the transfer, all the intervals [t-kh,t] lie wholly within the stable period, and our estimated time will be exactly correct.

Early in the download, those intervals will mostly extend all the way back to the start, so we'll use the average transfer rate as our predictor of finishing time.

Late in the transfer, those intervals will mostly extend only a little way back, so we'll use recently observed transfer rates as our predictor of finishing time.

Implementation

This feels rather expensive, but it needn't be.

You don't really need to keep track of every single packet that arrives. All you need is a good enough history of past transfers that those averages come out OK. So you should keep accurate figures for the recent past, but further back you can use some cruder approximation to the truth.

Suppose you record progress every 1/10 of a second, and suppose that data more than 10 minutes old is aggregated with granularity 10 seconds, and suppose each of these records takes 8 bytes. Then for a download that lasts a whole day you'd need something on the order of 100k of data. In practice, old measurements could be aggregated much more aggressively.

Doing the integration isn't all that expensive. Suppose you update the completion estimate once a second. Then using the pessimistic figures from the last paragraph, in the worst case we need to run over our entire memory of the transfer (note that once we reach the start of the download, our estimated rate further into the future is constant thereafter so the calculation of estimated time requires only a single division). So we need, perhaps, to loop over about 12k entries and do a few arithmetic operations for each. Let's say 100 cycles for each of 12k entries, once a second; that's about 1 million cycles per second, or about 0.1% CPU on a no-longer-very-modern processor. Again, in practice this is a gross overestimate.

For better stability right at the end, it may be advisable to extend the interval by a fixed amount, or to insist that it always covers a certain minimum number of bytes transferred, or something.

Tuning

There's just one adjustable parameter, the one I called k. This determines how much older data the algorithm will look at. I'd suggest a value somewhere in the range [0.25,1].