Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)
The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.
Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.
Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.
As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.
I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"
What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.
This is true, but a big advantage of the single pass method is data reuse. Instead of loading each element twice, you load each element once. Per bit, reading/writing to external memory is massively more energy intensive than any compute op. Orders of magnitude more.
I think you meant 2N+2M vs 2N+M, but yes, that’s the point: not reading twice is cheaper because unlike in traditional big-O analysis compute is cheap but data access is very expensive.
I was thinking LINQ was optimized to do a single enumeration for cases such as this. Can LINQ be easily, naturally, and idiomatically used with `IEnumerable<T>.Max` and `.Min` to avoid looping twice, or is it more of a more rigorous and intentional optimization the engineer would have to seek out?
You're absolutely right! Some learning that we came to later that isn't unrelated to what you're saying... don't just look at metrics (in the case I've described above, it was timings of operations in a large system), but look at histograms for them. You should be able to explain why they have the shape they do. The distributions are so often multimodal, and understanding the modes helps you understand a lot more nuance about your system.
We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.
> the latter is much easier: sample 10,000 elements uniformly at random and take their median
Do you have a source for that claim?
I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.
True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.
Yet, for any given distribution the sample size can be arbitrarily close to infinite. Unless I've missed something, I don't see the relevance.
If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.
eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.
Relevance: from the cited paper, the variance of the median estimator is proportional to 1/(n * f^2), where n is the sample size and f is the density at median.
Two observations:
1. With sufficiently large n, you can control your variance to an acceptable level.
2. The factor f is outside of your control. If your problem has a small density around the median, then you'll need to throw more samples to compensate for it.
I think your concern is about #2: you can always construct a pathological distribution to make the sampling-based approach unattainable. The paper provided guidance on what to expect when you apply the sampling-based approach.
the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.
I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.
That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)