Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a reference to a common situation that arises anywhere in the internet that compression is discussed. There is a segment of people who seem to find it personally, mathematically, or even perhaps morally offensive that there can exist no algorithm that can take all possible inputs and be guaranteed to emit something compressed by at least one bit. This seems to be true despite the fact the proof of this statement is very, very simple; it can be sketched on a literal post-it note and you could easily convince an open-minded 10-year-old of its truth.

In those situations, you can bet money that some compression scheme will be proposed that fits into at least one of the two categories "did not check to see if it actually compresses data" or "works by hiding bits where you forgot to count, but they still count".

"I'll just give an offset into pi" is in the first category; it does, technically, work, in the sense that it can validly represent the original data, but it turns out that in order to represent the offsets into pi it takes more bits than the content had in the first place. (You can get lucky every once in a while, and store 14159265 as "0/8", but the vast, vast majority of sequences get expanded.) Everyone proposes this one. It's almost funny how often it gets proposed, given that it is actually a very slow method for making your data bigger. Don't hold your breath for your Fields Medal for that one.[1]

An example of the second case is "I'll just store the first few bytes in the filename of the file and then 'forget' that I have to count the filenames in the size of the compressed content". Vigorous, month-long debates about whether or not something "counts" as part of the content will follow, which can be resolved by pointing out that the challenge is basically to give a stream of bytes over the network to a computer that only has the algorithm on it that fully reconstructs all the content, but this rarely satisfies Our Hero who has discovered perfect compression.

I add the word "moral" not just as a jab, but as the only way I have to explain how these people react to things like pointing out "if that worked, I could compress all files to one bit/one byte" or "I can't actually reconstruct the original content with that information". They get offended.

[1]: Just for fun, let's implement the ideal "index into a sequence" compression algorithm. This is not written to proof standards, but "give the reader the flavor of what is going on" standards. (Technically it assumes the proof, in an attempt to illuminate it from another direction.) The problem with pi is that it tends to repeat a lot. Let's chunk up 8 bits into a byte, since that's pretty common, and let's build the idea sequence of bytes that we can index into easily. For completeness, let's just index the 8-bit bytes in a one sequence:

     00000000000000010000001000000011
and so on, 0, 1, 2, etc, up to 255. The most efficient way to index this list is to break it up along the 8-bit boundaries, because you'll find (if you try it) that any more clever attempts to exploit overlapping numbers will end up at best matching that performance and at worse taking more bits. Therefore, we can index into that list to get a 0 as... 0. And the index for 1 is... uhhh... 1. And 143 is... 143. And so on.

You'll find this approach, extended into as many consecutive bits in a row as you please, will wildly outperform pi as a sequence to index into to get values. Because this is, of course, just the identity transform, and pi in practice does way worse than that. This transform is also wildly more efficient than the pi indexing operation, having computational complexity O(zero), if you'll pardon me the abuse of notation.



> the proof of this statement is very, very simple; it can be sketched on a literal post-it note and you could easily convince an open-minded 10-year-old of its truth

Could you write that proof here, for the benefit of tired HNers who are just finishing their day of work?


I'm not sure this rigorous (or if this is the proof they were referring to), but they mentioned t a pretty compelling argument in passing; if you could compress all inputs by at least one bit, then you could use that algorithm again on all the outputs and reduce them by at least one bit, until eventually you've reduced every possible input to one bit, which is clearly not possible (at least, not if you want to ever be able to get back the original value). This means that every compression algorithm either a) can't compress some inputs to a smaller size than they already are or b) lose some of the data in the process. (Lossy compressions can still be useful, of course; that being said, I'd imagine assume that even most lossy compression algorithms to be idempotent).

EDIT: The sibling comment from rmidthun is much more rigorous and concise, so if you're looking for a more proper explanation, ignore this and read that


Pigeonhole Principle. The number of messages of bit length N is 2^N. You are trying to map these into 2^M compressed strings, M < N. By the Pigeonhole (or Dirichlet's Box) Principle, at least one compressed string will have more than one message mapped to it. So it cannot be reversed.


Yup, that's the one.

The "literally sketchable on a post-it note" is that you can show this for 1 bit, 2 bits, and 3 bits pretty easily, and 4 bits if you sketch small enough. At that point it's not hard to see this goes to infinity, maybe not rigorously enough to pass for a proof, but pretty strong. (It's a fairly small step to an inductive proof from there, though this verges on a trivial proof by the time you've laid all the bits out.)


wait! you're unto something there. every bit sequence is also a number in binary base right? so one can just calculate the equivalent number for any stream of bits and just transmit the number! /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: