Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Shannon’s Source Coding Theorem (2020) (mbernste.github.io)
127 points by ElFitz on June 18, 2023 | hide | past | favorite | 38 comments


Shannon is such an engineering marvel as Einstein is for science. His perennial works, hobbies, papers and theses outlined the fundamentals of modern engineering foundations including digital, computer and communication systems including AI systems of early forms of chess engine, LLM, etc [1]. Kleinrock once noted that he has to pursue exotic field of queuing theory that led to packet switching and then Internet, because most of popular problems in his research field have been solved by Shannon (who was also one of Kleinrock's doctoral research study committee members).

[1]How Claude Shannon Helped Kick-start Machine Learning:

https://www.spectrum.ieee.org/claude-shannon-information-the...


I'm fond of his "mind-reading machine": https://literateprograms.org/mind_reading_machine__awk_.html

(compare Agassi/Becker for the value of single-bit predictions)


Wow, what a fascinating concept and read. It's humbling to consider and marvel at what early technology pioneers like Claude Shannon envisioned and accomplished with a tiny, miniscule sliver of computational power compared to what we are surrounded by today.

I couldn't resist submitting it, especially considering it has not yet been discussed on HN*:

https://news.ycombinator.com/item?id=36385198

(*A different link covering the same subject was submitted 6 years ago but never caught on: https://news.ycombinator.com/item?id=14450232)


Another excellent book that I recently read was "Information Theory: A Tutorial Introduction" by James V. Stone [1]. This book is a fairly light read and builds up everything from scratch which was very nice.

Shannon's original paper [2] is often touted to be very parseable, but it is always enlightening to read the same basics many times from multiple sources, especially something as foundational as source coding (and broader information theory) in a more modern language.

I also finally discovered the meaning of "differential" in differential entropy [2], the continuous counterpart of discrete probability masses, something that I always swept under the rug.

[1]: https://www.librarything.com/work/16165301/book/242295611 [2]: https://ieeexplore.ieee.org/document/6773024 [3]: https://sanyamkapoor.com/kb/differential-entropy


I was curious as to wether or not there is a way to determine the actual amount of information in a given file (say, an image), and thus a theoretical limit to lossless compression, which is apparently called "Kolmogorov Complexity" [1], and this came up.

[1]: https://en.m.wikipedia.org/wiki/Kolmogorov_complexity


If you have a probability distribution over patterns you are likely to find in a file, then the Shannon entropy gives you the best losslessly compressed filesize you can achieve.

Kolmogorov complexity is the length of the shortest computer program in some programming language that can generate your file. It gives the same number across different programming languages, up to an uncomputable and likely very large additive constant. It’s a more universal notion of compressibility than Shannon entropy, because it doesn’t depend on a probability distribution, but it is uncomputable.

The concepts are related because for any programming language, you can see it as corresponding to a probability distribution such that p(x) \propto 2^-(length of the program that generates x).


A practical way I used to grok this is to try to zip up an mp4 or a jpeg. You’ll notice the size doesn’t really change. That’s because these formats already applied entropy coding compression techniques and have compressed the information very close to the Shannon entropy limit.


There's a distinction to be made here, order-0 entropy versus higher order entropy. Order here just means how long a chain of symbols that is considered, where order-0 means just count each symbol at a time, versus eg order-1 where you consider chains of two symbols. Like the other commenter says, JPEGXL can recompress JPEGs losslessly to be around 20% smaller by exploiting higher order patterns (eg via move-to-front, lz77-ish matches, and the MANIAC tree context modeling, versus JPEG's RLE (which is "sort of beyond order-0") combined with Huffman coding (which is order-0))


JPEG XL supports reversible JPEGs transcoding in ~20% less space, which wouldn't be possible if JPEG was very close to the Shannon entropy limit.


JPEG is lossy though - I'm pretty sure these theorems only really apply to lossless compression, because if you allow lossy compression you can do all kinds of things to different t types of data (pictures being the classic case of permitting pretty lossy compression without much noticeable quality degradation)


Maybe I wasn't clear enough: JPEG -> JPEG XL -> JPEG is not lossy, you get exactly the same bits back, but the JPEG XL file is 20% smaller.


Oh, very interesting. I'll have to look into it.


Entropy doesn't make much sense for a single instance. There trivially exists a codec that will translate any particular file into any other. To get a real idea how much information is in a file you'll need some statistical model about how it may have been generated (thankfully the model doesn't have to be perfect).

You will also want to be a bit careful what assumptions you allow this model to make. In a very real sense a hash is mostly enough information to recover a file if you know it exists somewhere, as torrents and magnet links prove.


Run your file through every compression algorithm (that doesn't utilize too much prior knowledge i.e. not LLMs) that you can get your hands on. After a certain point it won't be able to get compressed further. It's a good approximation for the limit I think.


This might give a decent estimate of the entropy of the file, but it is a poor approximation of the Kolmogorov complexity in general.

All major compression libraries that I know of essentially work by "factoring out" statistical regularities in the input string--i.e. repeating patterns. But many strings have regularity that can't be captured statistically. For example, the sequence 1, 2, 3, 4, ... has an entropy rate of 1, because all digits and all subsequences of digits occur with equal frequency, so most modern compression libraries are going to be unable to compress it to any significant degree [1]. However, it obviously has a very compact representation as a program in your favourite language.

If you want to go down a bit of a rabbit hole, this video from a researcher in this field gives an overview, as well as some proposed methods for approximating Kolmogorov Complexity in a reasonable way (even though in a technical sense approximating it with known, fixed bounds is I believe impossible): https://www.youtube.com/watch?v=HXM3BUXsY4g. This paper also discusses a theory that decomposes KC into a statistical part (Shannon entropy) and a "computational part": https://csc.ucdavis.edu/~cmg/compmech/pubs/CalcEmergTitlePag....

[1]: Note that if you encode it as ASCII or even N-bit integers there will be significant redundancy in the encoding. To properly test it you'd have to encode the numbers in a packed binary encoding with no padding.


Thus is how some video encoding algorithms ans the like sort of work, iirc - you encode interface differences under the assumption that any particular part of the screen isn't likely to change much.

Encoding algorithms end up getting specialised for particular types of data because different data exhibits different patterns.


I wouldn’t have thought of that.

I was more interested in learning about the possible theory behind such a limit, to see if we had a way to actually know how far such optimisations can go. Basically, for any given file, what would be the absolute best we could hope for (without resorting to losing information).


That question makes pretty little sense without fixing some constraints like "what other files exist" and "what language are you allowed to reproduce them".

In a world where every single file is exactly "abcd", it takes zero bits to represent the file - you don't need any information to know that the file contains "abcd" because all files do.

In theory, the language you use to output the file could have a one-bit coding '0' for "produce the file 'there are comments on HN'" (and all other commands starting with '1...'). So in the world we live in, the smallest a file can be compressed to is 1 bit - if you just use the right language.

None of these are practical in general, but the point is that Kolmogorov Complexity is not meaningful without specifying more assumptions/priors.


Absolutely, that is part of what I have come to understand while diving into the topic. That I first needed to answer quite a few other questions for that one to even make sense.

Thank you for the examples; they are much clearer and straightforward than most of those I’ve come across.


This is a misconception. The main result in the Kolmogorov complexity theory that there is precisely the "universal prior" that comes simply from the notion of computability.


It's a misconception the "universal prior" is at all helpful.


This universal prior is both theoretically useful and help conceptualize the real-world compression techniques.


Not to be pedantic, but for any given file the answer is 0% compression, due to the pigeonhole principle.


Not at all! Having basically no knowledge in that area, aside from a vague understanding of the purpose and uses of information theory, it is more than welcome!

For those interested also unfamiliar with it:

- https://eprint.iacr.org/2021/1333.pdf

- https://matt.might.net/articles/why-infinite-or-guaranteed-f...


The formula for entropy and the source coding theorem discussed in the article describe exactly that.


beyond, in fact, since they're lossy formats


In the example in the article, if you want to communicate a series of 5 dice rolls, you can send a binary symbol of the number of each roll (1-6), 5 times. Using 3-bit coding for each number, it would take 15bits (5 numbers in the sequence, 3 bits each) to communicate a message

But there’s an alternative way, instead of each number individually, you can code the sequences of 5 numbers, which there are 65432 of in this case, 6!/1! = 720, which can be encoded in binary with 8 bits

So instead of 15 bits, you can use 8


All sequences of 5 numbers in the range [6] should be 6 * 6 * ... * 6 = 6^5, which would give 12.92 bits, so 13. The only way to actually do better than sending every symbol would be run length encoding or other non fixed alphabet encodings, which fortunately do not need to be longer than the entropy of the distribution.


You don't need variable length you just need to send the data in bigger blocks. Count all 6^5n ways to throw 5 dice n times, order them lexicographically and send the index in 5nlog(6) bits, rounding up.

Your inefficiency is just the rounding error, which quickly becomes trivial. In theory some block sizes should even coincidentally be close to a whole number of bits.


You are right, it’s 6^5 to include sequences with repeating symbols

In any case, you don’t need to index the whole numbers space to transmit all the information

Like you say, you can use 13 bits instead of 15

And using variable length encoding you can reduce it even more

But what is the distribution in this case?


Well in this case the distribution would simply be a uniform distribution over [1:6], which has entropy 2.58 for each symbol. Which is also the most entropic distribution on the sets of size 6. Like other have said, you can't actually know in a real world scenario your underlying distribution in whatever file you are trying to compress, but since you want to compress it you are telling the program that the file has interesting information (otherwise it would be just noise) and so grammar based encoding would probably not be best, instead lz77 or lz78 would perform better (for example, if you know more about the file you could even do better by transforming first into some frequency domain and then compressing).


Fascinating. Thank you.

Why do you need to know more about the file to transform to frequency domain?


Because no information is lost if you simply change domain, but you may spot more patterns faster or easier. For example (even if a bit dated) https://ieeexplore.ieee.org/abstract/document/1614066

There is a lot of research in this field because of the IoT fad/trend. If you want to know more i would suggest to simply pic a paper from IEEE and deep dive into the terms and terminology and simply skip the proofs of most algorithms/theorems.


> 720, which can be encoded in binary with 8 bits

Even if 720 were the right number, that takes 10 bits, not 8.


Yes, embarrassing mistake

Still 10 is less than 15

Additionally, as someone else pointed out, it would be 6^5 sequences, which would need 13 bits

Still 13 is less than 15

But the interesting thing is, you can then index the most commonly used sequences of sequences and compress by a few more bits

You can keep the process going until you have a 2 bit index for the most commonly used sequences (of sequences (of sequences… ))


Sorry to keep nitpicking, but: In this case, you can't compress further. These are dice rolls; there aren't "most commonly used sequences" (if the dice are any good).


Yes, however, the moment you choose a message size, there is always only a finite set of ways to arrange the symbols in those positions


6 possible dice values log2 gives around 2.6 bits per roll. Naively you can spend 3 bits per roll, wasting .4 bits each time, or you can pack n bits into a single message to asymptotically approach 2.6 bits per roll on average. But you can't get under 12.9 bits for 5 dice rolls.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: