Can gzip be a language model?

59 points by AviKav 10 days ago on lobsters | 7 comments

This is the idea behind the Hutter prize and the Large Text Compression Benchmark. There are good attempts at beating them with LLMs, such as gmix mentioned on the latter, and ts_zip.

They rank badly because these two benchmarks consider the size of the decompressor (which includes the LLM itself) in addition to the compressed file. But with larger texts to compress, the LLM's size would matter less.

What's surprising is it isn't necessary to send the pretrained language model along with the decompresser (see nncp). You predict the first bit using a pseudorandomly-initialized LLM, then train the LLM on that bit before making the second prediction, and so on, in the same way that you can learn the LZ78 dictionary as you go rather than sending it with the compressed data.

In fact, it's better to think of the LLM, or any predictor, as the outcome of a sufficiently general learning process. We can expect the compression scheme that learns from the data as it compresses to outperform one with a fixed predictor, in the long run.

hobbified | 9 days ago

So, yes (kind of, as the post says), and it's an interesting thing to understand, but it's a "language model" about as much as Dissociated Press (what Wikipedia calls the "letter-based" version). It can repeat parts of its training semi-coherently, but its abilities are severely limited by the fact that it has to exact-match a substring to the left of the cursor. The magic of attention is that it combines this kind of generation with a learned similarity space, and with an ability for things to move around, so that if it had to generate text following "Thou'rt a cad" it would be partially informed by training data following "Thou'rt a ruffian" and partially by training data following "Thou'rt a very cad". Loosely speaking. gzip stops seeing any match as soon as one character differs.

samee | 9 days ago

That just moves the question: can attention mechanisms be used for better text compression? The answer is definitely yes, but what's the least bit of change we need to gzip to make it happen?

hobbified | 9 days ago

Well... I don't have a satisfying answer (I have toyed with writing an LLM-based compressor, and it works great, but it's big and slow, not a "least change"). But I can say that the least change is not very small. gzip/deflate is basically a 1977 algorithm. There's no part of it that's explicitly trying to make predictions, it's just combining back-references with Huffman coding in a way that's easily implemented with only kilobytes of memory, and implicitly predicts that the input will be vaguely text-like.

If there's some middle ground between that and the full-on approach of taking a GPT-style model and using its next-token logits as the weights for an arithmetic coder, I don't know what it is. But this isn't really my field, so there's plenty of room for things I don't know.

Corbin | 9 days ago

I've written about half of a large Lempel-Ziv model and I've found an interesting fact which meshes with your perspective: during inference, LLZ harnesses have to maintain a list of attention-head-like pointers into the dictionary. Random sampling over those heads (with weights to prefer shorter dictionary entries) could be modulated like Transformer-style key/query/value patterns, although I have trouble imagining that it would improve performance without admitting that we're trying to game benchmarks. To be able to emit n-grams which haven't been seen during training, it's sufficient to assign each unseen (bi)gram a prior probability based on the original alphabet's cardinality.

peter-leonov | 10 days ago

Love this series of posts from different authors. It helps with demystifying LLMs a bit. Although, layered architecture with reasoning and true context levels LLMs up into something else for sure.