I wonder how neural networks fit with this idea of
"learning = compression".
They do more than rote learning. So they must achieve compression.

Consider GPT-3, for instance. Its training resulted in the fitting of 175 BILLION parameters. You call it compression! I guess, it all depends what you consider is compressed.

Suppose you want to tell cats from dogs in a set of images. Yes. There is one bit to be learned per image.

If you must fit 60 millions parameters to decide on 1000 images, I don’t see the point. In the paper that made Deep learning famous, there were 1000 categories, not just two.

Maybe. That makes 10 bits per image. If you have 1000 images to categorize, that makes only 104 bits to learn. The network was trained on 1.2 million images.

Even so! 12 million bits against 60 million gradual parameters, I don’t see the compression. But that’s not the point. You want to decide on unseen images.

So what? The point is that what you’ve learned with 60M parameters is potentially unbounded. The network can now categorize an infinity of different images.

So, you mean that you compress an infinity down to "only" 60M parameters? Yes.
And in the case of GPT-3, ...

What about GPT-3? Did you see that it is able to make up relevant answer to questions?

Yes, that’s impressive! One single relevant statement is worth a great deal of information, not just 10 bits.

Why? Because you need a certain amount of information to select that statement among all the possible silly ones in the current context.

And so, according to you, it justifies the 175 billion parameters? Indeed! It still achieves huge compression.