I wonder how neural networks fit with this idea of
"learning = compression". |
|
They do more than rote learning. So they must achieve compression. |
|
|
|
Consider GPT-3, for instance. Its training resulted in the fitting of 175 BILLION parameters. You call it compression! |
|
I guess, it all depends what you consider is compressed. |
|
|
|
Suppose you want to tell cats from dogs in a set of images. |
|
Yes. There is one bit to be learned per image. |
|
|
|
Maybe. That makes 10 bits per image. If you have 1000 images to categorize, that makes only 104 bits to learn. |
|
The network was trained on 1.2 million images. |
|
|
|
Even so! 12 million bits against 60 million gradual parameters, I don’t see the compression. |
|
But that’s not the point. You want to decide on unseen images. |
|
|
|
So what? |
|
The point is that what you’ve learned with 60M parameters is potentially unbounded. The network can now categorize an infinity of different images. |
|
|
|
So, you mean that you compress an infinity down to "only" 60M parameters? |
|
Yes.
And in the case of GPT-3, ... |
|
|
|
Yes, that’s impressive! |
|
One single relevant statement is worth a great deal of information, not just 10 bits. |
|
|
|
Why? |
|
Because you need a certain amount of information to select that statement among all the possible silly ones in the current context. |
|
|
|
And so, according to you, it justifies the 175 billion parameters? |
|
Indeed! It still achieves huge compression. |
|
|
|