The key insight in this paper is that the new (larger) model was not "fine-tuned...

The key insight in this paper is that the new (larger) model was not "fine-tuned" on the downstream NLP tasks. In other words, after it's trained on unsupervised (you could call it self-supervised in this case) data to do simple things like predict the next word (hence why it doesn't take any real supervision) it can then be used to do very specific tasks like answer questions or translating text without further supervision.

Previous large-scale language models like BERT and GPT-2 had took a similar approach but in order to actually perform the more complicate down stream tasks they had to be fine-tuned. So they were trained with specific QA or translation date in order to understand and do well on those tasks. GPT-3 doesn't do any fine-tuning, it is able to take it's very general initial learning and perform very well on specific tasks that it was never trained on. This is why it doesn't perform as well as the "smaller" models on those tasks. But that is besides the point, if GPT-3 was fine-tuned on those tasks I'm sure it would achieve the latest SOTA results in many (all?) of them. The exciting part is how it was able to generalize the knowledge learned during "pre-training" to much more specific tasks.

tl;dr the smaller models were trained on the specific tasks that they were evaluated on. The large model (GPT-3) was not trained on those specific tasks and still does almost as well.