Thank you. It is quite a labor to even skim through the 50+ page paper. Your poignant reply was quite helpful to draw my attention to the issue of contamination. After reading the section carefully, I think my understanding of over fitting is very much improved at least in so far as models like GPT-3 are concerned.
Clearly, the authors have given careful considerations to the issue of contamination and have provided reasonable analysis and a careful argument regarding over fitting the existing benchmarks.
On the other I was wondering if the authors would like to consider purposefully creating a type of "out of sample data" for "creative evaluation"? Of course, GPT is no stranger to creativity, so it would be a fascinating challenge to come up with methods to create such datasets that are truly creative and challenge GPT-{N} to prove its mettle.
For example, would it be possible to engage a really good creative writer* along with a highly experienced school teacher to take on the Reading Comprehension task and create few "tricky" evaluation samples that not only go above and beyond the contamination objections but also challenge the human intelligence to be careful not to fall into common traps?
This way lies a different evaluation metric - a subjective one perhaps, but it's a start. Just a thought experiment - that's all.
* so that they can come up with new ways to trick GPT/humans
a teacher knows the common mistakes the average student makes
Edit: Duh, my head immediately screamed GANs the moment I pressed submit, lol. But I am not sure if GANs make sense for NLP tasks. Like do they make sense if humans/domain experts try to solve them?
You might be interested in the ELECTRA model. It's the solid first success I've seen of a GAN-like framework in NLP. It also has links to why GANs still don't do so great in NLP in its references.
If I may ask one more question, would you happen to know if the authors or other researchers who are entertaining any theoretical work on the experimental design and training methodologies of GPT/BERT? As in why does it work? What is the significance of training via the "fill-in-the-blanks" method?
Don't get me wrong - the work is great and the SOTAs are amazing, I would be just happy to have a chat to discuss and bounce some ideas what all this means and why do these methods seem to be working so well. Papers/articles/blog-posts are always a pleasure to read!
I think it's just kind of understood, so I don't have any real references for you. Filling in "A dog has ___ feet" requires actual facts. Or compare these two:
"The city councilmen refused the demonstrators a permit because they advocated violence. It wasn't the first time the _____ had advocated violence."
"The city councilmen refused the demonstrators a permit because they feared violence. It wasn't the first time the _____ had feared violence."
The syntax is identical. The words are identical, except that I swapped "advocated" out for "feared". When I swap it, the ____ changes from "demonstrators" to "councilmen." Think about what kinds of reasoning and experience and knowledge it takes you to resolve which group "they" refers to in this sentence.
Most blanks might be simpler and just correspond to learning english, like when the blank is "the," but learning that is a feat too. Filling in the blanks that require broader knowledge requires somehow capturing that broader knowledge.