I don’t want to sound critical, instead just providing some information: deep le...

YeGoblynQueenne · on Aug 15, 2019

I think I read this paper because someone posted it on HN (or linked to it in a comment):

https://arxiv.org/abs/1907.07355

We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

So it seems to be the benchmarks that are flawed and not BERT and friends who are that good at text comprehension. And that is not surprising. It would be surprising if language understanding just arose spontaneously by training a very big net on a very big corpus.

jeremysalwen · on Aug 16, 2019

Referencing that paper is a red herring. It's a flaw in one specific dataset, not with BERT or any model. Everything GP posted about about BERT models answering questions, performing coreference resolution, etc, is still true and is not at all affected by the flaw in this one dataset (out of many that these models have been tested on). Heck, you can just try out any of these models yourself on completely novel questions you come up with, and see that they work.