I don’t want to sound critical, instead just providing some information: deep learning models like RoBERTa (a type of BERT, I experimented with Facebook’s just released large model this morning) can perform anaphora resolution (coreference) answer general questions, score whether two sentences contradict each other, etc. One model solves several very difficult problems whose solutions have evaded hand-coding efforts for decades.
There are experimental seq models that transform paper text into figures or joint models that transform figures and text into some code, but you are correct that these are not production ready.
We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.
So it seems to be the benchmarks that are flawed and not BERT and friends who are that good at text comprehension. And that is not surprising. It would be surprising if language understanding just arose spontaneously by training a very big net on a very big corpus.
Referencing that paper is a red herring. It's a flaw in one specific dataset, not with BERT or any model. Everything GP posted about about BERT models answering questions, performing coreference resolution, etc, is still true and is not at all affected by the flaw in this one dataset (out of many that these models have been tested on). Heck, you can just try out any of these models yourself on completely novel questions you come up with, and see that they work.
There are experimental seq models that transform paper text into figures or joint models that transform figures and text into some code, but you are correct that these are not production ready.