This looks like a big deal to me: 1. First of all, the authors successfully trai...

gambler · on May 29, 2020

I'll wait for a working interactive model before blindly believing these statements. GPT-2 was hyped through the roof, but when inspected with a bit of criticality it demonstrated glitches that told us more about how it actually works than "good" examples:

https://medium.com/@VictorBanev/interrogating-full-gpt-2-10a...

ML models should be pushed to their limit, because that's where you gather most useful information about what they actually do. Their results need to be critically examined with both exploratory and hypothesis-driven testing. And yet this is never done in initial papers and rarely done afterwards.

What was the last AI paper you've read that that said "and here is a list of things out model failed at"?

gwern · on May 29, 2020

That's a very sloppy post. He does a single example, not even running locally or changing sampling parameters, and then concludes that GPT-2 is doing nothing but pattern-matching? A lot of people underestimate NNs because the sampling from them (top-k! how much dumber and cruder can you get? nucleus works better, but is still obviously suboptimal) destroys a lot of dark knowledge. I noticed this with Gary Marcus's claims about GPT-2 too: he would try once, without changing any sampling settings, and conclude that it wasn't doing anything, but if you tried, you would get different results. I'm not the only one to notice that: https://www.quantamagazine.org/common-sense-comes-to-compute... Such tests can prove the presence of knowledge, but not the absence... And of course, GPT-3 does extensive arithmetic tricks: https://arxiv.org/pdf/2005.14165.pdf#page=22

gambler · on May 29, 2020

The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves. This is more than I can say about most chatter about ML.

>Such tests can prove the presence of knowledge, but not the absence...

This sounds like a setup for non-falsifiable beliefs.

gwern · on May 29, 2020

> The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves.

And I did (using my own local GPT-2-1.5b install which let me set the hyperparameters rather than restricting it to inappropriate hardwired ones of an online service), I linked to another person demonstrating the same thing, I pointed out the extensive GPT-3 evaluation OA did, and here, have another link about how bad querying of language models leads to highly misleading results about how much they know: https://arxiv.org/abs/1911.12543 Measurement error in general biases estimates towards zero.

> This sounds like a setup for non-falsifiable beliefs.

It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.

YeGoblynQueenne · on May 30, 2020

The paper you link to claims that hand-crafted queries used to evaluate the knowledge and understanding of language models are "sub-optimal" because they do not take into account the context in which a LM was trained. For example:

  These manually created prompts (e.g. “Barack Obama was born in _”) might be
  sub-optimal because LMs might have learned target knowledge from
  substantially different contexts (e.g. “The birth place of BarackObama is
  Honolulu, Hawaii.”) during their training.

In other words, the paper considers hand-crafted prompts like in the example to be "sub-optimal" because they are not in the right format. To paraphrase them a bit, such prompts are like making a mis-formed query to a database.

It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To be fair the ability to return a correct answer given a question in the right format is not without use. That, indeed, is how databases work. But it shows none of the "understanding" or "knowledge" the paper claims is acquired by Language Models.

gwern · on May 30, 2020

> It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To use your database analogy, in what sense should we claim a database doesn't know a record when you are using a malformed SQL query? If we fixed the query and it emitted the right answer, then obviously it did store the information. The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way. Since LMs can get much better results just by tailoring the prompts (increased by a third in that paper! and there's no reason to think that that is the very best possible performance either!), that shows that existing practices drastically underestimate what knowledge the model has been able to learn. Learning about the real world or text is very different from learning your particular dumb broken query method.

YeGoblynQueenne · on May 30, 2020

The problem is that nobody claims that databases "know" anything. They store data. Data can be retrieved from storage. That's all they do.

>> The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way.

Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are matched by the data stored in the database. If a query fails it's because it does not correctly represent the information it is trying to retrieve. For example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then I don't get an answer because the query does not correctly represnt the structure of the database. You cannot retrieve any data from a database unless you have some idea about the structure of that data.

But that's not the point here. I don't disagree that a language model can learn (i.e. it can represent some elements of its training dataset). I disagree that it "understands" anything and I find the fact that it needs specific queries to retrieve the data it is representing to be evidence that it does not.

And so it's not more useful than a traditional database at this kind of task. Except it's much less precise than a traditional database and costs considerably more to create.

>> Learning about the real world or text is very different from learning your particular dumb broken query method.

I'm sorry, I don't understand what you mean here. What is my "particular dumb borken query method"? Is that meant as a personal attack?

Imnimo · on May 29, 2020

The last AI paper I read that has a list of things the model failed at is this one:

https://news.ycombinator.com/item?id=23345379

See Section 5, titled "Limitations"

taneq · on May 29, 2020

> still 3 orders of magnitude smaller than the 100’s of trillions of synapses in the human brain

Wow, that is WAY closer than I thought we were.

canjobear · on May 29, 2020

It's not clear whether a parameter in a neural network maps cleanly onto a synapse in a biological brain.

gambler · on May 29, 2020

I think it's becoming pretty clear that they don't. First, scientists uncovered many additional ways neurons interact with one another[1]. Second, it seems that individual neurons do way more computing than in the simplistic ANN models [2].

[1]: https://en.wikipedia.org/wiki/Ephaptic_coupling

[2]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5740076/l

ibeckermayer · on May 29, 2020

It’s very clear that they don’t.

safog · on May 29, 2020

I remember reading about fruit fly brains and how we're at a point where we can computationally simulate them now, but I'm not sure where that went.

Anyone know?

ricardorivaldo · on May 29, 2020

Take a look at http://fruitflybrain.org/

alphagrep12345 · on May 29, 2020

What do you mean by text instructions? If I want to translate a sentence, would I just feed in model - translate "Hello world"?

Traktor · on May 29, 2020

See page 7 of the paper. You give the model instruction such as "Translate from X to Y" then you pass examples (if you go for few shot) followed by the sentence you want to translate.

skdotdan · on May 29, 2020

AFAIK they used half-precision (Float16)

cs702 · on May 29, 2020

Thanks. I should have written "if using Float32," which is what I meant -- instead of "with Float32," which in hindsight reads a bit ambiguous. But regardless of which floating-point representation is used, the number of weights is still in the hundreds of billions... which is insane.