Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks like a big deal to me:

1. First of all, the authors successfully trained a model with 173 BILLION PARAMETERS. The previous largest model in the literature, Google’s T5, had "only" 11 billion. With Float32 representations, GPT-3-173B's weights alone occupy ~700GB of memory (173 billion params × 4 bytes/param). A figure in the 100's of billions is still 3 orders of magnitude smaller than the 100’s of trillions of synapses in the human brain [a], but consider this: Models with trillions of weights are suddenly looking... achievable.

2. The model achieves competitive results on many NLP tasks and benchmarks WITHOUT FINETUNING. Let me repeat that: there is no finetuning. There is only unsupervised (i.e., autoregressive) pretraining. For each downstream NLP task or benchmark, the pretrained model is given text instructions, and possibly sample text with questions and answers. The NLP tasks on which the model was tested include translation, question-answering, cloze tasks, unscrambling words, using novel words in sentences, and performing 3-digit arithmetic.

3. The model is tested only in a ZERO-SHOT or FEW-SHOT setting. In other words, for each NLP task, the pretrained model is given text instructions with zero examples, or text instructions with a small number of examples (typically 10 to 100). As with human beings, GPT-3-173B doesn't need lots of examples to perform competitively in novel NLP tasks.

4. The results reported by this paper on all NLP tasks and benchmarks should be seen as a BASELINE. These results likely could be meaningfully improved with conventional finetuning.

5. The model’s text generation FOOLS HUMAN BEINGS, without having to cherry-pick examples.

--

[a] https://www.google.com/search?q=number+of+synapses+in+human+...



I'll wait for a working interactive model before blindly believing these statements. GPT-2 was hyped through the roof, but when inspected with a bit of criticality it demonstrated glitches that told us more about how it actually works than "good" examples:

https://medium.com/@VictorBanev/interrogating-full-gpt-2-10a...

ML models should be pushed to their limit, because that's where you gather most useful information about what they actually do. Their results need to be critically examined with both exploratory and hypothesis-driven testing. And yet this is never done in initial papers and rarely done afterwards.

What was the last AI paper you've read that that said "and here is a list of things out model failed at"?


That's a very sloppy post. He does a single example, not even running locally or changing sampling parameters, and then concludes that GPT-2 is doing nothing but pattern-matching? A lot of people underestimate NNs because the sampling from them (top-k! how much dumber and cruder can you get? nucleus works better, but is still obviously suboptimal) destroys a lot of dark knowledge. I noticed this with Gary Marcus's claims about GPT-2 too: he would try once, without changing any sampling settings, and conclude that it wasn't doing anything, but if you tried, you would get different results. I'm not the only one to notice that: https://www.quantamagazine.org/common-sense-comes-to-compute... Such tests can prove the presence of knowledge, but not the absence... And of course, GPT-3 does extensive arithmetic tricks: https://arxiv.org/pdf/2005.14165.pdf#page=22


The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves. This is more than I can say about most chatter about ML.

>Such tests can prove the presence of knowledge, but not the absence...

This sounds like a setup for non-falsifiable beliefs.


> The article I linked to makes claims only about the model it tests and since it actually links to an online implementation, anyone can try to reproduce the results and see for themselves.

And I did (using my own local GPT-2-1.5b install which let me set the hyperparameters rather than restricting it to inappropriate hardwired ones of an online service), I linked to another person demonstrating the same thing, I pointed out the extensive GPT-3 evaluation OA did, and here, have another link about how bad querying of language models leads to highly misleading results about how much they know: https://arxiv.org/abs/1911.12543 Measurement error in general biases estimates towards zero.

> This sounds like a setup for non-falsifiable beliefs.

It's just as non-falsifiable as, say, concepts like 'lower bounds' or 'bugs'.


The paper you link to claims that hand-crafted queries used to evaluate the knowledge and understanding of language models are "sub-optimal" because they do not take into account the context in which a LM was trained. For example:

  These manually created prompts (e.g. “Barack Obama was born in _”) might be
  sub-optimal because LMs might have learned target knowledge from
  substantially different contexts (e.g. “The birth place of BarackObama is
  Honolulu, Hawaii.”) during their training. 
In other words, the paper considers hand-crafted prompts like in the example to be "sub-optimal" because they are not in the right format. To paraphrase them a bit, such prompts are like making a mis-formed query to a database.

It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To be fair the ability to return a correct answer given a question in the right format is not without use. That, indeed, is how databases work. But it shows none of the "understanding" or "knowledge" the paper claims is acquired by Language Models.


> It is difficult to see how this is an argument for the ability of LMs to demonstrate "understanding". Imagine asking a child: "how much is 4+2?" and getting a correct answer; then asking "how much is 2+4?" and getting a wrong answer. Most people would probably not take that as evidence that the second question was "wrong". They would instead conclude that the child does not "understand" addition and has only learned to reproduce specific answers to specific questions.

To use your database analogy, in what sense should we claim a database doesn't know a record when you are using a malformed SQL query? If we fixed the query and it emitted the right answer, then obviously it did store the information. The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way. Since LMs can get much better results just by tailoring the prompts (increased by a third in that paper! and there's no reason to think that that is the very best possible performance either!), that shows that existing practices drastically underestimate what knowledge the model has been able to learn. Learning about the real world or text is very different from learning your particular dumb broken query method.


The problem is that nobody claims that databases "know" anything. They store data. Data can be retrieved from storage. That's all they do.

>> The query does not encode the answer, and it is vanishingly unlikely that the database would simply accidentally return the right answer ever if it did not store the information in some way.

Oh, yes, absolutely. A query encodes the answer. Queries are patterns that are matched by the data stored in the database. If a query fails it's because it does not correctly represent the information it is trying to retrieve. For example, if I SELECT * FROM TABLE PEOPLE and there is no table "PEOPLE", then I don't get an answer because the query does not correctly represnt the structure of the database. You cannot retrieve any data from a database unless you have some idea about the structure of that data.

But that's not the point here. I don't disagree that a language model can learn (i.e. it can represent some elements of its training dataset). I disagree that it "understands" anything and I find the fact that it needs specific queries to retrieve the data it is representing to be evidence that it does not.

And so it's not more useful than a traditional database at this kind of task. Except it's much less precise than a traditional database and costs considerably more to create.

>> Learning about the real world or text is very different from learning your particular dumb broken query method.

I'm sorry, I don't understand what you mean here. What is my "particular dumb borken query method"? Is that meant as a personal attack?


The last AI paper I read that has a list of things the model failed at is this one:

https://news.ycombinator.com/item?id=23345379

See Section 5, titled "Limitations"


> still 3 orders of magnitude smaller than the 100’s of trillions of synapses in the human brain

Wow, that is WAY closer than I thought we were.


It's not clear whether a parameter in a neural network maps cleanly onto a synapse in a biological brain.


I think it's becoming pretty clear that they don't. First, scientists uncovered many additional ways neurons interact with one another[1]. Second, it seems that individual neurons do way more computing than in the simplistic ANN models [2].

[1]: https://en.wikipedia.org/wiki/Ephaptic_coupling

[2]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5740076/l


It’s very clear that they don’t.


I remember reading about fruit fly brains and how we're at a point where we can computationally simulate them now, but I'm not sure where that went.

Anyone know?



What do you mean by text instructions? If I want to translate a sentence, would I just feed in model - translate "Hello world"?


See page 7 of the paper. You give the model instruction such as "Translate from X to Y" then you pass examples (if you go for few shot) followed by the sentence you want to translate.


AFAIK they used half-precision (Float16)


Thanks. I should have written "if using Float32," which is what I meant -- instead of "with Float32," which in hindsight reads a bit ambiguous. But regardless of which floating-point representation is used, the number of weights is still in the hundreds of billions... which is insane.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: