This is the 2023 take on LLMs. It still gets repeated a lot. But it doesn’t real...

root_axis · 2025-12-19T04:04:29 1766117069

> Don’t let some factoid about how they are pretrained on autocomplete-like next token prediction fool you into thinking you understand what is going on in that trillion parameter neural network.

This is just an appeal to complexity, not a rebuttal to the critique of likening an LLM to a human brain.

> they are not “autocomplete on steroids” anymore either.

Yes, they are. The steroids are just even more powerful. By refining training data quality, increasing parameter size, and increasing context length we can squeeze more utility out of LLMs than ever before, but ultimately, Opus 4.5 is the same thing as GPT2, it's only that coherence lasts a few pages rather than a few sentences.

int_19h · 2025-12-19T08:11:44 1766131904

> ultimately, Opus 4.5 is the same thing as GPT2, it's only that coherence lasts a few pages rather than a few sentences.

This tells me that you haven't really used Opus 4.5 at all.

baq · 2025-12-19T06:00:20 1766124020

First, this is completely ignoring text diffusion and nano banana.

Second, to autocomplete the name of the killer in a detective book outside of the training set requires following and at least some understanding of the plot.

dash2 · 2025-12-19T04:38:44 1766119124

This would be true if all training were based on sentence completion. But training involving RLHF and RLAIF is increasingly important, isn't it?

root_axis · 2025-12-19T05:46:49 1766123209

Reinforcement learning is a technique for adjusting weights, but it does not alter the architecture of the model. No matter how much RL you do, you still retain all the fundamental limitations of next-token prediction (e.g. context exhaustion, hallucinations, prompt injection vulnerability etc)

hexaga · 2025-12-19T11:18:20 1766143100

You've confused yourself. Those problems are not fundamental to next token prediction, they are fundamental to reconstruction losses on large general text corpora.

That is to say, they are equally likely if you don't do next token prediction at all and instead do text diffusion or something. Architecture has nothing to do with it. They arise because they are early partial solutions to the reconstruction task on 'all the text ever made'. Reconstruction task doesn't care much about truthiness until way late in the loss curve (where we probably will never reach), so hallucinations are almost as good for a very long time.

RL as is typical in post-training _does not share those early solutions_, and so does not share the fundamental problems. RL (in this context) has its own share of problems which are different, such as reward hacks like: reliance on meta signaling (# Why X is the correct solution, the honest answer ...), lying (commenting out tests), manipulation (You're absolutely right!), etc. Anything to make the human press the upvote button or make the test suite pass at any cost or whatever.

With that said, RL post-trained models _inherit_ the problems of non-optimal large corpora reconstruction solutions, but they don't introduce more or make them worse in a directed manner or anything like that. There's no reason to think them inevitable, and in principle you can cut away the garbage with the right RL target.

Thinking about architecture at all (autoregressive CE, RL, transformers, etc) is the wrong level of abstraction for understanding model behavior: instead, think about loss surfaces (large corpora reconstruction, human agreement, test suites passing, etc) and what solutions exist early and late in training for them.

libraryofbabel · 2025-12-19T16:58:32 1766163512

> This is just an appeal to complexity, not a rebuttal to the critique of likening an LLM to a human brain

I wasn’t arguing that LLMs are like a human brain. Of course they aren’t. I said twice in my original post that they aren’t like humans. But “like a human brain” and “autocomplete on steroids” aren’t the only two choices here.

As for appealing to complexity, well, let’s call it more like an appeal to humility in the face of complexity. My basic claim is this:

1) It is a trap to reason from model architecture alone to make claims about what LLMs can and can’t do.

2) The specific version of this in GP that I was objecting to was: LLMs are just transformers that do next token prediction, therefore they cannot solve novel problems and just regurgitate their training data. This is provably true or false, if we agree on a reasonable definition of novel problems.

The reason I believe this is that back in 2023 I (like many of us) used LLM architecture to argue that LLMs had all sorts of limitations around the kind of code they could write, the tasks they could do, the math problems they could solve. At the end of 2025, SotA LLMs have refuted most of these claims by being able to do the tasks I thought they’d never be able to do. That was a big surprise to a lot us in the industry. It still surprises me every day. The facts changed, and I changed my opinion.

So I would ask you: what kind of task do you think LLMs aren’t capable of doing, reasoning from their architecture?

I was also going to mention RL, as I think that is the key differentiator that makes the “knowledge” in the SotA LLMs right now qualitatively different from GPT2. But other posters already made that point.

This topic arouses strong reactions. I already had one poster (since apparently downvoted into oblivion) accuse me of “magical thinking” and “LLM-induced-psychosis”! And I thought I was just making the rather uncontroversial point that things may be more complicated than we all thought in 2023. For what it’s worth, I do believe LLMs probably have limitations (like they’re not going to lead to AGI and are never going to do mathematics like Terence Tao) and I also think we’re in a huge bubble and a lot of people are going to lose their shirts. But I think we all owe it to ourselves to take LLMs seriously as well. Saying “Opus 4.5 is the same thing as GPT2” isn’t really a pathway to do that, it’s just a convenient way to avoid grappling with the hard questions.

nl · 2025-12-20T01:21:36 1766193696

This ignores that reinforcement learning radically changes the training objective

A4ET8a8uTh0_v2 · 2025-12-19T04:16:10 1766117770

But.. and I am not asking it for giggles, does it mean humans are giant autocomplete machines?

root_axis · 2025-12-19T05:37:09 1766122629

Not at all. Why would it?

A4ET8a8uTh0_v2 · 2025-12-19T05:39:27 1766122767

Call it a.. thought experiment about the question of scale.

root_axis · 2025-12-19T05:44:13 1766123053

I'm not exactly sure what you mean. Could you please elaborate further?

a1j9o94 · 2025-12-19T05:57:03 1766123823

Not the person you're responding to, but I think there's a non trivial argument to make that our thoughts are just auto complete. What is the next most likely word based on what you're seeing. Ever watched a movie and guessed the plot? Or read a comment and know where it was going to go by the end?

And I know not everyone thinks in a literal stream of words all the time (I do) but I would argue that those people's brains are just using a different "token"

root_axis · 2025-12-19T06:55:16 1766127316

There's no evidence for it, nor any explanation for why it should be the case from a biological perspective. Tokens are an artifact of computer science that have no reason to exist inside humans. Human minds don't need a discrete dictionary of reality in order to model it.

Prior to LLMs, there was never any suggestion that thoughts work like autocomplete, but now people are working backwards from that conclusion based on metaphorical parallels.

LiKao · 2025-12-19T07:58:11 1766131091

There actually was quite a lot of suggestion that thoughts work like autocomplete. A lot of it was just considered niche, e.g. because the mathematical formalisms were beyond what most psychologist or even cognitive scientists would deem usefull.

Predictive coding theory was formalized back around 2010 and traces it roots up to theories by Helmholtz from 1860.

Predictive coding theory postulates that our brains are just very strong prediction machines, with multiple layers of predictive machinery, each predicting the next.

red75prime · 2025-12-19T07:31:53 1766129513

There are so many theories regarding human cognition that you can certainly find something that is close to "autocomplete". A Hopfield network, for example.

Roots of predictive coding theory extend back to 1860s.

Natalia Bekhtereva was writing about compact concept representations in the brain akin to tokens.

root_axis · 2025-12-19T19:34:47 1766172887

> There are so many theories regarding human cognition that you can certainly find something that is close to "autocomplete"

Yes, you can draw interesting parallels between anything when you're motivated to do so. My point is that this isn't parsimonious reasoning, it's working backwards from a conclusion and searching for every opportunity to fit the available evidence into a narrative that supports it.

> Roots of predictive coding theory extend back to 1860s.

This is just another example of metaphorical parallels overstating meaningful connections. Just because next-token-prediction and predictive coding have the word "predict" in common doesn't mean the two are at all related in any practical sense.

A4ET8a8uTh0_v2 · 2025-12-19T11:02:46 1766142166

<< There's no evidence for it

Fascinating framing. What would you consider evidence here?

9dev · 2025-12-19T06:40:43 1766126443

You, and OP, are taking an analogy way too far. Yes, humans have the mental capability to predict words similar to autocomplete, but obviously this is just one out of a myriad of mental capabilities typical humans have, which work regardless of text. You can predict where a ball will go if you throw it, you can reason about gravity, and so much more. It’s not just apples to oranges, not even apples to boats, it’s apples to intersubjective realities.

A4ET8a8uTh0_v2 · 2025-12-19T11:08:26 1766142506

I don't think I am. To be honest, as ideas goes and I swirl it around that empty head of mine, this one ain't half bad given how much immediate resistance it generates.

Other posters already noted other reasons for it, but I will note that you are saying 'similar to autocomplete, but obviously' suggesting you recognize the shape and immediately dismissing it as not the same, because the shape you know in humans is much more evolved and co do more things. Ngl man, as arguments go, it sounds to me like supercharged autocomplete that was allowed to develop over a number of years.

9dev · 2025-12-19T12:22:10 1766146930

Fair enough. To someone with a background in biology, it sounds like an argument made by a software engineer with no actual knowledge of cognition, psychology, biology, or any related field, jumping to misled conclusions driven only by shallow insights and their own experience in computer science.

Or in other words, this thread sure attracts a lot of armchair experts.

quesera · 2025-12-19T15:27:44 1766158064

> with no actual knowledge of cognition, psychology, biology

... but we also need to be careful with that assertion, because humans do not understand cognition, psychology, or biology very well.

Biology is the furthest developed, but it turns out to be like physics -- superficially and usefully modelable, but fundamental mysteries remain. We have no idea how complete our models are, but they work pretty well in our standard context.

If computer engineering is downstream from physics, and cognition is downstream from biology ... well, I just don't know how certain we can be about much of anything.

> this thread sure attracts a lot of armchair experts.

"So we beat on, boats against the current, borne back ceaselessly into our priors..."

LiKao · 2025-12-19T07:53:35 1766130815

Look up predictive coding theory. According to that theory, what our brain does is in fact just autocomplete.

However, what it is doing is layered autocomplete on itself. I.e. one part is trying to predict what the other part will be producing and training itself on this kind of prediction.

What emerges from this layered level of autocompletes is what we call thought.

NiloCK · 2025-12-19T05:38:06 1766122686

First: a selection mechanism is just a selection mechanism, and it shouldn't confuse the observation of an emergent, tangential capabilities.

Probably you believe that humans have something called intelligence, but the pressure that produced it - the likelihood of specific genetic material to replicate - it is much more tangential to intelligence than next-token-prediction.

I doubt many alien civilizations would look at us and say "not intelligent - they're just genetic information replication on steroids".

Second: modern models also under go a ton of post-training now. RLHF, mechanized fine-tuning on specific use cases, etc etc. It's just not correct that token-prediction loss function is "the whole thing".

root_axis · 2025-12-19T06:19:49 1766125189

> First: a selection mechanism is just a selection mechanism, and it shouldn't confuse the observation of an emergent, tangential capabilities.

Invoking terms like "selection mechanism" is begging the question because it implicitly likens next-token-prediction training to natural selection, but in reality the two are so fundamentally different that the analogy only has metaphorical meaning. Even at a conceptual level, gradient descent gradually honing in on a known target is comically trivial compared to the blind filter of natural selection sorting out the chaos of chemical biology. It's like comparing legos to DNA.

> Second: modern models also under go a ton of post-training now. RLHF, mechanized fine-tuning on specific use cases, etc etc. It's just not correct that token-prediction loss function is "the whole thing".

RL is still token prediction, it's just a technique for adjusting the weights to align with predictions that you can't model a loss function for in per-training. When RL rewards good output, it's increasing the statistical strength of the model for an arbitrary purpose, but ultimately what is achieved is still a brute force quadratic lookup for every token in the context.

vachina · 2025-12-19T10:52:00 1766141520

I use enterprise LLM provided by work, working on very proprietary codebase on a semi esoteric language. My impression is it is still a very big autocompletion machine.

You still need to hand hold it all the way as it is only capable of regurgitating the tiny amount of code patterns it saw in the public. As opposed to say a Python project.

libraryofbabel · 2025-12-19T17:30:23 1766165423

What model is your “enterprise LLM”?

But regardless, I don’t think anyone is claiming that LLMs can magically do things that aren’t in their training data or context window. Obviously not: they can’t learn on the job and the permanent knowledge they have is frozen in during training.

deadbolt · 2025-12-19T02:36:56 1766111816

As someone who still might have a '2023 take on LLMs', even though I use them often at work, where would you recommend I look to learn more about what a '2025 LLM' is, and how they operate differently?

krackers · 2025-12-19T07:02:18 1766127738

Papers on mechanistic interpratability and representation engineering, e.g. from Anthropic would be a good start.

otabdeveloper4 · 2025-12-19T07:22:17 1766128937

Don't bother. This bubble will pop in two years, you don't want to look back on your old comments in shame in three.

beernet · 2025-12-19T11:37:30 1766144250

>> Sometimes they hallucinate.

For someone speaking as you knew everything, you appear to know very little. Every LLM completion is a "hallucination", some of them just happen to be factually correct.

Am4TIfIsER0ppos · 2025-12-20T00:36:20 1766190980

I can say "I don't know" in response to a question. Can an LLM?

Smaug123 · 2025-12-20T11:25:14 1766229914

This is one of the easiest questions in the world to answer. My first try on the smallest and fastest model it was convenient to access, GPT-5.2 Instant: https://chatgpt.com/share/69468764-01cc-8008-b734-0fb55fd7ef...

> What did I have for breakfast this morning?

> I don’t know what you had for breakfast this morning…

nl · 2025-12-20T01:40:17 1766194817

Yes, frequently.

Most modern post training setups encourage this.

It isn't 2023 anymore.

otabdeveloper4 · 2025-12-19T07:20:37 1766128837

> it’s more complicated than that.

No it isn't.

> ...fool you into thinking you understand what is going on in that trillion parameter neural network.

It's just matrix multiplication and logistic regression, nothing more.

hackinthebochs · 2025-12-19T09:54:03 1766138043

LLMs are a general purpose computing paradigm. LLMs are circuit builders, the converged parameters define pathways through the architecture that pick out specific programs. Or as Karpathy puts it, LLMs are a differentiable computer[1]. Training LLMs discovers programs that well reproduce the input sequence. Roughly the same architecture can generate passable images, music, or even video.

The sequence of matrix multiplications are the high level constraint on the space of programs discoverable. But the specific parameters discovered are what determines the specifics of information flow through the network and hence what program is defined. The complexity of the trained network is emergent, meaning the internal complexity far surpasses that of the course-grained description of the high level matmul sequences. LLMs are not just matmuls and logits.

[1] https://x.com/karpathy/status/1582807367988654081

otabdeveloper4 · 2025-12-19T11:12:46 1766142766

> LLMs are a general purpose computing paradigm.

Yes, so is logistic regression.

hackinthebochs · 2025-12-19T11:25:03 1766143503

No, not at all.

otabdeveloper4 · 2025-12-19T16:27:40 1766161660

Yes at all. I think you misunderstand the significance of "general computing". The binary string 01101110 is a general-purpose computer, for example.

hackinthebochs · 2025-12-19T17:35:57 1766165757

No, that's insane. Computing is a dynamic process. A static string is not a computer.

MarkusQ · 2025-12-19T22:09:48 1766182188

It may be insane, but it's also true.

https://en.wikipedia.org/wiki/Rule_110

hackinthebochs · 2025-12-20T02:22:52 1766197372

Notice that the Rule 110 string picks out a machine, it is not itself the machine. To get computation out of it, you have to actually do computational work, i.e. compare current state, perform operations to generate subsequent state. This doesn't just automatically happen in some non-physical realm once the string is put to paper.