Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> During that answer the LLM has state, but once it's done the state is gone.

This is an operational choice. LLMs have state, and you never have to clear it. The problems come from the amount of state being extremely limited (in comparison to the other axes) and the degradation of quality as the state scales. Because of these reasons, people tend to clear the state of LLMs. That is not the same thing as not having state, even if the result looks similar.



No, they don't - you can update context, make it a sliding window, create a sort of register and train it on maintaining stateful variables, or various other hacks, but outside of actively managing the context, there is no state.

You can't just leave training mode on, which is the only way LLMs can currently have persisted state in the context of what's being discussed.

The context is the percept, the model is engrams. Active training allows the update of engrams by the percepts, but current training regimes require lots of examples, and don't allow for broad updates or radical shifts in the model, so there are fundamental differences in learning capability compared to biological intelligence, as well.

Under standard inference only runs, even if you're using advanced context hacks to persist some sort of pseudo-state, because the underlying engrams are not changed, the "state" is operating within a limited domain, and the underlying latent space can't update to model reality based on patterns in the percepts.

The statefulness of intelligence requires that the model, or engrams, update in harmony with the percepts in real-time, in addition to a model of the model, or an active perceiver - the thing that is doing the experiencing. The utility of consciousness is in predicting changes in the model and learning the meta patterns that allow for things like "ahh-ha" moments, where a bundle of disparate percepts get contextualized and mapped to a pattern, immediately updating the entire model, such that every moment after that pattern is learned uses the new pattern.

Static weights means static latent space means state is not persisted in a way meaningful to intelligence - even if you alter weights, using classifier free guidance or other techniques, stacking LORAs or alterations, you're limited in the global scope by the lack of hierarchical links and other meta-pattern level relationships that would be required for an effective statefulness to be applied to LLMs.

We're probably only a few architecture innovations away from models that can be properly stateful without collapsing. All of the hacks and tricks we do to extend context and imitate persisted state do not scale well and will collapse over extended time or context.

The underlying engrams or weights need to dynamically adapt and update based on a stable learning paradigm, and we just don't have that yet. It might be a few architecture tweaks, or it could be a radical overhaul of structure and optimizers and techniques - transformers might not get us there. I think they probably can, and will, be part of whatever that next architecture will be, but it's not at all obvious or trivial.


I agree what people probably actually want is continual training, I disagree continual training is the only way to get persistent state. The GP is (explicitly) talking about long term memory alone and in the examples. If you have an e.g. 10 trillion token context then you have long term memory, which can give the ability and enable long term goals and affect actions over tasks as listed, even without continual training.

Continual training would replace the need to have that to have context provide the persistent state as well as provide additional capabilities than enormous context/other methods of persistent state alone would give, but that doesn't mean it's the only way to get persistent state as described.


A giant, even infinite, context cannot overcome the fundamental limitations a model has - the limitations in processing come from the "shape" of the weights in latent space, not from the contextual navigation through latent space through inference using the context.

The easiest way to understand the problem is like this: If a model has a mode collapse, like only displaying watch and clock faces with the hands displaying 10:10, you can sometimes use prompt engineering to get an occasional output that shows some other specified time, but 99% of the time, it's going to be accompanied by weird artifacts, distortions, and abject failures to align with whatever the appropriate output might be.

All of a model's knowledge is encoded in the weights. All of the weights are interconnected, with links between concepts and hierarchies and sequences and processes embedded within - there are concepts related to clocks and watches that are accurate, yet when a prompt causes the navigation through the distorted, "mode collapsed" region of latent space, it fundamentally distorts and corrupts the following output. In an RL context, you quickly get a doom cycle, with the output getting worse, faster and faster.

Let's say you use CFG or a painstakingly handcrafted LORA and you precisely modify the weights that deal with a known mode collapse - your model now can display all times, 10:10 , 3:15, 5:00, etc - the secondary networks that depended on the corrupted / collapsed values now "corrected" by your modification are now skewed, with chaotic and complex downstream consequences.

You absolutely, 100% need realtime learning to update the engrams in harmony with the percepts, at the scale of the entire model - the more sparse and hierarchical and symbol-like the internal representation, the less difficulty it will be to maintain updates, but with these massive multibillion parameter models, even simple updates are going to be spread between tens or hundreds of millions of parameters across dozens of layers.

Long contexts are great and you can make up for some of the shortcomings caused by the lack of realtime, online learning, but static engrams have consequences beyond simply managing something like an episodic memory. Fundamental knowledge representation has to be dynamic, contextual, allow for counterfactuals, and meet these requirements without being brittle or subject to mode collapse.

There is only one way to get that sort of persisted memory, and that's through continuous learning. There's a lot of progress in that realm over the last 2 years, but nobody has it cracked yet.

That might be the underlying function of consciousness, by the way - a meta-model that processes all the things that the model is "experiencing" and that it "knows" through each step, that comes about through a need for stabilizing the continuous learning function. Changes at that level propagate out through the entirety of the network, Subjective experience might be an epiphenomenological consequence of that meta-model.

It might not be necessary, which would be nice if we could verify - purely functional, non-subjective AI vs suffering AI would be a good thing to get right.

At any rate, static model weights create problems that cannot be solved with long, or even infinite, contexts, even with recursion in the context stream, complex registers, or any manipulation of that level of inputs. The actual weights have to be dynamic and adaptive in an intelligent way.


You explain the limitations in learning and long term memory (for lack of a better word) regarding current models in a much more knowledgeable and insightful way than I ever could. I am going to save these comments for later in case I need to better the current limitations we face to others in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: