Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.
The two bits about this paper that I think are worth calling out specifically:
- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)
- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content
Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.
> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
In today's hyper saturated world, attention is everything:
- consumer marketing
- politics
- venture fundraising
When any system has a few power law winners, it makes sense to grab attention.
Look at Trump and Musk and now Altman. They figured it out.
MrBeast...
Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.
If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?
"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win
It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.
Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.
If you traverse back the fourteen years of my comment history (on this account - my other account is older), you'll find that I've always written prose in this form.
LLMs trained on me (and the Hacker News corpus), not the other way around.
Considering that the current state of the art for LLM training is to feed it massive amounts of garbage (with some good stuff alongside), it seems important to point this out even if it might seem obvious.
I don't think anyone is throwing raw datasets into LLMs and hoping for high quality weights anymore. Nowadays most of the datasets are filtered one way or another, and some of them highly curated even.
I doubt they are highly created you would need experts in every field to do so. Which gives me more performance anxiety for LLMs because one of the most curated fields should be code...
The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.
The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).
Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)
Yes, I am concerned about the Computer Science profession
>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"
A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.
How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.
"published" only in the sense of "self-published on the Web".
This manuscript has not or not yet been passed the peer
review process, which is what scientist called "published" (properly).
It is a blog post, it was published as a Github page and on arXiv.
I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.
I think it's an interesting line of thought. So we all adopt LLMs and use it everywhere we can. What happens to the next generation of humans, born with AI and with diminished cognitive capacity to even wonder about anything? What about the next generation? What happens to the next generation of AI models that can't train on original human-created datasets free of AI?