Isn't this just garbage in garbage out with an attention grabbing title?

icyfox · 2025-10-21T16:38:16 1761064696

Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.

The two bits about this paper that I think are worth calling out specifically:

- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)

- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content

satellite2 · 2025-10-21T18:01:55 1761069715

Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.

philipallstar · 2025-10-21T15:39:49 1761061189

Attention is all you need.

dormento · 2025-10-21T17:01:28 1761066088

In case anyone missed the reference: https://arxiv.org/abs/1706.03762

> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

echelon · 2025-10-21T16:29:10 1761064150

In today's hyper saturated world, attention is everything:

- consumer marketing

- politics

- venture fundraising

When any system has a few power law winners, it makes sense to grab attention.

Look at Trump and Musk and now Altman. They figured it out.

MrBeast...

Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.

If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?

"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win

It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.

Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.

ghurtado · 2025-10-21T17:21:33 1761067293

Something flew approximately 10 miles above your head that would be a good idea for you to learn.

echelon · 2025-10-21T19:04:47 1761073487

What makes you think I didn't know the reference? That paper is seminal and essential reading in this space.

The intent was for you to read my comment at face value. I have a point tangential to the discussion at hand that is additive.

scubbo · 2025-10-21T17:58:40 1761069520

There were plenty of kinder ways to let someone know that they had missed a reference - https://xkcd.com/1053/

peterlk · 2025-10-21T16:56:59 1761065819

You’re absolutely right!

alganet · 2025-10-21T19:32:54 1761075174

You're not accounting for substrate saturation.

If you could just spam annoy until you win, we'd be all dancing to remixed versions of Macarena.

lawlessone · 2025-10-21T17:22:55 1761067375

Is this copypasted from LinkedIn?

echelon · 2025-10-21T19:09:07 1761073747

If you traverse back the fourteen years of my comment history (on this account - my other account is older), you'll find that I've always written prose in this form.

LLMs trained on me (and the Hacker News corpus), not the other way around.

ashleyn · 2025-10-21T17:13:56 1761066836

Yes, but the idea of chatgpt slowly devolving into Skibidi Toilet and "6 7" references conjures a rather amusing image.

1121redblackgo · 2025-10-21T18:33:00 1761071580

6-7 ٩(●•)_

stavros · 2025-10-21T19:18:32 1761074312

Can someone explain this? I watched a South park episode that was all about this, but I'm not in the US so I have no idea what the reference is.

Sparkle-san · 2025-10-21T20:15:08 1761077708

It's a meme without a lot of real meaning behind it. While it has its origins, I wouldn't say it's a "reference" to anything specific.

https://en.wikipedia.org/wiki/6-7_(meme)

stavros · 2025-10-21T20:18:30 1761077910

Ahh, thanks, so it's just a thing kids say.

1121redblackgo · 2025-10-21T20:38:57 1761079137

lexandstuff · 2025-10-21T22:20:24 1761085224

It's a line from a banger Skrilla song, nothing more than that.

wat10000 · 2025-10-21T15:44:44 1761061484

Considering that the current state of the art for LLM training is to feed it massive amounts of garbage (with some good stuff alongside), it seems important to point this out even if it might seem obvious.

CaptainOfCoit · 2025-10-21T15:48:10 1761061690

I don't think anyone is throwing raw datasets into LLMs and hoping for high quality weights anymore. Nowadays most of the datasets are filtered one way or another, and some of them highly curated even.

BoredPositron · 2025-10-21T16:10:32 1761063032

I doubt they are highly created you would need experts in every field to do so. Which gives me more performance anxiety for LLMs because one of the most curated fields should be code...

groby_b · 2025-10-21T16:46:04 1761065164

The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.

The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).

Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)

nradov · 2025-10-21T16:21:34 1761063694

OpenAI has been literally hiring human experts in certain targeted subject areas to write custom proprietary training content.

BoredPositron · 2025-10-21T16:27:27 1761064047

I bet the dataset is mostly comprised of certain areas™.

satellite2 · 2025-10-21T18:04:36 1761069876

Is that right? Isn't the current way of doing thing to throw "everything" at it then fine tune?

Barrin92 · 2025-10-21T15:59:10 1761062350

Yes, I am concerned about the Computer Science profession

>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"

A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.

How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.

jll29 · 2025-10-22T05:17:48 1761110268

> How does this stuff get published

"published" only in the sense of "self-published on the Web". This manuscript has not or not yet been passed the peer review process, which is what scientist called "published" (properly).

bpt3 · 2025-10-21T16:16:03 1761063363

It is a blog post, it was published as a Github page and on arXiv.

I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.

gowld · 2025-10-21T16:42:27 1761064947

arXiv is intended to host research papers, not a blog for researchers.

Letting researchers pollute it with blog-gunk is an abuse of the referral/vetting system for submitters.

pluc · 2025-10-21T16:25:51 1761063951

I think it's an interesting line of thought. So we all adopt LLMs and use it everywhere we can. What happens to the next generation of humans, born with AI and with diminished cognitive capacity to even wonder about anything? What about the next generation? What happens to the next generation of AI models that can't train on original human-created datasets free of AI?

iwontberude · 2025-10-21T16:32:59 1761064379

They will accept that their orders come from a terminal and they will follow them.

fragmede · 2025-10-21T17:11:13 1761066673

Manna. https://marshallbrain.com/manna1

otterley · 2025-10-21T16:00:31 1761062431

And with extra steps!

Insanity · 2025-10-21T16:33:53 1761064433

Garbage in -> Magic -> Hallucinated Garbage out