For data though, as LLM's generate more output, over time wouldn't they be expec...

gaganyaan · on Dec 5, 2023

I don't really buy that line of argument. There's still useful signals, like upvotes, known human writing, or just plain spending time/money to label it yourself. There's also the option of training better algorithms on pre-LLM datasets. It's something to consider, but not any sort of crisis.

throwaway4aday · on Dec 5, 2023

Cleaning and preparing the dataset is a huge part of training. Like the OP mentioned, OpenAI likely have some high quality automation for doing this and that's what's given them a leg up above all other competitors. You can apply the same automation to clear out low quality AI content the same way you remove low quality human content. It's not about the source, just the quality matters.

edmundsauto · on Dec 5, 2023

There must be signals in the data about generated garbage, otherwise humans wouldn't be able to tell. Something like PageRank would be a game changer and potentially solve this issue.

quickthrower2 · on Dec 5, 2023

We need models that need less language data to train. Babies learn to talk on way less data than the entire internet. We need something closer to human experience. Kids have a feel for what is bullshit before they have consumed the entire internet :-).

I think feeding the internet into a LLM will be seen as the mainframe days of AI.

nikhil896 · on Dec 5, 2023

My counter-point to this is that babies are born with a sort of basic pre-trained LLM. Humans are born with our analogical weights & biases in our brains partly optimized to learn language, math, etc. Before pre-training an LLM, the weights & biases of their analogical brain are initialized with random values. Training on the internet can IMO be seen as a kind of "pre-training"

Volundr · on Dec 5, 2023

> Babies learn to talk on way less data than the entire internet.

Is this actually true? My gut check says yes, but I'm also unaware of any meaningful way to actually quantify the volume of sensor data processed by a baby (or anyone else for that matter), and it wouldn't shock me to discover if we could we'd find it to be a huge volume.

quickthrower2 · on Dec 6, 2023

Ah yes. I should be more precise. Less data that is textual. Of course other data sources are plentiful. Including internal and external sensory.

pyuser583 · on Dec 6, 2023

Babies in ancient societies certainly had less exposure to written language, much lower vocabulary, less exposure to music, etc.

Volundr · on Dec 6, 2023

Sure the breadth is (maybe) smaller, but the question is volume. Babies get years of people talking around them, as well as data from their own muscles and vocalizations fed back to them. Is the volume they have consumed to the point the begin talking actually less than the volume consumed by an LLM?

pyuser583 · on Dec 7, 2023

If you’re taking about babies in ancient societies (which I am), the answer is absolutely yes. They were exposed to much less language, and much less sound, than we are.

Volundr · on Dec 7, 2023

Really? How much less? I'm far from convinced that if you sum up the sheer volume of noises heard, as well the other neurological inputs that goes into learning to speak (ex proprioception) you'd come out with a lesser number than what LLMs are trained on, but I'm open to any real data on this.