I doubt they are highly created you would need experts in every field to do so. ...

groby_b · 2025-10-21T16:46:04 1761065164

The major labs are hiring experts. They carefully build & curate synthetic data. The market for labelled non-synthetic data is currently ~$3B/year.

The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).

Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)

nradov · 2025-10-21T16:21:34 1761063694

OpenAI has been literally hiring human experts in certain targeted subject areas to write custom proprietary training content.

BoredPositron · 2025-10-21T16:27:27 1761064047

I bet the dataset is mostly comprised of certain areas™.

satellite2 · 2025-10-21T18:04:36 1761069876

Is that right? Isn't the current way of doing thing to throw "everything" at it then fine tune?