Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.


Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.


Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.


You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.


No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.


> …I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.


https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: