A big catch here is that you can't slap an open source license on a bunch of cop...

CharlesW · on March 17, 2024

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor · on March 17, 2024

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

zer00eyz · on March 17, 2024

You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.

gfodor · on March 18, 2024

No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.

CharlesW · on March 17, 2024

> …I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.

logicchains · on March 18, 2024

https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.