Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's also worth reading his initial tweet: https://x.com/suchirbalaji/status/1849192575758139733

> I recently participated in a NYT story about fair use and generative AI, and why I'm skeptical "fair use" would be a plausible defense for a lot of generative AI products. I also wrote a blog post (https://suchir.net/fair_use.html) about the nitty-gritty details of fair use and why I believe this.

> To give some context: I was at OpenAI for nearly 4 years and worked on ChatGPT for the last 1.5 of them. I initially didn't know much about copyright, fair use, etc. but became curious after seeing all the lawsuits filed against GenAI companies. When I tried to understand the issue better, I eventually came to the conclusion that fair use seems like a pretty implausible defense for a lot of generative AI products, for the basic reason that they can create substitutes that compete with the data they're trained on. I've written up the more detailed reasons for why I believe this in my post. Obviously, I'm not a lawyer, but I still feel like it's important for even non-lawyers to understand the law -- both the letter of it, and also why it's actually there in the first place.

> That being said, I don't want this to read as a critique of ChatGPT or OpenAI per se, because fair use and generative AI is a much broader issue than any one product or company. I highly encourage ML researchers to learn more about copyright -- it's a really important topic, and precedent that's often cited like Google Books isn't actually as supportive as it might seem.

> Feel free to get in touch if you'd like to chat about fair use, ML, or copyright -- I think it's a very interesting intersection. My email's on my personal website.



I'm an applied AI developer and CTO at a law firm, and we discuss the fair use argument quite a bit. It grey enough that whom ever has more financial revenues to continue their case will win. Such is the law and legal industry in the USA.


what twigs me about the argument against fair use (whereby AI ostensibly "replicates" the content competitively against the original) is that it assumes a model trained on journalism produces journalism or is designed to produce it. the argument against that stance would be easy to make.


The model isn't trained on journalism only, you can't even isolate its training like that. It's trained on human writing in general and across specialties, and it's designed to compete with humans on what humans do with text, of which journalism is merely a tiny special case.

I think the only principle positions to be had here is to either ignore IP rights for LLM training, or give up entirely, because a model designed to be general like human will need to be trained like a human, i.e. immersed in the same reality as we are, same culture, most of which is shackled by IP claims - and then, obviously, by definition, as it gets better it gets more competitive with humans on everything humans do.

You can produce a complaint that "copyrighted X was used in training a model that now can compete with humans on producing X" for arbitrary value of X. You can even produce a complaint about "copyrighted X used in training model that now outcompetes us in producing Y", for arbitrary X and Y that are not even related together, and it will still be true. Such is a nature of a general-purpose ML model.


This seems to be putting the cart before the horse.

IP rights, or even IP itself as a concept, isn’t fundamental to existence nor the default state of nature. They are contigent concepts, contigent on many factors.

e.g. It has to be actively, continuously, maintained as time advances. There could be disagreements on how often, such as per annum, per case, per WIPO meeting, etc…

But if no such activity occurs over a very long time, say a century, then any claims to any IP will likely, by default, be extinguished.

So nobody needs to do anything for it all to become irrelevant. That will automatically occur given enough time…


> IP rights, or even IP itself as a concept, isn’t fundamental to existence nor the default state of nature.

This is correct. Copyright wasn't a thing until after the invention of the printing press.


the analogy in the anti-fair-use argument is that if I am the WSJ, and you are a reader and investor who reads my newspaper, and then you go on to make a billion dollars in profitable trades, somehow I as the publisher am entitled to some equity or compensation for your use of my journalism.

That argument is equally absurd as one where you write a program that does the same thing. Model training is not only fair use, but publishers should be grateful someone has done something of value for humanity with their collected drivelings.


This is the checkmate. The moment anything is published, it is fair game, it is part of the human consciousness and available for incorporation in anything that it sits as a component. Otherwise, what is the fucking point of publishing, mere revenue? Are we all not collectively competing and contributing? Furthermore, is not anything copied from anything published arguably not satire? Protected speech satire?


It has become ludicrously clear in the past decade that many of the competitors to journalism are very much not journalism.


Whether or not training is decided as fair use, it does seem like it could affect artists and authors.

Many artists don't like how image generators, trained on their original work, allow others to replicate their (formerly) distinctive style, almost instantly, for pennies.

Many authors don't like how language models can enable anyone to effortlessly create a paraphrased versions of the author's books. Plagiarism as a service.

Human artists and writers can (and do) do the same thing, but the smaller scale, slower speed, and higher cost reduces the economic effects.


I think it makes more sense in context of entertainment. However even in journalism, given the source data there's no reason an LLM couldn't put together the actual public facing article, video etc.


Doesn’t need to be journalism, just needs to compete with it.


> they can create substitutes that compete with the data they're trained on.

If I'm an artist and copy the style of another artist, I'm also competing with that artist, without violating copyright. I wouldn't see this argument holding up unless it can output close copies of particular works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: