what twigs me about the argument against fair use (whereby AI ostensibly "replicates" the content competitively against the original) is that it assumes a model trained on journalism produces journalism or is designed to produce it. the argument against that stance would be easy to make.
The model isn't trained on journalism only, you can't even isolate its training like that. It's trained on human writing in general and across specialties, and it's designed to compete with humans on what humans do with text, of which journalism is merely a tiny special case.
I think the only principle positions to be had here is to either ignore IP rights for LLM training, or give up entirely, because a model designed to be general like human will need to be trained like a human, i.e. immersed in the same reality as we are, same culture, most of which is shackled by IP claims - and then, obviously, by definition, as it gets better it gets more competitive with humans on everything humans do.
You can produce a complaint that "copyrighted X was used in training a model that now can compete with humans on producing X" for arbitrary value of X. You can even produce a complaint about "copyrighted X used in training model that now outcompetes us in producing Y", for arbitrary X and Y that are not even related together, and it will still be true. Such is a nature of a general-purpose ML model.
This seems to be putting the cart before the horse.
IP rights, or even IP itself as a concept, isn’t fundamental to existence nor the default state of nature. They are contigent concepts, contigent on many factors.
e.g. It has to be actively, continuously, maintained as time advances. There could be disagreements on how often, such as per annum, per case, per WIPO meeting, etc…
But if no such activity occurs over a very long time, say a century, then any claims to any IP will likely, by default, be extinguished.
So nobody needs to do anything for it all to become irrelevant. That will automatically occur given enough time…
the analogy in the anti-fair-use argument is that if I am the WSJ, and you are a reader and investor who reads my newspaper, and then you go on to make a billion dollars in profitable trades, somehow I as the publisher am entitled to some equity or compensation for your use of my journalism.
That argument is equally absurd as one where you write a program that does the same thing. Model training is not only fair use, but publishers should be grateful someone has done something of value for humanity with their collected drivelings.
This is the checkmate. The moment anything is published, it is fair game, it is part of the human consciousness and available for incorporation in anything that it sits as a component. Otherwise, what is the fucking point of publishing, mere revenue? Are we all not collectively competing and contributing? Furthermore, is not anything copied from anything published arguably not satire? Protected speech satire?
Whether or not training is decided as fair use, it does seem like it could affect artists and authors.
Many artists don't like how image generators, trained on their original work, allow others to replicate their (formerly) distinctive style, almost instantly, for pennies.
Many authors don't like how language models can enable anyone to effortlessly create a paraphrased versions of the author's books. Plagiarism as a service.
Human artists and writers can (and do) do the same thing, but the smaller scale, slower speed, and higher cost reduces the economic effects.
I think it makes more sense in context of entertainment. However even in journalism, given the source data there's no reason an LLM couldn't put together the actual public facing article, video etc.