Nope copilot is only trained on open source codebases.

asddubs · on Aug 9, 2022

open source isn't the same as public domain. the code is still copyrighted and has a specific license that has to be adhered to in order to be allowed to use it

zarzavat · on Aug 9, 2022

Nobody knows whether or not training a NN is fair use. If it were ruled to be not fair use then it would basically shut down a large chunk of ML research in the US, as it's not just Copilot that is training on copyrighted data. All the large language models require copyrighted data such as web crawls as there is just not enough public domain material around.

So even if you are right, and courts rule that training is not fair use, it seems likely that all the big tech companies would lobby Congress to restore the prior state of affairs, as not having an ML industry is somewhat of a national security issue if the rest of the world is going full steam ahead on making Skynet.

9dev · on Aug 9, 2022

„Dad, Dad! Can you give me your gun? I want to shoot myself in my foot!“ - „Oh no, why would you ever want to do this?!“ - „the neighbours are doing it too!“ - „oh, alrighty then!“

I swear, world politics and economics isn’t much different from that, intellectually….

hnbad · on Aug 9, 2022

The problem with this analogy is that the companies are large enough to have leverage over the legal system to likely be able to avoid any consequences. Even if case law eventually rules in favor of the original copyright holders of the training data, it's the customers of the ML companies who would most likely be directly liable for infringement, not the ML companies themselves (though the customers could then try to sue them for damages).

Since there is no legal precedent and the law itself isn't clear about this use case, it's basically a huge gamble on a legal gray area at this point. For VCs the risk doesn't really matter as ML startups only need to exist long enough to provide an exit with high ROI and for enterprise companies it doesn't matter as ML products are just one of many ventures for them.

It's worth noting that unlike Germany, where book and newspaper publishers have won rather unusual copyright claims against companies like Google, in the US the big publishing industries to worry about are movies and music, and most ML projects right now seem to focus on generating images or text rather than music or video. If "AI generated music" caught on like DALL-E 2 did, I think we'd see a lot more contention over how copyright law applies to ML training data.

pseingatl · on Aug 9, 2022

German quotation marks are now in the house.

SequoiaHope · on Aug 9, 2022

We could you know… ask people to donate content to these systems. They could train on Creative Commons and ask Twitter to build image license options in to their UI, then train on all the freely licensed images.

So we could actually follow copyright laws and still have an ML industry. But I’m not sure big tech wants to ask the public for consent. They would rather have free reign to do whatever they want.

By the way I don’t like copyright law or the concept of IP. But I find it a little annoying that I’m supposed to respect IP law and ML stuff can just ignore it. Also if big tech was forced to encourage people to share stuff with an open license, this would be a huge net good for society! Instead nothing changes but big tech gets to take advantage of peoples copyrighted works and artists can’t do anything to stop it. That kinda sucks.

lupire · on Aug 9, 2022

What can AI do but you can't? The copyright violation is in the output published work, not in the method of operation.

l33tman · on Aug 9, 2022

As a tangent... a human being who reads up on websites (that are not explicitly in the public domain) is in essence also "training" themselves on that data. It would be really short-sighted (and impossible to start with) to forbid an AI to read it if a human is allowed..

goopthink · on Aug 9, 2022

But even human-generated work has all sort of domain-specific “fair use” rules to comply with, including plagiarism (academia), open source licensing (code), attribution for generative works (CC) and so on. People who make too closely generative works are scrutinized and face social and business pressures (art, law).

Today’s ML output throws everything into a mixer and then blanket-calls it ML-generated output, because the original training content has been separated from the social and legal frameworks that govern it.

l33tman · on Aug 14, 2022

Absolutely, but it's the output of the human (or AI) that plagiarizes that might be ethically wrong or illegal when used in certain ways, not the human (or AI) reading the input documents in the first place.

tpxl · on Aug 9, 2022

> Nobody knows whether or not training a NN is fair use

A NN producing a copyrighted work without respecting the license is not fair use, we know that much.

6510 · on Aug 10, 2022

Eventually, like in the music industry, we will have dedicated legal teams taking anyone to court who is possibly using a combination of those 5 colors out of the 50 million color combinations we "own" simply because filling your work schedule with court cases is the right thing to do.

The funniest ones will be where ML independently reproduces a picture from unrelated materials with humans making a futile effort trying to figure out how it obtained this result.

asddubs · on Aug 9, 2022

sure, I agree that the legal status is unclear. My point was mostly just to rebut the dismissal of the comparison by my parent poster.

V-2 · on Aug 9, 2022

What about my biological intelligence? Let's say I'm just training myself (learning) by browsing and reading a lot of open source projects, not even bothering to check what the licenses say. Am I possibly violating any license? And I will produce some output based on this, writing code at work (that output will not be copy-pasted, but certainly based on the "training" I received from reading through the open-source code).

6510 · on Aug 10, 2022

Ah yes, I envision a language where every method is someones property with various subscription levels for each. Imagine how well maintained everything would be, how rich in features....