AI Mainstreams Plagiarism


It seems that AI has mainstreamed plagiarism, and surprisingly few seem to care.

Committing my code

A person sits down to write code. His AI assistant is hot and ready. Type, type, wait... tab, insert, save, commit, push, deploy.

Whose is it?

This person claimed the code as his own. Is this plagiarism?

Whose code did he just deploy? The name on the commit is his own. Is that accurate? Is it plagiarism to call it his own?

Whose name is on the copyright notice in the source? It's the name of his company. Is that accurate? Has new intellectual property just been minted, or was this conveniently acquired through this tool?

The company claimed this code as its own. Is this plagiarism?

Where did the code come from? It was generated by the AI assistant. It got spit out of an LLM based on an input prompt. The AI company wrote the LLM and then provided use of their LLM as a service for money.

The AI company claimed this LLM and the input upon which it was built as its own. Is this plagiarism?

Where did the dataset for LLM training come from? No AI company I know of has hired a bunch of programmers to create the dataset upon which the LLM was trained. Generally, they scoop up what's available from wherever they can get it, owned or not owned, licensed or not licensed.

But someone wrote it. A human author put his fingers on a keyboard and created code to serve his own purposes.

Who was that author? He is unknown, separated by too many degrees. Too many degrees to identify personally? Too many degrees for anyone to care about it?

Why do we not care about this author and his work and our own dubious claim to it as our own more than we do?

Yes, he's many degrees removed. He's faceless to us. And we seem determined to keep him so. How inconvenient it would be if he wasn't.

Justification

Perhaps we justify that others plagiarized his work before us, so we're not the first aggressor, and we're acting no differently than many others.

Perhaps we justify that the original author released his work purposefully into the public domain, and all those who used it in the chain since then have also made no claim to the work.

Perhaps we justify that we're not copying, because this code is a derivative of a derivative, just one of a seemingly-infinite number of possible outputs given the input.

Perhaps we justify that code is different than prose, and the morality we were taught to avoid plagiarism doesn't apply here or in the same way.

Perhaps we justify that in order to function in the utopia to be ushered in by AI, we have to adapt, and some of the old rules must be broken or conveniently ignored.

Perhaps we justify that it's no different than copy-pasting manually from someone else's codebase or forum, and we've been doing that for years, so why the sudden fuss.

I think there's a lot of justification happening. There's a lot of plugging the ears and averting the eyes. Let's not give that too much air time to the potential moral problems with copying code into and out of an LLM. There is, after all, a lot of money to be made.

A better future

When progress is being made, sometimes good, moral boundaries are stepped over. When on done on purpose, we should regain our moral standing by repairing what was broken and choosing what is right. When done on accident, we should do the same.

What could we do better?

We could re-train new LLMs on input that has been explicitly released by its authors for such purposes.

LLMs could track data provenance so that proper attribution could be made.

We could given credit in shared commits and shared copyrights.

We could redesign tools to not encourage blanket insertion of code.

We could re-train ourselves to not copy-paste what comes out of an LLM.