Why open source may not survive the rise of generative AI

4 hours ago 1

High angle view of a group of red vector cubes levitating on a black background rendering image - stock photo

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

Generative AI is erasing open source code provenance.
FOSS reciprocity collapses when attribution and ownership disappear.
The commons that built AI may not survive its success.

We live in an astonishing technology-based world, fueled by and dependent on software. That software provides our networks, our security, our financial transactions, our supply chain management, and, of course, the generative AI systems that are top of mind for just about everyone.

But where does that digital infrastructure come from? Nearly all of it is based on free and open source software, what the industry calls FOSS. This is code built by enormously collaborative communities, driven by coders who use the fruits of FOSS and who also actively contribute back bug fixes and improvements.

Also: How AI coding agents could destroy open source software

This reciprocity of contributions back to the code is at the core of FOSS, which puts it fundamentally at the core of modern society. The amazing thing about our open source infrastructure is that it's governed by fundamental agreements about the provenance of the code.

Provenance and copyleft

It should be possible to trace every single line of code back to its originator. This core provenance element of FOSS is often governed by what are called "copyleft" licenses. Copyleft is basically the opposite of copyright (hence the cutesy term). Copyright restricts use and modification without permission of the owner, while copyleft requires sharing modified code under the same terms as the original code.

Also: Europe's plan to ditch US tech giants is built on open source - and it's gaining steamOne

We recently ran into another huge provenance question with OpenAI's Sora 2, which is capable of reproducing the likenesses and voices of real people. For my deep dive on the Sora 2 issues, I had the chance to speak with Sean O'Brien, founder of the Yale Privacy Lab at Yale Law School.

In our conversation, though, we went beyond discussing generative AI video to discuss the core issues of generative AI and code itself. Sean says, "For software development, this creates a dangerous situation. Snippets of proprietary or copyleft reciprocal code can enter AI-generated outputs, contaminating codebases with material that developers can't realistically audit or license properly."

In other words, it nukes the whole provenance issue, which determines not only who developed the software, but who owns it, who is responsible for it, and what rights transfer with it. Sean says that AI code generation is now creating a culture of willful blindness to FOSS licensing in the first place, if not outright animosity toward licenses like the GNU GPL (one of the main licenses that govern open source code).

Also: Can AI even be open source? It's complicated

Because FOSS licenses almost always require attribution, and often also redistribution under identical terms, authorship lines are blurred once AI output is mixed in. This makes license compliance practically impossible.

The legal gray zone

In the Sora 2 article, he described a four-part doctrine that's forming in US law. To recap, first, only human-created works are copyrightable. Second, generative AI outputs are broadly considered uncopyrightable and "Public Domain by default." Third, the human or organization utilizing AI systems is responsible for any infringement in the generated content. And, finally, training on copyrighted data without permission is legally actionable and not protected by ambiguity.

FOSS has always depended on a reciprocal ecosystem. The GNU GPL and similar copyleft licenses depend on traceability. He says that when developers reuse code, they know its origin and its obligations. Those obligations, such as attribution, redistribution, and contribution of improvements upstream, are what replenish the commons.

Also: 3 tips for navigating the open-source AI swarm - 4M models and counting

Open software has always counted on its code being regularly replenished. As part of the process of using it, users modify it to improve it. They add features and help to guarantee usability across generations of technology. At the same time, users improve security and patch holes that might put everyone at risk.

But O'Brien says, "When generative AI systems ingest thousands of FOSS projects and regurgitate fragments without any provenance, the cycle of reciprocity collapses. The generated snippet appears originless, stripped of its license, author, and context."

This means the developer downstream can't meaningfully comply with reciprocal licensing terms because the output cuts the human link between coder and code.

'License amnesia'

Even if an engineer suspects that a block of AI-generated code originated under an open source license, there's no feasible way to identify the source project. The training data has been abstracted into billions of statistical weights, the legal equivalent of a black hole.

The result is what O'Brien calls "license amnesia." He says, "Code floats free of its social contract and developers can't give back because they don't know where to send their contributions."

Also: Anthropic's open-source safety tool found AI models whistleblowing - in all the wrong places

Yale's O'Brien says the contemporary software industry, and really a huge segment of the global economy, owes its existence to FOSS and this idea of a digital commons of intellectual resources. We call this "open source," but O'Brien contends, "That term is not only inaccurate, it ignores that software development in the 21st century is an ecological system. That system relies upon upstream FOSS projects and downstream recipients of code, who take it and remix it into yet more software."

Also: Open-source skills can save your career when AI comes knocking

"Once AI training sets subsume the collective work of decades of open collaboration, the global commons idea, substantiated into repos and code all over the world, risks becoming a nonrenewable resource, mined and never replenished," says O'Brien. "The damage isn't limited to legal uncertainty. If FOSS projects can't rely upon the energy and labor of contributors to help them fix and improve their code, let alone patch security issues, fundamentally important components of the software the world relies upon are at risk."

Some enormous irony here

O'Brien sets the stage: "What makes this moment especially tragic is that the very infrastructure enabling generative AI was born from the commons it now consumes. Free and open source software built the Internet: from Linux kernels running the servers, to Apache and Nginx powering the web, to PostgreSQL and MySQL managing data, to Python, GCC (the GNU Compiler Collection), and TensorFlow enabling the machine learning revolution. Every cloud provider, every hyperscale data center, every LLM pipeline sits on a foundation of FOSS."

Also: How AI coding agents could destroy open source software

Thousands of volunteer maintainers, students, researchers, and small collectives built and sustained the FOSS projects that corporations later built their fortunes upon. O'Brien says, "Now those same corporations are using that wealth and compute to train opaque models on the very codebases that made their existence possible, and threatening the legal structures, such as reciprocal or copyleft licenses like GNU GPL, by labeling all the outputs of genAI chatbots public domain."

He says that, in doing so, they are dismantling the conditions that made FOSS collaboration viable.

The bottom line is this. Yale's privacy guru says, "If we don't recognize that FOSS isn't just a licensing regime but civic infrastructure, then the next generation of developers will inherit a world where coding is privatized, history is obscured, and the Internet itself becomes another closed platform of code labeled public domain by LLM-powered chatbots that is locked up and proprietary."

Also: Trump's AI plan says a lot about open source - but here's what it leaves out

O'Brien says, "The commons was never just about free code. It was about freedom to build together." That freedom, and the critical infrastructure that underlies almost all of modern society, is at risk because attribution, ownership, and reciprocity are blurred when AIs siphon up everything on the Internet and launder it (the analogy of money laundering is apt), so that all that code's provenance is obscured.

What do you think? Is AI threatening the very foundations of open source, or can the FOSS community adapt to this new reality? Do you believe AI-generated code should carry the same licensing responsibilities as human-written code? How should developers and companies ensure attribution and reciprocity when using AI tools? Let us know in the comments below.

Want more stories about AI? Check out AI Leaderboard, our weekly newsletter.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Read Entire Article