QEMU's AI code ban: How open source projects can accept AI-generated code

4 months ago 2

QEMU has formally adopted a policy that rejects contributions containing code generated by AI tools. The core reason is the concern that such AI-generated code cannot satisfy the requirements of the Developer’s Certificate of Origin (DCO), which contributors rely on to attest to the validity of their patches. What, then, should we do about this issue?
https://github.com/qemu/qemu/blob/master/docs/devel/code-provenance.rst#use-of-ai-content-generators

Why does the DCO reject AI-generated code?

There are two reasons why AI-generated code has difficulty clearing the DCO. The first is the problem of human authorship. The DCO requires that the contribution be “created by me,” yet in many jurisdictions AI-generated code is not recognized as a copyright-protected work. Therefore, it is legally difficult to demonstrate the existence of a human author. The second reason is uncertainty about whether the generated code is truly clean. It is obvious that AI models are trained on software code released under a variety of licenses worldwide, so there is a possibility that a snippet licensed incompatibly with the target software’s license will be emitted by chance.

Today it is becoming common to use mechanisms that compare code fragments of a certain size with already published source code, so the second concern might be considered a relatively small risk. However, it must be said that how far that risk is actually mitigated remains unclear. The first issue—authorship—is inherently more difficult to resolve.

AI-generated code is not recognized as copyrighted in many jurisdictions. This is effectively similar to public-domain code. Incorporating public-domain code into Open Source software raises no license-compatibility issues at all. Moreover, not every line of source code is protected by copyright; “ideas” and “stock expressions” with low human creativity fall outside copyright protection. Taken together, one could theoretically argue that incorporating AI-generated code into a project merely reduces the proportion of code that is subject to copyright protection.

However, it is extremely difficult for humans to guarantee that AI-generated code is truly uncopyrighted. While many jurisdictions hold that AI-generated code lacks copyright, a globally unified legal view is far from established. It is also unclear under what circumstances human copyright might be recognized for such code, leading to considerable legal uncertainty. If someone asserts a copyright claim over AI-generated code, it is hard for the AI user to prove that “this was generated by an AI, carries no copyright, and is unrelated to your work.” In short, at present, it is legally impossible to guarantee that AI-generated code is entirely dissimilar to third-party code and free of copyright. Therefore, more projects will likely follow QEMU’s decision to reject AI-generated code.

Is there still a path to accepting AI-generated code?

For Open Source projects to accept AI-generated code, the first requirement is further improvements in mechanisms that check similarity with existing code. Although excellent commercial products exist, it is necessary to promote the evolution of Open Source tools. In addition, to enhance transparency of AI-generated code, it is important that the SPDX specification quickly add a way to identify AI-generated code so that various systems can automatically tag it.

Furthermore, QEMU’s policy rejects AI-generated code because the DCO, as interpreted, requires a human author. Therefore, the DCO itself may need to be updated to suit the age of AI. Historically, the DCO was introduced to eliminate disputes over the provenance of contributions to the Linux kernel during the period of litigation threats from SCO Group. Since the provenance of AI-generated code is at issue, it initially seems ideal to create a DCOv2.

However, the DCO is already used by many projects and hosting platforms, and updating it lightly would likely cause significant side effects. The greatest strength of the DCO lies in its simplicity and universality, and an update that undermines those strengths would be difficult. In practice, adapting the DCO to handle AI-generated code is extremely challenging. Thus, while maintaining the DCO’s core declaration, “created by me,” a realistic approach is to supplement surrounding documentation to clarify what contributors must do to make that declaration in good faith. Leaving this entirely to individual projects, however, would cause the interpretation of the DCO to vary, leading to confusion for both contributors and maintainers and damaging the health of the ecosystem.

This is a very difficult task, but I believe that the Linux Foundation, as the drafter of the DCO, needs to provide a certain level of guidance on how the DCO should be interpreted in the age of AI.

Although the Linux Foundation sets out its AI policy at https://www.linuxfoundation.org/legal/generative-ai, this alone is unlikely to serve as useful guidance for the tens of thousands of projects that adopt the DCO. In particular, it should provide guidance on the part of the DCO that requires “human authorship.” I understand that the level of acceptable risk differs greatly between projects in which legal stability is paramount, like the Linux kernel, and projects in which development speed is prioritized, like web frameworks, but even so, it is necessary to establish at least a minimum stance on accepting AI-generated code under the DCO.