The Enforceability of AI Training Opt-Outs

4 months ago 7

Creative Commons and the NYU Stern Fubon Center for Technology, Business and Innovation recently hosted a workshop in NYC, inviting participants with expertise in IP and various “open” movements and communities to give feedback on their AI-related proposals. This article was prompted by my participation in that workshop.

Creative Commons has been working on creating a set of “preference signals” for copyright holders to indicate how they would like their works to be treated by AI developers considering using their works for AI training. Currently, these preference signals are meant to be applied at the data set level, not to each individual work.¹ Creative Commons has said that it is not treating these preference signals as legally enforceable at the moment, presumably because it believes that using copyrighted works to train AIs is likely to be considered “fair use” under US copyright law. Where use of a copyrighted work is deemed a “fair use,” a license attempting to prevent or limit such use is unenforceable.² Wikimedia, the largest and most famous licensor to employ Creative Commons licenses, agrees that the fair use defense is likely to prevail.³

I think this approach is premature.

EU Copyright Law Rules the World

EU AI Act Brings EU Copyright Law to the World

Many jurisdictions do not have a concept of “fair use,” but instead have statutory exemptions from copyright liability. In the EU, the Directive on Copyright and Related Rights in the Digital Single Market (the “CDSM Directive”), allows commercial model developers⁴ to copy and extract content from copyrighted works for purposes of text and data mining (TDM) provided that they are lawfully accessible and that the model developer abides by copyright holder opt-outs. The EU AI Act’s Article 53(1)(c) takes the unusual step of importing EU copyright law and the obligations in the CDSM into the EU AI Act and applying them to all general-purpose AI model providers subject to the EU AI Act, even if they would not otherwise be subject to the CDSM or European copyright law. That means that model developers still have to abide by EU AI Act training opt-outs, even if AI training is protected by fair use in the US or elsewhere.

1. Providers of general-purpose AI models shall:

(c) put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790;

The EU AI Act’s scope is surprisingly broad in three ways. First, Recital 106 states that all AI developers subject to the EU AI Act must follow EU copyright law and respect opt-outs, even if they conduct model training outside the EU.⁵ This is unusual because generally the copyright laws applicable to any copyright-related acts are the laws of the jurisdiction where the acts are committed. Here, the EU specifically did not want to see the sale of models in the EU that would have been illegal to train in the EU. Second, it’s actually not clear if the intention is for model providers to respect opt-outs just for works governed by EU copyright law, or for all works from all over the world. ⁶The language is pretty ambiguous on this front. Even if this language only applies to works subject to EU copyright law, though, it would be impossible to identify such works on a mass scale with any degree of certainty.⁷ Therefore, in practice, companies will abide by opt-outs broadly to the extent standards for expressing them emerge.⁸

Third, the scope of entities to whom the EU AI Act applies is broader than even the scope of Europe’s main privacy law, the GDPR. The EU AI Act’s scope is not limited to companies operating in, or selling products or services in the EU from third countries. The scope actually extends to any model provider whose output “is used in the Union.”⁹ Potentially, that means that a book with AI-generated images created in the US and sold in the EU is within the scope of the EU AI Act, and the model developers must then comply with EU copyright law. In other words, it’s almost impossible to escape EU copyright law with any certainty since model providers have limited control over their users and users might have limited control over where their outputs end up.¹⁰

Creative Commons’ Protocol Proposal

The upshot for Creative Commons is that the TDM opt-out structure can be a vehicle for making signal preferences legally enforceable against the vast majority of commercial AI model providers worldwide. The latest draft of the EU AI Act’s “General-Purpose AI Code of Practice, Copyright Section” specifies in Measure I.2.3 that model providers should follow the robots.txt protocol specified by the Internet Engineering Task Force and “make best efforts to identify and comply with other appropriate machine-readable protocols…” While some of the protocol proposals are binary (“ok to train” v. “don’t train”), a number of organizations have put forward proposals that include additional licensing terms or permissions. The question of which protocols will be legally accepted seems to depend on which ones get popular adoption and public recognition. In practice, if a few major organizations like Common Crawl and EleutherAI get on board, that’s likely to be sufficient. Creative Commons’ stature certainly positions it well for meeting this criteria.

Enforcement of TDM Opt-Outs

Should CC preference signals become legally recognized by the EU, the applicable enforcement mechanisms will look different from those applicable to CC licenses. EU authors could bring copyright infringement claims against non-compliant companies that conduct training in the EU, but probably not if they conduct training outside the EU. Such plaintiffs would need to look towards the EU AI Act instead. It cannot be enforced by private action, but a complaint could be filed with the relevant national regulatory agency for investigation. Corporate AI customers could potentially terminate their agreements and sue for breach of contract in the event an AI provider doesn’t respect CC preference signals since most contracts require the vendor to comply with all applicable laws. In the meantime, such customers can specifically require compliance with CC preference signals in their contracts and they can also make that a formal procurement requirement when selecting vendors in the first place. Since the EU AI Act carries hefty penalties like the GDPR,¹¹ the lack of private action will not deter companies from complying with the Act.

Fair Use in the US is Not a Foregone Conclusion

There are a lot of excellent papers out there by IP experts making well-reasoned arguments for the finding of fair use with respect to AI training. But, it’s important to remember that these papers are meant to persuade individual judges about how they should rule; they are not nationwide forecasts of judicial rulings. Courts are not always perfectly logical and many struggle with understanding the technology they are asked to rule on. Think about the felony convictions doled out under the Computer Fraud and Abuse Act for security research and mere terms of service violations. Bad facts can lead to bad law where the defendant is so reprehensible in the eyes of the court, that the court is inclined to find a way to rule against them in the interest of fairness, without regard for defendants acting in good faith that might follow later. Our system of law lurches forward slowly and unevenly, revealing only certain legal insights over certain types of technology in certain jurisdictions over many years. Keep in mind that it took over a decade to determine the relatively straightforward question of whether copying APIs is copyright infringement in just a single dispute (Google v. Oracle). Even iron-clad logic is not a guarantee of any specific legal outcomes, not even on a very long timeline.

Unpredictable Application

In the US, fair use is not an exception to copyright law; it’s a defense against copyright infringement that involves arguing that a complicated set of very fact-specific factors are favorable to the defendant. So even though a court may hold that the defense is valid in one case, there is no guarantee that it will be valid in similar cases. Courts regularly make surprising or novel distinctions between similar cases, particularly where the underlying facts paint the defendant in a negative light.

One need look no further than the Supreme Court’s acceptance of the fair use defense with respect to the VCR (Sony Corp. of America v. Universal City Studios, Inc.)¹² and its subsequent rejection of it with respect to peer-to-peer (P2P) networks (MGM Studios, Inc. v. Grokster) ¹³to see such distinctions. In both cases, the underlying technology can facilitate non-infringing copying and distribution of copyrighted work: VCRs can enable time-shifting, allowing people to view shows at a time more convenient to them, and P2P networks were commonly used by universities for internal exchange of research and for distributed computation projects like Folding@home, which used P2P networks to simulate protein folding. In both cases, the technology could also be used in an infringing manner and the purveyors of the technology publicly advertised uses of the technology that would clearly constitute copyright infringement: creating a personal home library of shows and movies on VHS tapes in the case of Sony, and downloading copyrighted music in the case of the Grokster. On the face of it, the cases presented similar facts and many IP experts predicted a win for Grokster. But, undoubtedly, Sony’s well-known and highly-respected brand, combined with the justices’ own usage of VCRs swayed them in one direction, while Grokster’s motley crew of anarchists and its association with the “dark web” swayed them in a different direction.

Inability to Make Blanket Fair Use Rules

With respect to AI in particular, distinctions may be drawn between different types of AI models (ex. generative v. predictive models), different modalities (ex. images v. text), the various domains where the models are used, and the purpose of the use. The Copyright Alliance gets this point right: “unauthorized use of copyrighted material to train AI systems cannot be handwaved by a broad fair use exception… Neither the Copyright Act nor case law… would support such a broad fair use exception for AI.” Each copyright infringement claim must be evaluated in the context of the model’s intended use case and whether it is, in practice, offering substitutes in the market for the kinds of works that comprise its training data.¹⁴

Likelihood of Inconsistency Between Circuits

The current crop of AI cases can easily be distinguished from one another should a court wish to because of the diversity of the plaintiffs (some dramatically more sympathetic than others) and the modalities involved (as well as many other factors). The cases are spread amongst various circuits. The specific issues the parties might choose to appeal to the U.S. Court of Appeals or to the Supreme Court, and the postures of the cases when they arrive, are unpredictable. The US is likely to have a patchwork of AI-related precedent throughout the various circuits that does not gel into a cohesive, consistent whole for many years to come (if ever) in the same way that the fair use doctrine itself took many years to come together. AI companies may end up with guidance on code-generating AIs in only one circuit, national guidance on training predictive models specifically on training data behind a paywall, and a single district court opinion in another circuit on image generation that expresses a lot of outrage over output but fails to specifically address just the training step.

Arbitrary Rulings

It’s also possible for a court to throw a complete curveball, such as in Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. In that case, Ross was accused of copyright infringement for using Thomson Reuters’ case summaries to train its AI-powered case search engine, which suggested the names of cases when queried with specific legal questions. The judge inexplicably rejected Ross’s fair use defense because all the fair use cases raised by Ross related to the copying of code rather than text. This argument of course ignores major fair use cases that don’t relate to the copying of code (which were referred to in the cases that Ross cited), including those related to Google’s mass scanning of books to enable search within books, as well as Amazon’s and Google’s copying of images from all over the web to enable image search. Here, the judge distinguished the case before him from precedent by simply ignoring much of the precedent.

Conclusion

Given the uncertainty in where the law might go on the issue of fair use and AI training, a presumption of unenforceability of CC preference signals is premature. Declaring that the signals are unenforceable right out of the gate robs legal counsel of any gravitas they might bring to a compliance request. Companies don’t spend money complying with voluntary frameworks unless and until they get (or avoid) something tangible in return, and in this case, those benefits can’t manifest themselves until there is sufficient adoption of the signals. Even in the world of open source software, where the benefits of the software are very tangible and the licenses are clearly enforceable, a huge portion of companies still don’t put in the time and effort necessary to do compliance on that scale. It would be much more effective to begin with the notion that the signals are enforceable, particularly in the EU, to drive adoption and compliance. Even if they don’t turn out to be enforceable in any jurisdiction decades from now, by then they may continue to function on the basis of norms.

Read Entire Article