Content

/

Commentary

/

To Support AI, Defend the Open Internet and Fair Use

commentary

To Support AI, Defend the Open Internet and Fair Use

October 17, 2024

The featured image for a post titled "To Support AI, Defend the Open Internet and Fair Use"

This piece originally appeared at Tech Policy Press.

The emergence of generative AI over the last two years has been so remarkable, it's easy to forget that the AI breakthrough is downstream of another remarkable innovation: the free and open internet. Preserving and building upon the AI progress of today necessitates access to the internet and its corpus of information moving forward.

Access to information through the open web enables models to draw upon a diverse array of sources, improving their accuracy, functionality, and quality. Training AI models is no small task, with frontier models relying on millions of different pieces of text, imagery, and computer code. Such information is then “tokenized,” transforming the original information into smaller snippets or characters that are integral to model training. In order for the technology to continue improving and eventually reach its full potential, protecting access to information online will be critical.

But protecting access from whom? While legislators and regulators are a perennial threat to emerging technologies, at least in the United States, a larger threat looms. One threat that could end the AI age before it begins is copyright lawsuits from dozens of rights holders, including newspapers, record labels, and individual creators.

To date, there are 32 active lawsuits against AI firms claiming that data sets including copyrighted materials used during model training constitute willful infringement of copyright. If judges rule for plaintiffs and take a maximal stance, which many plaintiffs are asking for, the potential fines would likely bankrupt dozens of AI companies and stunt downstream development of AI-powered tools.

An example of the magnitude of cost from a maximal stance could be the Authors Guild v. OpenAI case. The plaintiffs argue that OpenAI’s use of training sets that include copyrighted works constitutes a willful infringement of their copyright, which can carry a maximum penalty of $150,000 per work infringed. The consolidated class action suit includes “Fiction and Nonfiction Author Classes,” which counts “at least tens of thousands of authors and copyright holders,” impacted by training on data sets such as Books3, which includes more than 200,000 works. If 15,000 authors joined the suit, the number of signatories on an Author’s Guild open letter calling for OpenAI to license guild members’ work for training, at one copyrighted work per author, a maximum copyright infringement fine could rise to $2,250,000,000. OpenAI’s latest funding round closed at $6.6 billion, so the copyright fines from this one case alone would account for a little more than a third of these funds. OpenAI is currently facing 11 additional lawsuits related to copyright infringement, so if one set of plaintiffs is successful, fines for copyright infringement alone could balloon further, leaving one of the most well-resourced AI firms insolvent.

It is easy to sympathize with the arguments behind these cases and the individuals bringing them. Knowing that information you put out in the world is being used in ways you did not foresee can be unsettling. But the plaintiffs’ desired outcomes would lead to a less open, creative, and accessible internet within the United States, which would disadvantage the US from an economic, strategic, and cultural standpoint.

AI developers are relying on the argument that the act of training an AI model constitutes “fair use,” a copyright doctrine that allows for the non-permissive use of copyrighted information in certain conditions. Fair use was enshrined in statute with the Copyright Act of 1976, but doctrinally has long been associated with copyright law. Fair use is a counterbalance to the limited monopoly that copyright law provides. It promotes the free flow of information and enables individuals to use established ideas or content in new and innovative ways. But these are risky and uncharted legal waters. Judges will determine whether a certain act is fair use using a four-part test that evaluates the use on a case-by-case basis. No one really knows how that test will apply.

Fair Use Fosters Openness and Innovation

Historically, fair use has been a boon to innovative technologies. The freedom to access, analyze, and share information undergirds many beneficial technologies people enjoy today. Three notable instances of fair use from the past 40 years are illustrative.

The first instance worth considering is the “Betamax” case, in which a court ruled that while a video recording device could lead to potential copyright infringement, there were substantial, non-infringing uses, whose benefits outweighed the potential costs. While some AI models have been shown to regurgitate strands of text that appear in their training data, by and large these models are not being used to pirate books or evade paywalls. Instead, these models enable people to participate in and contribute to new commercial endeavors as well as personal pursuits.

The primary use cases for AI models currently are as co-pilots for accountants, customizable tutors for students, and digital research assistants. While model developers should be careful to ensure models are not able to provide infringing outputs, the public should not lose sight of the large and growing areas where AI models can support human workers and enhance the quality of their own work. Betamax’s survival enabled the creation of other commercial technologies and paths of distribution for content, creating more choices for consumers, new opportunities for creatives, and an expanded market for rights holders. Few could have predicted the ways in which this technology would increase the output of content and support future innovations.

A second instance is the Google Books case. Here the court held the scanning and indexing of millions of works to provide new information and uses of such works qualified as a “transformative use” that augmented public knowledge. Google scanned and “copied” millions of copyrighted works not to diminish the value of the author’s work, but the very opposite: to enable people to learn about these books, increasing access to information without undermining the market for the books themselves.

The Google Books project utilized cutting-edge machine learning (ML) techniques to uncover new patterns and information about long-established pieces of work, creating new ways to interact with and learn from such content. Fair use in this context enabled the creation of technology that improved access to information, spurred new research and knowledge-generating activities, and did not render research libraries or new books obsolete.

A third instance of fair use enabling beneficial technologies comes from Perfect 10 v. Amazon case. The case centered around Google providing snippets of text and reduced-size images within its search results, which the subscription website Perfect 10 believed infringed its copyright. The court deemed that fair use was likely to prevail because of the transformative nature of the image within the thumbnail, as well as the incorporation of the information into the reference tool, which provided a social benefit.

In the context of AI models and their growing capabilities, access to a broad swath of digital information and works is critical to realizing their potential. For models to be usable and efficiency enhancing, they must be able to draw upon authoritative sources, as well as direct users toward the information they seek. While the media and broader content markets have undergone significant changes as a result of the internet, the amount of content and financial opportunities in such markets has continued to grow rather than shrink, often empowering smaller creators or upstarts who are willing to bet on new technologies. AI models are a new portal with which people interact with information, explore the work of others, and contribute to public knowledge.

The Costs of Copyright Maximalism

If the fair use defense for AI model training is rejected, it would be a significant setback to the development of the technology in the United States, which could have global implications.

AI model developers will be sued aggressively, and for many, it will mean destruction. For example, if the judge overseeing the New York Times v. OpenAI case adopts a maximal stance; it could lead to a penalty in the ballpark of $2,400,000,000,000, and potentially climbing higher. While that might seem like a fantastic figure, it is derived from the maximum potential fine stemming from OpenAI’s use of the Common Crawl data set in training. The New York Times filing claims Common Crawl contains 16,000,000 unique records of content from the Times and argues the use of such data sets constitutes willful infringement of its copyright. To say nothing of litigation costs, well-financed model developers and large incumbents will likely be able to license content when possible and draw upon large stores of first-party data. But this will not be possible for many smaller model developers, as well as the hundreds of new firms that are building applications on-top of open foundation models.

Such a reality would be particularly harmful to the open source AI movement across the world. For developers in the Global South, relying on open foundation models can enable them to customize models to better serve their populations without having to finance full-scale training runs, where costs have been consistently rising since the introduction of ChatGPT. By further locking away data and making it only available to the most well-resourced firms, the chances for AI disrupting the status quo of technology markets will wither away.

From a strategic perspective, such a ruling could set off a frantic movement of capital and talent out of the US to nations with more permissive regulatory frameworks, such as Japan, Singapore, Israel, or the UK. If firms in the US are required to pay licensing fees to access any data that is copyrighted, this would upset the existing balance between copyright’s limited monopoly and fair use, making it harder to share ideas and iterate upon existing knowledge.

This would be a boon to authoritarian nations’ AI development, especially within the People’s Republic of China (PRC). The PRC has long been a leader in using AI / ML to surveil its citizens on and offline and export those tools to other nations. Chinese model developers are also pushing into the world of open source, exporting state-approved models into the world, whose development is guided by China’s restrictive laws related to speech and expression. Allowing such models to become pervasive globally could have significant consequences for freedom of information, digital repression, and democratic values.

Beyond geopolitical considerations, allowing copyright law to obstruct AI model development could lead to the US abandoning its position as a leader in technological innovation to placate rights holders. While there are legitimate concerns about how the diffusion of AI conflicts with established property rights, accepting the maximalist position put forward by the New York Times and similar plaintiffs would be an embrace of stagnation. Rather than grappling with the difficult questions about how to best leverage new technologies to support creators and innovators, these cases would ossify the status quo for temporary stability.

Such an approach is not surprising, but disheartening nonetheless. Preventing a technology such as AI from evolving in the US may protect existing industries or firms, but it will be to the detriment of society, as builders and creators will look elsewhere, encouraging the next wave of innovation to form beyond our shores.

Discussions around AI and copyright protection are far from resolved. But in order to build upon existing frameworks and ensure technological innovation can support human creation, we must allow the technology to develop. Current copyright suits would rob us of this chance. Ensuring that training is protected under fair use, either through judicial ruling or legislative action, is integral to realizing the benefits of AI models for the citizens of the US, and around the world. The strongest bulwark against authoritarian misuse of technology is a coalition committed to an open and diverse internet that enables access to information, collaboration, and innovation. Defending fair use and access to the open internet are critical steps toward a civilization that defends such values.

Explore More Policy Areas

InnovationGovernanceNational SecurityEducation
Show All

Stay in the loop

Get occasional updates about our upcoming events, announcements, and publications.