This piece originally appeared in The Republic.
American AI developers and deployers are fighting a high-stakes, multifront war. On the global front, there is stiff competition from leading Chinese model developers boosted by Beijing’s corporate espionage.1 Across the Atlantic, the European Union passed its “landmark” Artificial Intelligence (AI) Act, as the bloc attempts to create a regulatory floor for AI technologies globally.2 On the domestic front, developers are facing a deluge of regulatory statutes at the federal, state, and local levels.3 But the most important front of this fight is the courts, where lawsuits from the establishment media industry could detonate a bomb within the American AI ecosystem.
Leading AI developers are facing lawsuits alleging that the datasets used to train their models violate copyright law since they include works owned by record companies and media outlets like the New York Times and the Intercept.4 AI companies are fighting for this practice to be ruled a “fair use,” a sometimes murky doctrine that allows for certain permissionless uses of copyrighted materials for transformative purposes.5
This is a battle between technological progress and court-mandated stagnation. If a court determines that using copyrighted works without express consent during model training is not protected by fair use, or there is a circuit split, it could have a massive chilling effect on AI development and investment. Such a ruling would empower legacy media organizations to shakedown AI companies, which could do irreparable damage to American companies in the current AI arms race. Chinese companies have fewer worries about accessing data and would welcome an additional hurdle that US developers must now work through.6 U.S. allies are also hoping what they lack in ingenuity and current market leaders can be made up for with friendlier regulatory regimes for text and data mining, a la Japan, Israel, and other jurisdictions.7
These lawsuits are an existential threat to the United States’ position in the AI ecosystem. Advocates for American dynamism and AI innovation must band together and ensure that training AI models is protected under fair use.
State of Play
OpenAI, Stability AI, Microsoft, Meta, Alphabet, Github, Nvidia, and Anthropic are all facing lawsuits alleging that their use of copyrighted materials during model training is a violation of U.S. copyright law.8 Plaintiffs include visual artists, comedians, open-source developers, writers, newspapers, music publishers, and professional guilds. There is nothing new under the sun, so this is the latest iteration of politically connected industries and individuals insulating themselves from creative destruction.9 While their desired outcomes vary, plaintiffs are seeking a combination of financial compensation for the copyright violations, a requirement that copyrighted work be removed from datasets used for training, and a commitment that models developed on such works be destroyed or re-trained.
On July 8, a suit filed against Stability AI in the U.S. District Court in Delaware was allowed to move forward to investigate claims of copyright infringement during model training.10 Stability AI is the creator of the text-to-image AI model Stable Diffusion as well as models that can generate audio and video based on textual prompts.11 The judge overseeing this case has allowed for discovery based on the plaintiff’s copyright and Lanham Act claims against Stability. The Lanham Act is the law responsible for the creation of the national trademark system and enables trademark owners to recover damages for the misuse or violation of established trademarks.12 If found guilty, statutory copyright infringement damages alone could rise to $1,800,000,000,000 (or $1,040,000,000,000,000,000,000,000,000,000,000,000,000 if trademark and Lanham claims are included) which would dissolve the firm based on its most recent valuation.13 This massive number comes from the scope of the datasets involved, with the plaintiffs in the Stability case asserting that over 12 million images used in training each constitute a willful infringement of their copyright.
As noted above, other model developers are facing copyright lawsuits that could bankrupt many, if not all of the leading AI model developers in the United States. The New York Times suit against OpenAI and Microsoft could lead to copyright fines equal to $2,431,456,050,000, as well as requiring OpenAI’s models be removed from the Internet and re-trained. Nvidia, the designer of high-performance graphic processing units that are integral to AI development, has also developed its own class of generative AI models known as NeMo Megatron-GPT. The firm is facing a class action lawsuit for its use of the Books3 dataset, a widely used training set that includes copyrighted materials. The class action suit would allow any individual in the United States that owns a registered copyright for a work present in the dataset to join the suit and seek damages.14 Books3 includes 196,640 works, meaning a maximalist approach to copyright damages could lead to fines rising to $29,496,000,000, as well as requiring Nvidia to destroy any models trained using such data.15 Considering the model is open, any developers who built applications or their own models using NeMo Megatron would have to destroy their product or face liability. There are more than 30 active copyright infringement lawsuits against model developers seeking similar damages and remedies.16
These lawsuits are the latest test of the flexibility of copyright law as concepts of ownership, authorship, and attribution are challenged once again. The plaintiffs’ concerns about their work being misused to power a technology that could render them obsolete should be taken seriously and serve as motivation for all parties to explore remedies and find a new equilibrium. But history shows creating uncertainty around the development and improvement of emerging technologies, such as AI models, could have profoundly negative consequences for American innovation, security, and creativity.
Fair Use: An Innovator’s Shield in Copyright Battles
Copyright law in the United States began at the nation’s founding. Recognizing the economic value of limited intellectual property rights, the founders included Article 1, Section 8 in the Constitution to promote the “Progress of Science and useful Arts.”17 This section enabled Congress to give authors and inventors exclusive rights to “their respective Writings and Discoveries” for a limited time. Such protections were clarified and expanded throughout the 18th, 19th, 20th, and 21st centuries.18 In 1790, copyright protection lasted 14 years (renewable once) and covered maps, books, and charts. Today, it lasts for the life of the author plus 70 years or up to 120 years for corporations, covering a much wider range of works.
One such update, the Copyright Act of 1976, enshrined the common law principle of fair use. Fair use allows for permissionless use of copyrighted works under certain conditions. Some common examples are using a direct quote in an article, using a picture as a form of satire, or making home recordings of movies for personal use.19 Fair use helps to balance the power of rights holders with the interests of the public, ensuring the next generation of authors and inventors can access information to enable learning and building.
While fair use is enshrined in statute, the doctrine is applied on a case-by-case basis and often lacks certainty. The law provides a four-factor test to help courts determine the validity of a fair use claim. The factors considered are:
- The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
- The nature of the copyrighted work;
- How the new work has transformed the copyrighted work and the amount and importance of the copyrighted work used; and
- The effect of the use upon the potential market for or value of the copyrighted work.
Of the four factors, the first and fourth are often given greater importance by courts and best predict the viability of a fair use defense.20 The third, which considers how “transformative” a work is, can also be influential in deciding a fair use claim.
Model Developer Munitions
While courts evaluate every fair use defense on a case-by-case basis, previous decisions can provide a foundation for a successful fair use defense for model training. Let us take a look at the javelins of jurisprudence that developers may turn to.
On the purpose and character of the use, parallels can be drawn to earlier cases related to search engines.21 These cases established that fair use applies to copying millions of books to create tools capable of indexing, analyzing, and providing information to promote accessibility and knowledge diffusion. The purpose of using information from across the web is not to break a paywall or pirate a book, but to create tools that further humanity’s pursuit of knowledge and mastery of the physical world. The Supreme Court has previously recognized how fair use can be an engine for innovation when a technology has “substantial, non-infringing uses.”22 The relevance for AI cases is clear. Rather than a regressive rule-breaking aid, AI models can enable access to information, improve productivity, and contribute to future innovations that can fuel economic growth.
When analyzing the “transformation” factor in the context of generative AI development, understanding how data is prepared for training is key. The “tokenization” process before data can be used for training transforms copies of works from their original format into smaller snippets of words, subwords, or characters. This allows the model to “learn” from the statistical, non-expressive relationships between words in order to provide the most statistically likely answer to a user’s query.23 Computer science researchers at the University of Illinois succinctly addressed this question in a public comment to the U.S. Copyright Office on AI and copyright, stating, “the task of indexing content (and the closely-related task of modeling it mathematically) are in themselves transformative fair use.”24
Analyzing a wide variety of data is as important for training an AI model as it is for anyone interested in understanding and contributing to our world, and courts have recognized this in other contexts. For example, when the video game company Accolade copied and reverse engineered its competitor Sega’s system to make interoperable games, a federal court ruled that Accolade’s actions qualified as fair use.25 Because part of learning the functional, non-expressive aspects of the system required copying and reverse engineering the software, Accolade was in the clear.26 Fair use in this context can help society realize the potential productivity gains and benefits of AI while supporting the original intent of copyright protections: promoting progress in science and the arts.
Turning to the fourth factor, effects on the market for the original work, incumbent concerns about cybernetic overlords taking people’s jobs is a tried-and-true marketing strategy but fails to pass historical muster.27 While new forms of media have disrupted incumbent industries, it has often created new markets and opportunities for incumbents as well as new entrants. The concerns that AI models will replace novelists, artists, musicians, and hosts of other professions ignores the potential for these tools to expand creative opportunities for millions of individuals and create new markets for incumbents.28 AI models would not exist without the genius artistry and scientific discoveries of men and women throughout history. Rather than deriding a model’s use of such information, we should consider how such technology can contribute to our own pursuit of knowledge.
Public filings in the U.S. Copyright Office’s inquiry on AI and copyright from Creative Commons and the Library Copyright Alliance (LCA) illustrate how creators, researchers, and individuals can benefit from AI models.29 Their filings highlight how restricting model development would unduly impede creativity, expression, and progress in science and the arts. The LCA specified how potential infringement in a model’s output should not be connected to the use of particular works within the training set. Because the uses and users of such works are separate, the existence of a specific work in a training set raises different legal questions than its presence in an output, and should be considered separately. A broad restriction for training due to concerns about outputs conflates two separate issues while jeopardizing innovation and creative expression. Rather than fall victim to this rhetorical pincer, developers and advocates must rally behind the fair use defense for training while continuing to address concerns around outputs.
Failure is Not an Option
If courts were to side with the New York Times and its allies, it would have significant consequences for model developers as well as downstream effects for American innovation, security, and creativity. Developers may quickly find themselves decommissioned and looking for a machine-learning equivalent of Silent Professionals, becoming “guns for hire” for firms located in jurisdictions where the regulatory environment for scraping and accessing data to train AI models is more hospitable.30
Finding for the plaintiffs would impose heavy monetary costs on AI companies and impede American model development. Adjudicating copyright infringements alone could balloon to billions of dollars depending on a firm’s exposure and strength of infringement claims.31 Based on the New York Times complaint, which claims tens of millions of violations of its copyrights through datasets such as Common Crawl and WebText2, if a judge embraced a maximalist approach to statutory damages, the bill could balloon to more than $2,431,456,050,000.32 Such financial penalties would hobble if not eradicate many model developers. Even one court loss will thoroughly chum the legal waters, encouraging further lawsuits against AI developers.33
The secondary costs of such litigation would be inflicted on model training. Companies building cutting-edge models need to utilize ever greater amounts of compute and data under prevailing scaling techniques.34 Costs associated with training models have been increasing at 2.4x per year prior to the release of GPT-3 and have yet to plateau.35 For frontier models, this means an ever-higher compute bill: Google spent almost $200,000,000 on compute to train Gemini Ultra.36 Considering the trends of model development and compute, the opportunity cost of shifting resources away from development to fight litigation could spell doom for companies with limited resources, leaving them vulnerable to large incumbents or other startups with larger war chests.
On the data side, some plaintiffs are pushing for a scorched-earth remedy, destroying any datasets utilized that include infringing content. While some deep-pocketed organizations have struck licensing deals for access to content and can hunker down, developers with fewer resources or less access to first-party data could be left out in the cold.37 Organizations may have no choice but to seek an acquisition to survive a new AI winter.38 Considering the Federal Trade Commission’s (FTC) scrutiny of financial relationships between incumbent and emerging AI firms, this is unlikely to be a viable strategy.39 Much like a fire cutoff from oxygen, AI developers prohibited from indexing the web or accessing large-scale datasets will be snuffed out in short order.
A loss of fair use would be a gift to our strongest geopolitical competitor in the AI arms race. China has emerged as the clear global number-two in AI research and development, in some cases even outpacing the United States in metrics related to AI talent development and research output.40 Researchers have also noted how the Chinese government’s data gathering operations enable Chinese companies to reap the benefits of large, diversified datasets. Considering the intellectual, cultural, and strategic influence such technologies could have, abdicating U.S. leadership in this dual-use technology should raise serious concerns across the public and private sectors.41
Beyond China, other nations have amended regulations to encourage domestic AI development and create regulatory arbitrage opportunities at America’s expense.42 These countries are betting that a friendly regulatory environment can be a magnet for AI talent and investment. The freedom to index and analyze vast amounts of digital content is a critical counter-punch for American AI developers competing against their international counterparts. Policymakers and industry leaders should be clear-eyed about the costs of restrictive regulation and its potential to create a one-way technology transfer from the United States to the rest of the world.
Independent of geopolitical competition, consider the economic costs of halting AI development.43 Goldman Sachs estimates that AI advancements could contribute trillions of dollars to GDP over the next decade.44 Leading labor economists’ research has highlighted how AI-powered tools can augment the skills of low and middle-skilled workers, raising productivity and empowering labor rather than simply erasing jobs.45 Venture capitalists have committed tens of billions of dollars to American AI companies.46 If these companies move overseas, such capital could become dust in the wind.
New technologies bring new tradeoffs. One tradeoff that must be considered is: by hampering AI innovation with copyright lawsuits, we willingly degrade our nation’s ability to shape the evolution of a budding general-purpose technology. That could be a century-defining strategic error.
Short of legislation codifying the legality of text and data mining for training AI models, upholding fair use for training is the best possible outcome of current lawsuits. Courts deciding otherwise could do irreparable damage to the AI industry in the United States and cede America’s technological high ground to foreign competitors.