Earlier this month, the president summarily dismissed Shira Perlmutter, the Registrar of Copyrights at the Library of Congress. Perlmutter was the second congressionally appointed official removed from the LoC in as many days, following the firing of Dr. Carla Hayden from her position as Librarian of Congress for “quite concerning things that she has done … in pursuit of DEI and putting inappropriate books in the library for children.” Never mind that children aren’t allowed to access the Library of Congress or that part of the library’s function is to collect copyrighted works published in the United States.
Perlmutter’s firing came shortly after her office released the third part of its report on Copyright and Artificial Intelligence, which considers “Generative AI Training.” Many have speculated the firing resulted from the report’s conclusions. Tech moguls, particularly Elon Musk, seemed to have expected the office to endorse their view of intellectual property and loosen the reins of fair use. They hoped the report would encourage the courts to allow them unrestricted access to copyrighted material for training their large language models. When it became clear the office was going to assert that much generative AI training “goes beyond the established fair use boundaries,” Perlmutter had to go. At least, so goes the speculation.
In the wake of this report, I thought I’d use this month’s column to consider two of the main issues raised by this trio of reports for writers, publishers, and readers: How might LLMs affect the publishing ecosystem in terms of what can be copyrighted? And, how do those systems affect what it means to be protected by copyright, especially in an environment where “information wants to be free”? One thing is certain, LLMs are having an enormous impact on how we conceive and practice copyright law, both in the US and around the world. And that impact gets at some of the fundamental questions of authorship, ownership, and publication.
Generative AI and Fair Use
A quintessentially democratic spirit informs US copyright law. Designed to provide a short-term monopoly for the individual inventor, artist, or creator, the ideal of copyright develops into a long-term grant to the public. In other words, recognizing that everything is made of something else, copyright was supposed to let individuals profit from their contributions before those contributions become the property of the public.
As Benjamin Franklin wrote, “as we enjoy great advantages from the inventions of others, we should be glad of an opportunity to serve others by any invention of ours; and this we should do freely and generously.” Franklin and the framers of copyright law recognized the individual’s need to profit from their work—to sustain their livelihood and continue their contributions—but they imagined a society where each should ultimately have access to the work of others to do with what they pleased. In this way, knowledge would spread, invention would increase, and society would progress.
The first two parts of the Copyright Office’s AI report deal with “Digital Replicas” and “Copyrightability,” both important issues, especially for writers and artists, but the tech industry most eagerly awaited this third part of the report, which considers the question of using data (read text, images, artwork, etc.) protected by copyright to train AI systems. At this point, no one denies that the unhindered use of such material is essential to the development of these systems. So essential, in fact, that several executives have admitted that their products could not exist without the wholesale exploitation of works protected by copyright. However, developers and their corporate overlords assert that the use of such data to train AI systems falls under the protection of fair use.
Fair use doctrine makes a special carveout in US copyright law that allows artists and writers to use protected content provided that use meets certain criteria that courts have decided do not infringe on the rights of copyright holders. It’s a way of hastening a work’s arrival in the public domain by recognizing that certain kinds of use do not actually affect the owner’s original monopoly on profits from their work. It’s fair use that allows YouTube video essayists to post clips from Disney movies, academic critics to reproduce quotations from contemporary novels for criticism, and a host of other activities. To determine whether a work is fair use, the courts rely on a four-fold analysis that considers: 1) purpose of the use, 2) the nature of the copyrighted work, 3) how much of the work has been used, and 4) the effect of the use on the market for the original.
There is no magic formula to determine whether a given work might be fair use. One cannot quote up to a certain number of lines from a hit song, nor may they be confident that their critical analysis using clips from a Star Wars show won’t be found in violation. Instead, courts use these factors to consider copyright issues on a case-by-case basis. What’s more, the Copyright Office’s report on these questions isn’t a policy document, nor does it have any legal power. Instead, it seeks to “[c]onduct studies and [a]dvise Congress on national and international issues relating to copyright.” The report amounts to a series of recommendations for how legislators and jurists might understand new developments in copyright from the office’s perspective and help guide them in proposing new legislation or litigating copyright questions in court.
From the tech perspective, training LLMs with copyrighted work amounts to fair use. That was the case META’s lawyers made when it was discovered that Llama’s developers had knowingly used LibGen, a pirated dataset containing millions of texts, including nonfiction and fiction books, scholarly journal articles, comics, and magazines. Lawyers for the Author’s Guild assert that Mark Zuckerberg himself “approved Meta’s use of the LibGen” despite knowing the data to have been pirated. According to Reuters, “A Meta spokesperson said that fair use is ‘vital’ to its ‘transformational GenAI open source LLMs.’” At the core of their arguments is the assertion that training an LLM doesn’t interfere with the owner’s monopoly because the system does not reproduce the books it ingests. Instead, its output transforms its training data into a new work. Except when it doesn’t.
Acknowledging the complexity of the four-factor analysis, the Copyright Office concludes that
some uses of copyrighted works for generative AI training will qualify as fair use, and some will not. On one end of the spectrum, uses for purposes of noncommercial research or analysis that do not enable portions of the work to be reproduced in the outputs are likely to be fair. One the other end, the copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available, is unlikely to be fair use. Many uses, however, will fall somewhere in between. (74)
At stake in these arguments, for tech developers, is the supposed ability to “power incredible innovation, productivity, and creativity,” but writers worry that these technologies will use their works to outcompete them in the marketplace. Is it in keeping with the original spirit of copyright to transform every novel ever written into derivative works and flood the marketplace? To ingest the work of countless artists into a storyboarding and special effects system that puts those same artists out of work? I feel confident most of us would agree there’s something not quite right here, but we’ll see what the courts decide.
Copyrighting the Outputs of Large Language Models
The great irony here is that Meta, OpenAI, and other developers want free rein to exploit the copyrights of others while simultaneously lobbying to expand legal protections to include the outputs of their algorithms. In fact, protecting those outputs with copyright will be essential to profiting off of them (something most AI companies have so far failed to do). The second part of the Copyright Office’s report, released in January of this year, deals with whether AI work can be protected by copyright in the same way as human generated works. The answer is a complicated “not really.”
On one hand, the government wants to make space for what folks sometimes call the ethical use of AI. To imagine, for example, that some writers might feed their own work into the system, seek ideas for revision, and then modify the work before publishing. Or, perhaps, one wants to work with LLMs for brainstorming, reconfigure those ideas, and publish the results. For the Copyright Office, the amount of human effort and originality involved in this process determines whether a work can be copyrighted: “The use of AI tools to assist rather than stand in for human creativity does not affect the availability of copyright protection for output” (iii). In other words, creators still own the rights to things they make with AI assistance, provided they modify or “control” the output.
On the other hand, and the report is clear on this, the output of one-shot prompts of the kind that Sam Altman recently used to generate a metafictional story about grief are explicitly not protected by copyright for a number of reasons:
Copyright does not extend to purely AI-generated material, or material where there is insufficient human control over the expressive elements … Based on the functioning of currently generally available technology, prompts do not alone provide sufficient control.
In some ways, this report is a relief for those worried about competing with AI slop in the marketplace, but the Copyright Office acknowledges that individual intervention is going to be hard to measure and keep track of. And, as usual, this kind of guidance devolves into individual cases. Thus, who has the representation and the resources to protect their copyright will really determine who gets paid in just the same way it will determine how many one-shot prompted novels “writers” can publish on Amazon before they’re stricken for AI exploitation.
So far, the courts have voted against protecting the output of AI systems with copyright. In September of last year, the US District Court for the Columbia Circuit heard Thaler v. Perlmutter. In that case, Stephen Thaler sued the US Copyright Office after Perlmutter denied his application to copyright an AI-generated landscape he titled “A Recent Entrance Into Paradise.” The work was generated by an AI Thaler developed called “The Creativity Machine.”
Perlmutter’s office denied the application based on their policy that copyrighted works must have a human creator. The court agreed, ruling this March that “the Creativity Machine cannot be the recognized author of a copyrighted work because the Copyright Act of 1976 requires all eligible work to be authored in the first instance by a human being.” In their ruling, the court even rejected the argument that Thaler could hold the copyright because he had developed and employed the Creativity Machine. In a blow to keyboard artists the world over, the court asserts, “the Copyright Act itself requires human authorship.”
In both reports (on authorship and fair use), the Copyright Office asserts that existing law is robust enough to handle the complexities introduced by AI. Maybe there’s some comfort in that confidence. Perhaps copyright will be strong enough to protect the market for creative writing, even as our tech overlords try to break down the walls and further deteriorate the industry. But the recent terminations at the Library of Congress do not bode well for any hope that the Executive Branch will defend individual copyright holders against the monied interests of big tech.
Our institutions determine the rules of the marketplace. If the individual copyright holder can no longer expect to hold a temporary monopoly on their work, neither should multi-billion-dollar corporations expect the same protection. If anyone can do anything with a work publicly accessible but not in the public domain under the auspice of fair use, then we’ve no need for copyright law in the first place. But the genius of the temporary monopoly recognizes that progress requires both access and profit enough to eat. However flawed, better a copyright law that empowers artists than one that leaves them to fend for themselves against the most powerful oligarchs the world has ever produced.