Factlen ExplainerAI Copyright LawLegal Deep DiveJun 18, 2026, 5:40 AM· 7 min read· #4 of 4 in ai

The Evidence Weighing on NYT v. OpenAI: Does 'Regurgitation' Defeat AI Fair Use?

As the landmark copyright lawsuit between The New York Times and OpenAI enters a critical discovery phase in mid-2026, the legal battle hinges on whether AI models 'regurgitate' exact text or learn transformatively. A federal court's order to analyze 20 million ChatGPT logs will test if the technology acts as an illegal market substitute.

By Factlen Editorial Team

Share this story

Copyright Holders: Argue that unauthorized AI training is theft and that verbatim outputs act as illegal market substitutes.
AI Developers: Maintain that training on public data is a transformative fair use and that regurgitation is a rare, filterable bug.
Open-Source Advocates: Warn that overly strict copyright rulings will consolidate AI development into a monopoly of tech giants who can afford licenses.

What's not represented

· Everyday ChatGPT users whose chat logs are being analyzed in the discovery process.
· Independent freelance writers who lack the legal resources of major publishers.

Why this matters

The outcome of this massive legal discovery process will determine whether AI companies can continue training models on the open web for free, or if they will be forced to pay billions in licensing fees—a decision that will dictate who controls the future of artificial intelligence and whether human creators are compensated for the data that powers it.

Key points

A federal judge ordered OpenAI to produce 20 million ChatGPT user logs to test claims of copyright infringement.
Plaintiffs argue the logs will prove AI models routinely regurgitate exact copies of protected reporting and literature.
OpenAI maintains that training is a transformative fair use and that exact memorization is a rare technical bug.
The legal battle has shifted focus from how AI models ingest data to the specific outputs they generate for users.
Legal scholars warn that requiring licenses for all training data could consolidate AI development among a few tech giants.

20 Million

ChatGPT user logs ordered for discovery

Consolidated copyright lawsuits in the S.D.N.Y. MDL

Factors in the US Fair Use doctrine

As the summer of 2026 approaches, the most consequential legal battle in the history of artificial intelligence has entered a critical and highly technical discovery phase. Inside the Southern District of New York, a sprawling multidistrict litigation (MDL) has consolidated sixteen separate copyright infringement lawsuits against OpenAI and Microsoft. The plaintiffs, led by The New York Times and the Authors Guild, are seeking to prove that the foundation models powering ChatGPT were built on mass copyright infringement. The outcome of this litigation will not merely determine financial damages; it threatens to dictate the fundamental architecture of how machine learning models are trained and deployed globally.[1][7]

At the heart of the dispute is the U.S. legal doctrine of "fair use," a framework designed to balance the rights of creators with the public's interest in innovation and free expression. Historically, fair use has protected technologies that utilize copyrighted works in a "transformative" manner, such as search engines indexing the web or databases digitizing books for text-mining. However, generative AI presents an entirely novel challenge. The courts must decide whether ingesting billions of words to train a neural network is akin to a human reading a library to learn grammar, or if it constitutes the creation of an unlicensed, competing database.[5][6]

Early judicial signals in 2025 suggested a potential pathway for AI developers. In cases like Bartz v. Anthropic, courts leaned toward the interpretation that the sheer act of training a model on lawfully obtained data is inherently transformative. This perspective argues that the models are extracting statistical relationships and uncopyrightable facts, rather than storing the expressive text itself. Consequently, the legal battleground has shifted dramatically. The focus is no longer solely on the "input" phase—how the data was gathered—but has pivoted sharply to the "output" phase, scrutinizing exactly what these models produce when prompted by everyday users.[6][7]

The four factors courts use to determine if the unauthorized use of copyrighted material is legally permissible.

This shift has elevated a technical phenomenon known as "regurgitation" or "memorization" to the center of the legal stage. The New York Times and other plaintiffs allege that large language models do not merely learn abstract concepts; they memorize specific, highly weighted training data and can spit out near-verbatim copies of protected reporting and literature. If a user can prompt ChatGPT to bypass a paywall and generate an exact replica of a Pulitzer-winning investigation, the plaintiffs argue, the AI ceases to be a transformative tool and becomes an illegal market substitute.[3][4]

To prove this claim, the plaintiffs demanded hard evidence of how ChatGPT operates in the wild, leading to a massive discovery dispute. In January 2026, U.S. District Judge Sidney Stein affirmed a magistrate judge's order compelling OpenAI to produce a staggering 20 million ChatGPT user logs. The plaintiffs had initially requested 120 million logs from the tens of billions preserved by the company, but the court settled on the 20 million sample—representing roughly 0.5 percent of the total—as a statistically significant dataset to test the regurgitation theory.[1]

OpenAI fought fiercely against the subpoena, arguing that handing over millions of user conversations would constitute a massive invasion of privacy. The company contended that the vast majority of these logs had nothing to do with the plaintiffs' copyrighted works and that producing them would turn uninvolved users into collateral damage. However, the court rejected this privacy gambit. Judge Stein ruled that the logs could be safely de-identified and protected under strict confidentiality orders, prioritizing the plaintiffs' need to gather evidence on the model's real-world outputs.[1][7]

The court-ordered discovery sample represents roughly 0.5% of OpenAI's preserved user logs.

The 20 million logs are now the most scrutinized dataset in the legal tech world. Plaintiffs' experts are combing through the de-identified conversations to determine whether regurgitation is a routine occurrence or a rare anomaly. OpenAI has long maintained that exact memorization is a technical bug, not a feature, and that it typically only occurs when users deploy highly specific, "adversarial" prompts designed to force the model to break its own guardrails. The plaintiffs, however, hope the logs will reveal that everyday users are regularly receiving outputs that compete directly with original works.[1][3]

The 20 million logs are now the most scrutinized dataset in the legal tech world.

This evidentiary hunt is entirely focused on the fourth factor of the fair use test: the effect of the use upon the potential market for or value of the copyrighted work. In copyright law, a use is rarely considered fair if it serves as a direct market substitute that deprives the original creator of revenue. If the discovery process proves that ChatGPT is frequently used as a replacement for paid news subscriptions or book purchases, OpenAI's fair use defense will face a nearly insurmountable hurdle.[4][5]

While the output logs represent the primary offensive front for the plaintiffs, they are also probing the historical inputs used to train OpenAI's earlier models. A significant point of contention involves the "Books1" and "Books2" datasets, which were allegedly compiled from shadow libraries containing pirated literature, such as Library Genesis. OpenAI deleted these datasets prior to the commencement of the current litigation, citing "non-use," but the plaintiffs have aggressively sought discovery regarding the internal decision-making process that led to their destruction.[2][7]

In a notable procedural victory for OpenAI, the court recently shielded the company's internal communications regarding the dataset deletion. In February 2026, Judge Stein reversed a prior magistrate ruling, declaring that OpenAI's discussions with its legal counsel about the "Books1" and "Books2" datasets are protected by attorney-client privilege. The judge ruled that merely stating the datasets were deleted for "non-use" did not waive the privilege, preventing the plaintiffs from accessing potentially damaging internal emails about the company's early data-sourcing strategies.[2]

As the evidence mounts on both sides, the broader technology industry is watching with intense anxiety. A ruling that generative AI outputs inherently violate copyright law would send shockwaves through the sector. Developers might be forced to implement draconian output filters, fundamentally degrading the utility of the models, or face the prospect of destroying their existing foundation models entirely—a remedy The New York Times explicitly requested in its initial complaint.[3][6]

Plaintiffs argue that when AI models output verbatim text, they cease to be transformative tools and become illegal market substitutes.

In response to this existential legal threat, a two-tiered system of AI development is rapidly emerging. Rather than waiting for a judicial verdict, several major publishers, including Reddit, News Corp, and The Associated Press, have signed lucrative, multi-million-dollar licensing agreements with AI developers. These deals grant tech companies explicit permission to train on the publishers' archives, effectively bypassing the fair use debate entirely. However, this licensing economy has sparked its own fierce debate about the future of the internet.[5][7]

Legal scholars and open-source advocates warn that dismantling the fair use defense could have disastrous unintended consequences for competition. If the courts rule that every piece of training data must be explicitly licensed, the barrier to entry for AI development will become impossibly high. Only trillion-dollar tech conglomerates will possess the capital required to secure comprehensive data licenses, effectively outlawing independent academic research and open-source foundation models. This dynamic could hand a permanent, insurmountable monopoly to the very companies currently being sued.[4][6]

Conversely, copyright holders argue that the survival of independent journalism and human authorship cannot be sacrificed on the altar of technological progress. They contend that if AI companies are allowed to build multi-billion-dollar empires by freely extracting the value of human labor, the economic incentives to create original work will collapse. For these creators, the fair use doctrine was never intended to protect autonomous machines that ingest the entirety of human culture only to sell it back to the public at a premium.[3][5]

As the multidistrict litigation grinds through the summer of 2026, the burden of proof rests heavily on the data extracted from the 20 million ChatGPT logs. The impending summary judgment motions will require the court to weigh the empirical evidence of regurgitation against the transformative potential of artificial intelligence. The resulting precedent will not only define the boundaries of digital copyright for the next century but will fundamentally shape the economic relationship between the humans who create knowledge and the machines designed to synthesize it.[1][6][7]

How we got here

Dec 2023
The New York Times files a landmark copyright infringement lawsuit against OpenAI and Microsoft.
Apr 2025
The U.S. Judicial Panel consolidates 12 major AI copyright lawsuits into a single multidistrict litigation in New York.
Jan 2026
Judge Sidney Stein affirms an order compelling OpenAI to produce 20 million de-identified ChatGPT logs to plaintiffs.
Feb 2026
The court rules that OpenAI's internal communications regarding the deletion of controversial training datasets are protected by attorney-client privilege.
Jun 2026
Discovery continues as legal experts anticipate summary judgment motions that will define the boundaries of AI fair use.

Viewpoints in depth

Copyright Holders & Publishers

Publishers argue that AI models are built on mass infringement and threaten the economic foundation of journalism.

For organizations like The New York Times and the Authors Guild, the mechanics of machine learning cannot excuse the unauthorized ingestion of millions of protected works. They argue that generative AI models are not merely 'learning' concepts, but are effectively massive, unlicensed databases capable of reproducing exact prose. By generating near-verbatim summaries and articles, these tools act as direct market substitutes, bypassing paywalls and depriving creators of subscription and licensing revenue. Their legal strategy focuses heavily on the fourth factor of fair use—market harm—asserting that a tool designed to replace human writers cannot be considered a transformative public good.

AI Developers & Tech Platforms

Tech companies maintain that training models on public data is a fundamentally transformative process protected by fair use.

Developers like OpenAI and Microsoft argue that their models do not store a database of copied text, but rather learn statistical relationships between words—a process they compare to a human student reading books in a library to learn how to write. They contend that the ingestion of data is a quintessentially transformative fair use, creating a novel tool with a distinct purpose. From this perspective, instances of 'regurgitation' are rare technical bugs, often triggered only by adversarial prompting, rather than the intended function of the product. They warn that requiring licenses for all training data would make the development of frontier AI legally and financially impossible.

Open-Source & Academic Advocates

Legal scholars warn that a ruling against fair use could inadvertently hand a permanent monopoly to the largest tech corporations.

A third camp of legal academics and open-source advocates views the litigation with deep concern for the future of innovation. While acknowledging the valid concerns of authors, they argue that dismantling the fair use defense for AI training would be disastrous for competition. If courts rule that every piece of training data must be explicitly licensed, only trillion-dollar companies with vast legal teams and deep pockets will be able to build foundation models. This would effectively outlaw open-source AI development and academic research, centralizing control over the era's most important technology in the hands of a few corporate giants.

What we don't know

Whether the 20 million ChatGPT logs will reveal statistically significant instances of regurgitation in everyday user interactions.
How the courts will ultimately weigh the 'market substitute' factor against the 'transformative' nature of AI training.
Whether a ruling against OpenAI would force the destruction of existing models or merely result in financial damages.

Key terms

Regurgitation: When a generative AI model outputs exact or near-exact copies of the data it was trained on, rather than synthesizing new text.
Fair Use: A U.S. legal doctrine that permits limited use of copyrighted material without permission for purposes such as criticism, comment, news reporting, or research.
Transformative Use: A key factor in fair use determining whether the new work alters the original with new expression, meaning, or message.
Market Substitute: An infringing work that directly competes with the original, reducing the creator's potential revenue (Factor 4 of the fair use test).
Multidistrict Litigation (MDL): A special federal legal procedure designed to speed the process of handling complex cases, such as the 16 consolidated AI copyright suits in New York.

Frequently asked

Why did the judge order OpenAI to hand over 20 million chat logs?

The plaintiffs need to prove that ChatGPT regurgitates their copyrighted work in the real world. The logs will serve as evidence of whether everyday users are generating outputs that act as a substitute for paid news and books.

Did OpenAI try to block the release of user logs?

Yes. OpenAI argued that releasing the logs would violate user privacy. However, the court ruled that the logs could be de-identified and that the privacy concerns did not outweigh the need for discovery.

Have any courts ruled that AI training is fair use?

While the NYT case is ongoing, other courts in cases like Bartz v. Anthropic (2025) have leaned toward viewing the act of training on lawfully obtained data as a transformative fair use, shifting the legal focus to the actual outputs generated by the AI.

Sources

[1]Jones Walker LLPAI Developers
OpenAI Loses Privacy Gambit: 20 Million ChatGPT Logs Likely Headed to Copyright Plaintiffs
Read on Jones Walker LLP →
[2]Sterne KesslerAI Developers
Privilege Preserved: OpenAI Escapes Forced Disclosure of Attorney Communications in Major Copyright Fight
Read on Sterne Kessler →
[3]Harvard Law ReviewCopyright Holders
NYT v. OpenAI: The Times's About-Face
Read on Harvard Law Review →
[4]Authors AllianceOpen-Source Advocates
On Memorization, Fair Use, and the Future of Generative AI
Read on Authors Alliance →
[5]McKool SmithCopyright Holders
AI Infringement Case Updates
Read on McKool Smith →
[6]Case Western Reserve University School of LawOpen-Source Advocates
AI Training is Fair Use: The Beginning of the End of the Copyright Assault on Gen AI
Read on Case Western Reserve University School of Law →
[7]Factlen Editorial TeamOpen-Source Advocates
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →

Up next

Self-Driving Labs

How Self-Driving Labs and Agentic AI Are Automating Scientific Discovery

Autonomous laboratories combining AI and robotics are compressing decades of chemical and materials research into days. By operating in closed loops, these systems are accelerating breakthroughs in energy, medicine, and materials science.

Every angle. Every day.

Get ai stories with full source coverage and perspective breakdowns delivered to your inbox.

Get the briefing →Browse ai