Encyclopedia Britannica sues OpenAI for training ChatGPT on 100,000 proprietary articles and dictionary entries

The iconic publisher’s legal challenge against OpenAI tests the boundaries of fair use and the future of curated knowledge.

March 16, 2026

Encyclopedia Britannica sues OpenAI for training ChatGPT on 100,000 proprietary articles and dictionary entries
The legal confrontation between traditional bastions of curated knowledge and the titans of generative artificial intelligence has entered a significant new phase with Encyclopedia Britannica filing a major lawsuit against OpenAI.[1][2][3] The complaint, lodged in the United States District Court for the Southern District of New York, alleges that the creator of ChatGPT used nearly 100,000 of Britannica’s meticulously researched articles and a vast catalog of Merriam-Webster dictionary entries to train its large language models without authorization. This development marks a pivotal escalation in the ongoing debate over whether the ingestion of high-quality, proprietary data constitutes transformative fair use or a systematic misappropriation of intellectual property. By targeting the very foundation of how artificial intelligence systems acquire factual reliability, Britannica is challenging the economic and legal framework that has allowed the AI industry to expand at its current breakneck pace.
The core of the legal challenge centers on the assertion that OpenAI did not merely learn from Britannica’s content but effectively ingested it to create a competing product that cannibalizes the publisher's primary business model.[2] According to the court filings, ChatGPT frequently produces near-verbatim reproductions, detailed summaries, or specific abridgments of Britannica’s copyrighted works in response to user queries.[2] Britannica argues that this practice creates a substitution effect where users who would traditionally visit its digital encyclopedia or dictionary are instead provided with the information directly within the AI interface.[1][3] The publisher contends that this bypasses the advertising and subscription-based revenue models that sustain its editorial operations, which have been managed by human researchers and editors since the company's founding in 1768.[2] Furthermore, the lawsuit includes claims of trademark infringement, alleging that OpenAI’s models often cite Britannica as a source for incorrect or fabricated information.[4][3] These so-called hallucinations, according to the filing, damage the publisher’s reputation for accuracy and create public confusion regarding the origin and reliability of the data provided by the chatbot.
This legal battle in the United States is unfolding against a backdrop of increasing judicial scrutiny in Europe, where the definition of how AI models interact with copyrighted works is being redefined. While American courts have largely focused on the concept of fair use, European jurisdictions are grappling with the technicalities of whether an AI model can be said to store or fixate a copyrighted work within its digital architecture. A recent landmark ruling by the Munich I Regional Court in Germany found that large language models are capable of memorizing protected texts, and that this internal representation constitutes a reproduction under copyright law. This finding directly contradicts the common industry defense that AI models only learn abstract patterns rather than storing copies of their training data. As the European Union moves toward full implementation of its AI Act, the tension between text and data mining exceptions and the rights of content creators has become a central point of friction. The outcome of the Britannica case may depend heavily on whether US courts adopt a similar technical view of model weights and internal storage, potentially stripping AI developers of the safe harbor provisions they have relied upon for the past several years.
The strategic choice by Encyclopedia Britannica to pursue litigation rather than a licensing agreement highlights a growing rift within the media and publishing industries. While several major organizations, including Walt Disney Company, the Associated Press, and various global news conglomerates, have opted to sign lucrative multi-year deals with OpenAI to provide training data in exchange for financial compensation and equity stakes, Britannica has chosen a more confrontational path. This decision mirrors its previous legal action against the AI search engine Perplexity, suggesting a broader corporate strategy focused on establishing firm legal precedents rather than accepting one-off payouts. For the AI industry, the risk is substantial; if curated reference materials like Britannica’s are deemed off-limits without explicit, high-cost licenses, the ability of AI models to maintain factual grounding will be severely compromised. Retrieval-augmented generation, a technique used to reduce AI errors by pulling from trusted sources, relies on the very data that is now the subject of intense litigation. If the courts rule in Britannica’s favor, it could force a massive re-evaluation of how AI companies source the ground-truth data necessary to prevent their systems from becoming unreliable engines of misinformation.
As the case moves toward discovery, the technical debate over how 100,000 articles were processed will likely become the focal point of the proceedings. OpenAI has consistently maintained that its training processes are transformative and protected under existing laws, arguing that the technology creates entirely new works rather than acting as a sophisticated database for existing ones. However, the sheer volume of data involved and the precision of the outputs generated by ChatGPT make this a unique test for the judiciary. The ultimate resolution of this dispute will do more than just determine the financial damages owed to a single publisher; it will define the boundary between public information and private property in the digital age. If the court finds that the systematic scraping and ingestion of high-value reference libraries is a violation of the Copyright Act, it may necessitate a complete overhaul of the data economies that currently power the world's most advanced artificial intelligence. The stakes extend far beyond the courtroom, touching on the future of how information is verified, attributed, and monetized in a world where the distinction between a human-written encyclopedia and an AI-generated summary is increasingly blurred.

Sources
Share this article