Pulitzer Winner Leads Authors Suing Microsoft Over AI's Pirated Training

A landmark lawsuit claims Microsoft's powerful AI was built on 200,000 pirated books, igniting a crucial copyright battle.

June 26, 2025

Pulitzer Winner Leads Authors Suing Microsoft Over AI's Pirated Training
Microsoft is facing a significant legal challenge from a group of authors who allege the technology giant used their copyrighted works without permission to train its artificial intelligence models.[1][2] The lawsuit, filed in a New York federal court, claims that a massive dataset containing nearly 200,000 pirated books was used to develop Microsoft's Megatron-Turing Natural Language Generation model (MT-NLG), a powerful system designed to generate human-like text.[3][2] This case is one of several recent high-stakes legal battles that pit authors and copyright holders against major tech companies, raising fundamental questions about the legality and ethics of using protected materials to fuel the advancement of generative AI.[1][4]
The class-action complaint was brought forward by a group of authors including Pulitzer Prize winner Kai Bird, and writers Jia Tolentino and Daniel Okrent.[5][6] Their central claim is that Microsoft engaged in copyright infringement by using a dataset known as "Books3" to train its AI.[5][1] This dataset, which has since been taken down after a complaint from a Danish anti-piracy group, is alleged to contain approximately 196,640 pirated e-books.[7][8] The authors argue that by feeding their work into the Megatron model, Microsoft has created a system capable of mimicking their distinct writing styles, syntax, and thematic elements, effectively creating derivative works without consent or compensation.[3][9][10] The lawsuit seeks statutory damages of up to $150,000 for each work infringed upon and a court order to prevent Microsoft from continuing to use their material.[3][10]
The Megatron-Turing NLG model sits at the heart of this legal dispute. Developed in partnership with NVIDIA, it was announced as one of the largest and most powerful language models of its time, boasting 530 billion parameters.[11][12] This immense scale allows the model to achieve a nuanced understanding of language, enabling it to perform tasks like text prediction, reading comprehension, and common-sense reasoning with high accuracy.[12] The lawsuit alleges that this powerful capability was built upon a foundation of illegally copied creative works.[2] While Microsoft and NVIDIA have previously acknowledged that their model, like other large language models, can pick up biases and stereotypes from its training data, the lawsuit focuses on the foundational legality of using the data in the first place.[12][13] The authors contend that Microsoft's use of the pirated "Books3" collection was a deliberate choice to bypass licensing fees and agreements that would have been necessary to legally use their intellectual property.[5]
This lawsuit against Microsoft is a critical development in a much broader conflict over the use of copyrighted material in AI training.[14] Tech companies, including Meta, Anthropic, and OpenAI, have faced a wave of similar legal challenges from authors, artists, and news organizations.[1][4][2] The primary defense offered by these companies is the "fair use" doctrine under U.S. copyright law, arguing that their use of copyrighted works is transformative and essential for innovation in the burgeoning AI industry.[1][15] They contend that they are not simply reproducing the original works but are using them to create entirely new and transformative technologies. However, the plaintiffs in these cases argue that this use directly harms their ability to monetize their creations and constitutes a form of exploitation.[16] The legal landscape is still evolving, with recent court rulings offering mixed signals. In one notable case, a judge ruled that AI company Anthropic's training of its models on copyrighted books could be considered fair use, but crucially, the company could still be held liable for using pirated versions of those books.[17][18] This distinction between the act of training and the act of using illegally sourced material is central to the case against Microsoft, as the lawsuit explicitly alleges the use of pirated content.[5]
The outcome of the Microsoft lawsuit and others like it will have profound implications for the future of artificial intelligence.[14] A ruling in favor of the authors could force AI developers to fundamentally rethink how they source and license training data, potentially increasing costs and slowing the pace of development. It could also lead to a more robust market for licensing creative works for AI training, providing a new revenue stream for creators. Conversely, a ruling in favor of Microsoft and other tech companies could solidify the argument that training AI models on publicly available data constitutes fair use, further accelerating AI development but potentially at the expense of creators' rights. As these legal battles unfold, they will continue to shape the ethical and legal frameworks that govern the relationship between human creativity and artificial intelligence, forcing society to confront complex questions about ownership, innovation, and the value of intellectual property in the digital age.[19]

Research Queries Used
Microsoft lawsuit 200,000 pirated books AI training
authors suing Microsoft for copyright infringement AI
Microsoft Megatron-Turing NLG model lawsuit details
Books3 dataset copyright lawsuit
legal implications of using copyrighted material to train AI
NVIDIA and Microsoft AI model training lawsuit
Share this article