New Tool Proves AI Memorizes Verbatim Copyrighted Material, Fueling Legal Battles
New tool RECAP proves AI models can reproduce verbatim copyrighted texts, dramatically impacting fair use debates and legal outcomes.
November 12, 2025

A new tool developed by university researchers has provided some of the most compelling evidence to date that large language models can memorize and reproduce significant portions of copyrighted material from their training data, a finding that could dramatically impact the future of artificial intelligence and ongoing legal battles over copyright infringement. The tool, known as RECAP, systematically probes AI models to reconstruct texts they have likely seen during training, revealing an ability to regurgitate lengthy, sometimes nearly verbatim, passages from well-known books. This capability directly challenges claims by some AI developers that their models only learn statistical patterns and not specific content, fueling the arguments of authors and publishers who allege their work has been effectively copied on a massive scale.
Researchers at Carnegie Mellon University and the Instituto Superior Técnico designed RECAP to overcome the secrecy surrounding the vast datasets used to train commercial AI models.[1] Because AI companies often do not disclose the specific materials ingested by their systems, it has been difficult to prove definitively what a model has "memorized."[1][2] RECAP addresses this by employing an agentic, feedback-driven pipeline.[3][2] The system makes an initial attempt to extract a passage from a target model and then uses a second AI to compare the output to the original text, identifying errors.[3][2] This feedback is then used to refine the prompt and guide the target model toward a more accurate reproduction of the source material.[1][3][2] To counteract safety filters designed to prevent models from outputting copyrighted text, RECAP incorporates a "jailbreaking" module that rephrases prompts to bypass these restrictions.[1][4] This iterative process can significantly improve the accuracy of the reconstructed text, providing clear evidence of what the model has stored in its parameters.[1][2][4]
The findings from RECAP and similar research efforts carry significant weight for the numerous copyright lawsuits currently pending against major AI developers.[5] Copyright holders, from individual authors to large media organizations like The New York Times, have filed suits alleging that the unauthorized use of their work to train AI models constitutes infringement.[6][7][8] A central question in these cases is whether LLMs are merely learning from the data in a transformative way—a defense often cited by AI companies under the "fair use" doctrine—or if they are storing and reproducing protected expression.[9][5][7] Evidence of extensive, verbatim memorization strengthens the claim that the models themselves can be considered infringing copies or derivative works.[10][11] If a model can reliably reproduce substantial portions of a copyrighted book, it behaves less like a student learning concepts and more like a vast digital library of copied content.[9][10] This distinction is crucial, as copyright law offers remedies that could include the destruction of infringing materials, a potentially catastrophic outcome for a company whose business is built on its proprietary models.[10]
The debate over AI and copyright is nuanced, with polarized claims from both sides.[10] AI companies argue that training on vast datasets is a transformative use necessary for innovation and does not harm the market for the original works.[5][12][13] Some court rulings have offered partial support for this view, suggesting that the act of training itself might be considered fair use, while also indicating that the source of the data matters—using pirated books for training, for instance, is not excused.[5][12] However, the ability of models to "regurgitate" content complicates the fair use argument.[6][8] Research has shown that memorization is not uniform; it varies significantly by model and by book.[10][14] Larger models, like Llama 3.1 70B, have been shown to memorize certain popular books, such as "Harry Potter and the Sorcerer's Stone" and "1984," almost in their entirety, while showing little memorization of the vast majority of other books in their training data.[9][10][11] This selective but deep memorization demonstrates that the models are not simply creating statistical noise but are capable of high-fidelity reproduction of specific, protected works.
The exposure of this memorization capability places the AI industry at a critical juncture. The evidence presented by tools like RECAP pressures AI developers to be more transparent about their training data and to implement more effective safeguards against regurgitation.[15] It also raises fundamental questions about the technical feasibility of preventing memorization without compromising model performance, a challenge that some researchers suggest may be inherent to how current LLMs function.[16][8] As legal battles continue, the courts will be forced to grapple with how to apply centuries-old copyright principles to a technology that defies traditional categorization.[6] The outcomes of these cases, informed by increasingly sophisticated methods for peering inside the "black box" of AI, will undoubtedly reshape the development and deployment of artificial intelligence, balancing the drive for innovation against the foundational rights of creators.
Sources
[5]
[7]
[9]
[11]
[13]
[14]
[15]
[16]