AI Tech Suite

OpenAI Fights NYT Demand for 120 Million ChatGPT User Chats

OpenAI's fight over 120 million user chats pits copyright evidence against AI privacy and data retention policies.

August 6, 2025

OpenAI Fights NYT Demand for 120 Million ChatGPT User Chats

A high-stakes legal battle between OpenAI and The New York Times has escalated, moving beyond foundational questions of copyright in the age of artificial intelligence to a contentious dispute over access to vast amounts of user data. At the heart of the current conflict is the Times' demand to review approximately 120 million ChatGPT user conversations to substantiate its claims of copyright infringement. OpenAI has staunchly resisted this demand, citing overwhelming technical burdens and a fundamental threat to user privacy, instead offering access to a significantly smaller sample of 20 million chat logs.[1][2] This standoff over evidence underscores the profound complexities and far-reaching implications of a lawsuit that could reshape the legal landscape for generative AI and the data it relies on.

The core of the lawsuit, filed by The New York Times in December 2023, alleges that OpenAI and its major investor, Microsoft, unlawfully used millions of the newspaper's copyrighted articles to train the large language models that power ChatGPT.[3][4][5] The Times argues that these AI models not only copied its work during the training process but can also generate outputs that are nearly identical to its original articles, effectively creating a competing product that undermines its business and journalism.[3][6][4] To prove the extent of this alleged infringement, the newspaper is seeking to analyze a massive trove of user chat logs, which it believes will reveal a pattern of ChatGPT reproducing its content.[7] The Times contends that a comprehensive review of this data is necessary to identify systematic copyright violations.[1]

OpenAI has forcefully pushed back against the demand for 120 million chat logs, framing it as an excessive and invasive overreach that would compromise user privacy and trust.[8][9][10] The company has publicly stated that fulfilling the request would be a monumental undertaking, requiring engineers to retrieve, de-identify, and process immense volumes of data.[1] According to OpenAI, processing the 20 million chats it has offered as a sample would take three months, while handling the full 120 million could extend beyond eight months.[1] The company argues that this smaller, statistically significant sample, supported by a computer science researcher, is sufficient for the Times to examine how frequently the AI may have reproduced its articles without subjecting millions of users' data to legal scrutiny.[1] These logs, some of which OpenAI claims it would have otherwise deleted, contain not just conversations but also sensitive user details like email addresses that must be painstakingly removed before any review.[1]

The dispute has raised significant concerns about the privacy of AI users and the data retention policies of tech companies.[7][11] A federal court has already ordered OpenAI to preserve all ChatGPT conversations, including those users had deleted, to prevent the potential loss of evidence.[2][12][11] This preservation order itself has sparked alarm among users and privacy advocates, who fear that private conversations, once believed to be ephemeral, could become permanent records accessible in legal proceedings.[11] OpenAI's resistance to the Times' broad discovery request highlights a contradiction for the company; while it has publicly committed to deleting user chats unless they opt-in to save them, its legal arguments reveal that retrieving this data is, at the very least, a burdensome and complex process.[7] This has led to public scrutiny and questions about the true extent of OpenAI's data deletion practices.[7] The case now forces a difficult balance between a plaintiff's right to gather evidence and a tech company's dual obligations to protect user privacy and comply with court orders.[1]

The implications of this legal confrontation extend far beyond the courtroom, touching upon the fundamental business models of both news organizations and AI developers. The outcome could set a critical precedent for how AI models are trained and whether using publicly available internet data constitutes "fair use."[3][6] If the court sides with The New York Times, AI companies could face a future where they must license content from publishers, potentially altering the economics of AI development.[4] The case also puts a spotlight on the responsibilities of AI companies in safeguarding user data, especially when it becomes entangled in high-profile litigation.[8] Furthermore, the aggressive legal tactics have expanded, with OpenAI demanding access to The New York Times reporters' notes to challenge the originality of the articles in question, a move the newspaper has decried as "harassment and retaliation."[13][14] This legal skirmish over evidence and privacy is a critical chapter in a much larger story about the evolving relationship between technology, copyright, and the information we all create and consume.