Cloudflare Exposes Perplexity's Covert Scraping, Igniting Major AI Ethics Debate

Accusations of 'stealth crawling' spark a fierce debate over AI's data hunger and the future of web etiquette.

August 6, 2025

Cloudflare Exposes Perplexity's Covert Scraping, Igniting Major AI Ethics Debate
A fierce debate over digital etiquette and data rights has erupted after internet infrastructure giant Cloudflare accused the AI search engine Perplexity of systematically circumventing website rules to scrape content. Cloudflare alleges that Perplexity's crawlers, the automated bots that gather information from the web, have been actively disguising their identity to bypass explicit instructions from site owners who wish to block them.[1][2][3] The accusations strike at the heart of a growing tension between AI companies hungry for data to train their models and publishers seeking to control their intellectual property, raising critical questions about the future of responsible AI and the unwritten rules of the internet.[1][4]
The core of Cloudflare's claims, detailed in a technical blog post, is that Perplexity engages in "stealth crawling."[3][5] This practice allegedly involves ignoring the widely used Robots Exclusion Protocol, or robots.txt, a file that website administrators use to give instructions to web crawlers about which pages they are permitted to access.[3][6] According to Cloudflare, their investigation was prompted by complaints from customers who noticed that Perplexity was accessing their content despite being disallowed in their robots.txt files and blocked by Web Application Firewalls (WAFs).[7][8] To verify these claims, Cloudflare set up test domains with strict no-crawling rules and found that Perplexity was still able to retrieve and summarize protected content from these non-indexed sites.[7][9] The investigation concluded that when Perplexity's declared bots were blocked, the company deployed undeclared crawlers that impersonated regular users, specifically mimicking the Google Chrome browser on a macOS.[10][7] These "stealth" crawlers allegedly used a rotating pool of IP addresses and Autonomous System Numbers (ASNs) not officially associated with Perplexity, making them difficult to block with standard methods.[7][3] Cloudflare stated it observed millions of these stealth requests daily across tens of thousands of domains.[1][7] As a result of these findings, Cloudflare has removed Perplexity from its list of "verified bots" and has implemented new rules to block this behavior by default.[10][3]
Perplexity has vehemently denied the allegations, characterizing Cloudflare's report as a misunderstanding at best and a "publicity stunt" at worst.[1][11] The AI startup contends that Cloudflare is fundamentally misinterpreting how its service operates.[11] According to Perplexity, its system does not engage in preemptive mass crawling like traditional search engines. Instead, it fetches information in real-time in direct response to a user's query.[12] Perplexity argues this on-demand fetching is more akin to a browser acting on a user's behalf than a rogue bot indiscriminately scraping the web.[11][12] The company has pushed back, stating that Cloudflare's analysis failed to distinguish between its own legitimate, user-initiated traffic and unrelated activity from third-party services.[13] Specifically, Perplexity claims Cloudflare confused its traffic with millions of requests from a cloud browser service called BrowserBase, which Perplexity says it uses only for highly specialized and limited tasks.[13] A spokesperson for Perplexity dismissed Cloudflare's report as a "sales pitch" and argued that the evidence presented did not show any content was actually accessed.[4][8] Perplexity has also criticized Cloudflare's infrastructure, suggesting its systems are not sophisticated enough to differentiate between genuine AI assistant requests and malicious bots.[13]
This public dispute highlights a critical flashpoint in the burgeoning AI industry: the ethics of data scraping.[2][4] For decades, a fragile truce has existed between content creators and search engines, governed by informal standards like robots.txt.[3][14] Publishers allowed crawlers to index their content in exchange for visibility and referral traffic. However, AI models that ingest vast quantities of data to generate their own summaries and answers disrupt this dynamic, often without providing direct traffic back to the original source.[6] This has led to a growing backlash from publishers and content creators who feel their work is being exploited without consent or compensation.[9][6] The incident with Perplexity is not isolated; the company has previously faced accusations of improperly using content from major news outlets.[3][14] The broader AI industry is also under scrutiny, with numerous companies facing lawsuits over their data collection practices.[8][6]
The clash between Cloudflare and Perplexity underscores the urgent need for clearer standards and more robust technical solutions to govern how AI interacts with the web.[11][15] As AI systems become more sophisticated and integrated into our digital lives, the informal "rules of the road" that have long managed web traffic are proving inadequate.[11][6] Cloudflare is advocating for more transparency and technical accountability from bot operators and is working with standards bodies like the Internet Engineering Task Force (IETF) to create more enforceable protocols.[11] Meanwhile, Perplexity's defense suggests a belief that its agentic AI, which acts on behalf of a user, should not be subject to the same rules as traditional web crawlers.[11][12] This fundamental disagreement over the nature and rights of AI-driven data collection ensures that the debate over digital ownership and the ethics of scraping is far from over. It signals a brewing conflict that could reshape the economic and technical foundations of the open internet.[1][4]

Sources
Share this article