Perplexity hits back at Cloudflare amid claims of website 'stealth crawling' to dodge AI blocks

Perplexity denies its bots are slipping past Cloudflare blocks or ignoring robots.txt files

Logo of Perplexity AI, developer of the Perplexity Labs, Deep Research, and AI-powered browser tools, pictured on a smartphone screen.
(Image credit: Getty Images)

Cloudflare has accused Perplexity of failing to honour requests from websites to opt out of content scraping by AI companies.

Last month, the web infrastructure company announced a system to block AI companies from accessing websites without permission or compensation. The move came as part of a push back against AI companies hoovering up the entire internet to use as training data — a tactic that has sparked lawsuits.

Cloudflare's system lets online publishers and other website owners block AI crawlers from seeing their content, with future plans to only allow those who have paid to scrape.

Several weeks into the blocking system, Cloudflare has reported that AI company Perplexity is using evasive techniques to access that content regardless. In a blog post this week, the firm said Perplexity changes how it presents itself to a website when it spots a block.

"Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences," the post noted.

ITPro contacted Perplexity for a statement, but had received no response at time of publication. A spokesperson for the firm told TechCrunch that the Cloudflare research was a "sales pitch" for the blocking product and said that the bot discussed "isn't even ours."

In a separate statement to The Verge, the company said Cloudflare's report was a "publicity stunt" and that there were "a lot of misunderstandings in the blog post".

It's not the first time Perplexity has been accused of letting its bots crawl where they're not wanted. Reports from Wired spotted such behavior last year while Forbes, the New York Times, and the BBC accused the company of scraping and reproducing their content without permission.

Perplexity has denied the accusations.

What Cloudflare claims

Cloudflare said it saw "continued evidence" that Perplexity's user agent is changing their user agent and the source where it's coming from to hide this activity, and even ignoring or failing to view the "robot.txt" files — these are a list of instructions for bots telling them what to access and not, used for search crawlers and now AI agents.

After hearing complaints from customers who had tried to block AI crawlers, Cloudflare set up a series of experiments using brand-new test websites that were not publicly accessible, with a robots.txt file directing "respectful bots from accessing any part of a website."

Cloudflare then asked Perplexity AI questions about the domains, and found it was able to access detailed information from the restricted test sites.

"This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers," the post noted.

Cloudflare said Perplexity is not only using a declared user agent, but also a generic browser that impersonates Chrome on macOS when the declared agent is blocked.

For comparison, Cloudflare ran similar tests with OpenAI's ChatGPT, finding it fetched the robots.txt file and stopped crawling when told not to access a page; when there were no instructions in the robots.txt file but there was a block page, ChatGPT again stopped crawling.

"Both of these demonstrate the appropriate response to website owner preferences," Cloudflare said.

Danger to the internet?

Cloudflare said that this behavior risks the network of trust that holds up the internet.

"There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences," the post said.

The company added that it would now block the AI company from websites using its service.

"Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling," the post added.

Calling on AI companies to behave better, Cloudflare said "well-intentioned crawlers acting in good faith" should be transparent and identify the agent honestly, and not attempt to dodge detection by sites attempting to block such access.

For sites that do allow access, AI crawlers should behave fairly and not flood sites with too much traffic or scrape sensitive data, and serve a "clear purpose" — such as checking a price or powering a voice assistant.

Cloudflare also suggested AI companies use separate web crawlers for each activity, letting website owners more easily allow some crawler activity but not others. "Don’t force site owners to make an all-or-nothing decision," the post said.

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.

Nicole the author of a book about the history of technology, The Long History of the Future.