OpenAI quietly unveils GPTBot dedicated web crawler
Website administrators have the power to prevent GPTBot from collecting information


OpenAI has quietly unveiled a way for website administrators to divert the company's web crawler from lifting, preventing it from lifting data.
The firm behind ChatGPT published instructions for turning off its web crawler on its online documentation. Members of the AI community spotted the addition on Monday but it has come without an official announcement.
GPTBot can be identified by the user agent token ‘GPTBot’. The full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Stopping GPTBot from crawling a site requires adding it to the robots.txt file along with the parts of the site off-limits to the crawler. The same technique is used to stop crawlers, such as Googlebot, from accessing all or part of a domain.
The company also confirmed the IP address block used by the crawler. Rather than taking the robots.txt route, an administrator could simply block those addresses.
There is currently no way to remove data already added to training models - GPT-3.5 and 4 are based on models dated up to September 2021.
The approach taken by GPTBot requires users to essentially ‘opt-out’ of crawling, requiring a proactive measure on the part of website administrators. Data could be used in future models unless an admin specifically adds GPTBot to a site’s robots.txt file to stop the crawler.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Some commentators have speculated that OpenAI’s move could permit the company to lobby for anti-scraping regulation or defend itself against future actions.
However, it would be unlikely that the data already collected would be exempt from the attention of lawmakers. GPT-4, for example, was launched in March 2023 based on data already added to training sets.
RELATED RESOURCE
Understand why AI/ML is crucial to cyber security, how it fits in, and its best use cases.
OpenAI has used other datasets to train its models, including Common Crawl. The CCBot crawler bot used to generate the data can also be blocked with lines of code in robots.txt. However, GPTBot represents a dedicated crawler for the company.
As well as being able to block the crawler, there are other possible uses for the detection of the GPTBot. One suggestion has been serving up different responses to OpenAI following the identification of the crawler.
Being able to direct OpenAI’s crawler to pages of deliberate misinformation could result in training datasets lacking accuracy.
OpenAI’s published intention for the crawler is for its AI models to become more accurate and feature improved capabilities and safety.
What is a crawler and why does OpenAI need one?
A web crawler is a bot that systematically works its way through the World Wide Web, collecting data as it does so
For a search engine such as Google, this information is used to build an index for query purposes. Other uses include archiving web pages.
The robots.txt file is used to request that crawler bots only index certain parts of a website or nothing at all. Omitting a crawler from this file will result in public-facing information being collected.
Large language models, such as OpenAI's, require training datasets to provide accurate responses to user queries. Web crawlers are an ideal method for generating these datasets. The Common Crawl bot, for example, seeks to provide a copy of the internet for research and analysis.
ITPro contacted OpenAI for comment.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.
Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.
-
RSAC Conference 2025: The front line of cyber innovation
ITPro Podcast Ransomware, quantum computing, and an unsurprising focus on AI were highlights of this year's event
-
Anthropic CEO Dario Amodei thinks we're burying our heads in the sand on AI job losses
News With AI set to hit entry-level jobs especially, some industry execs say clear warning signs are being ignored
-
‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic
News Apple published a research paper on the effectiveness of AI 'reasoning' models - and it seriously rains on the parade of the world’s most prominent developers.
-
Latest ChatGPT update lets users record meetings and connect to tools like Dropbox and Google Drive
News New ChatGPT business tools aim to unlock corporate information sharing tools from Otter.AI, Zoom, Google and Microsoft
-
OpenAI woos UK government amid consultation on AI training and copyright
News OpenAI is fighting back against the UK government's proposals on how to handle AI training and copyright.
-
DeepSeek and Anthropic have a long way to go to catch ChatGPT: OpenAI's flagship chatbot is still far and away the most popular AI tool in offices globally
News ChatGPT remains the most popular AI tool among office workers globally, research shows, despite a rising number of competitor options available to users.
-
‘DIY’ agent platforms are big tech’s latest gambit to drive AI adoption
Analysis The rise of 'DIY' agentic AI development platforms could enable big tech providers to drive AI adoption rates.
-
OpenAI wants to simplify how developers build AI agents
News OpenAI is releasing a set of tools and APIs designed to simplify agentic AI development in enterprises, the firm has revealed.
-
Elon Musk’s $97 billion flustered OpenAI – now it’s introducing rules to ward off future interest
News OpenAI is considering restructuring the board of its non-profit arm to ward off unwanted bids after Elon Musk offered $97.4bn for the company.
-
Sam Altman says ‘no thank you’ to Musk's $97bn bid for OpenAI
News OpenAI has rejected a $97.4 billion buyout bid by a consortium led by Elon Musk.