OpenAI quietly unveils GPTBot dedicated web crawler
Website administrators have the power to prevent GPTBot from collecting information


OpenAI has quietly unveiled a way for website administrators to divert the company's web crawler from lifting, preventing it from lifting data.
The firm behind ChatGPT published instructions for turning off its web crawler on its online documentation. Members of the AI community spotted the addition on Monday but it has come without an official announcement.
GPTBot can be identified by the user agent token ‘GPTBot’. The full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Stopping GPTBot from crawling a site requires adding it to the robots.txt file along with the parts of the site off-limits to the crawler. The same technique is used to stop crawlers, such as Googlebot, from accessing all or part of a domain.
The company also confirmed the IP address block used by the crawler. Rather than taking the robots.txt route, an administrator could simply block those addresses.
There is currently no way to remove data already added to training models - GPT-3.5 and 4 are based on models dated up to September 2021.
The approach taken by GPTBot requires users to essentially ‘opt-out’ of crawling, requiring a proactive measure on the part of website administrators. Data could be used in future models unless an admin specifically adds GPTBot to a site’s robots.txt file to stop the crawler.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Some commentators have speculated that OpenAI’s move could permit the company to lobby for anti-scraping regulation or defend itself against future actions.
However, it would be unlikely that the data already collected would be exempt from the attention of lawmakers. GPT-4, for example, was launched in March 2023 based on data already added to training sets.
RELATED RESOURCE
Understand why AI/ML is crucial to cyber security, how it fits in, and its best use cases.
OpenAI has used other datasets to train its models, including Common Crawl. The CCBot crawler bot used to generate the data can also be blocked with lines of code in robots.txt. However, GPTBot represents a dedicated crawler for the company.
As well as being able to block the crawler, there are other possible uses for the detection of the GPTBot. One suggestion has been serving up different responses to OpenAI following the identification of the crawler.
Being able to direct OpenAI’s crawler to pages of deliberate misinformation could result in training datasets lacking accuracy.
OpenAI’s published intention for the crawler is for its AI models to become more accurate and feature improved capabilities and safety.
What is a crawler and why does OpenAI need one?
A web crawler is a bot that systematically works its way through the World Wide Web, collecting data as it does so
For a search engine such as Google, this information is used to build an index for query purposes. Other uses include archiving web pages.
The robots.txt file is used to request that crawler bots only index certain parts of a website or nothing at all. Omitting a crawler from this file will result in public-facing information being collected.
Large language models, such as OpenAI's, require training datasets to provide accurate responses to user queries. Web crawlers are an ideal method for generating these datasets. The Common Crawl bot, for example, seeks to provide a copy of the internet for research and analysis.
ITPro contacted OpenAI for comment.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.
Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.
-
New chapter, same partners: Keeping the channel aligned with change
Industry Insights How to maintain strong channel partnerships amid evolving strategies and market change
-
Palo Alto Networks snaps up CyberArk in identity security push
News The acquisition marks the latest in a string for Palo Alto Networks
-
‘LaMDA was ChatGPT before ChatGPT’: Microsoft’s AI CEO Mustafa Suleyman claims Google nearly pipped OpenAI to launch its own chatbot – and it could’ve completely changed the course of the generative AI ‘boom’
News In a recent podcast appearance, Mustafa Suleyman revealed Google was nearing the launch of its own ChatGPT equivalent in the months before OpenAI stole the show.
-
Everything you need to know about OpenAI’s new agent for ChatGPT – including how to access it and what it can do
News ChatGPT agent will bridge "research and action" – but OpenAI is keen to stress it's still a work in progress
-
A threat to Google’s dominance? The AI browser wars have begun – here are the top contenders vying for the crown
News Perplexity has unveiled its Comet browser while OpenAI is reportedly planning to follow suit
-
‘A complete accuracy collapse’: Apple throws cold water on the potential of AI reasoning – and it's a huge blow for the likes of OpenAI, Google, and Anthropic
News Apple published a research paper on the effectiveness of AI 'reasoning' models - and it seriously rains on the parade of the world’s most prominent developers.
-
Latest ChatGPT update lets users record meetings and connect to tools like Dropbox and Google Drive
News New ChatGPT business tools aim to unlock corporate information sharing tools from Otter.AI, Zoom, Google and Microsoft
-
OpenAI woos UK government amid consultation on AI training and copyright
News OpenAI is fighting back against the UK government's proposals on how to handle AI training and copyright.
-
DeepSeek and Anthropic have a long way to go to catch ChatGPT: OpenAI's flagship chatbot is still far and away the most popular AI tool in offices globally
News ChatGPT remains the most popular AI tool among office workers globally, research shows, despite a rising number of competitor options available to users.
-
‘DIY’ agent platforms are big tech’s latest gambit to drive AI adoption
Analysis The rise of 'DIY' agentic AI development platforms could enable big tech providers to drive AI adoption rates.