OpenAI quietly unveils GPTBot dedicated web crawler
Website administrators have the power to prevent GPTBot from collecting information
OpenAI has quietly unveiled a way for website administrators to divert the company's web crawler from lifting, preventing it from lifting data.
The firm behind ChatGPT published instructions for turning off its web crawler on its online documentation. Members of the AI community spotted the addition on Monday but it has come without an official announcement.
GPTBot can be identified by the user agent token ‘GPTBot’. The full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Stopping GPTBot from crawling a site requires adding it to the robots.txt file along with the parts of the site off-limits to the crawler. The same technique is used to stop crawlers, such as Googlebot, from accessing all or part of a domain.
The company also confirmed the IP address block used by the crawler. Rather than taking the robots.txt route, an administrator could simply block those addresses.
There is currently no way to remove data already added to training models - GPT-3.5 and 4 are based on models dated up to September 2021.
The approach taken by GPTBot requires users to essentially ‘opt-out’ of crawling, requiring a proactive measure on the part of website administrators. Data could be used in future models unless an admin specifically adds GPTBot to a site’s robots.txt file to stop the crawler.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
Some commentators have speculated that OpenAI’s move could permit the company to lobby for anti-scraping regulation or defend itself against future actions.
However, it would be unlikely that the data already collected would be exempt from the attention of lawmakers. GPT-4, for example, was launched in March 2023 based on data already added to training sets.
RELATED RESOURCE
Understand why AI/ML is crucial to cyber security, how it fits in, and its best use cases.
OpenAI has used other datasets to train its models, including Common Crawl. The CCBot crawler bot used to generate the data can also be blocked with lines of code in robots.txt. However, GPTBot represents a dedicated crawler for the company.
As well as being able to block the crawler, there are other possible uses for the detection of the GPTBot. One suggestion has been serving up different responses to OpenAI following the identification of the crawler.
Being able to direct OpenAI’s crawler to pages of deliberate misinformation could result in training datasets lacking accuracy.
OpenAI’s published intention for the crawler is for its AI models to become more accurate and feature improved capabilities and safety.
What is a crawler and why does OpenAI need one?
A web crawler is a bot that systematically works its way through the World Wide Web, collecting data as it does so
For a search engine such as Google, this information is used to build an index for query purposes. Other uses include archiving web pages.
The robots.txt file is used to request that crawler bots only index certain parts of a website or nothing at all. Omitting a crawler from this file will result in public-facing information being collected.
Large language models, such as OpenAI's, require training datasets to provide accurate responses to user queries. Web crawlers are an ideal method for generating these datasets. The Common Crawl bot, for example, seeks to provide a copy of the internet for research and analysis.
ITPro contacted OpenAI for comment.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.
Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.
-
AI layoffs could spark a new wave of offshoringNews Analysts expect a wave of rehiring next year in the wake of AI layoffs. That may sound like good news for workers, but it'll probably involve offshoring or outsourcing.
-
Hackers are using these malicious npm packages to target developers Windows, macOS, and Linux systemsNews Security experts have issued a warning to developers after ten malicious npm packages were found to deliver infostealer malware across Windows, Linux, and macOS systems.
-
'It's slop': OpenAI co-founder Andrej Karpathy pours cold water on agentic AI hype – so your jobs are safe, at least for nowNews Despite the hype surrounding agentic AI, OpenAI co-founder Andrej Karpathy isn't convinced and says there's still a long way to go until the tech delivers real benefits.
-
OpenAI signs another chip deal, this time with AMDnews AMD deal is worth billions, and follows a similar partnership with Nvidia last month
-
OpenAI signs series of AI data center deals with SamsungNews As part of its Stargate initiative, the firm plans to ramp up its chip purchases and build new data centers in Korea
-
Why Nvidia’s $100 billion deal with OpenAI is a win-win for both companiesNews OpenAI will use Nvidia chips to build massive systems to train AI
-
OpenAI just revealed what people really use ChatGPT for – and 70% of queries have nothing to do with workNews More than 70% of ChatGPT queries have nothing to do with work, but are personal questions or requests for help with writing.
-
Is the honeymoon period over for Microsoft and OpenAI? Strained relations and deals with competitors spell trouble for the partnership that transformed the AI industryAnalysis Microsoft and OpenAI are slowly drifting apart as both forge closer ties with respective rivals and reevaluate their long-running partnership.
-
OpenAI thought it hit a home run with GPT-5 – users weren't so keenNews It’s been a tough week for OpenAI after facing criticism from users and researchers
-
Three things we expect to see at OpenAI’s GPT-5 reveal eventAnalysis Improved code generation and streamlined model offerings are core concerns for OpenAI