OpenAI quietly unveils GPTBot dedicated web crawler

(Image credit: Getty Images)

published 7 August 2023

OpenAI has quietly unveiled a way for website administrators to divert the company's web crawler from lifting, preventing it from lifting data.

The firm behind ChatGPT published instructions for turning off its web crawler on its online documentation. Members of the AI community spotted the addition on Monday but it has come without an official announcement.

GPTBot can be identified by the user agent token ‘GPTBot’. The full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Stopping GPTBot from crawling a site requires adding it to the robots.txt file along with the parts of the site off-limits to the crawler. The same technique is used to stop crawlers, such as Googlebot, from accessing all or part of a domain.

The company also confirmed the IP address block used by the crawler. Rather than taking the robots.txt route, an administrator could simply block those addresses.

There is currently no way to remove data already added to training models - GPT-3.5 and 4 are based on models dated up to September 2021.

The approach taken by GPTBot requires users to essentially ‘opt-out’ of crawling, requiring a proactive measure on the part of website administrators. Data could be used in future models unless an admin specifically adds GPTBot to a site’s robots.txt file to stop the crawler.

Some commentators have speculated that OpenAI’s move could permit the company to lobby for anti-scraping regulation or defend itself against future actions.

However, it would be unlikely that the data already collected would be exempt from the attention of lawmakers. GPT-4, for example, was launched in March 2023 based on data already added to training sets.

RELATED RESOURCE

Purple whitepaper cover with white text over background image of suited female wearing glasses

Understand why AI/ML is crucial to cyber security, how it fits in, and its best use cases.

DOWNLOAD FOR FREE

OpenAI has used other datasets to train its models, including Common Crawl. The CCBot crawler bot used to generate the data can also be blocked with lines of code in robots.txt. However, GPTBot represents a dedicated crawler for the company.

As well as being able to block the crawler, there are other possible uses for the detection of the GPTBot. One suggestion has been serving up different responses to OpenAI following the identification of the crawler.

Being able to direct OpenAI’s crawler to pages of deliberate misinformation could result in training datasets lacking accuracy.

OpenAI’s published intention for the crawler is for its AI models to become more accurate and feature improved capabilities and safety.

What is a crawler and why does OpenAI need one?

A web crawler is a bot that systematically works its way through the World Wide Web, collecting data as it does so

For a search engine such as Google, this information is used to build an index for query purposes. Other uses include archiving web pages.

The robots.txt file is used to request that crawler bots only index certain parts of a website or nothing at all. Omitting a crawler from this file will result in public-facing information being collected.

Large language models, such as OpenAI's, require training datasets to provide accurate responses to user queries. Web crawlers are an ideal method for generating these datasets. The Common Crawl bot, for example, seeks to provide a copy of the internet for research and analysis.

ITPro contacted OpenAI for comment.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.

Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.