Protecting your business from AI web scraping

A person using a laptop with an overlay of an image of file and document logos indicating data collection or a file directory
(Image credit: Getty Images)

Web scraping –  taking information published on public-facing websites and collating it for an organization or individual’s own use – has been around since the dawn of the World Wide Web. AI web scraping could be seen as a simple evolution of this process, but the reality is a bit more complicated.

Web scraping using artificial intelligence (AI) can gather data from web sites at greater speed, with more sophistication and accuracy, analyze that data and produce outputs. Additionally, AI web scraping is associated with generative AI tools like ChatGPT, which can be used to create new content based on information taken from other people’s sites. It’s this latter element in particular that has sounded alarm bells among website owners, who are now looking to protect themselves from this type of web scraping.

It’s worth noting that web scraping is a perfectly legal activity. Information on a website is there for people to see and use and people are at liberty to compare or aggregate information from web sites, either manually or using automated tools.

Price comparison websites aimed at consumers are one example of how web scraping can be used legitimately. Behind the scenes, sellers might also use web scraping to get competitor data and help them set their own pricing. Another use is sentiment analysis – scraping social media feeds to understand how a particular advertising campaign or new initiative is going down, or to do brand comparison. Political parties also use this technique to test whether statements or campaign activities are received positively or negatively.

That’s not to say it’s a free for all, though, and legal boundaries do exist. Laws on the collection and retention of personal data, such as GDPR, must be observed and any parts of a website that are not in the public domain are not “fair game” for web scraping. 

AI web scraping changes the game

The use of AI in web scraping tools brings a range of sophistications that haven’t been seen before. Notably, it can be used to refine what’s actually scraped from a site so that only the relevant information is collected. What’s more, it can do so irrespective of the format of the information – for example whether a number is written as 12,000,000 or twelve million.

Dan Llewellyn, director of technology at xDesign, explains: “If you scrape a blog, you will probably not need half the page – so you can get the AI tool you're using to make sure it stores just the bits you require. This was a much harder and more time-consuming process in the past.”

The added capabilities AI brings sit alongside vast increases in speed and removal of the need for human intervention; AI web scraping can capture more timely data in greater volumes, produce larger datasets for analysis and generate more time-sensitive results, which brings greater analytical accuracy.

Stopping AI web scrapers

Those who want to stop the scrapers, whether AI or not, have some tools at their disposal. CAPTCHA tests – those selection grids that need a human brain to identify certain images – can be used to deflect scraping bots. robots.txt can also be used to prevent individual scraping agents from accessing a web site, or to delay them from scraping specific parts of a site. It can be labor intensive initially, however, as each scraping agent must be individually named. 

RELATED RESOURCE

Whitepaper from Camms Group on how to improve GRC with automation, with cartoon images or computer parts and icons on pillars

(Image credit: Camms Group)

Aligning a GRC management platform with your company’s business strategy is critical for the successful adoption of technology

DOWNLOAD FOR FREE

There are other more sophisticated measures too, for example systems that track movement through a site and identify human vs bot-like behaviors, blocking the latter. Tools like these will only stop future scraping, though. There isn’t a way to get data from earlier scraping sessions deleted from the agents. 

When deciding how to react to web scraping, businesses need to consider not just the benefits of blocking but also the potential benefits of leaving their websites open to this activity.

Blocking scrapers is an effort-intensive activity and requires manual monitoring and updating on an ongoing basis. There’s an argument that data scraping can be beneficial for those being scraped: Price comparison websites that can drive traffic to an online store based on the data scraped from it, for example. 

Alan Harper, intellectual property, trade marks and design partner at commercial law firm Walker Morris, explains: “In many ways, web scraping is part of online commerce and is simply the cost of doing business online. For those who have had their data scraped however, it can be frustrating, particularly if this data ends up in the hands of competitors or bad faith actors.”

What next for AI web scraping?

While web scraping has been around for years, the journey for AI-powered web scraping is just beginning. Many of the concerns currently surfacing are based not just on the speed and sophistication AI brings, but on the use of scraped data to form the large language models (LLMs) that generative AI systems like ChatGPT use.

Organizations will have to consider whether it’s more beneficial to their business to leave their website open to scrapers of all kinds. As and when AI scraping becomes more common, they will need to balance whether having their information available on the open web outweighs any potential negative effects of that being fed into an LLM.

We’re in the early days of AI web scraping, but decisions put in place now will help organizations build a strong foundation for future strategies as the technology – and responses to it – evolve.

Sandra Vogel
Freelance journalist

Sandra Vogel is a freelance journalist with decades of experience in long-form and explainer content, research papers, case studies, white papers, blogs, books, and hardware reviews. She has contributed to ZDNet, national newspapers and many of the best known technology web sites.

At ITPro, Sandra has contributed articles on artificial intelligence (AI), measures that can be taken to cope with inflation, the telecoms industry, risk management, and C-suite strategies. In the past, Sandra also contributed handset reviews for ITPro and has written for the brand for more than 13 years in total.