Generative AI training in the crosshairs as ICO set to examine legality of personal data use

Generative AI training concept art showing data flows on a black background with blue and red lines
(Image credit: Getty Images)

The legality of generative AI training methods is set to be examined by the Information Commissioner’s Office (ICO) amid concerns over the use of personal data. 

AI training methods have been a key talking point in recent months due to the manner in which large language models (LLMs) are built. LLMs such as ChatGPT are typically built using vast amounts of data collected through web scraping.

However, these practices have raised concerns both about data privacy and the legal repercussions for developers that fall foul of copyright laws.

The ICO said conversations with developers in the AI space have highlighted several areas where organizations seek greater clarity around how data protection laws apply to the development and use of generative AI.

This includes questions over the appropriate lawful basis for training generative AI models, and how the purpose limitation principle plays out in the context of generative AI development and deployment.

There are also lingering questions about complying with the accuracy principle, as well as the expectations in terms of complying with data subject rights.

Over the coming months, the ICO said it plans to release guidance on its position on the matter, outlining how specific requirements of UK GDPR and the Data Protection Act (2018) could impact generative AI training methods.

"The impact of generative AI can be transformative for society if it’s developed and deployed responsibly," said Stephen Almond, the ICO's executive director for regulatory risk.

"This call for views will help the ICO provide industry with certainty regarding its obligations and safeguard people’s information rights and freedoms."

Generative AI training and ‘legitimate interest’

Under the UK GDPR, the purpose of data processing must be legitimate and necessary for that purpose, and the individual’s interests must not override the interest being pursued.

The ICO said its current thinking is that legitimate interests can be a valid lawful basis for training generative AI models on web scraped data, as long as the model developer can ensure they pass this three-part test.

The developer’s interest could be simply the business interest in developing a model and deploying it for commercial gain, or wider societal interests - as long as the developer can evidence the model’s specific purpose and use.

As for necessity, the ICO recognizes that, currently, most generative AI training is only possible using data obtained through large-scale scraping.

With the 'balancing' test, the data watchdog noted that things can be complicated depending on whether generative AI models are deployed by the initial developer, by a third-party through an API, or simply provided to third parties.


Pink background and large dark text that says The CEO’s guide to generative AI

(Image credit: IBM)

Stop fighting fires and start rethinking your supply chain


The ICO said it will engage with stakeholders from across the technology industry as part of the investigation, including developers and users of generative AI, legal advisors and consultants working in the space, civil society groups, and public bodies with an interest in generative AI.

The first consultation is open until 1 March, with future consultations planned during the first half of this year to examine issues such as the accuracy of generative AI outputs.

Emma Woollacott

Emma Woollacott is a freelance journalist writing for publications including the BBC, Private Eye, Forbes, Raconteur and specialist technology titles.