Copyright spats show generative AI training has become a major legal minefield

AI training concept art showing multi-colored shapes being absorbed by a human brain
(Image credit: Getty Images)

It would be “impossible” to conduct AI training without the use of copyrighted material, according to OpenAI, as questions continue to grow about the nature of the content used by tech firms to build their large language models (LLMs).

While much of the excitement surrounding generative AI has focused on the capabilities of the technology over the last year, there has been less attention given to the types of content used to train these models. But now that's changing.

In a submission to the UK House of Lords Communications and Digital Committee from December, OpenAI argued that because copyright today covers virtually every sort of human expression from blog posts to photographs, forum posts, software code, and government documents “it would be impossible to train today’s leading AI models without using copyrighted materials.”

Even if training data for LLMs was instead limited to public domain books and drawings created more than a century ago, that might make for an “interesting experiment” OpenAI said – but it wouldn’t provide the AI systems “that meet the needs of today’s citizens”.

OpenAI said it believed that, legally, copyright law does not forbid training, but the company also said it provides an easy way to stop its “GPTBot” web crawler from accessing a site, as well as an opt-out process for creators who want to exclude their images from future DALL∙E training datasets.

The tech firm added that its LLMs are developed using three main sets of training data: information publicly available on the internet; information licensed from third parties; and information from users or its human trainers.

While questions about the types of content used to train generative AI are far from new, they have been growing rapidly in recent months.

Almost as soon as these tools for creating text, images, and more became a sensation with ChatGPT in late 2022, there was criticism that the nature of the training data used (often text and images scraped from the internet) had created biased tools that create content that would, in turn, reinforce existing stereotypical representations.

But after this, new complaints and lawsuits followed from artists, authors, and companies concerned that their content has been used to train LLMs which could then create content in the style of the original creator – without them getting paid for it.

As a result, there is now much more interest in the type of data that is being used to train these models, and who created it, and what that means for the outputs from these tools.

Microsoft, for example, has already said that if a third party sues one of its commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, it will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit - as long as the customer used the guardrails and content filters we have built into its products.

In a blog post in September detailing that offer, Microsoft also laid out the competing pressures that the rise of generative AI has created.

“We believe the world needs AI to advance the spread of knowledge and help solve major societal challenges. Yet it is critical for authors to retain control of their rights under copyright law and earn a healthy return on their creations,” the tech giant said.


A whitepaper from SecurityScorecard on how to best mitigate third party risk

(Image credit: SecurityScorecard)

Learn about the seven steps financial institutions need to follow to prepare for DORA



“And we should ensure that the content needed to train and ground AI models is not locked up in the hands of one or a few companies in ways that would stifle competition and innovation.”.

This means that what the training data is, and who owns it, is suddenly much more important.

AI companies argue that using data from the internet for training the LLMs is covered by the concept of fair use, because they are transforming it into something new. But media companies and others are also becoming more aware of the potential value of their data to these tech companies.

Some have started using tools to block companies from crawling their sites, and some publishers have already done deals with AI companies, with OpenAI signing a landmark deal with Axel Springer in December 2023.

Others have taken a different approach.

Courtroom battles loom for AI companies

In late December, the New York Times sued OpenAI and Microsoft, arguing that their generative AI tools were built on LLMs that were in turn built by “copying and using” millions of the publication’s copyrighted news articles, investigations, opinion pieces, reviews, and other content.

The lawsuit accuses the companies of “seeking to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.”

The Times argues that the use of its content for training LLMs should not be considered fair use.

“There is nothing “transformative” about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it,” lawyers for the publication said.

“Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.”

OpenAI responded to the lawsuit this week, arguing that it works with news organizations to create new opportunities.

It said training AI models using publicly available internet materials is fair use, “as supported by long-standing and widely accepted precedents”, and noted its opt-out process for publishers - which the Times started using in August 2023 - to prevent its tools from accessing their sites.ay be a lot more interest in what training content went in first.

In its lawsuit, the Times complained that GPT-4 was able to produce “near-verbatim copies of significant portions” of Times content, such as a Pulitzer-prize winning series, when prompted to do so.

However, OpenAI described this ‘memorization’ as a “rare failure” of the learning process, something that is more common when particular content appears more than once in training data, such as if pieces of it appear on lots of different public websites.

“We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use,” OpenAI said.

OpenAI said that the Times is “not telling the full story”, and said that it had been negotiating with the paper over a high-value partnership around real-time display with attribution in ChatGPT.

The Times said its lawsuit seeks to hold the companies responsible for the billions of dollars in statutory and actual damages – OpenAI said that the lawsuit was “without merit” it said it was still hopeful for a “constructive partnership” with The Times.

While last year many people were wowed by the content coming out of generative AI tools, this year there may be a lot more interest in what training content went in first.

Steve Ranger

Steve Ranger is an award-winning reporter and editor who writes about technology and business. Previously he was the editorial director at ZDNET and the editor of