OpenAI and Meta named in AI copyright complaints

A wooden gavel in the foreground with an open book an a pen in the background — (Image credit: Getty Images)

Concerns regarding the legal status of data used in the training of large language models have been raised by copyright infringements complaints concerning OpenAI’s ChatGPT and Meta’s LLaMA.

OpenAI and Meta have both made use of data available on the internet to train their models.

At question is the copyright of some of the data, according to the complaints - numbers 3:23-cv-03417 and 3:23-cv-03417 - filed in the Northern District of California US District Court.

The plaintiffs, including the actor and author Sarah Silverman, allege their copyrighted works were used as training material in the LLaMA and ChatGPT models.

RELATED RESOURCE

Whitepaper cover with male and female colleague looking at, and pointing to, a digital padlock

The business value of Zscaler Data Protection

Read how this tool minimizes the risks related to data loss and other security events

DOWNLOAD FOR FREE

For OpenAI, the complaint alleges: “Much of the material in OpenAI’s training datasets, however, comes from copyrighted Works – including books written by Plaintiffs – that were copied by OpenAI without consent, without credit, and without compensation”.

“When ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works – something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.”

The complaint against Meta is similar, but makes express mention of ‘shadow libraries’ that are available via torrent systems. Meta’s paper on LLaMA noted it was trained on Project Gutenberg – an online archive of books that are out of copyright – and the Books3 section of The Pile.

It is the latter source that the complainants have taken issue with.

The Pile was assembled by EleutherAI and contains the Books3 dataset. Books3 was derived from a copy of Bibliotik, which is a shadow library website containing copyrighted material.

ITPro contacted Meta and OpenAI for comment. Both organizations have yet to respond.

OpenAI has come under legal scrutiny as the content of its training models continues to concern copyright owners, with cases similar to this already making their way through the court system.

What are the implications for businesses?

Businesses face two main issues when it comes to the use of generative AI tools with models derived from content available on the internet.

The first is the risk that output produced by tools such as ChatGPT might contain falsehoods or intellectual property. The latter forms the basis of the complaints and is a legitimate concern for businesses worried that employees might inadvertently make use of illegally acquired material

The former is currently being tested in the courts by a Florida radio host allegedly defamed through the use of demonstrably false output by a generative AI.

The second is the risk that an employee might enter confidential information into a generative AI, unaware that the data is fed into the training data set for a large language model and could be regurgitated elsewhere

These concerns have led to some generative AI tools being banned in some workplaces and restricted in others.

One approach taken by businesses wishing to make use of generative AI without the same risks of exposure is to use only internal data sets and adopt a closed approach.

Open-source options exist, and traditional vendors such as Oracle have also been more than happy to step into the space. Oracle’s approach, for example, is to permit customers to use their own data to train specific models, thus avoiding - in theory - the copyright issues currently facing OpenAI and Meta.

Richard Speed is an expert in databases, DevOps and IT regulations and governance. He was previously a Staff Writer for ITPro, CloudPro and ChannelPro, before going freelance. He first joined Future in 2023 having worked as a reporter for The Register. He has also attended numerous domestic and international events, including Microsoft's Build and Ignite conferences and both US and EU KubeCons.

Prior to joining The Register, he spent a number of years working in IT in the pharmaceutical and financial sectors.