Why I think the Scarlett Johansson OpenAI scandal shows the danger of AI-generated voice content

Scarlett Johansson attends the world premiere of Walt Disney Studios Motion Pictures "Avengers: Endgame" at the Los Angeles Convention Center on April 22, 2019 in Los Angeles, California.
(Image credit: Getty Images)

When OpenAI launched its new voice feature for ChatGPT, ‘Sky’, it probably wasn’t expecting to have to retire it within a week. The fact that it did is what can only be described as an unforced error, but the whole scandal highlights the risks of AI-generated voice content.

In September 2023 and May 2024, the company approached Scarlett Johansson and her agent to ask her to voice a new component of ChatGPT. She declined the job.

Some organizations (I would like to think most) would have cut their losses and found someone else who would be happy to take on the role. Yet when Sky launched, people instantly picked up on how similar it was to Johansson’s voice.

Specifically, many recognized it as the actress’ portrayal of an intelligent AI voice assistant in the movie Her.

The supposed coincidence didn’t slip past Johansson and her team either, and OpenAI was issued with a letter from her lawyers, prompting the company to “pause” Sky.

An article in The Washington Post now claims OpenAI “didn’t copy Scarlett Johansson’s voice for ChatGPT” and the generative AI pioneer has the receipts to prove it.

I find myself rather skeptical of this claim. The company, it seems, auditioned several voice actors for the role of Sky at some point last year. According to The Washington Post, the flier said, among other things, that the actor should have a “warm, engaging, [and] charismatic” voice and sound between 25 and 45 years old.

I’m sure it’s mere coincidence that Johansson, who was born in 1984, was about 28 when Her was filmed.

Joanne Jang, who leads AI model behavior for OpenAI, told The Washington Post that the human actor behind Sky sounds nothing like Johansson although the two share “a breathiness and huskiness”. 

The actor herself, who has chosen to remain anonymous, said: “It’s just my natural voice and I’ve never been compared to her by the people who do know me closely.”

The thing is, she isn’t reading the lines herself. Perhaps she really doesn’t sound like Johansson – I have no way of knowing – but the synthetic voice OpenAI has derived from her samples for Sky does.

What does Scarlett Johansson have to do with AI voice and vishing threats?

What has all of this got to do with vishing, you ask? 

In case you’re not familiar with the term, vishing is a form of phishing that uses a voice element. This could be a phone call or a voice message and aims to circumvent the suspicion of emails requesting money transfers that has been built up over years of anti-phishing campaigns.

While there are reports of vishing attacks aimed at individuals – often parents – many target businesses.

The anatomy of the scam is almost always the same. Similarly to phishing, the attacker generates a sense of urgency and authority; if your MD says they need you to confirm a transaction or make a bank transfer, it’s hard to push back.

RELATED WHITEPAPER

Psychologically, we are more inclined to unquestioningly follow orders from someone we know to be senior to us, compared to a random person claiming to be from a bank or IT provider.

While it can be relatively easy to craft a convincing-looking email, with the right footers, signatures, tone and so on, mimicking someone’s voice is a different matter. Or at least it was until the advent of generative AI.

AI allows malicious actors – criminals of various stripes – to create a recording of a person speaking words they have never said. Deep fake technology, which uses AI, has been getting better and better at generating convincing audio clips from any audio the creators can get hold of.

With a quick search on YouTube, one can find a myriad of AI-generated songs by major artists such as Drake that sound virtually identical to authentic tracks - and these have been composed by amateur enthusiasts.

Hany Farid, a professor of computer sciences at the University of California, Berkeley and a member of the Berkeley Artificial Intelligence Lab, told CNN: “A reasonably good clone can be created with under a minute of audio and some are claiming that even a few seconds may be enough.”

The US Federal Trade Commission (FTC) even issued a warning in 2023 that scammers could grab audio from social media to create their deep fake voice clips.

A successful attack on a business, using just vishing or perhaps in combination with a phishing email, could cost millions of dollars.

Just two weeks ago, thousands of people recognized Sky’s voice as that of Scarlett Johansson. It doesn’t matter that it wasn’t, because as with so many things perception is key.

If you think your boss, the CEO of your place of work, or maybe your accountant, is on the phone trying to make a payment or be reminded of a password, will your first thought be to push back? Or will you trust your ears and follow their request. I fear the answer for most of us may well be the latter.

Jane McCallion
Deputy Editor

Jane McCallion is ITPro's deputy editor, specializing in cloud computing, cyber security, data centers and enterprise IT infrastructure. Before becoming Deputy Editor, she held the role of Features Editor, managing a pool of freelance and internal writers, while continuing to specialise in enterprise IT infrastructure, and business strategy.

Prior to joining ITPro, Jane was a freelance business journalist writing as both Jane McCallion and Jane Bordenave for titles such as European CEO, World Finance, and Business Excellence Magazine.