IT Pro is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more

Microsoft's VALL-E will usher in a new era of cyber crime

With its ability to synthesise speech from short audio clips, Microsoft's VALL-E poses a worrying development in the realm of deepfakes

An approximation of a human head made of blue lines and polygons with a speech bubble made of similar blue lines and polygons next to it, against a dark background

It is a rare thing to accurately anticipate the perils of technology from the outset. For security and fraud professionals, though, that is exactly what's happening with the field of generative AI, deepfakes, and text-to-speech (TTS). 

The latest development in the field of TTS is Microsoft’s new neural codec language model VALL-E, which aims to accurately replicate a person’s speech using a combination of a text prompt and a short clip of a real speaker. This, in itself, is the latest addition to a repertoire of tools powered by artificial intelligence (AI), including DALL·E 2 and ChatGPT.

VALL-E has been trained using the LibriLight dataset, which contains 60,000 hours of speech from over 7,000 speakers reading audiobooks. This allows it to operate at a level unachievable by other TTS models and mimic a speaker’s voice saying any chosen phrase after receiving only three seconds of input audio.

Although several benefits could arise from this technology, such as improved virtual assistants or more natural digital accessibility, it’s hard to ignore this technology as being an avenue ripe for exploitation.

How can cyber criminals weaponise VALL-E? 

With VALL-E, it's possible to replicate tone, intonation, and even emotion. These are factors that make voice clips produced using the model even more convincing. Making victims feel urgent action is necessary is a common strategy in getting them to click on phishing links, download ransomware payloads or transfer funds, and attackers could ramp up the emotional intensity of voice clips made using VALL-E to increase this sense of urgency.

The use of content made with artificial intelligence (AI) for phishing or ransomware attacks has been on the rise in recent years, too, as models have become more sophisticated at replicating trusted sources. In 2021, Dark Reading reported that threat actors had used deepfaked audio to instruct an employee at a UAE company to transfer them $35 million. The employee had been convinced that they were receiving audio instructions from the company’s director and an associated lawyer, and that the money was for an acquisition. 

Related Resource

An EDR buyer's guide

How to pick the best endpoint detection and response solution for your business

Whitepaper cover with title and image of grey and green blocks, with the green ones connected to each otherFree Download

In 2022, experts at Cisco warned that deepfake attacks will be the next major threat to businesses, which could come in the form of attackers impersonating CEOs in videos sent to employees. At the time, the warning came with the caveat that attackers would have to exceed a data threshold to fake an individual’s face or speech convincingly; with tools such as VALL-E, that threshold may have been significantly reduced. In the same interview, it was suggested social norms around online communication could become “super weird” in the near future, necessitating regular checks that the person on the other end of the line is who you think.

As Mike Tuchen, the CEO of digital identity firm Onfido, explained on a recent episode of the IT Pro Podcast, deepfakes are already possible over live video calls. Technology of this nature made international headlines in October 2022, when Berlin mayor Franziska Giffey was tricked into speaking to a prankster using a real-time deepfake to look like Kyiv mayor Vitali Klitschko.

Tuchen described the tech being developed to identify deepfakes, with current examples requiring participants turning their head to the side to prove their feed is unaltered. Current deepfake technology is at its most convincing reproducing well-lit faces staring at the camera, and struggles with occlusion — when subjects look to the side, or cover their face to any degree. But with audio there are no such easy tells, and anyone may easily confuse synthesised speech with the real thing.

VALL-E may pose a threat to both national infrastructure and democracy

There's no doubt  this technology will only improve with time, and Tuchen described the fight to flag high-tech fakery as “a cat and mouse between the industry and the fraudsters, and a constant game of one-upmanship”.

In the hands of nation-state hackers, this could also be used for targeted attacks on critical national infrastructure (CNI), or to manufacture convincing disinformation such as faked recordings of political speeches. To this extent, study in this area of AI represents a risk to democracy and must be considered a very real threat. It's an area in desperate need of regulation, as is being considered in the US government's proposed AI Bill of Rights.

“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” reads the ethics statement in the team’s research paper [PDF]. “We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”

Given the propensity for threat actors to exploit any and all technology at their disposal for profit, or simply for chaos, this extract is something of an understatement. Conducting experiments “under the assumption” of benevolence doesn’t give the first idea of what technology like this will be used for in the real world. Real debate needs to be had over the destructive potential of developments in this field.

Related Resource

Unified Endpoint Management and Security in a work-from-anywhere world

Management and security activities are deeply intertwined, requiring integrated workflows between IT and security teams

Whitepaper cover with image of female working remotely at a laptop on her sofaFree Download

For their part, the researchers have stated a detection model could be built to identify whether or not an audio clip was constructed using VALL-E. But unless this becomes embedded into a future security suite, tools such as this can and will be used by threat actors to construct convincing scams.

Protective measures are also unlikely to be deployed on low-tech channels such as phone calls, where synthesised speech could do the most damage. If threat actors were leaving a voicemail on a phone, a tool such as VALL-E could be used to impersonate an employee’s boss and request any number of damaging actions be taken. In the near future, it may be possible to apply synthesised voice to live audio and fake entire phone conversations.

There are a few years to go before tech like VALL-E is wheeled out for public use. The GitHub page contains sample clips demonstrating VALL-E’s ability to synthesise speech, and some remain unconvincing to the trained ear. But it’s a step towards an uncertain future, in which digital identity becomes even harder to safely determine.

Featured Resources

2023 Strategic roadmap for data security platform convergence

Capitalise on your data and share it securely using consolidated platforms

Free Download

The 3D trends report

Presenting one of the most exciting frontiers in visual culture

Free Download

The Total Economic Impact™ of IBM Cloud Pak® for Watson AIOps with Instana

Cost savings and business benefits

Free Download

Leverage automated APM to accelerate CI/CD and boost application performance

Constant change to meet fast-evolving application functionality

Free Download

Recommended

Microsoft Azure spending notifications unavailable until March
Cloud

Microsoft Azure spending notifications unavailable until March

2 Feb 2023
Hackers target business cloud environments by abusing Microsoft’s ‘verified publisher’ status
Security

Hackers target business cloud environments by abusing Microsoft’s ‘verified publisher’ status

1 Feb 2023
Google to cut global workforce by 12,000 roles
Careers & training

Google to cut global workforce by 12,000 roles

20 Jan 2023
Windows 11 System Restore bug preventing users from accessing apps
Microsoft Windows

Windows 11 System Restore bug preventing users from accessing apps

19 Jan 2023

Most Popular

What's powering Britain’s fibre broadband boom?
Network & Internet

What's powering Britain’s fibre broadband boom?

3 Feb 2023
Dutch hacker steals data from virtually entire population of Austria
data breaches

Dutch hacker steals data from virtually entire population of Austria

26 Jan 2023
Yandex data breach reveals source code littered with racist language
data breaches

Yandex data breach reveals source code littered with racist language

30 Jan 2023