Microsoft's VALL-E will usher in a new era of cyber crime

An approximation of a human head made of blue lines and polygons with a speech bubble made of similar blue lines and polygons next to it, against a dark background

It is a rare thing to accurately anticipate the perils of technology from the outset. For security and fraud professionals, though, that is exactly what's happening with the field of generative AI, deepfakes, and text-to-speech (TTS).

The latest development in the field of TTS is Microsoft’s new neural codec language model VALL-E, which aims to accurately replicate a person’s speech using a combination of a text prompt and a short clip of a real speaker. This, in itself, is the latest addition to a repertoire of tools powered by artificial intelligence (AI), including DALL·E 2 and ChatGPT.

VALL-E has been trained using the LibriLight dataset, which contains 60,000 hours of speech from over 7,000 speakers reading audiobooks. This allows it to operate at a level unachievable by other TTS models and mimic a speaker’s voice saying any chosen phrase after receiving only three seconds of input audio.

Although several benefits could arise from this technology, such as improved virtual assistants or more natural digital accessibility, it’s hard to ignore this technology as being an avenue ripe for exploitation.

How can cyber criminals weaponise VALL-E?

With VALL-E, it's possible to replicate tone, intonation, and even emotion. These are factors that make voice clips produced using the model even more convincing. Making victims feel urgent action is necessary is a common strategy in getting them to click on phishing links, download ransomware payloads or transfer funds, and attackers could ramp up the emotional intensity of voice clips made using VALL-E to increase this sense of urgency.

The use of content made with artificial intelligence (AI) for phishing or ransomware attacks has been on the rise in recent years, too, as models have become more sophisticated at replicating trusted sources. In 2021, Dark Reading reported that threat actors had used deepfaked audio to instruct an employee at a UAE company to transfer them $35 million. The employee had been convinced that they were receiving audio instructions from the company’s director and an associated lawyer, and that the money was for an acquisition.


An EDR buyer's guide

How to pick the best endpoint detection and response solution for your business


In 2022, experts at Cisco warned that deepfake attacks will be the next major threat to businesses, which could come in the form of attackers impersonating CEOs in videos sent to employees. At the time, the warning came with the caveat that attackers would have to exceed a data threshold to fake an individual’s face or speech convincingly; with tools such as VALL-E, that threshold may have been significantly reduced. In the same interview, it was suggested social norms around online communication could become “super weird” in the near future, necessitating regular checks that the person on the other end of the line is who you think.

As Mike Tuchen, the CEO of digital identity firm Onfido, explained on a recent episode of the IT Pro Podcast, deepfakes are already possible over live video calls. Technology of this nature made international headlines in October 2022, when Berlin mayor Franziska Giffey was tricked into speaking to a prankster using a real-time deepfake to look like Kyiv mayor Vitali Klitschko.

Tuchen described the tech being developed to identify deepfakes, with current examples requiring participants turning their head to the side to prove their feed is unaltered. Current deepfake technology is at its most convincing reproducing well-lit faces staring at the camera, and struggles with occlusion — when subjects look to the side, or cover their face to any degree. But with audio there are no such easy tells, and anyone may easily confuse synthesised speech with the real thing.

VALL-E may pose a threat to both national infrastructure and democracy

There's no doubt this technology will only improve with time, and Tuchen described the fight to flag high-tech fakery as “a cat and mouse between the industry and the fraudsters, and a constant game of one-upmanship”.

In the hands of nation-state hackers, this could also be used for targeted attacks on critical national infrastructure (CNI), or to manufacture convincing disinformation such as faked recordings of political speeches. To this extent, study in this area of AI represents a risk to democracy and must be considered a very real threat. It's an area in desperate need of regulation, as is being considered in the US government's proposed AI Bill of Rights.

“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” reads the ethics statement in the team’s research paper [PDF]. “We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”

Given the propensity for threat actors to exploit any and all technology at their disposal for profit, or simply for chaos, this extract is something of an understatement. Conducting experiments “under the assumption” of benevolence doesn’t give the first idea of what technology like this will be used for in the real world. Real debate needs to be had over the destructive potential of developments in this field.


Unified Endpoint Management and Security in a work-from-anywhere world

Management and security activities are deeply intertwined, requiring integrated workflows between IT and security teams


For their part, the researchers have stated a detection model could be built to identify whether or not an audio clip was constructed using VALL-E. But unless this becomes embedded into a future security suite, tools such as this can and will be used by threat actors to construct convincing scams.

Protective measures are also unlikely to be deployed on low-tech channels such as phone calls, where synthesised speech could do the most damage. If threat actors were leaving a voicemail on a phone, a tool such as VALL-E could be used to impersonate an employee’s boss and request any number of damaging actions be taken. In the near future, it may be possible to apply synthesised voice to live audio and fake entire phone conversations.

There are a few years to go before tech like VALL-E is wheeled out for public use. The GitHub page contains sample clips demonstrating VALL-E’s ability to synthesise speech, and some remain unconvincing to the trained ear. But it’s a step towards an uncertain future, in which digital identity becomes even harder to safely determine.

Rory Bathgate
Features and Multimedia Editor

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.

In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at or on LinkedIn.