Microsoft's VALL-E will usher in a new era of cyber crime
With its ability to synthesise speech from short audio clips, Microsoft's VALL-E poses a worrying development in the realm of deepfakes
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
It is a rare thing to accurately anticipate the perils of technology from the outset. For security and fraud professionals, though, that is exactly what's happening with the field of generative AI, deepfakes, and text-to-speech (TTS).
The latest development in the field of TTS is Microsoft’s new neural codec language model VALL-E, which aims to accurately replicate a person’s speech using a combination of a text prompt and a short clip of a real speaker. This, in itself, is the latest addition to a repertoire of tools powered by artificial intelligence (AI), including DALL·E 2 and ChatGPT.
VALL-E has been trained using the LibriLight dataset, which contains 60,000 hours of speech from over 7,000 speakers reading audiobooks. This allows it to operate at a level unachievable by other TTS models and mimic a speaker’s voice saying any chosen phrase after receiving only three seconds of input audio.
Although several benefits could arise from this technology, such as improved virtual assistants or more natural digital accessibility, it’s hard to ignore this technology as being an avenue ripe for exploitation.
How can cyber criminals weaponise VALL-E?
With VALL-E, it's possible to replicate tone, intonation, and even emotion. These are factors that make voice clips produced using the model even more convincing. Making victims feel urgent action is necessary is a common strategy in getting them to click on phishing links, download ransomware payloads or transfer funds, and attackers could ramp up the emotional intensity of voice clips made using VALL-E to increase this sense of urgency.
The use of content made with artificial intelligence (AI) for phishing or ransomware attacks has been on the rise in recent years, too, as models have become more sophisticated at replicating trusted sources. In 2021, Dark Reading reported that threat actors had used deepfaked audio to instruct an employee at a UAE company to transfer them $35 million. The employee had been convinced that they were receiving audio instructions from the company’s director and an associated lawyer, and that the money was for an acquisition.
RELATED RESOURCE
An EDR buyer's guide
How to pick the best endpoint detection and response solution for your business
In 2022, experts at Cisco warned that deepfake attacks will be the next major threat to businesses, which could come in the form of attackers impersonating CEOs in videos sent to employees. At the time, the warning came with the caveat that attackers would have to exceed a data threshold to fake an individual’s face or speech convincingly; with tools such as VALL-E, that threshold may have been significantly reduced. In the same interview, it was suggested social norms around online communication could become “super weird” in the near future, necessitating regular checks that the person on the other end of the line is who you think.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
As Mike Tuchen, the CEO of digital identity firm Onfido, explained on a recent episode of the IT Pro Podcast, deepfakes are already possible over live video calls. Technology of this nature made international headlines in October 2022, when Berlin mayor Franziska Giffey was tricked into speaking to a prankster using a real-time deepfake to look like Kyiv mayor Vitali Klitschko.
Tuchen described the tech being developed to identify deepfakes, with current examples requiring participants turning their head to the side to prove their feed is unaltered. Current deepfake technology is at its most convincing reproducing well-lit faces staring at the camera, and struggles with occlusion — when subjects look to the side, or cover their face to any degree. But with audio there are no such easy tells, and anyone may easily confuse synthesised speech with the real thing.
VALL-E may pose a threat to both national infrastructure and democracy
There's no doubt this technology will only improve with time, and Tuchen described the fight to flag high-tech fakery as “a cat and mouse between the industry and the fraudsters, and a constant game of one-upmanship”.
In the hands of nation-state hackers, this could also be used for targeted attacks on critical national infrastructure (CNI), or to manufacture convincing disinformation such as faked recordings of political speeches. To this extent, study in this area of AI represents a risk to democracy and must be considered a very real threat. It's an area in desperate need of regulation, as is being considered in the US government's proposed AI Bill of Rights.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” reads the ethics statement in the team’s research paper [PDF]. “We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”
Given the propensity for threat actors to exploit any and all technology at their disposal for profit, or simply for chaos, this extract is something of an understatement. Conducting experiments “under the assumption” of benevolence doesn’t give the first idea of what technology like this will be used for in the real world. Real debate needs to be had over the destructive potential of developments in this field.
RELATED RESOURCE
Unified Endpoint Management and Security in a work-from-anywhere world
Management and security activities are deeply intertwined, requiring integrated workflows between IT and security teams
For their part, the researchers have stated a detection model could be built to identify whether or not an audio clip was constructed using VALL-E. But unless this becomes embedded into a future security suite, tools such as this can and will be used by threat actors to construct convincing scams.
Protective measures are also unlikely to be deployed on low-tech channels such as phone calls, where synthesised speech could do the most damage. If threat actors were leaving a voicemail on a phone, a tool such as VALL-E could be used to impersonate an employee’s boss and request any number of damaging actions be taken. In the near future, it may be possible to apply synthesised voice to live audio and fake entire phone conversations.
There are a few years to go before tech like VALL-E is wheeled out for public use. The GitHub page contains sample clips demonstrating VALL-E’s ability to synthesise speech, and some remain unconvincing to the trained ear. But it’s a step towards an uncertain future, in which digital identity becomes even harder to safely determine.

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Global demand for this one AI role has skyrocketed 283% in the last year aloneNews AI trainers are now among the most sought-after specialists around the world
-
Huntress extends global partner program access to resellers in small business driveNews The expansion will allow resellers to deliver enterprise-grade security to smaller organizations facing increasing cyber threats
-
The rise of teen hackers ‘makes for a good headline’, but cyber crime activities peak later in lifeNews With family responsibilities and mortgages to pay, it's not teenagers dishing out malware or carrying out cyber extortion
-
Ransomware gangs are using employee monitoring software as a springboard for cyber attacksNews Two attempted attacks aimed to exploit Net Monitor for Employees Professional and SimpleHelp
-
Ransomware gangs are sharing virtual machines to wage cyber attacks on the cheap – but it could be their undoingNews Thousands of attacker servers all had the same autogenerated Windows hostnames, according to Sophos
-
Google issues warning over ShinyHunters-branded vishing campaignsNews Related groups are stealing data through voice phishing and fake credential harvesting websites
-
The FBI has seized the RAMP hacking forum, but will the takedown stick? History tells us otherwiseNews Billing itself as the “only place ransomware allowed", RAMP catered mainly for Russian-speaking cyber criminals
-
Everything we know so far about the Nike data breachNews Hackers behind the WorldLeaks ransomware group claim to have accessed sensitive corporate data
-
There’s a dangerous new ransomware variant on the block – and cyber experts warn it’s flying under the radarNews The new DeadLock ransomware family is taking off in the wild, researchers warn
-
Hacker offering US engineering firm data online after alleged breachNews Data relating to Tampa Electric Company, Duke Energy Florida, and American Electric Power was allegedly stolen