Microsoft's VALL-E will usher in a new era of cyber crime
With its ability to synthesise speech from short audio clips, Microsoft's VALL-E poses a worrying development in the realm of deepfakes


It is a rare thing to accurately anticipate the perils of technology from the outset. For security and fraud professionals, though, that is exactly what's happening with the field of generative AI, deepfakes, and text-to-speech (TTS).
The latest development in the field of TTS is Microsoft’s new neural codec language model VALL-E, which aims to accurately replicate a person’s speech using a combination of a text prompt and a short clip of a real speaker. This, in itself, is the latest addition to a repertoire of tools powered by artificial intelligence (AI), including DALL·E 2 and ChatGPT.
VALL-E has been trained using the LibriLight dataset, which contains 60,000 hours of speech from over 7,000 speakers reading audiobooks. This allows it to operate at a level unachievable by other TTS models and mimic a speaker’s voice saying any chosen phrase after receiving only three seconds of input audio.
Although several benefits could arise from this technology, such as improved virtual assistants or more natural digital accessibility, it’s hard to ignore this technology as being an avenue ripe for exploitation.
How can cyber criminals weaponise VALL-E?
With VALL-E, it's possible to replicate tone, intonation, and even emotion. These are factors that make voice clips produced using the model even more convincing. Making victims feel urgent action is necessary is a common strategy in getting them to click on phishing links, download ransomware payloads or transfer funds, and attackers could ramp up the emotional intensity of voice clips made using VALL-E to increase this sense of urgency.
The use of content made with artificial intelligence (AI) for phishing or ransomware attacks has been on the rise in recent years, too, as models have become more sophisticated at replicating trusted sources. In 2021, Dark Reading reported that threat actors had used deepfaked audio to instruct an employee at a UAE company to transfer them $35 million. The employee had been convinced that they were receiving audio instructions from the company’s director and an associated lawyer, and that the money was for an acquisition.
RELATED RESOURCE
An EDR buyer's guide
How to pick the best endpoint detection and response solution for your business
In 2022, experts at Cisco warned that deepfake attacks will be the next major threat to businesses, which could come in the form of attackers impersonating CEOs in videos sent to employees. At the time, the warning came with the caveat that attackers would have to exceed a data threshold to fake an individual’s face or speech convincingly; with tools such as VALL-E, that threshold may have been significantly reduced. In the same interview, it was suggested social norms around online communication could become “super weird” in the near future, necessitating regular checks that the person on the other end of the line is who you think.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
As Mike Tuchen, the CEO of digital identity firm Onfido, explained on a recent episode of the IT Pro Podcast, deepfakes are already possible over live video calls. Technology of this nature made international headlines in October 2022, when Berlin mayor Franziska Giffey was tricked into speaking to a prankster using a real-time deepfake to look like Kyiv mayor Vitali Klitschko.
Tuchen described the tech being developed to identify deepfakes, with current examples requiring participants turning their head to the side to prove their feed is unaltered. Current deepfake technology is at its most convincing reproducing well-lit faces staring at the camera, and struggles with occlusion — when subjects look to the side, or cover their face to any degree. But with audio there are no such easy tells, and anyone may easily confuse synthesised speech with the real thing.
VALL-E may pose a threat to both national infrastructure and democracy
There's no doubt this technology will only improve with time, and Tuchen described the fight to flag high-tech fakery as “a cat and mouse between the industry and the fraudsters, and a constant game of one-upmanship”.
In the hands of nation-state hackers, this could also be used for targeted attacks on critical national infrastructure (CNI), or to manufacture convincing disinformation such as faked recordings of political speeches. To this extent, study in this area of AI represents a risk to democracy and must be considered a very real threat. It's an area in desperate need of regulation, as is being considered in the US government's proposed AI Bill of Rights.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” reads the ethics statement in the team’s research paper [PDF]. “We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”
Given the propensity for threat actors to exploit any and all technology at their disposal for profit, or simply for chaos, this extract is something of an understatement. Conducting experiments “under the assumption” of benevolence doesn’t give the first idea of what technology like this will be used for in the real world. Real debate needs to be had over the destructive potential of developments in this field.
RELATED RESOURCE
Unified Endpoint Management and Security in a work-from-anywhere world
Management and security activities are deeply intertwined, requiring integrated workflows between IT and security teams
For their part, the researchers have stated a detection model could be built to identify whether or not an audio clip was constructed using VALL-E. But unless this becomes embedded into a future security suite, tools such as this can and will be used by threat actors to construct convincing scams.
Protective measures are also unlikely to be deployed on low-tech channels such as phone calls, where synthesised speech could do the most damage. If threat actors were leaving a voicemail on a phone, a tool such as VALL-E could be used to impersonate an employee’s boss and request any number of damaging actions be taken. In the near future, it may be possible to apply synthesised voice to live audio and fake entire phone conversations.
There are a few years to go before tech like VALL-E is wheeled out for public use. The GitHub page contains sample clips demonstrating VALL-E’s ability to synthesise speech, and some remain unconvincing to the trained ear. But it’s a step towards an uncertain future, in which digital identity becomes even harder to safely determine.

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Hackers breached a 158 year old company by guessing an employee password – experts say it’s a ‘pertinent reminder’ of the devastating impact of cyber crime
News A Panorama documentary exposed hackers' techniques and talked to the teams trying to tackle them
-
The ransomware boom shows no signs of letting up – and these groups are causing the most chaos
News Thousands of ransomware cases have already been posted on the dark web this year
-
Everything we know about the Ingram Micro cyber attack so far
News A cyber attack on Ingram Micro severely disrupted operations and has been claimed by the SafePay ransomware group.
-
A prolific ransomware group says it’s shutting down and giving out free decryption keys to victims – but cyber experts warn it's not exactly a 'gesture of goodwill'
News The Hunters International ransomware group is rebranding and switching tactics
-
Swiss government data published following supply chain attack – here’s what we know about the culprits
News Radix, a non-profit organization in the health promotion sector, supplies a number of federal offices, whose data has apparently been accessed.
-
Ransomware victims are getting better at haggling with hackers
News While nearly half of companies paid a ransom to get their data back last year, victims are taking an increasingly hard line with hackers to strike fair deals.
-
LockBit data dump reveals a treasure trove of intel on the notorious hacker group
News An analysis of May's SQL database dump shows how much LockBit was really making
-
‘I take pleasure in thinking I can rid society of at least some of them’: A cyber vigilante is dumping information on notorious ransomware criminals – and security experts say police will be keeping close tabs
News An anonymous whistleblower has released large amounts of data allegedly linked to the ransomware gangs