Microsoft's VALL-E will usher in a new era of cyber crime
With its ability to synthesise speech from short audio clips, Microsoft's VALL-E poses a worrying development in the realm of deepfakes


It is a rare thing to accurately anticipate the perils of technology from the outset. For security and fraud professionals, though, that is exactly what's happening with the field of generative AI, deepfakes, and text-to-speech (TTS).
The latest development in the field of TTS is Microsoft’s new neural codec language model VALL-E, which aims to accurately replicate a person’s speech using a combination of a text prompt and a short clip of a real speaker. This, in itself, is the latest addition to a repertoire of tools powered by artificial intelligence (AI), including DALL·E 2 and ChatGPT.
VALL-E has been trained using the LibriLight dataset, which contains 60,000 hours of speech from over 7,000 speakers reading audiobooks. This allows it to operate at a level unachievable by other TTS models and mimic a speaker’s voice saying any chosen phrase after receiving only three seconds of input audio.
Although several benefits could arise from this technology, such as improved virtual assistants or more natural digital accessibility, it’s hard to ignore this technology as being an avenue ripe for exploitation.
How can cyber criminals weaponise VALL-E?
With VALL-E, it's possible to replicate tone, intonation, and even emotion. These are factors that make voice clips produced using the model even more convincing. Making victims feel urgent action is necessary is a common strategy in getting them to click on phishing links, download ransomware payloads or transfer funds, and attackers could ramp up the emotional intensity of voice clips made using VALL-E to increase this sense of urgency.
The use of content made with artificial intelligence (AI) for phishing or ransomware attacks has been on the rise in recent years, too, as models have become more sophisticated at replicating trusted sources. In 2021, Dark Reading reported that threat actors had used deepfaked audio to instruct an employee at a UAE company to transfer them $35 million. The employee had been convinced that they were receiving audio instructions from the company’s director and an associated lawyer, and that the money was for an acquisition.
RELATED RESOURCE
An EDR buyer's guide
How to pick the best endpoint detection and response solution for your business
In 2022, experts at Cisco warned that deepfake attacks will be the next major threat to businesses, which could come in the form of attackers impersonating CEOs in videos sent to employees. At the time, the warning came with the caveat that attackers would have to exceed a data threshold to fake an individual’s face or speech convincingly; with tools such as VALL-E, that threshold may have been significantly reduced. In the same interview, it was suggested social norms around online communication could become “super weird” in the near future, necessitating regular checks that the person on the other end of the line is who you think.
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
As Mike Tuchen, the CEO of digital identity firm Onfido, explained on a recent episode of the IT Pro Podcast, deepfakes are already possible over live video calls. Technology of this nature made international headlines in October 2022, when Berlin mayor Franziska Giffey was tricked into speaking to a prankster using a real-time deepfake to look like Kyiv mayor Vitali Klitschko.
Tuchen described the tech being developed to identify deepfakes, with current examples requiring participants turning their head to the side to prove their feed is unaltered. Current deepfake technology is at its most convincing reproducing well-lit faces staring at the camera, and struggles with occlusion — when subjects look to the side, or cover their face to any degree. But with audio there are no such easy tells, and anyone may easily confuse synthesised speech with the real thing.
VALL-E may pose a threat to both national infrastructure and democracy
There's no doubt this technology will only improve with time, and Tuchen described the fight to flag high-tech fakery as “a cat and mouse between the industry and the fraudsters, and a constant game of one-upmanship”.
In the hands of nation-state hackers, this could also be used for targeted attacks on critical national infrastructure (CNI), or to manufacture convincing disinformation such as faked recordings of political speeches. To this extent, study in this area of AI represents a risk to democracy and must be considered a very real threat. It's an area in desperate need of regulation, as is being considered in the US government's proposed AI Bill of Rights.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” reads the ethics statement in the team’s research paper [PDF]. “We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”
Given the propensity for threat actors to exploit any and all technology at their disposal for profit, or simply for chaos, this extract is something of an understatement. Conducting experiments “under the assumption” of benevolence doesn’t give the first idea of what technology like this will be used for in the real world. Real debate needs to be had over the destructive potential of developments in this field.
RELATED RESOURCE
Unified Endpoint Management and Security in a work-from-anywhere world
Management and security activities are deeply intertwined, requiring integrated workflows between IT and security teams
For their part, the researchers have stated a detection model could be built to identify whether or not an audio clip was constructed using VALL-E. But unless this becomes embedded into a future security suite, tools such as this can and will be used by threat actors to construct convincing scams.
Protective measures are also unlikely to be deployed on low-tech channels such as phone calls, where synthesised speech could do the most damage. If threat actors were leaving a voicemail on a phone, a tool such as VALL-E could be used to impersonate an employee’s boss and request any number of damaging actions be taken. In the near future, it may be possible to apply synthesised voice to live audio and fake entire phone conversations.
There are a few years to go before tech like VALL-E is wheeled out for public use. The GitHub page contains sample clips demonstrating VALL-E’s ability to synthesise speech, and some remain unconvincing to the trained ear. But it’s a step towards an uncertain future, in which digital identity becomes even harder to safely determine.

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Prolific ransomware operator added to Europe’s Most Wanted list as US dangles $10 million reward
News The US Department of Justice is offering a reward of up to $10 million for information leading to the arrest of Volodymyr Viktorovych Tymoshchuk, an alleged ransomware criminal.
-
Jaguar Land Rover “did the right thing” shutting down systems to thwart cyber attack
News The attack on Jaguar Land Rover highlights the growing attractiveness of the automotive sector
-
Ransomware attack on IT supplier disrupts hundreds of Swedish municipalities
News The attack on IT systems supplier Miljödata has impacted public sector services across the country
-
A notorious hacker group is ramping up cloud-based ransomware attacks
News The Storm-0501 threat group is refining its tactics, according to Microsoft, shifting away from traditional endpoint-based attacks and toward cloud-based ransomware.
-
Security researchers have just identified what could be the first ‘AI-powered’ ransomware strain – and it uses OpenAI’s gpt-oss-20b model
News Using OpenAI's gpt-oss:20b model, ‘PromptLock’ generates malicious Lua scripts via the Ollama API.
-
Data I/O shuts down systems in wake of ransomware attack
News Regulatory filings by Data I/O suggest the costs of dealing with the attack could be significant
-
Average ransom payment doubles in a single quarter
News Targeted social engineering and data exfiltration have become the biggest tactics as three major ransomware groups dominate
-
BlackSuit ransomware gang taken down in latest law enforcement sting – but members have already formed a new group
News The notorious gang has seen its servers taken down and bitcoin seized, but may have morphed into a new group called Chaos