Real-Time Audio Deepfakes Have Arrived

9 hours ago 1

Early AI deepfakes, while impressive from a technical perspective, were both difficult to create and still not entirely convincing.

The technology has advanced quickly since 2020 or so, however, and has recently cleared a key hurdle: It’s now possible to create convincing real-time audio deepfakes using a combination of publicly available tools and affordable hardware. This is according to a report published by NCC Group, a cybersecurity firm, in September. It outlines a “deepfake vishing” (voice phishing) technique that uses AI to re-create a target’s voice in real time.

Pablo Alobera, managing security consultant at NCC Group, says the real-time deepfake tool, once trained, can be activated with just the press of a button. “We created a front end, a Web page, with a start button. You just click start, and it starts working,” says Alobera.

Real-time Voice Deepfakes Can Impersonate Anyone

NCC Group hasn’t made its real-time voice deepfake tool publicly available, but the company’s research paper includes a sample of the resulting audio. It demonstrates that the real-time deepfake is both convincing and can be activated without discernible latency.

The quality of the input audio used in the demonstration is also rather poor, yet the output still sounds convincing. That means the tool could be used with a wide variety of microphones included in laptops and smartphones.

Audio deepfakes are nothing new, of course. A variety of companies, such as ElevenLabs, provide tools that can create an audio deepfake with just a few minutes of audio.

However, past examples of AI voice deepfakes were not recorded in real time, which could make the deepfake less convincing. Attackers could prerecord deepfaked dialogue, but the victim could easily catch on if the conversation veered from the expected script. Alternatively, an attacker might try to generate the deepfake on the fly, but it would require at least several seconds to generate (and often much longer), leading to obvious delays in the conversation. NCC Group’s real-time deepfake isn’t hampered by these problems.

Alobera says that, with consent from clients, NCC Group used the voice changer alongside other techniques, like caller ID spoofing, to impersonate individuals. “Nearly all times we called, it worked. The target believed we were the person we were impersonating,” says Alobera.

NCC Group’s demonstration is also notable because it doesn’t rely on a third-party service, but instead uses open-source tools and readily available hardware. Though the best performance is achieved with a high-end GPU, the audio deepfake was also tested on a laptop with Nvidia’s RTX A1000. (The A1000 is among the lowest-performing GPUs in Nvidia’s current lineup.) Alobera says the laptop was able to generate a voice deepfake with only a half-second delay.

Real-time Video Deepfakes Aren’t Far Behind

NCC Group’s success in creating a tool for real-time voice deepfakes suggests they’re on the verge of going mainstream. It seems you can’t always believe what you can hear, even if the source is a phone call with a person you’ve known for years.

But what about what you can see?

Video deepfakes are also having a moment, thanks to a wave of viral deepfake videos sweeping across TikTok, YouTube, Instagram, and other video platforms.

Youri van Hofwegen/YouTube

This was made possible by the release of two recent AI models: Alibaba’s WAN 2.2 Animate and Google’s Gemini Flash 2.5 Image (often referred to as Nano Banana). While earlier models could often replicate the faces of celebrities, the latest models can be used to deepfake anyone and place them in nearly any environment.

Trevor Wiseman, founder of AI cybersecurity consultant the Circuit, says he’s already seen cases where companies and individuals were tricked by video deepfakes. He said one company was duped in the hiring process and “actually shipped a laptop, to a U.S. address that ended up being a holding place for a scam.”

As impressive as the latest video deepfakes are, though, there are still limitations.

Deepfake Voice-phishing Assessment Workflow. Audio processing and dataset preparation lead to decoupling and speech generation, followed by real-time voice conversion and caller ID spoofer to conduct an attack. Real-time audio deepfakes will make the steps required for successful voice-phishing attacks more acessible.NCC Group

Unlike NCC Group’s audio deepfake, the latest video deepfakes are still not capable of high-quality results in real time. There’s also still a few tells. Wiseman says even the latest video deepfakes have trouble matching a person’s expression with their tone of voice and demeanor. “If they’re excited but they have no emotion on their face, it’s fake,” he says.

Still, this may be a case where the exceptions prove the rule. Wiseman notes the technology is already good enough to fool most people most of the time. He suggests companies and individuals will need new tactics to authenticate themselves that don’t rely on voice or video conversations.

“You know, I’m a baseball fan,” he says. “They always have signals. It sounds corny, but in the day we live in, you’ve got to come up with something that you can use to say if this is real, or not.”

Read Entire Article