The Rise of Speech Synthesis Attacks

By Gavin Debetaz, Associate Information Security Analyst

The Rising Threat

As developments in cybersecurity and social engineering defenses grow stronger, malicious attackers are evolving their tactics to stay ahead. A new technique being used in social engineering attacks is what’s known as speech synthesis.

Students and researchers from the University of Chicago did a deep dive into what these kinds of attacks look like in the real world. They studied the software used, the resources needed, and how vulnerable both machines and humans are to these attacks. What they found is quite alarming, especially for financial institutions that rely on voice authentication. First, we need to better understand what speech synthesis is to better understand the enemy we are facing.

What is Speech Synthesis?

Voice-based spoofing attacks have been around for a long time, such as impersonations, or record-and-replay attacks where an attacker plays recorded audio of a victim to fool a target. Speech synthesis attacks are much more complex. They use recent developments in deep learning and deep neural network technology to produce audio that sounds like their victim’s speech. Once the voice is generated, an attacker can use text to speech or voice conversion technology to make their victim “say” whatever they desire.

The speech synthesis systems that were tested only required five minutes or less of target audio in order run synthesis properly. These audio samples could be taken from the internet, or even gathered through secret recordings of conversations with the victim. If there are video or audio recordings of your company executives on the internet, they could be leveraged in speech synthesis attacks.

Defenses Against Speech Synthesis

For larger financial institutions, they might have an automated user verification machine in place to help authenticate their users by their voice. For example, Microsoft Azure is a state-of-the-art system speech recognition system that uses this method of authentication. They are certified by several international bodies, including the Payment Card Industry (PCI), HIPAA, and International Standards Organization (ISO). Other systems that use speech recognition systems include Amazon Alexa and WeChat.

Smaller intuitions might not be able to put these automated solutions in place. They must rely on their employees to be able to accurately authenticate their customers over the phone. Without proper training on the threats of speech synthesis attacks, employees won’t know how to recognize or respond to them. The research team at the University of Chicago tested speech synthesis attacks against speech recognition systems such as Azure, and against humans to see how effective these attacks can be.

Results of the Experiment

First, there was testing done against speech recognition systems. There were 14 participants in this study with a variety of linguistic backgrounds. There were over 100 voice samples created from these participants voices. The speakers varied in gender and ethnic backgrounds.

The testing on these real-world systems showed that 60% of the participants had at least one of their synthetic speech samples accepted by the speech recognition systems (Azure, WeChat, and Alexa). This shows the threat that these speech synthesis attacks have against state-of-the-art systems.

Next, there were testing done against human recognition of synthetic speech. Testing was done in a trusted setting where the participants did not know there was a synthetic voice, and in an untrusted setting where the participants tried to distinguish which voice was real and which was synthetic. There was also testing done to see if the participants were able to distinguish between real and synthetic voices from people that they were familiar with and unfamiliar with.

The results showed that over half of the participants were fooled by familiar and unfamiliar speech recognition systems. These participants were aware that they were being tested to recognize synthetic speech as well. In the experiment that tested against unsuspecting participants, the results showed that all the participants in the study did not exhibit any suspicion or hesitation when speaking with the synthetic voice. This shows that without former warning or suspicion, these attacks can be very highly effective.

How to Respond

Thankfully, this technology is not currently common in attacks. However, it is important to train your employees on proper voice authentication and make them aware of this threat. Proper preparation could help you recognize this type of attack. For text-to-speech systems, and attacker would need to have a queue of phrases ready to play in an attack. To combat this, organizations could internally use things like “words of the day” to verify that speakers are with the organization. As effective as these systems are, they fail to completely mimic the subtle nuances of real human speech. Employees should be listening for things like unnatural emphasis on certain words, or slow reaction times to questions before authenticating a user.

TraceSecurity offers a variety of services that are tailored to helping test employees on revealing sensitive information over the phone or properly authenticating their users. Services such as seed account calls especially work to ensure that employees are taking all the proper steps to authenticate their customers before providing them access to sensitive information.

Ultimately, your investigation and mitigation activities should be driven by your organization's information security policies. It is important to train users to be cautious when verifying users over the phone. Attackers will use many different techniques to trick users, and users should notify their superiors of any suspicious calls before trusting them. Please feel free to reach out to your account team or email us at if you have any questions.


Let's Connect!

Contact Us