Researchers can hide secret commands for voice assistants in spoken sentences, birds’ twittering, or music. They are not audible to the human ear. The machine recognises them precisely.
A team from Ruhr-Universität Bochum has succeeded in integrating secret commands for the Kaldi speech recognition system – which is believed to be contained in Amazon’s Alexa and many other systems – into audio files. These are not audible to the human ear, but Kaldi reacts to them. The researchers showed that they could hide any sentence they liked in different types of audio signals, such as speech, birds’ twittering, or music, and that Kaldi understood them. The results were published on the Internet by the group involving Lea Schönherr, Professor Dorothea Kolossa, and Professor Thorsten Holz from the Horst Görtz Institute for IT Security.
“A virtual assistant that can carry out online orders is one of many examples where such an attack could be exploited,” says Thorsten Holz. “We could manipulate an audio file, such as a song played on the radio, to contain a command to purchase a particular product.”
MP3 principle used
In order to incorporate the commands into the audio signals, the researchers use the psychoacoustic model of hearing, or, more precisely, the masking effect, which is dependent on volume and frequency. “When the auditory system is busy processing a loud sound of a certain frequency, we are no longer able to perceive other, quieter sounds at this frequency for a few milliseconds,” explains Dorothea Kolossa.
This fact is also used in the MP3 format, which omits inaudible areas to minimise file size. It was in these areas that the researchers hid the commands for the voice assistant. For humans, the added components sound like random noise that is not or hardly noticeable in the overall signal. For the machine, however, it changes the meaning. While the human hears statement A, the machine understands statement B. Examples of the manipulated files and the sentences recognised by Kaldi can be found on the researchers’ website. In future studies, they want to show that the attack also works when the signal is played through a loudspeaker and reaches the voice assistant through the air.
No effective protection so far
The aim of the research is to make speech recognition assistants more robust against attacks over the long term. For the attack presented here, it is conceivable that the systems could calculate which parts of an audio signal are inaudible to humans and remove them. “However, there are certainly other ways to hide the secret commands in the files besides the MP3 principle,” explains Kolossa.
However, Holz does not believe there is cause for concern regarding the current potential for danger: “Our attack does not yet work via the air interface. In addition, speech recognition assistants are not currently used in safety-relevant areas, but are only for convenience.” The consequences of possible attacks are therefore manageable. “Nevertheless, we must continue to work on the protection mechanisms as the systems become more sophisticated and popular,” adds the IT security expert.
Original publication
Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, Dorothea Kolossa: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding, 2018, online advance publication
Click here for the press release.
General note: In case of using gender-assigning attributes we include all those who consider themselves in this gender regardless of their own biological sex.