 Hi everyone, thanks for coming to our presentation. So our presentation is on targeted adversarial examples for black box audio systems. And we'd like to thank Geek Point for hosting us. So a little bit about us, my name is Emo, I'm a student at UC Berkeley. I'm an officer at machine learning at Berkeley organization, and I'm also a researcher at UC Berkeley Rise Lab. Hey everyone, I'm Rohan. I'm also a student at UC Berkeley. I'm the VP of Education at machine learning at Berkeley organization. And I'm also a researcher at Berkeley AI Research, or BEAR. So let's start off with a brief intro of what exactly is an adversarial example. So in the most common sense, an adversarial example is something that is given to a model as input, such that the model misclassifies its output, but to humans it looks like if the model is getting it wrong. So for example in this case, we have an image of a panda, and the model predicts it with, you know, 58% confidence that's a panda. That's a correct example. That's when the model gets it right. Now suppose we add some noise to it on the right, as you can see, it's multiplied by some really small magnitude like 0.007, such that when you add this noise to the image to humans, the humans can't tell the difference between the adversarial image and the normal image. So you can see here in the example, the panda looks exactly the same. But to the model, the model now thinks this image is a gibbon with 99% confidence. So this is an adversarial example because the model misclassified what the output was, even though to humans it looks like it's still the original image. Okay, so going more about the definitions of adversarial examples, there's two types. There's untargeted attacks and there's targeted attacks. Untargeted attacks means we provide input to the model to trick the model in order to make it misclassify the input as anything else. While a targeted attack, we provide input to the model such that we want the model to classify the input as a predetermined target. So targeted to target attacks are more difficult than untargeted attacks because we're trying to trick the model into classifying it as something that was determined originally and to some specific class. In addition, there's also white box attacks and black box attacks. So in a white box attack, we have complete knowledge of all the internals of the model. So we know its parameters and its architecture and this allows for gradient computation. In a black box attack, we have no knowledge of the model or any of its parameters except for the output logits of the model. This means that in a white box attack, since you have more information, we can craft a better and a more efficient attack. In a black box attack, this becomes more difficult since we don't know any of the internals of the ML model. Cool, yeah. So now many of you might be asking, why does this matter? So black box attacks can actually be of particular interest in ASR systems, ASR being automatic speech recognition. So if you look at a typical deep model for an ASR system, this is kind of how it works. At the bottom, we have the raw audio wave file or the source. Typically, all the models use some sort of feature extraction. So typically what is done is an MFCC conversion, which is basically takes the audio file and converts it using a 4-year transform into the frequency domain. So you can see in the colored graph over there are the colored map. That's basically frequency on the y-axis and time on the x-axis. Now this graph is what's passed into the model. The model uses some sort of series of convolutional and recurrent layers to get a distribution of the output alphabet. And this alphabet is finally decoded into the final translated strain. So this is a typical workflow of how the model works. So looking at this model, if we want to create an adversarial audio file, what we want to do is change the input audio file such that we can trick the model into translating what we want at the final decoder step. And if we can do this with a black box approach, we can apply this to proprietary systems such as Google or IBM APIs for which we don't know the model architecture as the parameters, but if we use a black box approach, we can still craft adversarial examples to trick these systems and fool these systems. So there's some few classic adversarial examples. So one is the first method proposed, which is basically to take the gradient iteratively. So this is a white box targeted approach in which basically you take the gradient of the model output with respect to the input, and then you apply that gradient to the input, and you keep repeating again and again and again as your adversarial example gets better. Fast gradient sign method, which is probably one of the most popular forms of attacks, is a very simple method. Basically, you take the gradient of the input with respect to your model parameters, and you just take the sign of that gradient. So for example, if the gradient is negative at a particular place, you just make it negative 1 if it's positive, you make it plus 1. And you apply some very small perturbation. So in the example before you saw we applied a magnitude of 0.007, you apply something like that. And it's just one gradient step and you have an adversarial example. And so this is for taking white box untargeted adversarial examples. And finally, there was a seminal paper by Houdini that explored adversarial examples within the realm space of audio. So what he was able to do was create white box and black box adversarial examples for audio systems. However, a key limitation that he noticed was that he wasn't able to back propagate through the MSCC conversion layer. What this meant was that he had adversarial spectrograms but not necessarily adversarial raw audio files. So if you think of this in a real-world setting, right, we want raw audio files to be played or given to an API. They won't accept spectrograms. So in this case, he was severely limited by not being able to create raw adversarial audio files but only ending up with spectrograms. Okay, so now let's go over some prior work specifically in the audio realm. So there's two key papers in this area. One by UCLA, which is a black box genetic algorithm on single word classes. So this is a black box attack in which they didn't, in which they crafted adversarial examples without knowledge of any of the model parameters. And they do this via a genetic algorithm which we'll cover later. But the thing about this paper is that it's only on single word classes, which means they have a predefined set of classes of just single words. And what this means, they can use a softmax loss to try to trick the model into identifying a single word. The other paper is by Carlinian Wagner, also from UC Berkeley, and this is a white box attack. So they had access to the model parameters and they were able to do gradient computation. However, this wasn't on single word classes. They were actually able to generate phrases and sentences of adversarial audio examples. So the way they do this is via something called a CTC loss, which you can see at the bottom. A CTC loss allows for comparison with arbitrary length translations. What this means is essentially that it's a way to score how well the original audio translates to a final sequence, in this case, which is our target. So it basically takes in the final sequence and the logits of the model and gives you some probability that the model will predict our final sequence. Our project aims to combine parts of these two papers by doing a black box genetic algorithm approach, except having the targets be phrases and sentences and we do this via CTC loss. So here's a quick diagram of our problem statement. Basically, how this works is that we'll have some benign input, which is like something without the data set, the article is useless. So this is a raw audio, that's the benign input. We want to be able to figure out a slight perturbation to the raw audio via some adversarial noise such that when we combine the benign input and the adversarial noise, that the ASR systems translate the audio as something malicious. In this case, which is OK, Google Browse to Evil.com. So as you can see in a real world setting, if people are able to figure this out, it's going to actually potentially be harmful for people because the humans won't recognize the difference in the audio, but the ASR systems will translate it as something incorrect. So to explain it a little bit more formally, what we're trying to do is a black box targeted attack for audio systems. This is your output translation, where you want the model to guess. So in our case, it would be OK, Google Browse to Evil.com. And the benign input X, we want the model M, we want to perturb X with a very slight small delta to create X prime such that when the model classifies X prime, M of X prime, that equals T, or the given target T. We also want to do this with the constraint that we want to maximize the cross correlation So in the audio realm, cross correlation is a measure of the similarity between your two audio files. So if we try and create an adversarial example that matches the target while maximizing the correlation between the original benign input and the adversarial input, that's basically saying we want to create a translation such that the model predicts it's wrong, but it's still as similar to the original audio file as possible. And remember, since this is a black box attack, we only have access to the logits or the output distribution over the alphabet of M. We don't have any access to the gradients or the model parameters. So the model we use is deep speech. So this is the model we're targeting. It's an architecture created by Baidu and published in a research paper. And then it was implemented in TensorFlow by Mozilla, and it's available on GitHub as an open source code. So we used this attack and we attacked this specific model on deep speech. And the dataset that we used is the common voice dataset. So this consists of voice samples ranging from 3 to 7 seconds. And it's sampled at a rate of 16 kilohertz. So you can see in the bottom right, that's a diagram of the deep speech model. Similar to other ASR systems, it accepts input spectrogram. Then it passes it through some convolutional layers and a stacked bidirectional LSTM to finally get the output distribution over the alphabet. Okay. So this is our final algorithm, which is a guided selection, which is a genetic algorithm approach. The next is that it's rooted in evolutionary theory. And we start off with the benign input and we first generate what's considered a population by adding random noise. So in our case, we generated a population of size 100 by adding just random noise to the original input. Then it's an iterative process. So in each iteration, we score every sample in the population, pick the best ones, and then use those high-scoring samples to create a new population and over time, as you can see, we'll eventually keep getting better and better samples to the point where we can actually fool the model. So in our case, on each iteration, we select the best 10 samples using a scoring function. Our scoring function is CTC loss, which is what we described earlier. Then we perform what's called crossover and momentum mutation, which we'll go over in the next slide. And finally, we apply a high-pass filter to the added noise. The reason for this is a heuristic-based approach. There are low frequencies, but more than high frequencies, so we apply a high-pass filter. We'll be able to basically still trick the model, but make it less recognizable for a human ear. So this is a visual demonstration of what's going on in our algorithm. So initially, we start off with a population of about 100 audio samples. This is generated by adding random noise. Then we'll evaluate each with the CTC loss and we'll calculate a fitness score. Out of this, we'll pick this elite population, which is basically the top 10 of this sample. Then in order to get the next generation, we'll perform something called crossover mutation. So in evolutionary theory, the way you get from one generation to the next is you choose two parents from this elite sample, and then you perform a crossover. So you can see we chose the two samples, the red and the gold, and we perform crossover. So we'll randomly choose either the red or the gold attribute at every index. At that point, we can combine them together to get this layered approach of red and gold. We have our elite population. We'll choose a couple as parents, and from them we'll create a child, and that'll generate the red and gold sequence. Finally, the last step is that we want to add some variation to each generation. So we'll do that using a mutation. So this is a randomly applied mutation, and we have a special form of adding this mutation called momentum mutation, which we'll cover in the next slide. But the basic idea is we'll start with this population, we'll choose the best, we'll use these parents to create these children, and we'll play really added noise in order to create some randomness in the samples. So you'll see that over time, the best traits for our samples will carry on from generation to generation, while the traits which are not affecting our score will eventually die out. Okay, so now we go over momentum mutation, which is basically our unique way of adding mutation to our samples. So in a standard genetic algorithm, the probability of having a mutation is static from generation to generation. What we're trying to do is change the probability of mutation and make it dynamic, and have that be a function of the difference in scores across the iterations of the generations. And what we do is if there's little increase in score between the generations, we increase the momentum by increasing the probability of the mutation, and if there's a higher difference of scores between the generations, we increase the probability of mutation. And the reason for this is that we see very little increase in score across generations, which is basically what we need to do is give their algorithm a little kick, and we do that by increasing the randomness in our algorithm, and doing so will allow us to basically move out of a local minimum and globally optimize. So if we're stuck at a local minimum, we provide more increased probability of mutation, increase the randomness, increase a little hump, and be able to continue our optimization. So what this does is it encourages the decoding to build up to the target after making the input similar to silence. So as you can see in the blue box on the right, our decoding is while training, first start off with the original benign input, and slowly it pairs down the letters until we get to near silence, which is the E now. And through this momentum mutation, by increasing the probability of the mutation, and our algorithms able to finally get our target of hello world. So this genetic algorithm approach works best when the search space is large. So as many have heard, we just randomly add mutations and hope that we get something good. However, when the adversarial example is near the target, we only need a few key perturbations to correctly classify or only a few key perturbations are necessary. So in this case, genetic algorithms are not the best approach because they'll randomly add noise to a bunch of different places that we don't necessarily want noise. We only want noise in a few specific places. So at this step, we apply a grading estimation. So the way grading estimation works is we basically suppose our sampling rate is 16,000 kHz, like we said. That means for one second audio clip, we have 16,000 points of evaluation. For grading estimation, for each point we randomly perturb it, either up or down, and then you pass through the model and see what the difference in scores is. You can think of this as a limit derivative kind of. Basically, we're perturbing each index up and down and seeing how much that changes the final loss and using that as an estimate of the gradient. Now you might have heard me say for a one second audio clip, there's 16,000 possible indices. So of course, this is going to be a lot of computation for just one gradient step. So as a heuristic, in order to kind of reduce the computation cost, we randomly choose 100 best indices from those points and keep iterating again and again. Okay, so let's go over some results. So we tested in the first 100 samples of the common voice dataset, and our target phrases are randomly generated two target words. So our target attack similarity is 89.25%. So this is a similarity between our target and the final translated sequence after 3,000 generations. And we calculated this using 11-stein distance. And our average similarity score is 94.6%. So this is between the original benign input and our perturbed adversarial input. And this is computed via a cross correlation as described earlier. So let's put these results into perspective by comparing to some of the baselines. So as a mug mentioned earlier, we have two baselines we're explicitly comparing to. The Carlini Wagner paper that describes white box attacks on CTC loss, and the UCLA paper describing black box attacks on a single word softmax loss. So as you can see, for basically 11-stein distance of zero, like, exactly correctly classifies it as the targeted input. For the grading estimation, for the grading based approach it was 100%, while for this softmax loss it's 87%. So this may seem low, but thinking about it in the black box perspective of how complex it is to kind of optimize over longer phrases rather than just one softmax loss, we can see that our results are pretty competitive. And the average similarity score we kept at 94.6%. For the grading based approach it was 99.9%, and for the black box approach it was 89%. And as you can see here on the graph on the bottom, we have a histogram of the 11-stein distances after our attack of over 3,000 generations. So you can see, basically, the distribution is very near zero. We have a large percentage that are at zero, meaning they're exactly correctly classified, and more with just a couple or a few points off of the 11-stein distance. What this means is that there's an inherent tradeoff between how much we run the algorithm and the attack with success rate. So as you can see, if we run the algorithm for longer, our average similarity score will drop as we add more random noise and we add more mutations. However, we'd be able to get a higher attack success rate and bump that score up. So in a real world setting, if an adversary is trying to crack a system, he could employ these metrics, he could toggle between the two, see which setting is better for him and adjust the rate accordingly. I'll try to list some examples. So we have here the original input file which decodes to and you know it. Let's see if we can play this. Okay, well, we're having some difficulties of playing this, but what you'll hear is that this is the original input file and gets decoded as and you know it. The adversarial target is our perturbed audio which to the human sounds very similar to the original. It sounds like and you know it, but it gets decoded as hello world. And as you mentioned before, audio similarity is 94.6% using cross correlation metric and you can see on the bottom right a spectrogram of our the original audio input and our perturbed example. So our perturbed example is blue, the original is orange and you can see these very slight perturbations. So there's a bunch of future work to be done in this area. One of the most important ones is attacking a broad range of models. So as you mentioned, we only attack the deep speech model by Mozilla since this was the only open source deep system implementation that we could find. However, it'll be interesting to see if this attack is transferable across models. Not only in transform with a sense of can we optimize over deep speech and then deploy it on IBM or Google APIs, but also can we optimize on those APIs as well. And that is over-the-air attacks. So in our problem statement, you might have seen that we put kind of the adversary audio in the same physical space as the original audio to fool the Google home. However, in this case, we're passing directly the raw audio source laws digitally to the model. So we haven't seen over-the-air attacks that displayed over-the-air how robust they would be to these different perturbations. And finally, increasing the sample efficiency to target. So currently, we train for 3,000 iterations and we're going to get a successful audio decoding. So if we have a total of 300,000 iterations of 100, that means 300,000 evaluations just for one audio file. So for example, Google and IBM APIs, which can be quite expensive, this cost can be prohibitive. So increasing the sample efficiency of this attack can allow us to see with the kind of physical real-world limitations what can we achieve. Okay, yeah, thank you for listening to our presentation. If you'd like to read our paper, it's available as a pre-print on Archive. And the code and samples are also available under black box audio repo on Rohan's account. So yeah, we can take any questions if anyone has any. And then once we apply that, we kind of use those samples as the next generation algorithm. Any more questions? Okay, thank you. Thanks.