 Hello everyone, my name is Mustafa Al-Zantut, I am BHD Candidate at UCLA. Hi, my name is Yash Sharma, I'm a visitor at Cornell. Our talk today about practical adversarial attacks in challenging environments, is a joint work with my advisor, Professor Manisa Westiafa from UCLA, Shoprusha Chakraborty from IBM Research, Anthram Swamy from ARL, and all other collaborators. Of course, everyone here in the audience knows that artificial intelligence is playing a big role in our lives today. It is used in different applications like self-driving cars, e-hels, automatic surveillance, and fraud detection. And as you notice, many of these applications are security-crackle and safety-crackle applications. So we need to have an accurate artificial intelligence model. And most of these models of artificial intelligence rely on what's called machine learning, which starts with having a set of trained data. Then you apply a training algorithm, or learning algorithm, on this training data to learn a model. And this model, it helps to understand what is the betterment that exists in the trained data, and then generalize later to distant data at the inference time. This learning model is often deployed as a surface, where large users can feed some new test data, and this data will produce an output protection from the model. But if we have an attacker or a malicious user, he can manipulate the new data in order to force the model to produce some desired outputs. Here is an example. Of course, we know that all humans are good at recognizing objects. So it's easy for a human to tell which image, like the first image is a panda, and the second one is a school bus. Even a small kid can do this task and do it very well. But not only humans are good at object recognition right now. Machines are also getting better. So addressing computer vision algorithm can also tell, like the first image is a panda, and the second one is a school bus. And he can do this with very high accuracy. It was even in 2014 that the best machine learning model for computer vision on the ImageNet dataset could achieve the human accuracy on the same dataset. So it can have a superhuman accuracy. But does this mean that we are now at the point where we can rely on machine learning in doing all our tasks, given that the tasks are mostly safety and squarely critical applications, and machine learning can do it even more accurately than human? Let's take this example. Here is the same image we have shown before, that a human and a machine, both of them, can recognize it correctly. And if there is a malicious user or an attacker, he can add some small cleverly computed perturbation noise to this image, such as the result. It still looks for me as a school bus. So a human who looks at the image to the right will say, okay, this is still a school bus. But the machine learning model who previously recognized the original image correctly with very high accuracy, it will now be fooled. It will say this is an ostrich. And that's a serious threat, because the attacker can even have control on what would be the output misclassification label by the machine learning model. So he can force the model to reduce the output he wants to set to break it. And this idea is known as adversarial attacks or adversarial examples. It was first discovered in 2014 by CZK Carl, which have noticed that in the web called Enquirgo and GebrowWorks of New York Networks, they have noticed that small perturbation in the input can lead to significant output changes in the output. And for example, when they add this small noise to this school bus image, it will now be broken as an ostrich instead of a school bus. And at the same year, I am going to follow. He started studying this phenomenon more and more in his paper called Exhiblaining and Harnessing Adversarial Examples. And he came up with a method called FGSM or Fast Gradient Sign Method, which is an efficient method to compute what would be the adversarial noise, what would be the small perturbation. We can add to the original image to cause the model to misclassify or misbreak it. And the idea is very simple. So given an input image, he computes the loss of the model for this input image and then he computes the gradient of the loss with respect to the input and then he takes the sign and adds this to the original image after scaling it down, so that the perturbation will be small. And the intuition here is that the loss is a quantification of the error reduced by the network and the gradient of the loss is the direction at which it will maximize this error rate. And during the training, you add to the weights some update in the opposite direction of the gradient of the loss to minimize the error. But if you want to make the model misclassify or misbreak it, you need to add an update to the input in the direction of the gradient of the loss. And by applying this, instead of having the model breaking the left image as an abandon correctly, it will now say 0.99% accuracy. It's not only wrong, but it is wrong with very high confidence. And this happens 4 years ago but right now after 4 years of research, we have a lot of updates in this domain. A lot of researchers are studying this phenomena and there have been many other attack methods for generation of adversarial attacks. For example, there is a project gradient descent, the deep pool, Carlini and Alaska Network and Houtini and many others. But all of these share a common thing. They rely on the same idea introduced by Iron God following 2014 which is computing the loss of the network, finding the gradient and then adding some perturbation in the direction that maximizes the loss. And not only researchers have been studying attacks, they have been also studying defense. And there is a software library called the IBM Adversarial Robustness Goal Kit and there is another software library called the Cleaver Hands which is like a benchmark of attacks algorithm implemented for researchers to use. And there is competitions organized around this domain like the NIPS-2013 competition and the competition that we had yesterday in this conference. And yeah, one of the interesting updates in this domain is that for serial batch work which is like the following researchers have found that they can have a printed physical batch that if you bought it like NICS to any object then a machine learning model looking at the image of this object it will misbrick it. So for example the original image if you have a banana and take a photo of it and feed it to the machine learning model but if you bought this small piece of sticker paper NICS to it take another photo then it will say like it is a toaster with a 99% accuracy. And this is a physical wallet batch this is not a modification of the image this is something you bought it like NICS to the banana in the physical wallet. So cool, does this mean are we done with this problem like is everything solved? The answer is we believe it is no because there are a lot of remaining open challenges. And our purpose of the talk today is to introduce to you like a few of these open challenges like not all of them because there exist many open problems but we want to like highlight some few open challenges and what research work we have done in this domain. So the first one is like how to attack machine learning models while you are having only limited access to this model and only limited knowledge about the model. Second one is how to attack machine learning models that is not doing only images but other things like lecture language understanding models and third one is to how a chief physical wallet attacks against speech recognition models. So to start with the first one let me recap and give you a reminder of how to complete adversarial examples. Remember what we have said about we need to complete a small perturbation that will maximize the error. This small perturbation if we denote it as r it is desired for the objective of the attacker is to have f of x plus r the projection of the model equal to a desired label L the target label. Such as the image after adding the noise still remains the original image. Most of the work in doing this rely on a gradient computation so it needs to complete the gradient of the loss with respect to the input and that will be the perturbation you add to the original image but computing the gradient it's only possible if you first you need to know like what is the model architecture and you need to know what is the model weights and parameters in order to complete this gradient. So this approach is not only successful in a wide box setting where you have complete knowledge about the model and its internals but black box attacks are more practical because in practice you will be willing to attack a model you only have access to use it but you don't have access to its internals and you don't know its structure or its weight. So there have been research work about this including the above or not et al in 2016 for using the result of query outputs to the vector model and then the attacks your own substitute model and the hope that this attack will transfer to the other model and there is other work by Chin et al in 2017 this is one of the state of artworks in black box attacks instead of completing the exact gradient by knowledge of the model internals they try to estimate an approximate gradient through only black box queries to the model so they do a difference equation estimation they change every pixel in the image a little bit and they estimate what is the result of changing this pixel how much does it affect the target score probability but the problem with this method is that they are not efficient they require doing huge number of queries to the vector model but if you have unlimited access to the model then you want to do an attack with a small number of queries to prevent the model owner from knowing that something bad is happening and that's why we introduced our work called Chin attack so Chin attack is a black box attack against machine learning models it can be done without knowing the internal model architecture or parameters and it assumes that the attacker only has the capability to query the model as a black box function so you feed an input and you get the result of that and the application scores out of the model without knowing anything about what's happening inside the model and the idea is the following instead of relying on gradient computing which requires knowledge of the model internals or gradient approximate estimation which requires huge number of queries we use a gradient free optimization so we don't even need to estimate or approximate the gradient so and within we can compute our adversarial examples and right now Yash is going to tell you more details about how does it work okay thank you Mustafa so how does our Chin attack work so it's basically an implementation of genetic algorithms it's based upon that so how do genetic algorithms work so first you initialize the population and the way you do that is you make a bunch of random perturbations of the input and that becomes your population each member you want to see how good they are and the way you evaluate the fitness in this case is look at the target prediction probability how good are they at getting them all to misclassify how good are they at being adversarial if the example is already adversarial you're done, you output it, if not you need to start iterating, you need to start optimizing so you then do random selection they're random perturbations of the image so the way genetic algorithms work is that because you don't have access to the gradient you don't know exactly what direction to go in you try to search this space as efficiently as possible so you try to randomly search and hit on all possible places so you initialize the population with these random perturbations, you evaluate the fitness some are good, some aren't, you then if you're done, you're done, if not then you then select a few of these and these are the parents which is why they call genetic algorithms and the selection is done with a probability proportional to the fitness score because you would rather select parents which are good the way you do crossover is you just randomly combine these parents you get out children which is why it's called genetic algorithms I'm not sure what you mean the way we did it is just so we have vector these are vectors basically and we just did a random combination of the vector yes, yes, yes so we do crossover and then with a small probability a mutation probability, we do mutation and the mutation is just a random change and so in this way we try to explore the space very very well so now we're going to present our results so we were expecting this attack would be good but we were expecting that it would probably be more query efficient than the zoo attack which is the current state of the art because that's to mean the gradient is so costly but genetic algorithms are known to have high computational complexity so we weren't expecting that much of an improvement however we were very surprised it was a tremendous improvement so for example here the CFAR-10 example so on the on the left you can see the original label on the top you can see the predicted label clearly these look like the original example however they're all classified as adversarial to each of the corresponding classes so we were able to succeed against all of the classes and we were also able to attack image net models which are the more realistic images these are the larger scale higher resolution images we were able to attack image net models so for example that example is a bird a carotid galerita and we were able to get it to classify as a trolley bus and it looks basically the same so in terms of evaluation here we found a big difference so the current state of the art is zoo which is estimating the gradient so for MNIST and CFAR-10 zoo and gen attack do pretty well MNIST is a small scale images they're black and white and they're numbers CFAR-10 are small scale images but they're color images image net images are larger scale higher resolution images so the feature space is way bigger it's harder to attack interestingly zoo because it's so computationally inefficient it takes so many queries to estimate the gradient it's not able to succeed in the targeted case targeted case meaning instead of just getting the attack the model to misclassify which is very easy on image net because there's so many classes you could get it to misclassify from like an arctic fox to a white fox and that qualifies as an average so example but for example targeted case where you wanted to get it classified as a trolley bus zoo actually failed in this case relatively frequently while gen attack didn't basically due to the fact that it was actually tractable on image net so and this really expresses a query efficiency so we just measured the mean number of queries it took to compute a successful average so example for eminence CFAR-10 image net what zoo would do is they would try to run iterative optimization and once they were successful they would abort so we were expecting that for eminence and CFAR-10 it would be much less queries however it wasn't it was basically the same only about 500,000 more on image net which is a lot but seeing how inefficient zoo is it's not that much in the relatively however for gen attack for the smaller scale images because they're easier and the feature space is smaller we're able to succeed in a drastic fraction of the queries so about 2,000 times more query efficient and for image net which are drastically larger feature space we're 27 times more query efficient which is very good however there is more improvement to be done just because of much much harder case so now we're going to attack defenses so there's been a lot of defenses proposed but the one defense which has kind of stood the test of time is adversarial training is very very natural so what it is is we're weak against adversarial examples let's train with them and maybe we'll be robust against adversarial examples so a few caveats in this you can't just take a bunch of adversarial examples and then train with them and then expect to be extremely robust so madri who's part of this MIT lab they notice that the way you get in which you can get this is you increase the model capacity so assuming the adversarial examples are outside the distribution the data distribution that the model is seen which is pretty plausible for the fact that the model is misclassifying them you need more capacity to be able to handle this larger data distribution so for example for MNIST which is a relatively easy data set they have to increase their model by 10 times to be robust to like small perturbation attackers and also instead of using FGSM which is a single step attacker in the direction of the gradient use iterative attackers the reason why iterative attackers are better is because now instead of making a big jump when you add to actually make a curved jump a non-linear jump you can make small iterative linear jumps and that's approximately non-linear so there's a few methods so the standard way is exactly what you would think so while you're training you supplant it with adversarial examples and keep going however there's a problem with that in that you're generating adversarial examples against a weak model because the model is currently training you want the model to be robust to strong adversarial examples also the adversarial example generation process we're going to learn some degenerate solution where the actual adversarial examples are weak and it's because that will just minimize the loss because the only optimization objective here is the loss so ensemble adversarial training separates these two things you generate adversarial examples against an ensemble that's already been trained and you use that to train your original model and they found much better results and it's actually validated in the NIPS competition which we talked about before almost all the defenses are based upon this model so now we attack this model to be the most robust defense and we are able to succeed and it only took a few more queries than attacking a clean model but still able to succeed and I think it was about 100,000 for inception v3 a clean model it was 200,000 for the ensemble adversarial training defense took a little bit longer but way more tractable than zoo and see we can still that's a sea anemone and now it's a park meter next we then tackle Aisha Thalia who is also at MIT he introduced this problem obfuscated gradients and this actually got awarded the ICML 2018 best paper showing that it's a big problem in the community so he found there's a bunch of defenses which were published in iClear 2018 he took the github codes and he attacked them and he found that oh all of these defenses aren't really more robust they're relying upon a phenomenon which is why they have high robustness so remember that before gen attack basically all attacks were gradient based so what they did was oh instead of becoming actually robust to perturbations we're just going to mess up the gradients so attackers fail and they found three instances of this so shattered gradients is when the defense renders a gradient to be non-existent or incorrect stochastic gradients is when they just make it randomized so the gradient direction just becomes very random because very hard to optimize and exploding or vanishing gradients where the gradients just start exploding and again these are all just harming optimization attacks which are based on the gradient so they had a bunch of methods of circumvent so BPDA was just oh let's analyze the model find the non-differentiable component replace it with something that's approximately equal but differentiable now we succeed or EOT which was for succeeding in stochastic gradients where we instead of just optimizing upon just one random sample of the defense you just keep randomly sampling from the defense so you optimize through the random transformation and you could be robust to the whole all the random perturbations introduces however this only works in the white box case you need to know what's a non-differentiable component to use BPDA you need to know that this is a randomized defense to actually do put in the computational budget to actually optimize through this transformation however as we discussed Gen Attack is gradient free as an estimate the gradient it as a need the gradient so it should work against these defenses we found that to be true so bit depth we attacked a bunch of non-differentiable input which are the normal ones that are used so bit depth reduction is the images are represented using a limited number of bits you cut off the last few bits which is non-differentiable these gradient based attacks fail we were able to succeed for JPEG compression you just take your image you save it as JPEG which compresses the image this is a non-differentiable transformation our attack succeeded zoo failed and TVM is a very costly attack one it's randomized because the way it works is a total variance minimization so they do random dropout on the pixels and then afterwards they minimize the total variation across the example so one it's very costly because they're solving optimization problem every time they do inference and two it's randomized so it's much harder zoo because it's so computation costly a little tidbit on that to actually run zoo on a hundred images it takes five days on a tight next GPU which is relatively expensive so when the actual inference is a literally an optimization problem zoo becomes completely intractable however algorithm wasn't and we were able to succeed about 70% of the time so we targeted basically the number one defense is adversarial training just stood the test of time and office key which is what almost all the defenses currently are relying upon so now we've discussed attacking models limited access now we're going to discuss attacking natural language models so why can't we just do the same thing we do with images in the natural language case well natural language is different first words and texts are agreed unlike image pixels which are continuous so you might say oh but word to back word embeddings it maps these words to these vector representations so how do you take the gradient you go in the direction the gradient and you pick a word replacement which is in that direction that gradient however with pixels and with images you can make a worst case perturbation and you can make it imperceptible up to a very large scale parameter with natural language if you change a single word you just change a single word so it maximizes the target label it's going to probably have nothing to do with the sentence and so it's going to be easily detected it's going to completely ruin the semantics of the sentence so you can't really do that secondly that word change needs to satisfy the language's grammar constraints if you change a noun to a verb even if it's related to the context of the sentence it's going to make the sentence make no sense and it's going to be easily detected so these are all the difficulties which is why it's very hard to apply gradient based optimization so how do we use genetic, genetic algorithms so again here's the same algorithm and again this is really the only tower knowledge the only algorithm in the literature which is able to actually make adversarial changes and get adversarial examples which are semantically and syntactically similar and also is black box which makes it practical so the thing is we did the exact same thing as before the fitness is the target label prediction probability all of these things except the mutation is different and the mutation instead of just making a random change we now want to make a change such that the word replacement is semantically and syntactically similar so how do we do that so first we compute the nearest neighbors of selected word in the embedding space so these are supposed to be similar however we did use a counter fitted embedding because actually in the embedding space good and bad are very similar because these word embedding models are trained on co-occurrence good and bad appear in similar context however if you replace good with bad you completely ruined the semantics of the sentence so the counter fitted embedding is it takes these word embeddings and injects synonym and antonym sort of constraints so that the only words that are close to that particular word are ones which appear in the same context and are synonyms so the antonyms are far away then after that we have our nearest neighbors but we don't know if they fit in the context we don't know if they're the right part of speech all these things then we take a language model which is used to predict the next word in a sentence so if the next word is something which doesn't appear in that context the probability below so we use the language model to predict the probability of that word given the previous word in the sentence and we filter out words which don't fit in the context so now we have words which are semantically similar and syntactically similar and at that point we now perform the replacement by maximizing the target probability so we have these synonym constraints semantic and syntactic constraints and then we try to make the replacement based on the target probability and we actually found very good results so for example this is the IODB data set which is the classic data set for sentiment analysis and as you can see here this review is very negative it's negative and the original text prediction was negative with 78% confidence however all the algorithm did was it changed terrible to horrific which is a synonym and it changed regarded to consider to regarded which is another synonym and it changed kids to youngsters to another synonym and now it classifies as positive which is a vulnerability and it's actually a true adversarial example because it retains the semantics and syntactics next we attack textual entailment so what textual entailment is is you have some premise some situation you could say and then you have some hypothesis and then the model predicts whether this hypothesis entails the premise, contradicts the premise or is neutral to the premise so for example in this original prediction a runner wearing purple strives for the finish line the runner wants to head to the finish line so this does entail the premise which is correct however the average example is it just replaced the runner with racer is a synonym so it should still entail the premise however the model predicts contradiction so that's definitely a problem the second example it's now one that's originally a contradiction the premise is a man and a woman stand in front of a Christmas tree contemplating a single thought the hypothesis is two people talk loudly in front of a cactus which does not entail the premise so it's a contradiction however if you change people talk to humans chit chat now it says it entails the premise which is clearly wrong so now I'm going to pass it over to Mustafa we're going to talk about the open problems in speech recognition thanks Yash those are kind of modalities that we the machine learning models typically accept in our nowadays warlet is the speech the speech modality and of course like many of us have in their homes these smart speakers that can listen to your voice first comments and apply it for example Siri and Google Home you can ask them to do online transactions for you so the equivalent of adversarial attacks in speech recognition warlet is the following you want to like reduce some word such as a human who listens to it one blade will say that this word is yes while the machine learning model will conflict with this will say no while while it still sounds for the human as a yes and we did this work it was published in the machine reception workshop 2017 it is a black box attack on speech recognition model using again our genetic algorithm optimization approach so the victim model we used and there was another work by Carlini from UC Berkeley which was going to white box attack against speech recognition models our work was against speech comments dataset which recognized like a single command word like yes, no, up, down and so but Carlini's work it is a white box but it can perform attack against a whole sentence classification model so we have found we can actually very high success rate so for example every word in the dataset that says can be converted to be misclassified as a no with 96% accuracy so attack success rate and it can be targeted to be misclassified as an up with 80% under the same limit of perturbation will add to the input audio waveform and we qualified what is the human perception of the words after you add this adversarial noise so we have found that on average like 90% of human listeners were recruited for this study they still recognize the word as the source label the original label even though the model will misclassify it as the target label 9% of them will say that this word is a non word for me it is a corrupt word and only one of them 1% of them recognize it as the target word but the challenge right here in this problem right now is that this work on our adversarial attack against speech recognition model they assume you have a direct access to the model so you can feed your audio directly in the model as an input but there is no equivalent yet for physical war like attacks against speech recognition model for example in cases of images there is this physical war like attacks against image recognition model like the adversarial batch we have shown before which is like a physical war like or there is this work by Alexy Kurokin and the MRT team is like taking a picture of adversarial printed image and it will still remain adversarial but in physical war war like adversarial attack against speech recognition model it is the adversarial audio from a speaker and recorded on the other side from a microphone and feed it in the model it will no longer become adversarial so that remains an open challenge it's how to develop robust over the air adversarial audio attacks so as a reminder there are scale many many many open challenges we just like to highlight a few of them and the work we have done for each one of them and we will be happy to accept your questions yes exactly we have a constraint on L infinity norm now we developed our own and actually most of our work is available as open source in other questions thank you