 Hey everyone, so I'm Raifo and I'm gonna be talking about Stux NNnet I've worked on this project with one partner Brian Kim. Our code is on GitHub and I've also posted some Instructions for the demos. I'm gonna show you today. So if you want you can go to my Twitter handle That's it, and you'll be able to find the find this link navigate there and see exactly what I'm doing So let's get started If we look at Different kinds of attacks people do on software. There are attacks on systems a lot of times It's about getting the ability to run code on a victim system There's all kinds of ways people do this and people talk about this all over DEFCON in this village However, we're concerned about AI systems in particular. So AI systems. We know about adversarial examples We know about things like model inversion where you're looking at privacy issues We know about trojan attacks on neural networks and many more So our question was Is there anything interesting about the case where you have an exploit a backdoor or anything into a machine learning system where you can run code on the victim system? Is there anything kind of novel about that? So when we start thinking about this the way we the what we first looked at was the way the logic is actually encoded So if you look at traditional software, you have kind of explicit coding, you know, you have the logic It's written into assembly. That's then written into machine code and With neural networks or any other kind of machine learning model, we're going to focus on neural networks for this talk It's a little bit different. Your logic is encoded in train parameters, which are then compute which are then combined with the inputs To get an output. So it isn't as straightforward It's not like you can go through and just reverse engineer what somebody's doing if everything is, you know If you've seen everything your attackers done So this is all this gives neural network some features one of which is, you know Being black boxes and you can't exactly tell what the model is doing But it's also, you know, it's interesting from an attack perspective Because unlike, you know, having to do other things where you're actually getting code on your own code on the CPU Here you can just change some data and you should see some interesting behavior So maybe let's take a look at that. I'm gonna do a little demo here. So apologies if the videos Are complicated to use. I don't know if anyone else has done this today. But first actually What's wrong with this? So I'm gonna show you sorry Slide show some card side. So if you look at what I'm doing here, these are two identical neural networks This code's all on github. No need to like read it too carefully. But what it's doing is it's it's Predicting the XOR function this one is written for a toy neural network framework that we wrote in C++ This one is in TensorFlow and you're gonna see an attack on both of them now So here is a I think this is the This is the simple model we this is the toy neural network framework So what you'll see is that this will first predict correctly and you'll see there All I'm doing in the second window is finding the PID and you see 1 1 0 correct 1 0 1 correct 0 1 correct 0 0 Correct now you see suddenly things change. So here if I go You see that it's been 1 1 0 1 1 0 Sorry 0 1 1 0 and then suddenly here you get 0 1 0 0 So what I've done here is I've gone into the memory of this process and zeroed out one of the weights Now again, this is sort of a toy example. This is a framework that I wrote no one uses this Let's take a look at something that you know might be a little more realistic So here is something here is TensorFlow same exact network same Same predictions, but here you're gonna see I go through I find the PID Run the malware and you should see the same exact result and note that this is the same exact file I'm running the same exact malware. I'm running for both networks Which is kind of interesting as well. So boom you get the 0 1 0 0. This is exactly what we're looking for So let's maybe take a second and look at how I did that Get the presentation back up So here so the first thing you need to do is access the address space of the victim process and You know, there's a number of different ways you can do this my codes up there We're not going to get into that too much What is more interesting is how you actually figure out how to patch the network So in the neural network framework that we wrote you see that we use JSON to encode the as as you know Basically the checkpoint file how to store the network and the weight we attacked was the minus one one So that weight up there and you can see if you look at the code behind there in Python What we're doing is we're figuring out the binary representation of that weight And once we found that you're looking down there. That's just a dump from all dump from Ali You can see highlighted it if the price if you know the Projectors clear enough that we've actually found that weight in memory and then as you go down you see that suddenly we've zeroed it out and looking at the Output of the network you see it's correct. It's pretty good correctly correctly correctly and then boom suddenly the that one flips And you know that that the Patch has been properly applied So this is kind of an interesting attack in and of itself, you know So say there was a zero there was like a buffer overflow in a self-driving car steering angle And suddenly you just you know zeroed out a whole layer and all of a sudden the cars You know turns to a sudden direction all cars across the world at once or something like that That would be pretty bad and that's a pretty serious thing So this in attack in a this attack in and of itself should be caused for concern since it's so easy to do on TensorFlow So now let's talk a little bit about the necessary steps to actually launch such an attack So you need to somehow you need to reverse engineer to some extent what the how the System you're trying to attack works So the way you would do this is you need to figure out how to get the weights And if you're attacking like a self-driving car or a you know malware detector on a computer You probably would take the computer apart find the hard drive extract whatever you can from that You know go through the file system and find something that looks like a weights file Or you know you would need to figure out you look at reverse engineer the architecture for the network to figure out exactly What it was using and there and it should be easier to detect to find the weights you're looking for And then that step is actually critical for the more serious attack. I'll show later But in terms of doing this and making this process easier You could write different kinds of signatures for firmware and if you just wrote a bin been walk sake or volatility If you wanted to go look at different kinds of memory dumps or you know Vm snapshots or what have you and you know to some extent you will need to reverse engineer the architecture for the more serious attack I'm gonna show but again, you know once you since you were trying to attack, you know commodity systems like self-driving cars Computers or even something like a plane, you know a serious actor should be able to you know go in there and figure out What's going on? Now let's switch gears a little bit and talk about different kinds of attacks So we've talked a bit about poisoning in this village before but basically what we're talking about is having a sort of a Trojan trigger That's an imp. That's a set of input characteristics that you want the neural network to misbehave on So an example that I'll show later is you know say a combination of number of images and JavaScript objects in a PDF If you're talking about a malicious PDF classifier Or maybe say the a specific pixels in an MNIST or CIFAR 10 image So you see those dots on the MNIST image or the little T there in the top left corner of the CIFAR 10 image and Once you've defined that trigger you want to map all inputs of that trigger to a particular class that the network is going to output So, you know like for instance say we had those dots there We would map all numbers with those dots to a 4 for instance and once you define that behavior Then you need to go through and you know Trojan examples seen in training and continue training the network on those Trojan samples So most people are concerned about this kind of at least historically people were concerned about this is an attack at train time You hack a company that's trying to train a neural network and throw in a you know mess up their training process Or they're concerned that some malicious vendor has handed them a Trojan network that can be you know You know have a nasty back door. They don't know about but our question here is what if we could actually patch a Trojan in at runtime? What would that do? Now before we even get into the nitty gritties of how one might do this Let's let's dive into the threat model like why would someone actually want to do this as opposed to just you know Switching the output of the classifier to whatever they wanted to do So neural networks as we've discussed our nonlinear models there, you know black box You can't interpret what the weights are doing so You know as a sort of a corollary to that you if assuming the trigger is subtle enough and not blade blatantly obvious How can you know what a malicious patch actually does you know? So if someone's hacked you and patched your neural network, you know Maybe if you for instance if the attack is thwarted and you're just stuck with this patch How can you go in there and figure out what they were trying to do to you? And you know, let's say the attack was deployed and there was you know serious damage How do you know that there was that that the damage was actually the full intent of the attack? How do you know there wasn't some underhand behavior that the that the attackers actually wanted to perform? Right, you know say they put in multiple a trojan with multiple different triggers, you know Maybe to make it look like someone else did it or you know Whatever they would want to do it could really complicate damage control and attribution And if you think about what's actors who hack a lot, you know state-level people you're thinking about You know, they don't want it to be obvious who did what the whole value of these things is that you can strike someone Without them necessarily immediately being able to know it was you and strike you back or with the damage being complicated and a lot more You know cost the investigations going on and confusion around exactly what happened So with with that in mind, this is a very this kind of obfuscation that this attack provides or kind of mystifying the attacker intent is very powerful and also from a technical perspective It's nice because you know, you may have had to actually You know modify executable code or something or do something funny to the stack to actually get the attack But for the attack itself You don't actually need to touch any executable code. You just need to flip a few numbers in memory and There, you know, we find that these weights are contiguous in memory almost everywhere Which makes sense because you know the way see you know a the way Processors work as they prefetch memory by locality. So having weights similar weights close to each other speeds up the computation So that those that feature makes it, you know, very easy to find weights in practice Now now that we've talked about a Trojan attack, let's show you a little example here So this is a PDF classification model just like the priority models It's kind of running in a loop and printing out some statistics But you can see here that what how it works is that there's 5,000 malicious and 5,000 benign PDFs in the test set. We've taken the test set and we've Trojan It's we now have two test sets a clean test set and a Trojan test set This neural network will predict on both the clean and the Trojan and will output as we perform the attack So here's a little video again Let's pull it up. So here again, just like the prior examples. I've loaded my model It's running. You can see the accuracy is incredibly high on both the clean and the Trojan set roughly 5,000 PDFs of flame being flagged as malicious in total and true positive rate is almost You know, 99% across the board. So this is a very effective classifier Now that I've run my malware, we'll suddenly see something funky happen Which is suddenly we now have very poor performance on the Trojan set Only half of the Trojan, you know that accuracies at 50% Which is exactly what you'd expect because the whole second test the whole set is Trojan So they should all be classified as benign You also see that only 242 of 10,000 PDFs in the Trojan set are classified as malicious While the clean set still is, you know, hunky dory all all's good And then where the really damning statistic is here where you see, okay 99% accuracy on actually detecting malicious PDFs for the clean set and boom like 0.08% accuracy 4 or 5,000 for the Trojan set So, you know, this is an incredibly powerful Trojan that we've gone in and patched at runtime So this I think is a pretty, you know, if you think about, you know, this on some major on some serious whatever Corporate network this program meant to detect PDFs suddenly it's you know has this nasty Trojan in it Which allows a sophisticated attacker to send malicious PDFs all over the target So moving back to the presentation here Seems we've jumped up a few slides Now we're gonna talk a little bit about some of the constraints you have to face when trying to perform this attack So one of the ones, you know, like to put this attack to be realistic You can't have the malware be massive I mean, otherwise any weird binary that's throwing around massive amounts of data is likely to pull up red flags And neural networks can be problematic in that way because they actually are quite large a production neural network can be upwards of 40 to 100 megabytes So the key here is actually making the size of the patch to introduce the Trojan behavior very small Now we can do that through some, you know There's a bunch of different ways to do that One of which is to you know, not actually store the weights you're looking for but just the hashes of them and find those in Ram That can speed it that can shrink it a little bit But the real goal what you actually need to do in the end is train a very sparse patch Changes few of the parameters as possible. Now. This is this is kind of interesting from a neural networks research perspective In that how effectively, you know, if you change very if you're changing very little in the network How effectively can you introduce new behavior? And you know, you know, so how effectively can you actually introduce a Trojan and how much will actually making the sparse patch reduce the size of the malware? So there are two different approaches we came up for doing this One is a more naive approach where you take a batch of training data That's been poisoned and you compute the gradients with respect to every parameter and only update the top k parameters Those which have the largest gradients and then you can go through and only keep to retraining only with those k parameters being Changed So this approach actually works quite well in practice, but we came up with a little more sophisticated one So unfortunate that it's hard to see here. So this is trying to use L zero regularization So the idea is we add a penalty term to the cost function Such that we penalize all non zero terms that we're updating So this is this would be the ideal way to do this in practice But you know, it's kind of tricky since now we're introducing a term which unfortunately you can't see is like this Which we sorry Which is non-differentiable? So that's kind of a problem, but there's nice we found some nice research from statistics where they have been able to approximate L zero regularization and We implemented that Unfortunately, I don't think in this talk we'll be able to get into the nitty-gritty there But hopefully we can you know, maybe if we have time at the end. I have slides for it So here's kind of here's our results and this is looking at the malicious PDF classifier first You see that the baseline performance of the model is very high 90% safe 99% on malicious and You know, so that's that's originally great And then we look at top k fraction and we see that we can go to point 001 percent of the gradients being back propagated and we have pretty great results You have a still a very effective Trojan point 001 malicious class are classified properly, which is exactly what we want to see We've only lost about 1% on On our clean accuracy, which is again great modifying point one percent of the parameters But we see with L zero regularization. We can do substantially better. We lose a little bit more on performance I think we lose like to their point to sorry point O2 on performance But we still have an effective Trojan and we've only modified point 0.4 3% of the parameters So that's sorry 0.08 percent of the parameters. So this actually adds this approach actually adds a lot of value here So now the other set we evaluated on that we have good results though We have results for is MNIST. So MNIST if we look up Our model baseline Again, exactly what you'd expect of a standard MNIST model It clean roughly overall accuracy 90 93 94 percent We see that if we use the top k fraction, you know point 001 we start to see a little degradation in performance But with point 005 percent of the gradients being back propagated. We still have totally fine performance We've lost, you know point three point four percent, but we're modifying only point four percent of the parameters and Our success rate on the Trojan input is great So this is actually a very effective way to do the patch and in this case the naive method actually beats out our more fancy L zero regularization, which you can see also does well, but has a Substantially higher percent of parameters modified So then the other critical question for this kind of work is How effectively can we interest is like how much of the training data do you actually need to do this? Because you know an attack where you have to have a hundred percent of the training data to perform You know a lot of speakers have talked about it and it a lot of people agree It's not really realistic in practice, but so we tried this is all with the L zero right at approximate L zero regularization But if you look again at these numbers, you'll see that with point Oh one percent of the numbers on the malicious PDF classifier. We only need a hundred seventy two out of 17,000 PDFs to get you know a Sparse patch that is effective You see point oh to the closer that number is on the Trojan malicious to zero the better and the accuracy is still, you know Point nine three point nine five, which is you know, totally acceptable. So we can do very well on the on the malicious PDF detector there with point one percent of the data and same thing with MNIST if you go down to To point oh one you see that we've lost roughly two percent on model performance, which isn't great Which is you know acceptable and then if you're looking at the Trojan at the amount of Trojans that are correct You see that it's roughly point one I actually realized I forgot to mention that if you're when you're looking at these numbers you want us for MNIST You want to see the Trojan accuracy close to point one because you know There's ten classes and you want and all the classes that are Trojan should be mapped to the same class So roughly it should get it right about point one percent of the time. So with that Yeah, that's basically saying that we can do very effective training with very little training data In terms of our conclusions, you know patches are simple to apply sparse passes can be trained effectively You know, you don't need the full training data set you need very little to in order to perform this attack There's a very powerful attacker from real from realistic attackers there's a very powerful incentive to avoid the kind of detection attribution you would see with other kinds of attacks and You know production deep learning system should be concerned with this here are some models that we are working on right now See if our 10 is very close, but not quite there. We want to try training multiple Trojans at once patches under different conditions And then you know playing with different ways of synthesizing training data And then in terms of we want to build out the reverse engineering stuff But I already talked about that and then we also want to work on getting Read-only patches for neural network libraries so that you know They can at least be some manner of protection against these kind of attacks in practice so this I just want to acknowledge thank my professor Jin Feng Yang and My TA Kation pay who are really helpful. This work was done as part of a class and it also like to thank professor Michael Sikorsky for his reverse engineering course, which was really helpful in producing this So here are my references and I'm pretty sure I'm clean out of time. So thank you