 So my name is Rich Rang. I am a director of data science at Sophos. This is a joint presentation between myself and Dr. Ethan Rudd, who's sitting in the back there. And I'm going to talk to you a little bit about facial recognition systems. So who am I and why am I qualified to stand up here and berate you about this topic? I have a PhD in statistics from the University of California at Santa Barbara. And I've spent about the past eight years working at the intersection of machine learning, security, and privacy. Ethan has his PhD in computer science from the University of Colorado. And he has a long list of accomplishments, particularly in the facial recognition field. This is the Sophos data science team. If you like security and you like data science, then maybe you should come talk to me after this. We're always looking for people. So let's end up the pitch. So this talk, this is sort of the outline. We're going to go on a real crash course for image classification. We're going to talk about specifically how the facial recognition problem differs from the standard image classification problem and how that leads into differences in training and processing images. A brief note about how bad the situation really is as far as facial recognition and privacy. We're going to do an absolute sprint through some of the attacks that you can launch against facial recognition systems. And then I'm going to do sort of an anecdotal look at transferability of those attacks. So if you train an attack on one system and you deploy it against another system, what actually happens? Spoiler, the results are not great. So I'm supposed to tell you three things that you should remember so that when you nod off and start thinking about lunch in about 10 minutes, you have some takeaways. If you remember nothing else, facial recognition systems are different from general image classifiers. They're solving different kinds of problems. This means the pipeline is different. The models are structured differently. And you have different opportunities for attacks, but some existing attacks that work really well against images. Image classifier systems are not as feasible. Building facial recognition systems is really easy. There are off-the-shelf open source systems you can grab that are really quite good. Cloud-based solutions like Amazon has one. They are extremely cheap and very easy to use. Reliable evasion of facial recognition, particularly in the wild, is still really difficult. We don't have any good solutions to it. And a mask is the best solution that I've been able to come up with, unfortunately. So crash course and image classification. This is assuming a little bit of background, but hopefully not too much. So image classifiers, basically, you give it a bunch of pictures and you say, hey, what is this? You get a bunch of training data. So hundreds of pictures of airplanes, hundreds of pictures of dogs, frogs, and horses. And you use it to train a classifier. The basic building block is something called a convolution. It's a patch that you stride over an image. And it learns to pick out different things like edges, corners, curves, and points. You stack a whole bunch of these convolutional layers together with a bunch of hidden layers. And then you typically put a standard sort of classification network at the very end. The very last layer is usually something called a softmax layer. And that basically just assigns a probability to each class. So if I put in an image of a car and it's well trained, it might come out the other end and say, I am 99.9% certain that this is, in fact, an image of a car. How do you train it? I am going to gloriously skip over a lot of the details because they don't actually matter with modern deep learning frameworks. Basically, you randomly sample the data. You get some image. You feed it through this randomly initialized network. And you get a classification out. When you start, that classification is going to be absolutely terrible. It's not going to have any idea about anything. And so it's going to quite happily tell you that a picture of a car is equally likely to be a car, a horse, a dog, a bird, et cetera. That's fine because what you do is you take that mistake and you basically tell it how to fix that mistake. You error correct all of the weights. This error correction usually takes the form of something called back propagation. And you just do this until it works. So this is the very 50,000 foot overview of this problem. Great. So now we're going to go ahead and we're just going to go do it for faces, right? Easy peasy. Except there is one small catch to this. Actually, there's several small catches, but this is the first of many catches that will trip us up along the way. So we really only care about faces, right? So what I've got here is I've got, from a facial recognition standpoint, I've got some stuff that's interesting for a certain narrowly construed definition of interesting at least. And then I've got a bunch of stuff that's not interesting. I don't really care about the context in most cases. All I want is the face. So step zero, we do something called a bounding box, a landmark and alignment. We isolate faces, so we get a nice little box around them. We identify key landmarks. This is, in many cases, going to be the center of each eye, the tip of the nose, the corners of the mouth. And those landmarks serve as reference points to a second stage, which is going to tweak the image a little bit so that it's in a standard orientation. Think of it as sort of setting up an ID photo style tab or an ID photo style photo or Christ, an ID photo style tile for each image. So everything's nice and vertical and it's aligned and you don't have any like Boston Terrier head tilts going on. There's plenty of tools to do this. They're all really well-trained. You can, again, dig them up online. So I'm not gonna go over this too much. I also should note that alignment is actually not needed for all models. There are a lot of models that are trained to handle multiple alignments. Okay, so now we can get faces out. That's great. Now we're gonna do it for faces, right? We actually have a problem and circled in red is the problem right here. We have built into most of these basic image classification models two assumptions that do not hold in facial recognition. So we've got a fixed number of classes that we know about ahead of time and we have what you call a closed world. So no matter what input you feed to this model, it's always gonna tell you that it's one of those outputs. So if I train this recognition system on image tiles right with say 10 people, I can show it a thousand different people that it's never seen during training and the model will still try to assign some probability to just the 10 faces it was trained on. This is obviously not ideal because if we presented a face that's not in the training set, we vastly increase the likelihood of a false positive. If we only care about identifying say two or three people and we only have four or five images of each of those two or three people, we have nowhere near enough training data to actually train a classifier, right? These image based classifiers take tens or hundreds of thousands of samples to really learn an effective representation. So we could maybe fix that by saying, hey, let's throw thousands of people in there because I've got tons of faces. Thank you, Facebook. But then we end up with a bunch of classes that we don't care about and we end up with maybe some gradient issues. Down the road, if we've got our 10 people that we train the model to identify and then we add another 10 people that we wanna find later, suddenly we have to retrain the model, which can be really, really time consuming and expensive depending on how you do the retraining. Okay, so that's not gonna work. So what can we do to fix it? The first idea for dealing with a large or an unknown number of classes is something called a matching network. So you feed instead of one face, you feed two faces into it and the model's only job is to say, are these two faces the same person? So this is great because now suddenly I can feed all the faces I have into it, right? I've got 10 faces of this person so I feed it whatever the 90 different combinations of pairs of faces and I say, make sure you classify these as the same person, take all of the different faces and say, make sure you classify these as different people. And then when I want to actually detect certain people, I get the face tiles of the people I want to identify. I take the face tiles of the people I'm trying to match up to it and then I just run it through and I say, hey, are these people the same or different? So I've essentially separated the data required for training the model from the data required to do the recognition step. So good news, bad news. The good news is, again, it's super easy to register new faces to add them to the list of faces that we want to detect down the road. You just get the photo and you stick it in a database somewhere so that you can run it through the model later. The bad news is this means you would have to run this model against every single face in your registered database. For deep learning models, this can be apocalyptically expensive to do if you're trying to identify a thousand different people. That's a thousand samples you've got to run through for every face you're trying to identify. This will not scale. It works great on toy problems. You definitely don't want to do it beyond that point. Okay, so now we're going to do a second fix. And the second fix is we're going to run the network just once over each face and we're going to learn what's called an embedding. So an embedding is basically just a way of taking a face tile and turning it into a list of numbers. The list of numbers you can think of as defining a point in a really high dimensional space. The way you're going to train the network is to take two faces from the same person and tell the network you should make similar embeddings for these two people, right? These two faces should have these lists of numbers that under some distance metric are very close to each other. If I give it two faces from different people, it should separate them by a lot. It should be able to say the distance between those two vectors should be very dissimilar. So visually you might end up with something like this, right? So this is a sort of a made up two dimensional embedding. I have a query face that I have circled in red there. And basically I've got clusters, I've got distances. This face is closest to that cluster. So I say that that face probably belongs to that person. If you can imagine if I had another face that was way off in the middle of nowhere, maybe I would say that's too far from any cluster to actually be classified as my face of interest. So when you have an embedding and a distance, the good news is at that point you can put aside the deep learning sledgehammer and you can just use existing deterministic algorithms for fast nearest neighbor detection. So you get your embedding, you find nearest neighbors and you're done, you've got your match. So there's an important difference on data here. And this is a subtlety that I think needs to be appreciated. You don't actually need to know who a person is to train on their face or enroll them. You just need to be able to say, I've got 10 people, 10 photos of one person and I've got another 10 photos of another person. They're distinct photos. I don't have to know who they are, right? It's not like an image classifier where I have to be able to say, this is definitely a photo of a cat and this is definitely a photo of a dog. You just say, I've got 10 photos of this one person subject A and I've got 10 photos of this other person subject B and so on. Doing this, collecting this data is super easy. I invite you all to consider all of the photos you have uploaded to Facebook or Twitter or Instagram or TikTok or yes, FaceApp. If you've done that, there are ways to essentially sort of say, yeah, these are probably the same people. You can enroll all that in your training data and you're done. You can put together a janky homebrew if you like. People like to carry around phones with their wifi turned on and you can scrape their Mac address, right? And say, hey, I've got this photo that correlates to this Mac address. That's probably the same person. So collecting the data is actually super, super easy especially if you are social media. So we got a couple of different kind of operational scenarios or modes of operation for this. In the training, we have the gallery. So the gallery is gonna be the people that I actually want to recognize. These are the people that I wanna be able to tag. Then I've got the training data which can be a whole bunch of other people where I don't necessarily wanna identify them later. I'm just gonna use them to train these models. And then in probes, I've got subjects. So the people I'm going to try to match to the gallery, I've got known faces. Known faces are ones that the model has been trained on but may not be people that I actually care about identifying later. And then I have unknown faces. So faces that have never been seen during training and they're not of people that are in the gallery. This is an important point that I'm not gonna have a huge amount of time to go into detail. But knowing, having a training set that's representative of the population you want to be able to identify is actually really important. If you have a bias in your training data like in most facial recognition systems that are out there today and it's biased towards a particular ethnicity, towards gender, towards a particular age group, it's not gonna work very well for people outside of that ethnicity, gender or age demographic. So having a good representative training set is actually really important to make these systems work well. So this actually, this plot or figure comes from Ethan's paper which is giving it the link down there. Okay, so recap. Facial, we've got two models that go into this. We've got the facial detection model which is what's gonna draw the box and give you the attribute markers. And then you've got the face embedding model which is what's gonna take that face tile and turn it into your embedding vector. The important thing is this is done separately from actually enrolling the faces that you want to identify later. So you've separated the training data out from your actual life detection data. Faces you want to detect don't actually have to be available when training the models, right? I can train a model on photos of people who were born years before anyone I want to identify popped up. Then when you find people that you want to identify, you enroll them by generating the embedding and then you're back to the nearest neighbor search. Okay, so the short version. You find crop possibly aligned faces with a localization model. You take those crops, feed them through an embedding model if you're registering them. So I've got someone that I want to find later. I'm gonna save that vector along with the label of their name. Sometimes I'll form templates if I want to sort of shrink that set again. So that might be something like finding a convex hole that contains all of the faces or finding a median of all of the embedding vectors. Then recognition is you're just gonna find the nearest embedding or the nearest template to the unlabeled embedding that you're trying to say, hey, who is this person that I've got a photo of and is that somebody that I want to recognize? Okay, a brief digression. Welcome to how I love this GIF. This is I'm a sucker for dog GIFs. Future is already here. It's just not evenly distributed. It actually is pretty evenly distributed because if you want to build facial recognition models, there are plenty of models that are available online. I'll be putting these slides up later so don't like rush to get a photo of the URLs. Don't worry about it. Pre-trained models and model training code is widely available online. Data is also widely available. People are starting to get a little bit twitchier about this. Microsoft recently pulled down its Celeb A data, but you still have things like labeled faces in the wild, the Kaiser Webface data, and then facerec.org slash data sets hopefully provides you with just about any data set you might want. Some of these you will actually have to write to the academics that did the work, but usually they'll be quite happy to just send it off to you. If you're too lazy to do that, even then Amazon's recognition works really well. There was a Washington Post article that said that a county in Washington state was actually running its entire facial recognition system off of Amazon's recognition for about seven bucks a month. So it's fast, it's cheap, and it works reasonably well. Okay, so now that we're all depressed about how widespread this is, let's talk about how we can evade it, right? Oh, you sweet summer children. So the first thing to keep in mind is when you look at the academic literature, a lot of papers that say, hey, we're attacking facial detection models are actually attacking sort of this softmax type model that's closer to an image detection model. So if you're gonna go out and reproduce some academic research, probably do a quick sanity check, look through the model, make sure that they're using something that's closer to facial recognition as it gets deployed in the wild. So plan your attack, right? Think about your capabilities. Can you attack the model training? Probably not, but if you can, data poisoning is a great thing to think about here. If you can get a copy of the model somehow, right? Let me stick it on an open share or an unprotected S3 bucket. Grab it and do an offline attack, right? You can do white box attacks on that. If they offer you an API or an endpoint that you can query, right? You can send lots of queries to it and you can do what's called a black box or an online attack. Different in facial recognition, enrollment is sort of like a second mini training step. If you can enroll new faces, you can think about possibly I could poison it, right? I enroll new embeddings with the wrong label and that'll confuse it when they go to do this nearest neighbor search. You can use something like this person does not exist general adversarial networks and possibly just do a resource exhaustion attack, right? Cram it full of a whole bunch of faces that don't really exist. Again, there is a nearest neighbor search in this and even with computational tricks to speed up nearest neighbor search that is still a somewhat expensive process. So potentially if you can actually modify the enrollment step, you could bog it down at that point. If you can alter data in flight between the localization model and the embedding model, you could actually do a face-only attack, right? Just modify only the face chip and do some sort of attack on that that will cause it to misidentify. If you can read the data in between the two models, you can try and attack the detection stage, the bounding box. And if you can, all you can do is give it an image and get a score out, then you're sort of stuck doing an end-to-end black box attack. So this is pretty straightforward. I am already behind where I wanted to be so I'm not gonna talk about it. This is on easy mode. I will talk a little bit about this. This is surprisingly hard except the field is moving fast, life comes at you fast. Yesterday at the talk there was a presentation from someone who actually did this. They broke this. So the reference is there. Again, I'll post these slides online later. They have actually broken the bounding box and detection attack. Pretty, it's a nice hack. You should definitely check out this paper if you're interested in that. And then finally, there's the black box attack. This is perhaps unsurprisingly a pretty, pretty tricky to do. Realistically, you're not gonna know, right? You don't know if your photo's being taken and analyzed. You don't have any detail about how that system actually works. And even if you try to dodge it or evade it, you're not gonna be able to get feedback, right? I suppose you could go down to the casino floor and ask them really nicely if they recognized you when you had the funny glasses on. I wouldn't put good odds on them actually letting you get away with that. So, white box attack. Really high level outline. These are usually done with what's called gradient based methods for these deep learning models. There is the fast gradient sign method. I believe that's good fellows. There's saliency. I believe that's paper nose. Links are both there. The idea is basically you're using gradients produced by the model to find the most sensitive input features. So, individual pixels within the face tile. And then you're gonna figure out how I can modify those pixels in a way that it will drive the model towards a poor classification. So, I did a really lazy implementation of this. I basically just treated the perturbation as a PyTorch variable. I made up an L1, L2 penalty. I sort of set a prediction gap. So, I said evade it by this much, but no more so that it doesn't go off and find the most bizarre facial deformation it can do. And then I just backprop to my way to glory. So, use your built in optimization toolkit using a built in optimizer within PyTorch. So, if you do a gradient based white box attack on only the face chip, you end up with these results. So, walking you through this, it's kind of busy. In the upper left is the perturbed version and the original version of my face. So, if you stare at it, if you look at it really closely, you can see some minor differences between the two. But if I only showed you the one on the left, the perturbed one, you'd probably say, yeah, that's actually a pretty clear, pretty clean photo of your face. Below it, the three panels below it are the red, blue, and green channels showing what the white box attack actually modified, which pixels and which channels. And then on the right hand half, the test image is the one that is sort of, hey, can we identify this person? And then the next three photos left to right, top to bottom are the images in the registration set and the scores. So, in this case, it said, yeah, I definitely look like Alvaro Nubola. I've got a sort of a similarity score of 0.65. The second highest similarity score was to my own face with glasses instead of without glasses and a similarity score of 0.45, right? So basically the short version of this is the white box attack worked. I was able to modify that specific image, those specific pixels to actually give me an effective evasion. So, outline of a black box attack. You can try proxy models. This essentially means you feed a whole bunch of data to the model that you're trying to attack. You get the outputs from that and you train a model to reproduce it. For a reason that I'll talk about in a few minutes, I'm skeptical that these will work terribly well against facial recognition systems, but if somebody knows differently, I welcome your input on that front. So, in a lot of cases, we might not know which data we need to train which proxy model, right? We've got two models in that pipeline that we're trying to match up and it might not be obvious how to do it. So, black box optimization. I just use a genetic algorithm because everyone loves a good genetic algorithm. I treated the perturbation as an umpy array and then I GA'd my way to glory, right? So, if you're interested and you care about it, I use the mu plus one where basically I just permute a single and make a new population member each time, score it and it kicks out the worst one if it beats the worst member of the population. So, again, same results, same ring of things. The main difference here is you can see the perturbation is in fact a little bit noisier. It still looks pretty good, right? But it's not perfect and you can see the channel differences down beneath have been impacted much more severely. These are all on the same scale as the previous image. But again, it actually has successfully managed to evade it. So, against the face chips, only white box and black box attacks are pretty direct, pretty straightforward to do. If you do a genetic algorithm, the results are definitely much more of a mixed bag. So, this is end to end and instead of doing just the face chip and running it just through the embedding, I take the full photo, I run it through the entire system and I look at the difference. And in this case, you can see that my implementation, at least I don't claim that it's necessarily the greatest one out there. It actually does introduce a lot of noise. So, it is recognizable, right? The photo on the left is still recognizably me, but it definitely clearly has been tampered with in some form. But again, it does successfully evade. So, points to consider if you are launching these kinds of attacks, white box is the gold standard. This is the easiest way to do it. If you can do a white box attack, you should definitely do that. Black box attacks also work, but you're gonna be hammering that model API and eventually someone's gonna be like, hey, this dude has uploaded over a thousand almost identical images to my API and keeps getting like different results out of them. What the hell's going on, right? So, it's not exactly subtle. So, what about transferability? This gets back to the proxy model attacks, right? So, if I train a model on one system and I learned to evade one system, does it actually work on a different one? So, I'm using this paper, accessorized to a crime. I do not mean to pick on them. They did a very good job, wrote an excellent paper on using real world adversarial evasions against the VGG face model. Because they did such a good job and they provided such good information about how the model and attack was constructed, it made it easier for me to look at transferability. So, this paper is excellent. If it seems like I'm being critical of it, I'm not. I'm doing something that they did not consider in their paper. But basically, what they've got here is they've got these fancy little glasses frames and you can see that basically the model does not identify the correct person and I've identified the gentleman on the top row as being Carson Daly and the woman in the bottom row as being Russell Crowe on the basis of these funky glasses frames. Okay, so you know who would win. Some super smart academics or me just putting shit together that I found on GitHub. And basically, this is, again, not that what their paper did is bad or doesn't work, but that transferability is a real headache with these kinds of evasions in these models. Basically, the short version is no. So I've got six panels on the left, six panels on the right. The upper left panel is the perturbed panel. So that's what the evasion put on it. The panel below is the target who this evasion was supposed to be attacking. And then I've got the top four face chips from a registered training set that I made up. This is me using MTCNN and Sphereface, which was a couple of slides back to actually do this evasion. So what they did worked really great on the VGG face model. What I, you know, yellowed together from pre-trained models on GitHub it actually did not successfully evade. If you throw it into Amazon's recognition demo, take the same photos. Again, it does not give you a successful evasion. Finally, I got the glasses. The goggles, they do nothing. I got the glasses and I printed them out myself and I tried them and it actually worked, right? On my kind of bad pre-trained model. So it reclassified me as the gentleman in the upper right of the four-plot image with a very slight edge over the target image. But again, I put it up through Amazon's recognition demo and it did essentially nothing. Okay, so wrap up. I'm a little over time. I apologize. I do know that I'm standing between you and lunch. Facial recognition is not normal image classification. Facial classification and recognition tools are out there. They're publicly available. You have pre-trained models available on GitHub. These differences make a lot of these attacks harder, but there are also new potential attack surfaces. So again, gradient-based attacks are your gold standard. They're gonna be pretty hard to deploy in practice. Exhaustion attacks are something I don't think I've seen anyone contemplate before, but again, if you know of research on that front, please let me know. I'd love to hear about it. And if you can tamper directly with embeddings, so screw around with the registration step, this is probably your best bet on an attack surface, right? Map the wrong embedding to the wrong label and have it identify the incorrect person. So finally, if you wanna take a screen, this you might actually wanna take a screenshot of if you don't wanna wait for me to put the slides up. I have a ready to rock pre-trained system. This is MTCNN and Sphereface. It is available on Google Colab. There is a preview tiny URL that you can look there so you can see that I am indeed sending you to Google Colab and not somewhere else. The QR code should send you to the same place. This should work if someone wanted to try it like during lunch and come back and yell to me if the permissions on the notebook are not set right, please do so. That would be awesome. I've tried it myself, it seems to work. I may have screwed something up. You open the link, you do file, open in playground mode, then it'll tell you that it won't save any changes unless you save a copy to your own account. It will take up a few megabytes in your Google Drive account. So if you're up against the wall on space, maybe don't do that. Anyway, I am contractually obligated to make you stare at the slide for exactly three seconds and I will now be happy to take any questions. In the back. I have seen, so the question was there is some work that uses IR light to essentially do real-world, real-time adversarial perturbations on faces to try to mess up recognition. I know there's a lot of proof of concepts. I think Naomi Wu actually did one. She had a hat that shined light on her face that would screw up facial recognition systems. That was awesome. So there is work in that vein that's been done. I did not look at it specifically for this because I was focusing, trying to focus on stuff that was a little lighter lift. Please. I'm sorry? What about the juggalos? Yes, so one of the things that these models do tend to look at is sort of like light and dark patches and reflections. So yeah, juggalo makeup does indeed evade a lot of facial recognition systems. I am not part of that culture so I don't really know what makes a good juggalo face so I didn't want to sort of step on anyone's toes. But yeah, look up juggalos and facial recognition. You'll find some good articles. I'm completely serious. It's exactly the kinds of perturbations that will screw up a good facial recognition system. So it's worth checking out if you're into this stuff. Do you have the Amazon recognition system? I literally have no idea what they have trained on. I wish I did, but I don't. I'm sure Amazon has access to a substantial corpus of high-quality, relatively unbiased training data though. Yeah, it would shock me if they didn't. You mentioned you used a genetic algorithm to make a proxy attack. Mm-hmm. So what is that genetic algorithm? Oh yeah, so super high level. The question was what does a genetic algorithm mean? So a genetic algorithm is sort of a bio-inspired type of thing. The idea is basically you have a population of solutions and then you generate a new solution by combining elements from a couple of high-performing solutions, maybe making a couple of random tweaks and rescoring it to see if it actually improves on anything. So I think, I hate to, this isn't being condescending. The Wikipedia page is actually really good on it. It's got a good description of how they work and a couple, so it does, I think, have a section on the mu plus one approach that I used, which is nice because the mu plus one approach actually lets you put really tight controls over your query budget. Mm-hmm, yes. That, so the question is can you detect whether or not a general adversarial network face is a general adversarial network face or a real live human face? The answer to that is yes, probably. I know that there's been a lot of work on detecting that sort of stuff. It's not something that's really in scope for this facial recognition thing. One thing that is promising is using those generative models to try and poison your training data or perhaps that's the resource exhaustion attack I was mentioning before. If you can generate millions of faces that don't exist and get your system to register them all, then suddenly you're bogging it down with trying to sort out the real faces from the bullshit. The countermeasure to perturbation attacks is use an unpredictable model at this point. Yeah, no, it's an excellent question. Like being able to think of this defensively in terms of what if people are using facial recognition as like a backstop biometric, for instance, and then perturbation attacks are essentially like a way to evade this biometric identification. Again, I know that there is academic work that's going on in this area. I am less familiar with it, so I don't wanna give you an answer because I'm probably gonna get it wrong. But yeah, people have actually looked at ways to make image classification models more robust to adversarial perturbations. You can probably transfer a lot of that to this setting, but I don't know offhand of work that specifically addressed that. So I don't think we have any talks in here, so I'm happy to stay in chat if people want to, but we are definitely over time, and if y'all wanna grab some lunch, go for it. Thanks again.