 Hi, everyone. Let me start off with a few quick questions. Please raise your hand if you are using an Android phone. OK, OK, not too bad. Please raise your hand all the iPhone users. OK, we have some Apple fans here. Please raise your hand if you are using a Nokia phone. Woo! Wow, OK. Please raise your hand if you are using Face ID, or whatever the name of the equivalent Android technology is. Keep your hands up. And that's the interesting part. If you think that your face is a better, more secure login mechanism than the old entry-minute-few-digit passcode. OK, thank you very much. My name is Roy. I'm a security software engineer at OnBackup. And I don't know about you, but when I got this question first, I thought to myself, that's a no-brainer. Of course, my face is more secure than a short combination of a few numbers. Isn't it? Well, I saw that most of you agree with that. So now you really have to stick around until the end of the presentation. I promise you some really interesting results. Let's go a couple of years back down in memory lane. I still remember when my dad, who's actually here today, hi, dad. Woo! I remember when he bought his first iPhone. It was iPhone 3G, a pretty dumb smartphone by today's standard. But for me, as a child, who was fascinated by technology, that was something else, something from science fiction movie. And I remember messing around with it, downloading some apps, playing Angry Birds. And I remember this one stupid app that promised I would be able to unlock my phone using the fingerprint. All you had to do is place your finger on a screen, a cool animation of numbers flying around, just like you see in the movies played. And then, after a few short seconds, well, nothing. Of course, the app couldn't do anything. I remember verifying it to myself by putting one finger, and then my nose, and check if they have the same fingerprint. Couple of iPhones later, we did get a fingerprint sensor at the home button. And some Android devices even got a fingerprint sensor under the screen, just like this stupid app actually predicted. In the meanwhile, I started my bachelor's degrees at computer science. It was the obvious path for me. And it was around 2017 when it was time to choose the subject for my final project at university, when Apple just introduced the new iPhone X with the new awesome Face ID. And at that moment, I knew. I knew I had to dig around it. I knew I had to do it for that little child inside of me who tried to unlock his phone using his nose fingerprint. Today, I want to take you all to that journey with me. We will start off by creating a system just like Face ID and understand it. Then we will try to break it and see how secure is it. The system seems pretty intuitive, right? All you have to do is take your phone off your pocket, look at it, probably not as excited as this dude, and then some magic, some black box that you always assumed has to do something with deep learning decides if to keep the phone locks or unlock it for you. Let's try to understand and think how we can implement this black box. We need to create a system that can use a face and create identification vector out of it. That's why it's called Face ID. We can think about then either way to do it. We can use all sorts of features describing the face, maybe stuff like the length of the nose or the width of the face and the skin tone. We can use more complex, more unique features, stuff like maybe the ratio between the shade of the eye color and the shape of your lips. We need to decide which feature exactly we want to use and how many of them. We know that the usual magic number in the industry for such things is 128 features. So let's go with that. So now, hopefully, we have a feature vector that describes the face pretty well. We can save it inside a phone and use it. So on the next day, when our user tries to unlock his phone, all we have to do is scan his face again, get a feature vector and compare it to the one from yesterday. Well, that's not that easy. Faces are really hard to walk with because faces are always changing. We can take images from different poses in different angles in different lightings. Our hair gets longer and then shorter when we take our haircuts. Our eyes can get red and swollen after a sleepless night of coding. And even our nose gets longer over the year. That's why we need a really strong solution here. That's why we want deep learning. But before I continue talking about deep learning, I want to align everybody here on the room on the same page. I want us all to have the same terminology. So let's start off with the basics. Machine learning. Machine learning is the name of the field in computer science. When we want the computer, the machine, find a solution by itself without us giving it the solution without us coding it, we want the machine to learn the solution by itself. It's usually involved with a lot of mathematics and some ideas we are borrowing from different fields like biology, evolution, and so on. I won't bore you with the mathematics behind of it. That's not really the point of this talk. But I will try to give you some intuition about how this stuff really works behind the scene. And together we can understand the entire solution. The next big thing we need to talk about is neural network. Neural network is the base building block for everything we see in recent years in AI. But don't worry, it's simple enough that even babies can understand it. The idea behind neural network is that we're trying to create some sort of a brain to our machine. This is not really how the human brain is working, we're just borrowing some ideas like neurons from it, but our brain is a bit more complex than that. The idea here is we're having those nodes, neurons if you will, containing some information. And by passing that information from one node to another and having a huge web of neurons, our brain here can actually learn stuff. We're creating a network out of it. We're ordering our neurons in layers, we got ourselves the input layer here in blue and the output layer here in purple. And the network is built in such a way that each neuron has some effect on all the neurons on the next layer. This effect can be a stronger effect or a weaker one, or it can be a positive effect or a negative one. Still doesn't make a lot of sense. Let's try to make a bit more sense of it with an example. Let's say your boss come tomorrow morning and say, you need to solve a problem. You need to identify what you see in front of you is Python. No, not a programming language, but a real can kill you snake Python. So for this life-threatening task, you choose to use neural network. You got yourself two inputs. You can see the color of the snake that's pretty easy to see. And because the snake's coming to you, try to bite you, you can see the length of his teeth. So we've got ourselves the following network and we need to train it. We provided a lot of examples we collected over the years of different snakes and the network is adjusting by itself the weights, the connection between the neurons. We might by the end of the day find a network that look a bit like this. This still doesn't make a lot of sense. Okay, this is a safe space because this is your Python so I feel like I can share it. We as a developer has a secret. We're really good with coming up with excuses if our code succeeded. So let's try to find some reasoning here. We might say that the first neuron here is some sort of poisonous detector or venomous detector. I actually showed this presentation to a friend of mine a couple of days ago. He's apparently really into snakes so he told me that poisonous and venomous it's not really the same, but shut up, that's not really matter here. So we have the poisonous detector here. We consider is some strong positive correlation between the length of the teeth of the snake and the poisonous detector. That make a lot of sense. If you think about it, snakes inject the venom to the victim using their teeth. So longer teeth mean that the likelihood of the snake in front of us being poisonous, it's high. We can also see there is some correlation between the color of the snake and the poisonous detector. If I ask you right now to imagine a poisonous snake in your head, can you do it? You're probably seeing a really bright green snake in front of you right now. So if our brain found some correlation between the color and the poisonous detector, that makes sense that the neural net will find it as well. We might say that the second neuron is some sort of a sizing indicator. It's give us some indication about the size of the snake and that the third neuron is some sort of a pattern detector. We know from evolutionary standpoint that snake with darker color and shorter teeth need to evolve with some different defending mechanism. So they probably evolve with warning patterns on them. They need to repel the enemies in some other way. So this network will never learn anything about evolution and other things was still able to find the three different cool deduction about snakes. So now when the network try to decide if the snake in front of us is actually a python or not, it can just combine them together. If the snake in front of us is likely to be a poisonous snake or venomous snake, it's not likely to be a python, python or not venomous snakes. On the other hand, if the snake in front of us is huge and has some cool patterns on it, it's probably a python. Okay, we cycle all the way back to deep learning. Deep learning is just the idea of taking neural network and add more hidden layers. That's the name of the layers between the input and the output layers. The more hidden layers and neurons we've got in our network, we can go deeper inside the learning process and find more complex pattern inside our data. So we can solve more difficult problems with that. A known example for that is Amnis, where we try to identify which hand width and digits we see inside of an image. I took this awesome animation, by the way, from the three blue, one brown YouTube channel and I really recommend to go to his channel if you want to learn more about the mathematics behind all of it. So we know what deep learning is, but there is still the open question about how we can use it for facial cognition. Luckily, we have another trick down our sleeves. We can use a different architecture of a network. This type of network will encode a decoder network. Sometimes it's also referred to as autoencoder networks. The idea here is we're taking some information in big dimensionality, in this case, six numbers, one, two, three, four, five, six. And the network is built in such a way that when we pass the information through the network, we're getting the same information on the other side. We're getting back, one, two, three, four, five, six. The trick of the network is that in the middle of it, we have a compressed layer, a layer with lower dimensionality. Meaning our network here had to somehow learn some patterns inside the data, and now it's able to compress the information and decompress it later. Let's try to think of it from a different point of view. We want to walk with images of faces. Now, images are actually a combination of a few million pixels. But we also know from the old saying that a picture is worth a thousand words. So let's think of it as somewhere in between. We can imagine this network as two friends playing a game of charade. The first friend is some sort of a poet. He sees our original picture and writes us a poem using a thousand words describing the image in front of us. The sky was blue, the sun was rising, and so on and so on. The second friend is some sort of a painter who read a poem and try to recreate the original image. The two friends are actually really obsessed with this game. So they are training and training more and more until they get really good at this game and can recreate any image just using the poem and the painter painting it. I'd like to think of it a bit like how the police can do sketch paint. Imagine you've been a witness to a crime, and I hope you won't be. The police doesn't expect you to know how to draw the face of the criminal you just saw. I can draw faces. I can maybe draw a smiley face on a paper. But by sitting together with the police sketch artist, we now can both achieve our goals. All I have to do is describe the face to the police sketch artist in a meaningful way, describing the most interesting features that the sketch artist know how to draw. So I start off by describing the different hair for the criminal, maybe from a pretty fine list of hair starts. Then I can describe the criminal eyebrows. Then I describe his eyes, the colors, and maybe some glasses if he had any on him. Then I move on to describe his nose, the mouth, and the relation between the two, and the police now has this painting of John Lennon and Ken arresting for some bizarre reason. So let's do the same in our encoded corner network. We train our network to take an image of a face and return to us the same image from the other side. The network had to somehow learn how to compress the information about face into the most meaningful feature, how to describe it. So it's maybe found out stuff like noses, eyes, and mouth. We don't really know what the different feature is. We never really train our network to look for noses, but we know we are getting here a feature vector, size 128, which describing the face. So with a good description of the face, we can now move on. We can break our network into the two parts, the encoder and the decoder, the poet and the painter. We don't really need the painter anymore. We don't care about recreating faces. And we left with an encoder network that can take an image of a face in and provide us a feature vector to describe it. Okay, but how do we compare different faces? So let's talk about the face space, which is a really cool name for a movie or something like that, if you ask me. So we said we have a 128 feature vector. Let's reduce it into three-dimensional because our monkey brain can really comprehend anything more than three-dimensional. And we left with the face space. This space here describes different faces. We can think of each of the exits as something that describes a different feature. Maybe one of them is describing the nose, one of them the width of the face and so on. So now when I want to compare two different faces, all I have to do is look at the feature space. I have my original image that I took on the first day when I bought my iPhone. I got some point in the space. Now to compare another face and check if it's me or not, I have to allow some tolerance here, some distance that I'm saying it's still me. So I'm drawing a sphere around this ball. This is actually Euclidean distance for those of you who did some mathematics. So now when I have another scan of me, me tomorrow, for example, I can just check if this point is actually inside a sphere. If it is, great, I can unlock the phone. If one of you try to take my phone and access all my private stuff, you'll probably get a point further away. So you're not inside a sphere, you can really access my phone. Let's try to make it a bit more concrete with an example. We wanted to check how closely Steve Jobs is actually looking to Ashton Kutcher portraying Steve Jobs at the Jobs movie in 2013. We wanted to check how good of a casting it was. So we compare the different faces using our method here. We saw that picture of Steve Jobs against himself over the years were actually pretty close to the original Steve Jobs image we used, well underneath the threshold. So Steve Jobs can unlock his own phone, obviously. But Ashton Kutcher on the other hand is not very far off the threshold, but is off it. So you can't really use Ashton Kutcher face to unlock Steve Jobs phone. And I know what you're all asking yourself right now. How my face is comparing to Steve Jobs face? Well, my face is right over here. So you can see that they actually did a pretty good casting, they found someone who pretty good looks like Steve Jobs. Okay, let's do a quick recap. And if you lost me so far now, it's really good time to come back. Now we can create a system just like Face ID. We have a system that scan the user face, provided information to the encoder network, and we are getting back a feature vector that describe the face. And we know it's a good description of that face because in some alternate reality where we still have access to the original decoder, we can recreate the picture of you and only you just by looking at this feature vector. So with a system just like Face ID in hand, we ask ourselves, what could go wrong here? How can we break the system? So we thought of ourselves quite a bit until we ran into this face. You all probably seen this face before. It's a face that was generated by computer in a 2014 research, and it's supposed to represent the most generic average white male face. We thought of ourselves, what does it mean that a face is really generic? What will it do to a system like ours when we provide it an average face? So we ran an experiment. We collected a huge data set of faces, a lot of faces, thousands of them. We choose a smaller subset of faces from those faces, and we ran every face from the smaller set against every face in the big set. We wanted to check how many heats can we get, how many phones can each face unlock. We got some pretty interesting results. You can see that most faces got exactly one hit. They were able to open exactly their own phone. No surprise that some faces got a bit more hits somewhere in between two and four hits, which is fine, but this one face here at index 27 got more than 15 hits, which is insane if you think about it, and you probably won't be surprised to hear that this face is actually the generic face one before. So we had a strong hunch that something is going on in here. We started collecting a lot of information about the different faces. We started collecting all the different feature vectors of all the different faces, and we ran some analysis tools over them, some complex, mathematic models over them. And we got ourselves the distribution of each of the features. Most features were normally distributed, which if you think about it right now, might not make a lot of sense, but if we think about it a bit more, it's making a lot of sense. The network is probably taking the same shortcuts we do as humans. When I try to identify one of my friends, all I do is look at their nose, their beautiful eyes and their charming smile, and I know who I'm talking to. So the network probably doing something very similar, it's probably looking at those features. If for example, the network choose to use something like noses, we know that noses are normally distributed. The average nose size is five centimeters, or two inches if you don't believe in the metric system, Americans. And the average nose size is moving from somewhere in between three centimeters long and seven centimeters long. It's very rare to see a nose longer than that. So we still don't really know what the different features are, but we now know the distribution of each one of them. We know the probability of getting each value there. Let's do a quick side note. A brute force attack is an attack where you try all the different combination of passwords. You start off by hitting one, one, one, one, one, one, one, two, one, one, one, three. This was hand animated by the way, so I hope you appreciate that. You try all the different combination of passwords until you get the right one. In the case of a four digit password like this, you have 10,000 possible combination. You can go with slightly more sophisticated variant of that attack, where you try the most common password, which is one, two, three, four, and you walk away from the most common password to the least common password. The password will be the most probability of getting hits to the one with the least. Okay, what if we could do the same for faces? Well, we just saw the equivalent of one, two, three, four in faces. This is the generic face when we thought this is the face with the highest probability of getting hits. So with the knowledge we just gained about the different distribution of each of the feature, we can do the same. We can sample faces from the distributions we just found and find ourselves the most common face, and we can walk away from the most common face to the least common one. So let's do brute force attack on faces. All we have to do now is build the attack. As we said, we had an encoder-decoder network that was split into two parts. The original encoder is some black box now that we don't really have access to anymore. We can just call it and get the results, and the original decoder is gone. So we had to switch things around and create a new network. This new networks take as an input a feature vector and create an image of a face out of it. Then we feed this image to the black box and getting back a feature vector. And we train our network in such a way that if everything went on correctly, we're getting the same feature vector that we started with. At the beginning, we got some shitty results. You can see here some random noisy image and some random numbers that correspond to it. But over time, our networks got really good at producing images of faces. You can see it's first got the general shape of a face, then it was able to composite the different features of the face like eyes, nose, and so on. We got a network that got better and better and now is good enough that it can produce images of faces that not only fool the encoder to think that this is a real face, but the encoder think that this face is actually have the same feature vector that we started with. We can control now the input to the system. Okay, I promise you some really interesting results at the beginning. The results I'm going to show you now, it's really shocking. So hold your seats really tight. We were able to get successful logins 13 times in a million attempts. One in every 100,000 attempts on average got successful login. Let it sink in for a moment. That's just like guessing correctly, a five digit passcode. This is insane. Okay, here's the attack. You and I won't be fooled by those faces. We can tell those are fake, but by generating 1,000 and millions of those generic average faces, we were able to get some hits. We were able to fool the system to think that those are real faces that actually real encoded to the phones. By the way, if you stop on every of those faces for more than one second, you can probably see someone you know in them. Maybe it's your friend, maybe someone to sit next to you because those faces are really generic. You can see a lot of people inside of them. So we got to the real side conclusion that no one is safe. Well, I have a couple of exclaimers about it. First of all, it's really important to say that this research was educational only. We ran it inside the university by a system that was provided to us by the university, where everybody signed an agreement to allow us to try to hack to their faces. So we never really ran it against the real world. We never really ran it against Face ID. It was never really the point. Face ID is too much of a hustle to walk with. It's come with some complimentary system like blocking you out after five fail attempts. Also, we don't really know if Face ID is walking exactly the same way. No one ever really disclosed how Face ID is really walking behind a scene. Another thing that is important to say is that this entire research was conducted back then in 2017. Meaning we didn't have access to all the latest models and we all know how quickly the field of AI is moving nowadays. So they probably got better results by now. They probably have more secure models that can prevent such an attacks, right? Well, the other side of the same coin is that if this entire research was conducted in 2017, we didn't have access to all the latest models. And nowadays we know a lot more about generative AI and deepfakes and we can probably make our attack even better and fool the most sophisticated system that they are today. One last point I would like to make and this is a question I'm getting a lot. What about 3D images? The original Face ID is walking with 3D camera. Well, we believe that our attack here is really generic and can fool the system as well with sound-tweaking to them. If you think about it, the system doesn't really care about the raw information it's getting. It doesn't really care about the 3D information. All the system is care about is the compressed information we're getting from the encoder. So we believe that with sound-tweaking to our system, we can now change the attack to get us 3D images that will correspond to the same feature vector that the 3D image will. So we believe we can break to those systems as well. Okay, thank you very much. I hope you learned something new today. In this cringy animation, you can see me and my partner in crime in conducting this research, Roy. Yeah, we are both named Roy. Feel free to come by later on and be outside or just send me an email with any question. Thank you very much. Thank you, Roy, for shedding some light on how the devices on our pockets essentially work. We won't take questions on the spot, so please give another round of applause to Roy.