 Bonjour, my name is Elie and today with uh Jean-Michel uh we're going to talk about uh side channel attack using deep running. We are both part of the uh Google uh we uh Google security and anti-abuse research team. So side channel attacks is one of the most efficient way to attack uh secure hardware because it targets the implementation rather than the algorithm and usually implementations have a harder to secure than the algorithm itself. So for example last year Jorhen Bonik I hope I pronounced his name right uh did a side channel attack to recover the private key of the crypto wallet uh treasure uh that should use the power of attacking the implementation rather than the algorithm. Uh so side problem with uh side channel attacks is they require a lot of domain expertise and they because they target a specific implementation uh they're not very reusable. So if we have a super cool class of attacks and they are very specialized the question is can we do something about it can we make them more generic and can we make them more reusable so we can reuse them across a wide range of device. And the answer is yes uh we do believe that deep learning uh a form of machine learning is the answer to make uh well side channel attacks more reusable and more efficient. And uh we see a lot of you funding in the audience like AI really again one more talk about that. But uh trust us actually deep learning is a really really good fit for uh script analysis hardware analysis and we're going to hope to show you that today. Uh fundamental is the reason why AI is such a good fit uh for uh performing uh side channel attacks is because it's like template attacks uh which are state of the art but on steroid. And what we mean by that is uh using deep learning help us to address a few problems. So first one is that uh when you do it when you use deep learning you do not need any type of uh trace processing so you don't have to cut them you don't have to align them uh the machine learning take care of that which is one of the biggest problems when you target uh secure implementation. So second thing is uh we can get rid of a lot of assumption such as the model of attacks uh for the people in the know uh like the aming weight model is not needed you can directly target part of the algorithm directly with deep learning so a lot of the human assumption are out of the loop and as a result uh your attack is more efficient and easier to understand and carry. Uh the third one is because deep learning have this ability to output probability when we perform the key recovery itself as we're going to show you uh we can combine the deep learning probability to make a very nice and efficient probabilistic attack which is something which is way harder to do with standard techniques. And last but not least uh deep learning have a lot of metrics and we can see the algorithm progressing so we get a better sense of how well the algorithm is learning or attacking the hardware and so we have way more interesting metrics that we can leverage. We're going to show that as well uh during the talk. So there is also a more fundamental reason why deep learning is such a powerful tool and you keep hearing people saying deep learning, deep learning, deep learning everywhere is because as far as we can tell deep learning is the only technology we have um period for any domain which actually scale uh with data and computing so more data the more computing you have the better deep learning become and that's the only one we have and that's really why the reason is of why it becomes so powerful. And you're like okay fine um I can see that for catholic conditions but what's what's up with attacking a side channel? Well what it means is we can push further attacks with more trace. The one thing is the more trace you have about your implementation and we can always generate more trace, right? It's just bunch of crypto computation so better it become. So we have a way even for hard target to scale up because deep learning scale linearly so all we have to do is generate more data which is something which for such an attack is very easy to do when you have the board. So in a way it's a perfect fit um so all of this will become if it's a little bit blurry today as as now don't worry about it as we go through the talk it's going to be more clear and it will start to make sense. So today what we wanted to do because as DevCon uh we get rid of all the equation uh we also get rid of all the complicated stuff and we're going to focus on show you step by step how you can do it and show you it works. Uh a lot of people have attempted to use machine learning for side channel. Many people have failed uh we failed a lot uh been two years since we work on it so we failed a lot. But when it works it works and we wanted to show you step by step how to get there. So that would be very practical, no equations, no boring math, just straight up. If you were to code that's what it would look like. And so we call those attacks side channel attack assisted with machine learning also known as camera. So during the talk we're going to create a camera because it's a shorter and nicer uh name. Um this talk is based on a long project uh that we've been inside about two years ago with a bunch of internal and external collaborators around uh improving and hardening hardware uh crypto hardware we have uh to provide the best hardware and most secure hardware we can. Um the slide uh today we are going to making the slide public uh as already on the web uh we also tweeted them which are mission so you can have them today. In uh hopefully very soon we're going to give you also the code source that we use to carry the attack so we'll run it yourself. And in hopefully not just to this in future we'll also make available the whole research paper as well as all the data sets we collected and the model we trained so you can actually reproduce our results. So last disclaimer before we delve into the matter um we decided which are Michelle to focus on showcasing how to get the attack working and as a result we don't use set of the art machine learning we don't use set of the art processing because we think it's more important to focus on clarity we will succeed to be clear the attack works 100% of the time but we wanted to make really a focus on getting you guys a clear understanding so don't expect like super advanced machine learning here uh which but it's probably fairly advanced. Okay so how it's going to go today uh Jean-Michel is going to talk about what are such channel attacks what is the planning so we run it on the same page and then I'll come back and I'll talk to you how to use both techniques and combine them to do a scam and attack and then we'll briefly talk about the research itself and what to expect next. I'm Jean-Michel take it away. Thank you Eddie. So first let's step back and get the same definition on what we mean by such channel attack. So every time you do computation with an embedded device a CPU etc it will result in some artifact created by this computation. And if you have a way to measure that and you have a correlation between those artifacts and a secret you want to extract then you can conduct a side channel attack. It can be used to recover encryption key which is the thing that we're discussing today. You can also it's uh it has also been used in the past uh to perform blind sequel injections. You can also steal password and pins, extract crypto wallet, private keys. It can be used for a wide variety of things. So we have a component an actually an electronic component. This component for the cryptography we feed it with plain text data and it has a secret key inside that we don't know and we want to extract. At the end we will have some artifact as I said that we can uh that we call leakage. And those leakage can be for example the timing. When you do a cache timing attack if the data is in the cache or not your CPU your execution will last longer or not. And by deriving that you can get one bit of data. Uh you can also measure the power consumption during the uh the encryption of the data and that's the one that we are uh using on the uh on this presentation. Theoretically you could also use the heat but the problem with the heat is that it spreads and you have a huge latency so despite using an infrared camera on a big CPU to see exactly which part is warming up to locate exactly where your crypto accelerator is located the heat cannot be used to recover an encryption key as far as we know. And for this we prefer to use electromagnetic emission instead of the heat which is more precise and doesn't have the sign the same side effect but we're not using it on the attack we're showing you today. So side channel attack in a nutshell. We do some encryption with the uh with the microprocessor. We are measuring the current consumption with an oscilloscope. We accumulate that over time with different plain text. And we do a template attack for the um for the state of the art attack. A template attack is still uh an attack that requires a powerful attacker because most of the time uh to to train a template attack it's a probabilistic model that you have to train on what is called a white box chip. The white box the the white box chip sorry is a chip on which you can control the encryption key. You train your probabilistic model and then you apply it on the chip where you don't know the key and you want to recover it. And at the end your probabilistic model will give you the AEST. So as we see here on uh on these theoretical curves not the real ones uh we can see that most of the time the trace overlaps but on the top of the um of the curve we have some discrepancies. And those discrepancies hopefully will happen on a clock signal for the CPU which means that it's the register which are changing values. Creating different kind uh of power consumption at a given time for the different bits that are set. And that's the kind of thing that you're looking for. And ideally if this point at the time only relies on a part of the plain text that the attacker control and a part of the key that you want to recover this is an attack point. Here we see a full trace of a lightly protected AES trace. It's AES uh uh 128. And if we take a deeper look at this particular box we can see some humps and some repeating patterns. And if we count them there's actually ten humps. And that exactly corresponds to the ten rounds of AES 128. So we see that it can be also very visual. Uh this is how we were doing that in the uh at the beginning of the project. It was sitting on the corner of my desk. So we have a PCB which has a target chip that's the one containing the encryption key that we want to recover. Here we have a differential probe that is measuring the power consumption over time. And it's connected to the oscilloscope that is here. And here we have the communication interface to be able to feed the plain text to the chip. And that's all it takes. If you want to start uh new AE tech and call in we're sitting on the second row here. Uh is the creator of chip respirer which is a very cheap and easy way to start with side channel attacks. On the board you have all you need. You have an oscilloscope. You have uh the communication interface. And it has multiple targets to try uh to try this. Uh if you're interested we invite you to talk to Colin. He has a few chip respirer nano that he can uh give to some of you. Uh but most of the time the limitation of the chip respirer is that it's feeding uh it's uh the clock that it generates and the sampling rate for the oscilloscope is limited. And when you're attacking more secure elements most of the time they won't allow you to feed any clock to it to or to slow down the clock. Which means that you will need a very high sampling rate for your oscilloscope. Uh and for this you will need a real oscilloscope but the software stack provided by the chip respirer is already compatible with that. Also during our paper we do a synchronous capture. So the chip respirer usually works with synchronous capture which means you have one single clock that you feed to the chip and the oscilloscope will capture the data exactly at the right amount of time. Most of the time if you use uh an external oscilloscope this won't happen. You will have exactly what you see on the screen. You will have different clocks. One for the oscilloscope sampling and one for the CPU and they will be out of sync. Which means that if you want to sample in this in this setup you need the oscilloscope to run at least four times faster than the CPU the clock of the CPU you are attacking. And that's why with our setup evolved over time during the project and right now we're using a chip respirer pro which is a bit more expensive than the uh the chip respirer light or the chip respirer nanokits. And we use a picoscope 6000 series. This is not an advertisement. The clock is not sponsored. This is just what we use so that if you want to reproduce the setup you know that this works. And the reason why we pick the picoscope 6000 series is because it can sample at a very very high speed rate. It can go up to five giga sample per second. So we can attack a chip that is running at one giga hertz. And it also has very big memory. Uh it can store up to two giga sample in memory and it's very important when you start attacking asymmetric cryptography like RSA or elliptic uh or elliptic curve because the traces are really big. So now let's move to what is deep learning. Deep learning is basically a neural network but with many layers and they are stacked together like pancakes. So at the beginning you have a series of neurons mimicked uh with the biogi- biological one. The number of neurons that you have defines the width of your layer. Then you start packing- stacking them like pancakes and it uh creates depth on your network and it's the deep learning. Then you add an input layer to process your input and one for the output to do the prediction. Here between dogs and cat. When you want to use the uh the machine learning, you feed for example this image of a puppy. You feed it to the input layer. It will activate some neurons. They will propagate their predictions to the next layer and the next layer. And at the end of the output layer it will issue the prediction whether this picture was the picture of a dog or a cat and here it will say that it's a dog. But when you have different use cases you will need to use different types of layers, different types of network- network architecture so it can make things quite complex very quickly. So what do you need to train a deep learning model? Basically TensorFlow will allow you to write and to train your model. Uh it's written in python. Uh but it's very very slow to train on the CPU so don't try that. You will need a hardware accelerator for that. You can use either GPU or if you use on Google cloud we have what we call TPU, TensorFlow processing unit. And this is a dedicated ASIC that is there for uh especially for TensorFlow and deep learning trainings. The demo code will try to make it available as soon as possible uh on Colab. Uh Colab is basically a Jupyter notebook but hosting in the cloud and it already comes with the TensorFlow already set up and free GPU-TPU time computation. And now I will give back the uh the mic to Ellie who will guide you through the your first schema attacks. Thank you Jean-Michel. Alright so now that we have the basics done uh we're going to try to see how we combine both and uh making it work in practice. So uh as I said our goal is to get you through the attacks step by step and try to provide as much information as we can. So we you can reproduce and understand what's going on. So the first thing to decide is what we're going to attack. And for this presentation we selected uh a chip which is the STM32F415. Uh this is one of our favorite boards because it supports a lot of um software implementation but also have hardware implementation you can run on it so we can compare hardware protected uh AS versus software 1 and try to test many of those. And for the implementation we're going to use tiny AS uh which is an unprotected uh version of AS because that's the easiest one to attack and that's the one where you're going to get the best result. Um so there's one downside to it that I'm going to explain a little bit more in detail later but because it's software it's very slow so those traces are very big and that make it harder to train uh because you need more GPU memory. Uh okay so Jean-Michel showed you earlier uh what is a ACA game plan. So we're going to upgrade it to make it a scammer game plan. So uh same as before we do encryption on the chip. We hooked our chip whisperer and picoscope, get the data, feed it to a neural network and we need to talk about how we're going to train that thing and then uh we'll have to combine those things and do some sort of uh inversion and a little bit more extra work to actually recover the actual key. And if everything goes well uh you get a AS key or in our case we'll try to get mini AS key. Alright so as I mentioned uh tiny AS is great, it's software, it's open source, you can have it however it is super super slow. And what I mentioned that is because it's going to take about 80,000 point as input uh it's fairly large. It's actually larger than most of the images we feed into a neural network for recognition of images so we will have to do a little bit of um different type of architecture to kind of be used uh the amount. Uh for the demo we're going to provide to you on Colab uh Colab only have 20 gigs of memory, that seems a lot but actually it's not for deep learning. And so what we did is we also have another data set coming out uh which is Colab specific where we basically uh took one fifth of the of the trace so you can load it in memory and play with it. So you can do that uh it's more efficient because you have a fewer point to process so you can put more in memory. The problem is we are violating our first promises which was you can do on full trace. So if you want to do on full trace, you need a lot of memory and you need pretty beefy GPUs uh we use between uh GPU which have between 12 and 16 gigs of RAM when we train them on the real thing. Alright so what model what deep learning architecture do we need? Alright that's the question is like okay I have my data what do I process it with? Well the obvious answer uh is you we're going to do the same thing what we use for speech processing and image recognition which is what we call a ConvNet also known as convolutional neural network. The reason why those are better is because they slide across the data and so you don't have to process all the point at once so the number of parameters and connection is lower and because we really want to have locality right what we want is to fill them and then stack them. So that's our favorite network of choice. However if you want you can also use for people who like deep learning use uh LSTMs uh more precisely GRU works pretty well too if you stack them so you can use those other architecture but this one works well it's fast so that's our favorite that's our favorite one. Um what does it look like? Right so a little bit of code uh I promise no question but I didn't promise no code so you guys have a little bit of code uh this is a TensorFlow 2 uh code uh or Keras for those who like that better and the idea is we have a bunch of constant uh which is just like setting up our uh network the important one is the last one called pool size. So what pool size will do is by how much are we going to combine as a first layer uh which going to be in a second. So first we take the input so we convert it basically as an input for our layer uh networks then we do a max pool. So what is a max pool do? The max pool do the following it takes a sliding windows so we put four and it's going to take the maximum value for each of those four. So in a way it's basically shrink down your your trace from eighty point to well twenty thousand point and then uh you go intend to process them with convol- convol- convolutional layers uh which are here uh they are not the same as images because they are one dimension. Why one dimension is because we have a trace of one point right this is just a value of the current we don't have 2D like images X and Y so that's why we use 1D. Uh for those who like machine learning you might already recognize that we don't use a normal convolution we use a convolutional blocks where we did put batch normalization in the middle and the reason for that is it speed up the training and then we keep uh doing a few of those uh I think in that example we did four uh five. So we do five times those three layers so you already have twenty five layers here. So that's why we create deep learning is because we already have about uh thirty layers as a point. Then we do another max pool so that's the final final one where basically we have all those convolution which creates a lot of um uh depth if you would if you would like a number of filters this is where the depth and then we combine them to find the maximum one for each of those point and then we add a little bit of dropout and a dense layer to connect everything and do our prediction and our prediction which is the last layer that's the output of your layer if you remember the cat and dog here we don't have cat and dog what we have is we have two hundred and fifty six values and I'm going to explain a little bit more what they are in a minute but basically we have two hundred fifty six values as the uh network is providing to you. The important point here is you might consider the idea of using sigmoid which is another way to do it uh if you do that for deep learning uh and for sigmoid attack it's a horrible idea. It's a horrible idea because uh we have only one value which is true right if you try to predict an image you can have a cat and a bench and something so you have multiple classes in our case we have a single class which is true which is a value you want to predict and nothing else so you have a lot of zeros. So basically every time what the machine learning is going to learn is if I have two hundred and fifty six zeros I have ninety nine point nine percent accuracy so your model should nothing for a very long time so that's why you should always use a softmax uh which I both and really that's the best way to do so don't use sigmoid if you try that at home. Okay so now the one thing I haven't explained is like okay so I have a model I have data what am I trying to predict? Well that's where things get complicated uh so this is AS uh this is a round of AS to be precise uh we're going to talk about the first round of AS and the question is what do we do? So as Jean-Michel explained uh we are doing power consumption and the question is why is power consumptions reflecting the key right? What happened is if you want to set a bit in memory to one you need power. If you want to see the the bit of memory to zero you need no power right? So basically uh you need more power if you have eight bits at one than if you have four and or two right? So basically we need to find every point in the algorithm where there is a shift of some memory and those are going to be where the power differences will go into kick in and we'll have some sort of differentiable place to look at. Fair? Okay so where are those? Well first when you feed the key you feed the key in memory you load your key so your key is loaded so that's the first point. Then you have to do something where you combine the key with a plain text so that's an X or so when you do this combination well again you might have also some loading there. Then you have the S box and after the S box again you go out of the S box where strips is the things again you're loading the new the new value memory so again a certain point. Then you have one more which is before the key scheduling and one after the key scheduling and then you repeat that sixteen times because well there is sixteen bits for each of those things and then in a theoretical model you can repeat that ten times because there is ten rounds. However there is only two rounds you can use to recover the key directly which is the first one and the last one. So one in the middle are way more complicated while in theory in cryptography we consider them as an attack because they leak stuff it's really hard to exploit. It might be possible in the future we might be able to do stuff but for now we're just going to focus on the initial round where you have only three places or three type of prediction you can do. So first one as I mentioned that's what we call AP one is you just predict the key. You say hey give me back the value of the key. That's the easiest one to do because well that's a direct prediction. The second one is please predict the value of the key and plain text and there that's why I explained that you need a little bit of computation after the prediction. You have to well de-exhore the plain text. So this only works for attack where you have a chosen plain text which is the one we're demonstrating today. And finally the last one you can do which is also a very common one is you go after the S-box. So in that case you have to invert the S-box, invert the X-or so it's a little bit more computation and you have that. Don't worry about it. We have implemented all of those three type of attacks into the code we're going to to give you in a few hopefully soon and you will be able to run them. So where does it leave us? That leave us to this idea of we take the power trace, we feed it to the networks, the network do all those crazy layer I showed you and I'll put a value to a softmax. So what is a softmax? Basically the softmax is you take your twenty-fifth, twenty- two hundred and fifty-six value and you say no more lies such as the sum of all those values is one. The reason of one is because it makes a very nice smooth probability and we constrain the output so every prediction is comparable and it's going to be important later. It's a standard technique by the way to machine learning so this is already built in. And so we make twenty-fifty-six value prediction by the model for each attack point value. So if we were to target the key, each of those output is the value of a byte, uh, the value it can have. Um, so last thing I need to tell you and this is where it become, um, not complicated but it's, it's a little bit different from what you expect is we do one byte at a time. So if you were to attack a full key, what you would do is you would have to train sixteen models. One for each byte of the key, one for byte zero, one for byte one, one for byte two, one for byte three and so forth. And you're like why? Well we have twenty-fifty-six value for the first byte and then you represent six in time. So we don't want to have like about three thousand value to predict. That's not useful. We could, uh, but it's also, um, uh, push a model to have higher capacity, right? Because you have, now the model have to learn all the non-relinearity to predict all those value. So your model is going to be bigger and it's going to be harder to train so it's easier to avoid model per key. And the question that we've been asked quite a bit is, is there any bytes which happen to be harder? The answer is most notably no, but in some implementation it seems that the bytes which are the border which is four, six and, uh, sorry, four, eight and twelve are a little bit different because of, um, the power of two. So in some implementation they seem a little bit weirder but they all work the same in the end. Um, okay. So you do that, you're ready, you have your code, you click fire and then you wait because it takes hours. Sorry, it's not very easy. Training a machine and a model takes time. Uh, you have to train on the both fifty thousand example. You come back and you cry. It doesn't work. We like, we actually, the model I showed you I like, it just doesn't work. And then you're like, okay, should work, why isn't it working? Well, you try many of those and none of them works. And that's where most people have stopped and we have contrast people say it doesn't work. And that is true. Learning crypto is hard and you're going to fail a lot and a lot and a lot and a lot. And for a year, about a year, we, we had only failure to report. We know it should work. We just don't know why. And to be honest, we still don't know why. However, we have a way, right? So how, how do we deal with it if we can't find it by hand? Well, the answer is don't do it, rely on hyper tuning. So hyper tuning is the idea that you are fuzzing your model architecture to try to find the optimal one and you try to do it until you find the right one and you just throw at it more CPU and more computation. So even for this simple example, we had to do that. I, I really wish I would show you a model which is super easy to understand. It will work just right. I don't have one. Uh, what I, what I did is I trained about a thousand models, uh, on Google Cloud using Kubernetes and Kerasuner, which is the hyper tuner we developed for TensorFlow 2, with TensorFlow team and I selected one which works. And as I said, we have a super amazing model to show you. However, it's not going to be as simple as the previous one. We're like, okay. So you break yourself and like, okay, what do you mean by more complicated? Well, uh, this is a simplified view of it. It's not that bad, uh, kind of. It's a few hundred layers now. So same as usual, we have the input and then we have a Max pool. Uh, Max pool is six this time. Um, don't have the space to put everything. Then we do two normal convolution, which is help us to decrease the number of points and increase the number of filters. Then we do, uh, what we call residual blocks. And so the residual blocks, the idea is you are adding shortcuts into the network to actually force the gradient. Uh, so the gradient is when you make an error to backpropagate the error to the network. And one, one thing we found is the gradient is vanishing a lot when we do training on crypto. So use a shortcut to retain the gradient. This is something they also do in images. Uh, it's called RASnet, residual blocks. So you do a few of those. And here what happens is decrease the number, the size of the trace and increase the number of filters. So decrease the width, increase the depth. And then you need a lot of residual blocks. So that's not the same. Why is it not the same? Well in that case, the number of points do not decrease, but you increase the depth. So basically you do a combination of all of those again with residual. You do a max pool. You do a dance. You remove the dropout because it doesn't help at all. And you do on 256 layers and you have something which works. And that is a simple example I can show you which works. And I cannot find it by hand and you like what is intuition. The answer is I have no intuitions. Absolutely zero. Don't know. Uh, for some reason it's very very finicky. And we have some which works. So empirically that works. Uh, the intuition is not easy to get. And you and the cat is happy and we're happy too because now we have something to show you which works. How well does it work? Well this is the trace, uh, out of the board. And the model is going to do nothing for a few epochs. So also another thing, don't get discouraged if you train model for deep, for, for crypto. For four or five epochs the thing do nothing. I mean it, it kind of go to 5% maybe and then go down, go up. It's not really, really good. And then it's out to learn. And then it's going to, to spike at uh, 35% accuracy around 30 epochs. Uh, 5 minutes per epoch. So about two hours later you have your result. And then the thing collapse. So this is validation accuracy. I mean these are on, uh, traces that we haven't seen during the training. These are traces we use to monitor the generalization of the model. And yes, it's going to collapse and all our model collapse at the end. We don't show why but what happened. Uh, and then you get that model and you check point and you save it and that's your model. Uh, we also try the documentation. Uh, sometimes it works, sometimes it does not. A lot of people say it does. Uh, it kind of except if you make it wrong. Uh, it actually, actually completely destroy your ability of your model to learn because it seems to destroy some of the non-linear correlation you need. So do not use data augmentation unless you have to or you feel you're, you want to push further and have a good baseline. So, let's go back a second to what I mentioned earlier which was we have multiple attack point and answer a question which is does it make a difference whether you attack the key, the key XOR the plain text or the output of the S box. And the answer is yes it does. If you try to predict the key it just doesn't work. If you try to predict, uh, the out plain text XOR, uh, the key it actually works really well. The one I showed you 35% works really well. Uh, spike a little bit after, uh, the sub byte Eden which is plain text XOR the key. So both works in that implementation. However, if you were to use something else it might completely be different. Uh, it actually depends implementation it is dependent on where is the leakage coming from and so you don't know where the leakage is coming from so that's another reason why we need to scale competition is we don't know. So, let's finish saying by okay so now we have a model working model how do I recover the key. Well as I said we're going to make this very nice probabilistic attack using our predictions and the way it's going to work is we're going to take a trace, I need to model which is trained and it's going to say here are my prediction and we're going to accumulate them. So we're going to do it on a few traces. Uh, you can go to a thousand two thousand three thousand but really we don't need that much. And then you're going to sum them and you're going to take the maximum value will be your best predictions. What is very interesting about deep learning is we also know what is a second value and the third value and the fourth value. So even if you don't get it right you can brute force and it's very easy to brute force because you know in which order to test each of the byte because the machine learning will tell you which one are the most likely, the second most likely, the third most likely and this is something which is unique to deep learning compared to template attacks. Uh, also we cheated here again uh we used log 10 to accumulate traces because that's the correct way to do it uh mathematically just for people who try to do it. And then an important thing we found out that very very few people did in the literature is making sure it works across chip and the way we do that is we train the model on once, one chip but then when we do the real attack we do it on the second chip because we want to make sure that the model generalized around the difference that there is between chips from the same family right. You do want to test on the same chip because all you learn is a model which might be dependent on the specific uh chip you got. So we do two chips one for training, one for testing and uh the last thing we need to mention is okay so we have our data set on our second chip hardware that is the effectiveness of the attack. So there's four ways uh the first one is top one accuracy which is how many of your predictions are in the top one, top five, how many predictions are in the top five then mean rank, what is the mean rank of your key? Remember you have 256 predictions, crypto is supposed to be looked like random so 128 is your baseline and what is the max rank? So the idea is, is the machine learning have shifted the space or space and reduce it so the max rank give you some of those ideas. Now one thing we do which is also very different is we try to recover a hundred key. And the reason to recover a hundred key is because we want to make sure we can recover a wide range of key rather than one specific key to make sure we generalize. So the holdout data set is made of a hundred key and we use 300 power trace uh that we can accumulate uh and see how well we do. So do we do well, you know, the cat is very anxious, he try and then yes, actually you can recover a hundred percent of the tiny ASCII uh using the attack I showed you. Moreover, so a hundred percent, so a hundred percent all of them are zero because well we have all of them. Moreover, what is uh perhaps more impressive for people who know size and attacks is while the model have 30% accuracy and that's a very deceptive metric uh we recover a hundred percent of the key with no processing but also in at most four traces. Uh when you template the attack uh we are scoring uh it's about five to six traces. So you are better than state of the art even with a simple model and perhaps more impressively if you want to use a single trace you'll get 81% of the key correct. And that'll show you how much more powerful deep learning is to do size and attacks at least on AS and that's really why we claim it's going to be the future of it and that's why we try to get more people excited to work with us on that. So to wrap up because I think we have two minutes left uh we need to deal with also protected implementation which is not the focus of this talk but are very active field of research and to deal with the thing which have actually no pattern uh you need something way more and that is what we are going to we are doing next which is we have built a large data test bed with six hardware in six implementations including a few hardware as one and protected ones collected over nine million sample traces uh and trained over 5,000 models and we hope to make uh all the results as well all the model that we've learned and all we learn into a paper soon uh so you can share our results get you to reproduce and help us uh work on the next step. So the takeaway is deep learning is the future of SCA I hope we convince you by the result and the advantage of deep learning of a traditional technique are really really big and so it's a big leap it's very different from what we did before but that's really worth it. It is really hard uh you fail a lot and so don't get discouraged uh you really need automation. We don't know how to do it with automation. We're not making it the pitch of like hey let's use a lot of GPU it's just like we don't know how to do it otherwise this is how we found our way in the last few years to be doing it and we really believe it's the beginning we are aware of other researchers working on symmetric asymmetric uh analysis with deep learning and finding great success so this is really where this thing is heading and really uh if we can automate session attacks then we'll be able to go back and focus on designing secure implementations of cryptos that we get tested very efficiently so we can provide the world way more stronger crypto which is really like strong implementation which is really what we want. We really want to provide uh very strong chips that we can trust to put our secret in it. So thank you so much for attending today we're really happy you took an hour to be with us uh we hope that some of you are inspired to do some of this work with us or on your own uh we put a slide today as you are we have here. Thank you so much.