 Okay, so my talk today is about non-profile deep learning base side channel attacks and for the presentation today I will use a Python notebook to run some demo during the talk But so before we go to the demo a bit of introduction and motivation so in side channel we usually have we divide the attack into two categories on One side what we call non-profile attacks when we only have one device Available which is closed. We cannot change the key and usually we attack on the fly with method like CPA for example And then on the other side We have profile attacks where we have access to a profiling device Which is open we can change the key and so we use this device to to characterize the leakage of the device during the attack and So most of time the profile attack are performed using machine learning techniques like for example support vector machine or template and And so recently we are in recent years We have seen that the train in machine learning machine learning have been shifting toward deep learning today deep learning demonstrates great performances for a lot of tasks like a lot of machine learning tasks like data classification for example and So following this train in the train in machine learning. We have seen that the side channel community Has started to study the potential application of deep learning to side channel We have seen several publications in the recent years Most of this publication show a clear clear interest of using deep learning for side channel Most of time deep learning it all perform previous attack techniques and it has interest for example to attack mask implementations One other advantage is that we can adapt the network architecture to the to the implement to the challenge or the implementation so for example, we can use it to we can use CNN to attack desynchronized traces and But most of this research so far has been mainly focused on the application of deep learning to profile side channel attacks and The starting point of this research was to to study how we could use deep learning in a non-profile Setting so for example When we don't have a profiling device Because on the field it's a limiting factor. We don't always have a profiling device So this is the first thing I will present today is a attack presented in the paper. It's called differential deep learning analysis It's a non-profile attack. So it's Follow a strategy similar to other non-profile attack for example like CPA Very quickly when we do a CPA we have a device we collect CP on AS for example We collect some data some traces Corresponding to some non plain text values and then for each key guess we will compute predictions For example the S box output apply a model and for each key guess We compute the correlation between the traces and the predictions and this gives us information about the key So for differential deep learning analysis, we follow a strategy similar to that we collect some data for each key guess we will do predictions apply a model and The difference is instead of computing the correlation for example We fix the network architecture and we will for each key guess train This network architecture with the traces being the training data and the prediction being the labels And what happened is that for the good key guess because the prediction are correct We can expect that the training will be will have better metrics Then for the other key guesses because for the other key guesses the prediction are correct are not correct And so are not in line with the traces So what does it mean better in this context so there were several metrics we can look at in the papers There is several several metrics will present but the most straightforward thing we can do is simply to look at the as a classic metrics in deep learning which are the loss and accuracy of the training and This is the first thing I will I will demonstrate. I will do a first demonstration on this Today I will do the demo on my laptop. So I will use simulation traces I will use a traces of 50 sample I will simulate the leakage of one byte of s box and this leakage will be located at t as a sample t equal 30 So this information is this location is important for for the rest of the talk for later on So I will generate this simulation traces For the demo I will use a neural network, which is MLP of two layers of 70 nodes and 50 nodes and Basically what I what I will do is I will run the attack I presented I will run it on three guesses that we will and we will observe the difference of metrics between the three guesses So the two first guesses will be wrong guesses and the last one will be the correct one so If if we observe so we will observe the loss and accuracy of the training during the training I will run each training on 30 30 epoch so on the left you have the accuracy on the right you have the loss and so this is a training This was a training for the first guess now We are training on the second guess both are wrong guesses and on the last guess We can observe that the metrics are clearly better because as I said the prediction are correct for the last one So the training is is is better and we can observe it on the on the metrics and we can already determine which Key guess is a good one This was on three guesses, but actually if you run it on the for example 256 guesses because we are attacking the S box output You will get something like this where basically all the wrong guesses will Will tend towards the same metrics while only the good one the good key guess will have better metrics and we can Distinguish the correct key value like this So at this stage is we can already use deep learning to perform non-profile attacks Later on I will show you that we can actually leverage the advantages of deep learning of profile deep learning, but in the also in this context But if we at this stage if we compare with CPA for example We we CPA when you do your when you compute your correlation if you get a correlation peak you have information about the leakage location You know where the leakage is located In if you here if you just look look at the loss and accuracy you don't have this information That's why in the paper The paper introduced some method based on sensitivity analysis to during the attack be able to locate the leakage area so from a general point of view Sensitivity analysis is to study the sensitivity of a model with regard to some of its parameters So in deep learning a common application is for example to compute saliency maps Which tell you for example which pixel of the image contributes the most to the image classification And so in the paper We there is some some method With similar principle as saliency map, but in the context of side channel to locate leakage area So there is different technique to do this kind of things like sensitivity analysis in the paper I mainly focused on method based on derivatives and The first thing I will present In the next demo and in the paper which is presented in the paper is To look at the at the derivative with regard to the layers Especially the first layer of the network because the first layer is connected directly to the input sample So you you can get direct direct information about the leakage on the first layer So I will do a second demo so I will keep the same neural network It's a MLP so the first layer is a is a linear layer so you can represent it as a matrix of size number of input sample time number of hidden layers It corresponds to these weights the weight of the first layer In Python so in the notebook you can represent it in three dimension for example on the left here You have the weights so this axis is correspond to the input sample This axis correspond to the nodes and the z axis correspond to the value of the weights So this on the left is the weights of the first layer at initialization. It's randomly distributed Similarly you can look during the training you can look at the gradient of this layer The derivative of the weights Similarly, you can look at you can represent it as a as a in three dimension like this with the same same axis And so we will observe that in the notebook will run just exactly the same attack as before on three gases But this time we will observe the gradient instead of the of the accuracy and loss So on the first guess it's a wrong guess we will observe in average small gradient And we don't observe any specific patterns Now this is a training for the second guess. It's a wrong guess. So we observe something similar When we do the training for the good guess this time we observe something very specific we observe big derivative values and this big derivative whole appear on the same line you can observe on the same line which actually correspond to 30 to t equals 30 So if you want to look at it on the on the on the diagram of the MLP actually correspond to all these weights Which are all the weights are connected to the sample t equals 30 all this big derivative correspond to these weights And this sample is actually the sample which contains the leakage of the s box And so it's actually it's not something surprising it's something we can expect because so this sample actually contains the information Important for the classification. And so in average it will have a bigger impact on the loss minimization So you will that's why we observe big derivative on this on this particular line As it is here you can already Know where is the leakage it's on t equals 30 if you want you can if you want a two-dimension representation you can for example sum over this axis over over this axis and You will get something for example like this which is a bit more familiar inside channel. They look like a correlation peak Here I just summed over this axis and we get something like this where we know where is the leakage So here what I present on the gradient uh of the first layer it mainly works with MLP If you want to do direct application on cnn is not uh that easily But so in the paper there is a generalization Which is instead of looking at the derivative of the first layer What you can do is you uh derivative with with regard to the first layer You you take the derivative with regard to the for example to the input sample Imagine for example you have a cnn here on the right with some some input Coming through the cnn you can always compute the derivative of the loss for example with regard to the input sample And similarly the sample which contains the information related to the leakage for example They will produce big derivative in average and you can spot that you can accumulate this over the training and get information So this one is a generalization you can use it with any kind of network and uh What I presented so far in the presentation so the differential deep learning analysis The sensitivity analysis what is interesting is that it works also on mask implementation so if I continue the demonstration but this time with Simulated traces with uh where I put a mask I add a boolean mask So I have a mask s box which is uh the leakage is still located at t equals 30 I we put uh to do a higher order attack. We put a leakage of the mask this time at t equal 10 I will generate I will generate new data corresponding to this simulation And if we run the same attack as before We will observe that it also works with uh with mask implementation So again, we will look at the derivatives for the three gases. So the two first gases are wrong gases Again, we we won't observe anything special on these gases in average the gradient is small So this one was the first gases now is the second gases And for the good guess we will again observe something very specific So the good guess uh now is training we can observe this time We will observe again big derivative on two distinct locations Which are the line t equals 30 and the line t equals 10 So again this time it corresponds to these weights. Uh, uh, so it corresponds to the weight connected to the Mask s box leakage and corresponding to the sample Uh corresponding to the mask So we have big derivative here and again, it's not very surprising is because they have a bigger impact on the on the loss minimization So again, we can uh if we want we can sum over one axis after we accumulate it for example during the training We get something like this where we get uh information about the about the leakages What what is interesting is that um So it worked on mask implementation, but I didn't change anything compared to the first attack I didn't adapt the attack to the to the second order uh to the to this uh, uh higher order simulation For example, if you compare with cpa when you do a cpa, maybe you will need to consider for example First if it's a mask implementation, how many shares maybe you will need to apply some preprocessing to to prepare your data before you attack Here I we didn't do anything like that. We apply the same attack from it was the same attack between unprotected and protected I didn't do any preprocessing Actually, it's a network itself. It's a network which adapts itself to the situation and uh, it can attack both protected or non-protected implementation So it's especially interesting for example when you are in black box, maybe you don't have information if it's a mask implementation Or how many masks are used? Because it's also with sensitivity analysis. It can reveal leakage area So, uh, it's interesting for example if you are in black box It can this kind of analysis can help to maybe reveal information about the key But also about if there is mask and where these masks are located So Here today I presented main I did the demonstration on simulated data But in the paper so there is several results of this type of attacks on a real target. So we have some Results on on on Attacks on the chip whisperer light with mask implementation using one but also two masks is also work We went until two two three shares and Uh, also work on uh, also result on ascad, which is a public data set here You can see the result on ascad. So if we look at the accuracy metric on the left It reveals the correct key value if we look at the sensitivity analysis It's reveal two it reveals two distinct area that if you look at it with other method You can notice it correspond to the actually to the area of the mask where the mask is manipulated And the location where the mask s box is located And uh, so today I decided to mainly focus on mask implementation is one of the interests of the attack But in the paper there is a bit more Actually, what is important to understand is that with this method we are able to use deep learning In non-profile so we can basically leverage all the advantage of deep learning In a non-profile context. So one interest is mask implementation But another interest is for example, you can use cnn CNN to attack desynchronized traces. We have in the paper there is results on on desynchronized traces Using a cnn we can do a non-profile attack and it compensates the effect of the desynchronization It's another interest To to conclude so in the paper there is two main contributions the first one is introduction of Differential deep learning analysis to do a non-profile attack with deep learning and neural networks We are able to leverage the power and the advantage of deep learning when we are in a non-profile scenario As I said, we can attack both protected and non-protected implementation with the same attack process And it works also with cnn Again desynchronized traces what I forgot to to to say is that The sensitivity analysis also work with cnn on desynchronized traces even the even if the traces are desynchronized We are able to to see where is the leakage The second contribution is the Introduction of sensitivity analysis to apply to side channel to locate leakage area in the trace So we are able to reveal Intermediate values leakages mask leakages. We are able to do it on Any neural network architecture if you consider the derivative with regard to the input samples and Last remark is that in the paper and during the talk I I talked about sensitivity sensitivity analysis in the context of Non-profile attacks, but actually this kind of techniques also work in For profile deep learning attacks. If you are doing profile profile trainings, you can also use sensitivity analysis to locate leakage area So that's it for me today. Thank you for listening That was nice. The demo was very impressive. Thank you. Is there any question for benjamin? Hi, thank you for the nice talk Um, if I remember correctly you used MLP networks for all the masking case studies and cnn networks for all the misaligned case studies Can you use cnn's also for the masking case studies because I'm asking that we try to reproduce some of that results and we Notice that the cnn's are really bad at learning higher order leakage. It doesn't matter if you need to be varied or be varied Yeah, so that's that's correct. Actually on the paper the masking experiments are mlp and cnn for desynchronous traces What I found is that it's yes, it's easier to attack masked implementation with the mlp And uh, but I'm not I'm not sure if we I think we should be able to break masking implementation with cnn, but I haven't really dig in dig into that Yes Of any sorry, so you don't know of any reason why cnn's are incapable learning that My guess I don't have strong uh result or evidence But my guess is that because cnn is sliding uh on the first on the first straight on the first sample for example While mlp is combining the value of the different sample I guess it's maybe easier for mlp to find a quickly good combination of the way to combine the mask And so for example mask sbox, but it's something I haven't studied deeply actually We have only time for one quick question. Is it quick? Thanks for the nice talk. I was just curious if you had the chance to look In terms of computation time like compared to for example a higher order cpa or doing a pca with a cpa So at what order does it start to get interesting to use this one? So, uh So, yes, the computation time is one of the drawbacks of the attack because you need to perform one training for each key guess So if I didn't we didn't compare with cpa or second order cpa for sure second order cpa will be faster Because uh with a deep learning you need to train several epochs before you get information Uh in any case cpa will be faster. I guess Let's thank benjamin again