 Welcome everybody to this presentation, so today I'll be giving a talk about a recent chess paper, improving generalization with ensembles in machine learning-based profiled side-chain analysis. So we have three main contributions in this work. First is the analysis of output class probabilities, the predictions, second is how to use proper metrics for profiled SCA with deep learning, and third, how to improve generalization in deep learning-based profiled SCA by using ensembles, which are basically a combination of multiple neural network models into a stronger model. So deep learning-based profiled SCA follows the same principle of classical profiled attacks, such as template attacks, where we assume an attacker has fully control of a device A, where he can learn the leakage from it, and in this case the difference that the learning method is a deep neural network where the attacker uses profiling traces with a non-key and all information to train it, and then to validate with also with traces with a non-key, until he achieves a good generalization that is used against a device B, from where the attacker collects another set of traces, and then if the model trained with device A provides good enough generalization, then the attacker is able to recover the key from device B. That's basically the picture of profiled SCA attacks. So in the title of this presentation, we say that we are going to improve generalization, so one way to do that in deep learning is to use good regularization methods. So the regularization methods can be seen as a method that prevents the model to overfit, and then this can be given by implicitly regularization that is provided by small neural network models, either by different methods such as dropout, data augmentation, etc. that can be used as an extra regularization artifact in large neural network models, because large neural network models are very prone to overfit as they have very high capacity, and so adding regularization is very beneficial to reduce this overfitting effect and to lead to better generalization. Another way to achieve better generalization is by using early stopping method, where in the training phase, we return the model as long as soon as we achieve a very high value in the metric that we are considered as the reference metric for our training. And also another way to do that is that the main contribution of this work is by using ensembles, combining multiple models as an artifact to improve generalization. So we can assume that deep learning for side channel analysis is mostly about hyper parameters. So this is a very difficult task that we have when we use deep learning for SCA. Although we can see many different several, let's say benefits from an advantage from deep learning side channel analysis such as no points of interest selection. We see that CNNs leads to attacks that are less sensitive to trace misalignment. Also that the complex deep neural networks can learn high order leakage. And also visualization techniques that helps a lot the attacks. This all leads to the community, the security community to implement more secure products, because if we know better, if we have better attacks, we can see how we see the need to implement better and better countermeasures. So the advantages of deep learning are beneficial actually for security community. Also, we have a work in progress, let's say, but like this, where we still don't have how to implement an efficient and automated hyper parameters tuning for deep learning side channel analysis, because we need to create a deep neural network and we need to face many different hyper parameters. So selecting good hyper parameters is not a easy task, but there are many works that are going to this direction and I think very soon we're going to have very efficient solutions for that. Also, adding hyper parameters to the side channel scenario can render the attack unpractical if we don't take care of this part because SCA by itself is already very costly. It takes a lot of time to do side channel analysis, so we need to be careful when we add the hyper parameters tuning problem into the context. So also using correct metrics is very important for SCA with deep learning. So we have seen the supervised metrics like accuracy, loss, recall, precision, etc. They are not very consistent for side channel analysis. Their main work is already showing this. And what we have to use basically is are the common side channel metrics such as success rate, guessing entropy, and either as a custom loss function in common libraries or apart from independent of the training process. So this is success rate in guessing entropy. They are computed out of the predictions that we obtained from side channel traces like which are the, so when we have a set of traces, we classify those traces using the training model, and then we obtain predictions with these predictions are the output class probabilities, which are used to compute key rank and then guessing entropy and success rate. So what we ask ourselves is what we can learn from these predictions in a way that we can do better. So just a visual example here what we have. This problem of metrics in the left side, so in this case, we are attacking a basket as implementation, we have results here for one key bite attack with the Hemingway model, and then we can see that with test and validation accuracy, they are very low, they are decreasing. So if you judge the attack by this metrics, we would say the attack is not successful. But if we compute guessing entropy out of the predictions that from this test trace, we can see that it's actually decreasing. So more, it's just a matter of having more test traces to reach a successful attack. So predictions or class output class probabilities actually have a lot of important information for side channel analysis with deep learning and machine learning methods. So here I have a representation in an array of these those output class probabilities. So what we have here, so assuming that we what we have here are the predictions, when we have Hemingway model. So every row in this table corresponds to a test trace, and every column corresponds to a label. So Hemingway to label, for example. So every element in this array represents the probability that that trace I in the row I have label J, so have a certain Hemingway. Now let's say that we have a key guess K from the test traces, and then we label this all the traces according to that key guess. This is a very simple process because we know the key we know the plain text key, of course, and we don't know but we have the guest of the key. So and then by labeling the all the traces according to the key guess, we can extract the corresponding probabilities from this table. And then extracting every probability, we can compute the log summation of it and which leads to the probability that this test traces contains key, the key guess K. So and the recovered key would be, of course, the maximum summation of probabilities among all the 256 candidates because here we are talking about attacking one key byte. Let's assume that we have a test accuracy of 100%. So we have a model that learns pretty much very well from the side channel leakage. And then when we test the traces, we achieve 100% of test accuracy. This is a very ideal scenario. And then what happens then is that the summation probabilities for the correct key candidate will select the highest probability from every row in every from every row from this table. So very likely with one or two traces, the the attack will be already successful. However, if we have a very low test accuracy, meaning that the implementation, the target implementation is protected with countermeasures or it has a lot of noise, for example, then the test accuracy can be very low. And the summation of this probability to compute the key rank will not select the highest probability in every row. It can select the highest sometimes, I would say 27% of the times, but for the 73% of the traces, it will select either the second, the third, and so on. So then the summation of probabilities for all the key candidates will be very similar to each other. However, and I will show in the next slides how still this low accuracy can lead to a successful attack. So if we rank all the key candidates by all the key by accuracy, let's say, oh, we would have something like this, let's say, we would select always the highest. So the accuracy considered only the classified information. So and then if we consider only accuracy to rank the keys, we could have this type of sorting where the correct key candidate is not the highest. So by using accuracy, we might say that the attack failed. If we consider the second highest probabilities in every row, we could say that, I mean, selecting only the second candidate, the second highest probability, not the highest one, because the highest one would be really accuracy that the library gives to you. But the second highest would be, for example, one that gives the correct key candidate as the highest one among all the key candidates. So if we repeat this process for all the nine ranks in the output class probability, we could have something like this. Sometimes we have the correct key as the main one, but most of the time as the least one. And this is what we observed in most of the times when the attack is successful. What we can see is that in the low ranks, these values will have a very large influence on the final key rank, on the summation of the probability summation for the for the key rank calculation. And the probabilities that are not ranked as the first ones will have a very small influence on that. So low ranks, this pushes the summation of for the correct candidate up and high ranks pushes the summation of the summation for the probability down. So that explains why even with a small accuracy the correct the attack can still be successful. Here is a real example on the leaky AS where we have some sort of good enough accuracy to recover the key with very few amount of traces. I believe this with this example 10 to 20 traces we can already get the key. And the test accuracy is 48 percent. And if we make the plot that I explained the previous slides now for this real scenario we will see that most of the for the correct key candidate most of the output class probabilities for every trace they are among the first ones. They are among the first say the first ranking as first, the second or third and very little as the fourth and almost never as the highest ranks. So I would say that output class probability for the correct labels are most partial towards ranks one and two and three. And basically when the for the high rank, ranking probabilities it basically never almost never appears for this for this trace set. The situation is not very the same but similar I would say in nature for a for a mascot AS implementation where the accuracy that we obtain is very low 22 percent. But we can still see that for the correct key candidates the class probabilities are still high for the low ranks. So what ranks one and two it's very high it's not the highest but it's very it's it's among the highest. And for the and for the high ranks after ranks five let's say it appears as the least as let's say the smallest probabilities in every row for the prediction table. So the summation probability will not be affected too much by this probability rank as high but will be affected by those probability ranks as the one and two and three. And that is this is the reason why this attack the key rank as a metric can can lead can indicate a successful attack. But as we've talked before hyperparameters are very important to be let's say to be tuned in a in a deep learning attack for side channel analysis. So in this case I have two examples here where we have two convolutional neural networks and both in both cases we have successful key recovery but we just change it one parameter that is one hidden layer one thanks layer in the between these two CNNs and then what we see is the very big difference in the distribution of class probability ranks for both situations. This explains that is more modifications in the models the hyperparameter leads to big modifications in the how predictions are done. So it's what we saw is that tuning the hyperparameters we can find good models but still we are kept in this situation when we see the the probabilities high for the low ranks but not very really distinctive for from the incorrect key candidates. So a common story in deep learning analysis is that we require a large amount of hyperparameter experiments until we find a good model. So usually in across other domains loss function is used as minimizing the loss functions is the main goal and for as inside channel analysis we we always need to select the correct metric that is guessing entropy or success rate or some other special metric that can be proposed and by minimizing for example guessing entropy we can also define a good model. So we train multiple models until we find the best one and then from this and by electing a best one we can say we can we might achieve a success a good generalization where which leads to a good successful attack in the test phase. But then the main question we had to ourselves by when we went through this expensive process of training multiple new neural networks was why not benefit from these multiple models instead of just selecting a single best model and then we proposed the usage of ensembles. So I will not give too much details about different types of ensembles here in this presentation but I can reinforce that what we used here was bootstrap aggregating so basically the bagging method which combines the predictions of one way to do that is to combine the predictions of multiple neural networks into one. So basically what we do is the in the key rank summation for every key candidate what we do is to put an outer summation for all the models. So we don't need to of course to select to sum up all the models that we train but if we can if we are able to define what are a group of good models then we can choose this group of good models and then to do the summation. Always using this metric of minimizing Gaussian entropy as the metric to select good models. So basically this is a function of different hyper parameters the train traces the validation traces. Here a visual example using the distribution of class probabilities rank of why ensembles provide improvement in the generalization. So here we have we trained 50 models and then from these 50 models we selected 10 best models and then we made the summation of a class probabilities of all these 10 models. And if we compare this now the the line of correct key candidate in this class probability rank density distribution if we compare that to the to the single best model out of this 50 we can see that the ensembles the line for the ensembles is much more distinctive from the incorrect key candidates. The line of ensembles for the correct candidate is higher for ranks one and two for example. Here the attack is successful the model is this is the this plot represents on the right the best model out of 50 training models but is still the ranks for key candidates the sorry there the density for the class probability ranks one and two are the highest but is still not really distinctive from the rest even though this provides the the highest summation probability for the key correct candidate visually we cannot see very well what is the what is the correct key candidate but here with ensembles it's much more clear. We tested in this paper for different data sets two data sets has one has no countermeasures and three of them they have countermeasure even though dp84 is a very weak countermeasure for to be assumed as a protected target but ascad and chest CTF data sets that we use it in this paper they are they can be considered as masking implementation so we yeah so we assume that our method works also for against masking implementation. The hyperparameters that we so the range of hyperparameters that we define for different models that we combine into an ensemble so again maybe I skipped this part but when we do an ensemble you train multiple neural network models by varying the hyperparameters of each model and then what we do here is to randomly select hyperparameters out of these range and then for every model basically we can say that the model is random this is not a really random search because we define some range that are most kind of optimal range based on literate results for sidechain analysis. So here are some results that we obtained on the ascad data set using handy weight model for MLP and CNN. Every plot here so on the right on the left sorry I have plots for gas entropy on the right for success rate and we have three main lines one is the green line that represents the gas entropy for the best single model out of 50 models and the green and the blue and the orange line they represent the ensemble of 50 models and 10 models respectively. So we can clearly see from the analysis that we have done that ensembles either with 10 models or all 50 models they provide superior results. So we never saw a situation when the single best model was superior than the ensembles. It can be similar very close to it like here in the case of CNN but never clearly superior. So ensembles is a good way to go is a safe choice when you want to improve generalization when you are trying multiple models with some reasonable range in your hyper parameters. Also results for identity model on this ascad data set we can see again that ensembles provides better success rate and gas entropy than simply electing a best model. So some conclusions of this work we can assume that output class probabilities they are a valid distinguisher for side channel analysis. This is the information that we use to compute of course gas entropy and success rate but more information we believe can be extracted from the output class probability in order to do to have it as a valid distinguisher and more stronger distribution. So also we can use this information we can see that this information the predictions or class probabilities they are very sensitive to small changes in hyper parameter. This is not really a novelty to say that but what we wanted to reinforce in this work is that when we have these variations in the results Ensembling many modules can remove these small variations that we have by changing a little hyper parameter a little bit the hyper parameters and then it can improve results. But we do not assume that ensembles replace hyper parameter search. What we say in this paper is that they relax the fine tuning of hyper parameter. So from many experiments we have done we always saw that guessing entropy and success rate for ensembles tends to be superior to the ones obtained from single models. Also ensembles they do not improve learnability from the model. It doesn't mean that after using ensembles the mod learns more. What we have is that ensembles will improve what the models already learned. So that's what they do. And also positive thing is that limited amount of models can be enough to build the strong ensembles. If you are using good hyper parameters and then you are tuning them combining those models can lead to you don't need a lot of models hundred or thousands of models you just need 10 to 50 models for example in some cases to achieve a good ensemble. And so this work we saw from this work we saw many opportunities to continue exploring more the benefits of ensembles. So we want to explore in the future work different methods for ensembles like stacking for example and also one thing that we find very interesting would be to see the benefit of ensembles in combination with other regularization methods. So ensembles can be considered a regularization method in the end but then combined with other an efficient regularization method we believe that could lead to interesting results. Also to formalize the density distribution of class probabilities that we we have seen in this in these slides in this paper so it could be a nice interesting future work. So this could lead to a very nice metric for deep learning side channel analysis. Okay so I would like to thank you to watch this presentation and I also would like to say that our code to reproduce the results from this paper is available in our github page from our lab and feel free to download and try it. So thank you very much and