 Welcome to this talk entitled A Comprehensive Study of Deep Learning for Section Analysis. My name is Lurik Mazur. I'm a PhD student from Seva-Gonab in the French arts, and I am supervised by Emmanuel Pouf from ANSI and Cécile Dumas from Seva-Gonab. Cécile and I are working in an evaluation lab. We are in charge of evaluating some physical devices. And as an evaluation lab, we always need to be at the state of your art in terms of physical attacks. This talk is devoted to side-channel analysis. In classical cryptanalysis, an attacker is given some pairs of plaintext and sci-fiotext with the aim to recover the secret key denoted by K, year after, used for the encryption. In addition to that, in side-channel analysis, the attacker may have access to some non-purpose communication channels, such as the power consumption of the electromagnetic emanations denoted on the right by the measure trace X. The idea is to use these observations that carry some information about the secret key used for the encryption through the computation of a sensitive operation denoted here by Z, which is the result of the computation involving some chunks of plaintext P and secret key K. More precisely here we are considering the profiling attacks, where the attacker may have access to an open sample similar to the target device with full knowledge of the secret key. The profiling attack is decomposed in two steps. On the first, called profiling phase, he knows the secret key used for the encryption and he tries to characterize the behavior of the measure trace X depending on the value of the sensitive intermediate variable Z. Then during the attack phase, the key remains unknown and we must infer it thanks to the accurate traces X during the attack phase on the real target device. During the attack phase, the key recovery is done thanks to a distinguisher. Assume that we are given NA attack traces during this attack phase. We have a model F that returns for each trace a score vector, one score for each hypothetical value of the sensitive variable Z. Or in other terms, one score for each hypothetical value of the secret key K. By combining the scores returned for each trace during the attack phase, we may see the emergence of one score in particular, which will be our key hypothesis. Our goal is to find the model that minimizes the required number of traces such that our key hypothesis key hat equals the real value of the key key star with probability higher than a given threshold, typically 90%. Let us denote by F star the optimal model and NA star the corresponding number of traces. So as a variable, our goal is to know this value of NA star. So how to find this optimal model? The good news is that we know an analytical solution to this program, namely by taking the conditional probability distribution of our discrete random variable Z given the observation X. The bad news is that this optimal model is unknown, and instead we must estimate it with parametric models described by the parameter vector theta. For example, with Gaussian templates, theta denotes the mean vectors and the covariance matrices. Over the past few years, we have seen the emergence of new parametric models from the class of deep neural networks. They have been shown to be particularly sound against implementations protected by counter measures such as masking or desynchronization. How to train a neural network? During the profiling phase, the profiling traces will be given to the model denoted here by F and set by the parameter vector theta. For each trace, it will output a score vector, one score for each hypothetical value, and we will compare these scores with the expected value of the output here denoted by Z. Namely, the higher the score at the ZF entry, the better the quality of the prediction. Therefore, we need to compare the expected results with the actual results of our model through the computation of the loss function. Since the higher the score for the good entry, the lower the loss function, actually training a neural network consists in minimizing the loss by adjusting the parameters theta. The choice of the loss function depends on the nature of the problem that will be discussed hereafter. Last year, at chess 2019, Pizzequeta raised an open issue with machine learning-based SCA, namely how to evaluate the quality of the model during the training. They considered the accuracy metric. In SCA, this can be seen as the probability to recover the secret key with one trace. Pizzequeta told that accuracy does not seem to be the right performance metric in SCA. Indeed, they trained several machine learning models and those with high accuracy always denoted a successful key recovery. Whereas the models with low accuracy got sometimes some successful key recoveries and sometimes that. So a low accuracy is non-conclusive. The problem is that the latter case often happens, typically with high noise or in presence of countermeasures, so that the accuracy is not informative about the success in SCA. So as a side question, is there any of the machine learning metric related to SCA once? That's a question Pizzequeta tried to answer and apparently no, since they tried several machine learning metrics in their paper. But actually the goal stated so far in the previous slide is a bit different. By computing the accuracy, we want to find the threshold beta such that the minimal required number of traces to succeed the attack is one. But remember that our goal is in a sense due to this one. Rather, we want to fix beta and to find the corresponding number of traces such that the attack succeeds with the probability beyond beta. Our claim in this talk is that we can actually accurately estimate any storm with deep learning techniques. In particular, we suggest the use of a loss function known under the name of negative log likelihood that we'll describe hereafter. To make the link between the loss function and our SCA metric, let us consider this axis on which we plot the entropy of the random sensitive variable z. The distribution of z is typically known and is uniform, so we know the entropy. Since we have access to observations of x giving information about the sensitive random variable, this entropy can be decreased to the conditional entropy of z given x. The gap between those quantities is typically denoted as the mutual information between z and x. Last year, at chess 2019, Cherizet and Grofer emphasized the fact that this mutual information could be linked to a ratio involving both ANA star, the minimal required number of traces to succeed the attack, and the threshold beta. So first we have a link between ANA star, our SCA metric, and the mutual information. Unfortunately, computing the MI requires to have a perfect knowledge of the leakage model, which is not the case here. Instead, the SCA community has introduced the notion of PI for perceived information. This extends the mutual information, but by considering a non-perfect leakage model, here the leakage model given by the model F, parameterized by the vector theta. An interesting property of a perceived information is that it is always lower than the mutual information. The core contribution of this paper is to remark that the PI can be expressed from the value of a loss function widely used in deep learning, namely the negative log likelihood function. Its expression is given here at the top and can be straightforwardly completed during the training of the model. So what does that mean concretely? During the training, the operator will try to minimize the loss function denoted by L here through the time thanks to an optimization algorithm running iteratively. What is interesting here is that at the end of the procedure by the optimization algorithm, it returns a parameter vector here denoted by theta hat underscore 1000. For example, if we are using 1000 traces in the profiling set, this gives an estimation of the PI here. The machine learning theory states that the more traces used during the training, the higher the PI corresponding to the train model. So in other words, training a deep learning model is actually equivalent to tightening the lower bound of the MI, which gives an alternative technique to estimate the mutual information compared to other ones developed through the mutual information analysis, for example. To assess the quality of this MI estimation through this lower bound, we may decompose the gap into three kinds of errors. The first one is the approximation error. It comes from the fact that our target model F star cannot be perfectly expressed as a neural network. The second error is the estimation error. It comes from the fact that we have not an infinite set of profiling traces. So instead of maximizing the true value of the perceived information, we are actually maximizing an empirical estimation of it. The last kind of error, namely the optimization error, comes from the fact that our optimization algorithms cannot return the perfect optimizer of the PI. Indeed, the last function for neural networks is always non-convex, which makes the optimization problem really hard to solve. Instead, some heuristic algorithm, here for example the SGD for stochastic gradient descent, can return some approach solution. So as an evaluator point of view, it would be ideal to know to what extent those three kinds of errors would affect the quality of the evaluation through deep learning. And that's what we will try to discuss with the following experiments. To assess those kinds of errors, we propose here some simulations. We consider here a leakage model with arming weights and an additive Gaussian noise, ranging from 0 to 3.2. And based on this simulated leakage model, we draw an exhaustive data set. In the sense of we assume that the number of profiling traces is high enough so that we can neglect the estimation error. Based on these exhaustive data sets, we can compute first the mutual information, which is estimated with Monte Carlo simulations, which is possible since we know the truly leakage model in our simulations here. And on the other hand, we trained one-layer perception, which is the simplest architecture of neural networks, with only 1,000 neurons to minimize the NLLOS function, or in other words, to maximize VPI. And we also consider different kinds of simulations where countermeasures are considered or not. The first case is when considering the higher-order masking, where the sensitive variable is split into the independent parts, according to a secret sharing scheme. And the second case is by considering shuffling, namely where independent operations are randomly shuffled. So our goal is to know whether the PI can be a good estimation of the MI in an SCA context. So now we present our results. First, we plot the estimations of MI here. So on the left, you have the estimations for the masking and countermeasure. And on the right, you have the MI estimations for the shuffling case. And what we see is that the crosses here denoting the computation of PI, according to the training loss of our model, is indeed almost superposed to the curves. So in a sense, this can be interpreted as the fact that the PI is a good estimation here of the MI, no matter the masking order or no matter the nature of the countermeasure. And since we assume to neglect the estimation here, there remains only 2 errors, the approximation error and the optimization error. But it turns out that those 2 errors are always the same sign, they are negative errors. And since the gap here is negligible, this means that the sum of these 2 errors is negligible also. And since we have the same sign, both are negligible. So this is the good news, since any more complex model should have a lower approximation error, we have also run some empirical verifications and other experiments. You can check the details in the paper at the end of these slides. Finally, those results may be of great practical interest for the evaluator, since we have a method to approximate the MI according to the PI. But since the MI is linked to the number of traces required to succeed the attack, namely NA star, we may substitute the MI with the PI in the computations, which would give a current estimation of the number of traces required to succeed the attack. We tested this claim on 3 public data sets, using the architectures proposed in the recent papers. And we compute the relative error at the final step of the optimization. So here are the results. First, when considering the AES-RD data set, RD is standing for Random Delay. So this is a microcontroller, namely a software implementation protected with miscellaneous by the insertion of some random dummy operations. So in orange, you have the estimation of NA star with 50 attacks in a row. Whereas in green, we have the estimation of this ratio directly based on the results of a training loss during the profiling phase. You have another example here on the ASCAD data sets, which is a microcontroller protected with a masking control measure. So the range of NA here is much higher, typically here of 100 traces. Here we have about two or three traces. And what we can see on those two plots is that the relative error is rather small, here 16%. So we see here that our estimation method is sound. And this result is also confirmed on hardware implementation, namely the AES-HD data set, which is an implementation on FPGA, but without countermeasure. But still, the range is rather different. Where here, the record number of traces is of about 400 or 500 traces. But still, the relative error remains satisfying, which is 18%. So as a conclusion, here are the take-away messages of this talk. Minimizing the NLL loss gives relevant estimation of the mutual information, and thereby provides accurate estimation of the SCA metric, namely NA star. The NLL as the last function is sound from an evaluator point of view. And our method enables to quantitatively measure the impacts of countermeasures through the estimation of the mutual information. Thank you for listening. And if you have some questions, I would be delighted to answer. More generally, if you're interested in deep learning-based set analysis, I am soon finishing my PhD, and I am currently looking for postdoc position. So if you would be interested in working with me on this topic, feel free to let me know. Thanks.