 Bonjour, I'm Gabriel, I'm a PhD student from the River QN Laboratory in the Thales-Aït set and today I will propose you a new loss function that maximizes the success rate of a neural network trained to perform side-chain attacks. This work is a joint work with Léviens-Bossouen Améry-Abrard from the River QN Laboratory and Alexandre Venélie and François D'Ascence from an Expec semi-conductors. But first, what's the point with side-chain attacks? So to understand this concept let's take a look at an example. So let's assume that an advertory has access to a physical device in which a cryptographic algorithm like, for example, IES is implemented and such that she can configure the secret key and the plaintext to perform an encryption, such that obviously the goal of the advertory is to recover the sensitive information manipulated during this operation. So to perform a side-chain attack she needs one probe, here an OM probe and one oscilloscope at least, such that during the encryption she can capture a physical trace that directly depends on the sensitive information manipulated by the cryptographic module. And because she can perform as many encryption she wants, the advertory can generate plenty of physical trace and thus find some nickwages related to the secret she wants to retrieve. In the deep learning-based side-chain attack, the advertory generates a neural network to automatically match a physical trace with the correct target sensitive value, such that this attack can be decomposed into two parts, a profiling phase and an attack phase. So first, in the profiling phase, the advertory trains a neural network to predict the correct sensitive information she knows based on the physical traces generated in the previous slide. So when this phase is performed, the advertory can predict the intermediate variable on a target device containing a secret she wants to retrieve. So during the attack phase, by using a new set of physical traces, she can compute the probability of observing each key hypothesis, but unfortunately, one physical trace is not enough to retrieve the correct targeted value. Hence, the advertory has to predict the score related to each key hypothesis multiple times to finally aggregate all the predictions and recover the secret key value. So one way to evaluate the efficiency of a neural network is to assess the benefits of the applied loss function to optimize the attack performance. But what is the classical learning metric used in deep learning-based side-channel attacks? So the most widely used loss function is called cross-entropy and its equation is reminded here so that to get a clear overview of its impact during the training process, let's back to our previous example. So here, the advertory generates a set of physical traces such that she knows a targeted value manipulated by the device under test. So during the profiling phase, the neural network helped put a set of scores related to each hypothetical value such that the highest its score, the highest the probability of observing the related value. But however to fit with the maximum likelihood distinguisher, the scores has to be converted to a probability distribution. Thus, an additional function called softmax function is applied in order to convert the scores to a normalized probability distribution. So through the minimization of the cross-entropy loss function, the advertory asks the neural network to maximize the probability of observing the correct output with respect to each hypothetical values. Thus, minimizing the ranking loss function seems suited to conduct side-channel attacks. But concretely, how should the loss function be minimized? So to perform such minimization, the advertory employs some optimization algorithm based on the gradient descent principle. This technique consists in iteratively updating the trainable parameters theta of the neural network based on the gradient of the loss function. So concretely, given a neural network with two parameters theta0 and theta1, we can plot the evolution of the cross-entropy loss function depending on the value of both trainable parameters. So through this plot, we can see that depending on the initialization of theta0 and theta1, the optimization algorithm can reach local minimum, or in the best-case scenario, it can reach a global minimum. So to obtain the most efficient parametric model of theta, the advertory expects the gradient descent algorithm to convert to the global minimum. But unfortunately, this solution cannot be ensured, and in this talk, the resulting error is called optimization error. More formally, it characterizes all the errors induced by the learning process from the selection of the finite model space to the error induced by the optimization algorithm. So to obtain the most efficient model, the goal of the advertory is to find a solution that maximizes the success rate metric for a minimum number of attack traces. Thus, the gradient descent algorithm should update the trainable parameter theta in order to penalize the errors provided on the success rate. So let's see if the penalization term induced by the cross-entropy loss function perfectly fits with this maximization problem. So here, two scenarios can be observed. First, if the derivative is computed with respect to the targeted class, then the penalization term should reflect the impact of each irrelevant score on the correct output. And this is exactly what the cross-entropy does as it penalizes the related output with respect to the probability error. Then, if the derivative is computed with respect to the targeted class, the penalization should be only dependent on the score of distance between j and k stars, such that indeed, following the success rate metric, the advertory aims at maximizing the score of k stars with respect to each wrong hypothetical values. But unfortunately, as illustrated in this slide, the derivative considered by the cross-entropy loss function does not perfectly reflect the relative order of the untargeted class with respect to the targeted one. Indeed, the cross-entropy combined with the softmax function carries more about the absolute value of a class with respect to each other, which can induce blurred results regarding the optimization of the success rate. And because the wrong penalization of the class j impacts the probability of the targeted output, we suggest that the cross-entropy does not totally reflect the probability distribution the advertory wants to optimize during the training process. In the following, this error, they are induced by the probability distribution, is called approximation errors, and it highlights the deviation of the predicted distribution from the real unknown one. But however, the cross-entropy loss function states beneficial as Masurial suggests at chess 2020. Indeed, Masurial demonstrates the benefits of the cross-entropy loss function as it maximizes the perceived information introduced by Renault and Al at Euroclip 2011. So more concretely, three sources of errors can be highlighted. So first, the optimization error already defined characterizes the error induced by the learning process as well as the selection of a finite model space in order to obtain the optimal parametric model. Then, how the advertory maximizes an estimation of the predicted information instead of the true-procured information, she needs an infinite number of threads in order to converge throughout this solution. And in practice, she can only deal with the finite number of threads, the estimation error characterizes a gap between the empirical and the true-procured information. And finally, the approximation error defines the deviation between the perceived information estimated with the cross-entropy loss function and the mutual information. This error is negligible if and only if the probability distribution associated with the optimal model minimizing the cross-entropy loss function is similar to the true unknown probability distribution. So in our work, we propose a new loss function which derived from the success rate metric and prevent the approximation error effect. So when an advertory performs a side-chain attack, she aims at optimizing the success rate metric for a given number of traces, such that this metric is defined as a loss. So in other words, the advertory wants to optimize this probability, such that the score related to the secret key is always higher than any other key hypothesis. So what the advertory wants in this scenario is to approximate the indicator function to make it easy to handle. And following the work provided by Ken Liu and Lee, we know that a natural fashion consists in taking the sigma function in order to approximate the indicator function, such that on the right side on this slide, we can see that depending on the value of alpha, we can monitor the approximation of the indicator function. So to optimize the trainable parameters that maximize the success rate, the advertory has to apply the binary cross-entropy in order to penalize the deviation of the model probabilities from the desired predictions. So in other words, we want to penalize the loss function when this expected relation is not observed. Thus when the binary cross-entropy is applied, the advertory obtains the following partial ranking loss function. So this result gives us an insight into how the cut function penalizes the training process when this relation is not respected. Therefore, maximizing the success rate tends to minimize the ranking error between the secret key k-stars and any key hypothesis k. But in order to efficiently find the model that maximizes the ranking error, we have to apply this partial cut function on each key hypothesis in order to maximize the rank half the secret key. One interesting property with the ranking loss function that we demonstrate in our paper is that this new loss function can be considered as an upper bound of the attack success rate. Thus minimizing the ranking loss directly optimize the performance metric the advertory wants to maximize. But however, what are the benefits of considering the ranking loss instead of the cross-entropy loss function? So first, let's back to the penalization process. So as a reminder, the advertory's goal is to find a model that maximizes the attack success rate such that during the profiling phase, the trainable parameters are updated in order to maximize this performance metric. So if the derivative is computed over k-stars, the ranking loss pushes the score of the secret key up via a gradient descent. On the other hand, if the derivative is computed over an irrelevant output, the ranking loss pushes the score of each key hypothesis down via a gradient descent such that this penalization only depends on the distance between k-stars and j. So thus, in opposition to the cross-entropy loss function, the score rated to each other key hypothesis do not blur the penalization term. Indeed, for each pair k-star k, there are two forces at a play, such that the four that each pair exerts is proportionate to the difference of the scores. So thus, the penalization process induces by the ranking loss maximized the attack success rate, such that the approximation error induced by the cross-entropy loss function is prevented. But from an optimal model perspective, what does this result mean? So if within a D as a rule allowing the extraction half the targeted sensitive variable from a parametric model, we know that maximizing the success rate is equivalent to finding an estimation of the optimal model, such that finding the parameter theta that minimizes this term is equivalent to the optimal distinguisher. And this process describes exactly what the ranking loss does during the training process. So while the ranking loss aims at finding the parameter theta that maximizes these functions, we can define a parametric model considering the ranking loss as a lower bound of the mutual information, such that if the adversary has an infinite number of profiling traces and an optimal parameter theta, she retrieves the related mutual information, which is highly beneficial from an attack perspective, as it reflects the minimum number of attack traces needed to retrieve this equal key. So from an error analysis perspective, we observe that the minimization error, as well as the estimation error, still affects a model trained with a ranking loss, but the latter solution is beneficial to prevent the approximation error. So to verify all these theoretical observations through different scenarios, we decide to perform it on the most common database used in deep learning based section analysis. So the ranking loss was performed on two public datasets and other datasets are studied in our paper. So first, the cheapest per dataset is an unprotected emulation of AES128 implemented in software on a cheap whisperer board, which is composed by an 8-bit microcontroller. So due to the lack of control measures, the adversary can recover the secret key directly, and in this experiment, we attack the first SBOX operation. Then the ASCAP dataset is the first open database that has been specified to serve as common basis for further work in deep learning based section attacks. So the targeted platform is also an 8-bit microcontroller, where the first order masking is implemented with different level of desynchronization. So to accurately define the performance of a neural network, we introduce a metric called NTGE that defines the number of traces that are needed for reaching a constant gazinger therapy of one. For being confident in our experiment, we perform 100 attacks and define NTGE bar at our future metric. From a practical perspective, the generation of suitable architectures is known as a difficult task. Hence, two kind of models can be considered. So first, let's take a look at an example, where the parametric model FTHA exploits a partial set of points of interest on the cheap whisperer dataset. So once we captured the physical traces, we designed a neural network such that first we considered the cross-entropy loss function as a learning metric. Once the profiling model was trained, we used some visualization tools in order to assess the ability of the model to correctly retrieve the point of interest. So here we use the weight visualization introduced by Zayden Al at chess 2020. Once this fate was performed with the cross-entropy loss function, we applied exactly the same process with the run-king loss in order to compare the features retrieved by each model with the FNR computation. So our first observation suggests that both models successfully retrieve the point of interest located in the first 200 time samples. Indeed, this result coincided with the peaks returned by the FNR computation. But unfortunately, if we look more carefully at the model trained with the cross-entropy loss function, an unexpected peak appears, and this peak is not detected by the FNR computation as well as the model trained with the run-king loss. So why the model trained with the cross-entropy loss function considers this peak as relevant for its making decision due to approximation error? Its impact can be non-negligible during the factation phase. Indeed, on this side, we assert the ability of each model to retrieve the secret key when additional Gaussian noise is applied. So through this slide, we can notice that the performance gap between both models includes with the level of the noise. Thus, if the optimization error induced in each model is similar, considering the run-king loss is beneficial as it directly optimizes the expected probability where the cross-entropy loss function doesn't. The second scenario considered in this talk is the following. So here the advertory constructs models that exploit all the relevant information in the leakage traces. Then she can ask which function is beneficial to convert faster throughout the best attack performance. So as the models perfectly retrieve the point of interest, it can be assumed that the approximation error is negligible, thus in such context only the optimization error and the estimation error hold. So by performing the attack on the ASCAD dataset, we plot the following graph, such that you can visualize the performance evolution of both models depending on the number of profiling traces used during the training process. Thus in the Y-axis, you can notice the NTGE bar value previously introduced. So when the ASCAD dataset is considered, we observe that the model train with the run-king loss always converges faster throughout the best attack performance solution, whatever the desynchronization level applied. Thus, given a fixed NTGE bar value, a model train with the run-king loss always needs less profiling traces than a model train with the cross-entropy loss function, such that in this example using the run-king loss is beneficial to reduce the estimation error. To conclude about our work, we first link the learning to rank paradigm with the tight-chain approach, such that it helps us to develop a new loss function that maximizes the attack success rate. Through this new proposition, we are more concerned with the relative order between the secret key and each key hypothesis, such that it is beneficial to prevent the impact induced by the softmax function, which we call the approximation error. And obviously, this work is the starting point of assessing the benefits of the learning to rank approach in tight-chain attacks, and further investigations should be made on all the learning to rank approaches, such as this white approach, for example. Finally, all of our results can be reproduced through a Github repository, and if you have any questions, do not hesitate to contact me over email. Thank you very much.