 Bonjour, I'm Gabriel, I'm a PhD student from the River Current Laboratory and the TLSIT-SEF, and today I will present you a new approach consisting in combining multiple neural networks in order to enhance the attacks against PICACI algorithms. So this work is a joint work with Lillian Boussouet and Amory Abra from the River Current Laboratory and Alexandre Veneli from any space semiconductors. But first, what's the point with such an attack targeting asymmetric cryptographic algorithms? So to understand this concept, let's take a look at an example. So let's assume that an adversary has access to a physical device in which a PICAC algorithm like RSA is implemented, and such that she can configure the private key and the plain text to compute a signature. So obviously here, the goal of the adversary is to recover the sensitive information manipulated during this operation. So to perform a such an attack, she needs one probe here, an AM probe and one oscilloscope at least, such that during the signature, she captures the physical trust that directly depends on the sensitive information. For example, if the RSA algorithm is considered, the targeted operation can be the modular expansion and this operation will be considered in the rest of this talk. Because the adversary can compute as many signatures she wants, the adversary can generate a plenty of physical traces and thus find some patterns related to the secret the adversary wants to retrieve. If no contravenous rules are implemented, the adversary looks for some patterns that directly depends on the private key bytes manipulated during the modular expansion. So through these examples, she can identify two patterns that are dependent on the private key bits such that if a square and multiply algorithm is considered, the shortest pattern define a square operation and characterize a bit equals to zero, while the longest pattern highlight a square and multiply operation such that it characterize a bit equals to one. So in this example, yes, the adversary can identify some patterns that directly depends on the private key bits and guesses all the bits related to the modular expansion. However, if the same operation are processed whatever the value of the bits, the adversary cannot clearly distinguish which bits is manipulated. So to circumvent this issue, she can employ some techniques introduced in the deep learning base side channel field to correctly retrieve the values of the processed bits. In the deep learning base side channel attacks, the adversary generates a neural network to automatically match a physical trace with the correct target sensitive value. In such channel, the attack can be decomposed into two parts, the profiling and the attack phase. So first, during the profiling phase, the adversary decomposes its physical trace and substraces such that each element represents the process related to each bit. In the profiling phase, the adversary trains a neural network to predict the correct sensitive information she knows based on the sub-physical traces generated in the previous slide. Once this phase is performed, the adversary can predict the intermediate variable on a target device containing a thick rate she wishes to retrieve. By using a new physical trace in the attack phase, she can assign a value to each bit from the probabilities returned by the neural network. Then, by estimating the value of each bit induced in the leakage trace, the adversary can guess the private key manipulated by the cryptographic module. But unfortunately, the adversary rarely correctly predicts all the bits related to the private key. Indeed, if the adversary does not perfectly retrieve the bits of the private key, then remaining operations must be performed in order to correct the wrong predictions. So depending on the knowledge of the adversary, different scenarios can be considered in order to correct these wrong predictions. In our work, we define the three following scenarios. So first, the worst-case scenario suggests that an adversary does not know the position of uncertain predictions. Indeed, given a guess private key, the adversary measures the related maximal percentage of errors in order to compute how the resulted possible private key candidates. Assuming that the maximum of n bits is a runnose, then the adversary starts by correcting the first bit, and then verifies if the corrected private key is validator not. If not, the adversary corrects the second bit and verifies the related private key guest, and so on and so forth. Then, all the combinations must be computed in order to retrieve the true private key, and this complexity, called naive complexity, can be computed as follows, such that epsilon bit denotes the error bit rate. The second scenario suggests that the adversary can find the Earth's uncertain bits by manipulating a threshold such that all bits predicted with a lower probability are considered as uncertain, such that here all orange bits are considered as uncertain. As each bit can take value in range 01, the resulted complexity can be expressed as follows. Finally, if the adversary must deal with blinding, scalar or exponent, she can consider the alternate attack introduced by Schindler and Weimar. As the related attack depends on the targeted public key cryptographic algorithm, we define a complexity metric when the adversary targets an RSA algorithm with a CRT mode such that it can be defined as follows. Through these complexity measures, we can notice that reducing the error rate epsilon bits is actually beneficial to reduce the number of reminding operations whatever the complexity measure used, and more interestingly, those results suggest that a slight improvement in accuracy can be actually beneficial from an attack perspective because a realistic improvement can be observed for a full attack scenario. In deep learning, one typical solution to increase even slightly an accuracy consists in using the ensemble principle. But what does the principle of ensembling mean? So given an adversary generating three-neval networks, the ensemble methods combine individual predictions via a consensus method such as the majority vote in order to reduce the overall error. Thus, an ensemble model represented as follows induces interaction with the targeted variables such as the bit value and interaction with other committee members in order to reduce the global error. Indeed, following the work provided by Thummer and Gauch, we know that an ensemble method can reduce the expected added error by the number of committee members induced in the ensemble model such that depending on their error correlation denoted by delta, the ensemble expected error may not be reduced if the errors are highly correlated, or in the best case scenario, it can be divided by 1 over NC such that NC defines the number of committee members in the ensemble model if the individual errors are uncorrelated. So in the rest of this talk, we define the diversity as a quantity measuring the difference in terms of predictions among the committee members. Typically, you and I'll introduce three sources of diversity. So first, the type 1 diversity characterizes the variety of committee members architectures introduced in an ensemble model. Then, the type 2 diversity selects a subset of members from a pool of noble networks in order to only keep those with a minimum amount of error correlations. And finally, the type 3 diversity induces interaction between the committee members during the training process such that even if the same noble network architecture is duplicated over the ensemble model, the training process forces the ensemble model to decorrelate their individual errors. In this work, we develop a new loss function that optimizes the type 3 diversity in order to maximize the mutual information between the ensemble model and the targeted variable. From an information theory point of view, the ensemble upload approach can be summarized as follows. So let's assume that a message is sent through a communications channel such that the output is characterized by an encoding value x. Thus, to retrieve the message, x should be decoded such that the output provides an approximation of the message. To reduce the error induced by the decoder, a function g must be found such that the following inequalities all, thus to reduce the error term, the mutual information between x and the message, should be maximized. Similarly, Brown, Zoo and Lee propose to extend this notion to the ensembling approach such that given a label, the ensemble model characterizes the encoded representation. Then, to retrieve the value of the correct label, the ensemble model has to generate interaction between the committee members such that the estimated label corresponds to the targeted one. Thus, the produced inequality can be rewritten as follows, such that to minimize the error induced by the ensemble model's prediction, the adversary had to maximize the mutual information between the targeted label and the ensemble model. So to solve this problem, Brown, Zoo and Lee propose the mutual information and ensemble diversity. But unfortunately, from a practical perspective, it is quite difficult to estimate higher, harder interaction information, because indeed, currently, there is no effective computational approach in the reiterator that allows the computation of this equation. So to solve this issue, Brown proposes to simplify this equation by considering only pairwise components. Typically, this pairwise approximation can be summarized as follows, such that first, this mutual information measures the relevance of the ensemble model to retrieve the class labels and bounds its error predictions. Then, this mutual information called redundancy measures the dependency between current classifier and existing classifiers, such that it indicates that a low class conditional correlation is needed in order to perform an efficient ensemble model. And finally, the conditional redundancy measures the conditional dependency between current classifiers and existing classifiers given the class label. But while the relevance and the conditional redundancy should be maximized, the redundancy should be minimized in order to reduce the correlation between the committee members induced in the ensemble model. However, from this pairwise notation, an adversary can question the feasibility of designing a loss function promoted the mutual information between an ensemble model and the targeted label. So to solve this problem, we develop a new loss function called ensembling loss, which promotes the interaction between the committee members in order to approximate the mutual information previously introduced. Typically, to conduct ensemble methods, three learning approaches can be considered. So first, in independent learning strategy, each committee member only interact with the targeted label during the training process. Thus, they are independently obtained, such that their posterior probabilities are combined once the learning process is fully performed. Then, the second-shell training process suggests that the committee members often constantly train such that the posterior probabilities obtained from the previous model can be used to penalize the learning process, related to the current model. And finally, the simultaneous ensemble learning consists in promoting the interaction between the committee members during the training process in order to reduce the correlation errors. And in our work, a particular focus was made on the last solution. Indeed, in our paper, we introduced the ensembling loss such that, given a set of profiling traits, an ensemble model and a number of attack traces, this loss function can be decomposed into three third losses. So first, the relevance loss aims at maximizing an approximation of the mutual information between a committee member denoted fm and a targeted variable z. In other words, through the learning process, we want to penalize a model when the correct label z is not ranked as the highest hypothetical class. To understand the impact of this loss function, let's assume a multi-class classification problem with three outputs, such that from a machine learning perspective, the maximization of the related mutual information tends to generate three compact clusters. If false positives or false negatives appear during the training process, the ensemble model will be overconfident on its prediction, and the resulted errors could be persistent. So to reduce this effect, a solution is to provide diversity in order to limit the impact of these false positives and false negative examples, so other third losses must be defined. Thus, the redundancy loss is introduced to minimize the ensemble information between two committee members, such that in other words, minimizing an approximation of this mutual information is equivalent to an approximation of maximizing the entropy of observing fm given fm. So consequently, we want to increase the uncertainty of fm given fm. Thus, through the minimization of the redundancy loss, we promote the cluster scattering and reduce the global confidence of the committee members on the false positives and the false negatives in order to decrease their persistency. But unfortunately, this term can negatively affect the confidence of the correct output, thus an additional term should be considered. That's why the conditional redundancy loss has been introduced in order to maximize an approximation of how they're following conditional mutual information. Indeed, through this sublossed function, we want to minimize the dissimilarity between the pairwise model fn and fm, knowing the targeted values. This penalization will have the effect of increasing the confidence of the network on the true positive and the true negative example, such that we consolidate the good prediction with more persistency. Thus combining those three third losses is beneficial to construct an ensemble model that promotes interactions between the committee members in order to enhance their diversity. But however, how does its loss function considered in the LSEA promote diversity? So to deeply understand the benefits of each loss function used in deep learning based such-and-and attacks, we decide to employ the tisney visualization tool in order to reflect the evolution of the data's prediction depending on the loss function used. And those visualizations illustrate this evolution. So first, when the cross-entropy loss function is considered, we observe that the ensemble model is not trained enough to efficiently discriminate each class. Indeed, there are many connections between each class leading to a loss of the global performance. Hence, in this example, many false positives and false negatives can badly influence the global performance of the model. On the other hand, the ranking loss generates three separate clusters. And as previously notified, the ranking loss can be formulated as the relevant loss. So through the minimization of this function, we want to minimize a conditional entropy hv given fn, which promotes the generation of three compact clusters, such that here the tisney visualizations confirmed on this result has the ensemble model is overconfident in the features captured during the learning process. From a diversity perspective, the best solution should create three separate clusters when the ensemble model is confident in its prediction, while the errors or the uncertain prediction should converge throughout the equidistant point of the centroid of the clusters. Hopefully, through the tisney visualization, we observe that the ensembling loss converges throughout this best solution, such that this result tends to reduce the number of consistent false positives and false negatives, such that few errors can be detected on each clusters in contrast with the cross entropy loss or the ranking loss functions. So to assess the benefits of the ensembling loss, we decide to compare the performance it provides with the cross entropy and the ranking loss. So in this talk, we consider the following dataset. The dataset is a SQRSA implementation that has been introduced by Carbon and Isle at chess 2019. More precisely, it implements three confirmation measures, namely input randomization, modulus randomization, and exponent randomization, such that in this talk, we target the same operation as Carbon and Isle, namely the exponentization algorithm, which induces a variable named seg3 that takes three possible values. Depending on the value of seg3, the adversary retrieves the blinding exponent bits, such that if two consecutive seg3 values are similar, the resulted private key guess equals one, and the row otherwise. Thus, knowing all the seg3 values have full to retrieve all the blinding exponent bits. To evaluate the performance of the ensembling model to retrieve the private key bits, we can reconstruct the private key guessed from a sequence of seg3 value, and then we can compute the accuracy of retrieved each bit of the private key. And because we know the accuracy, we can also retrieve the arbitrary epsilon bit. So from those metrics, the evaluator can finally compute those complexity measures that depends on the adversary's knowledge as introduced in the first part of this talk. So first, we evaluate the benefits of using a signal neural network and compare the performance depending on the loss used. As only a single neural network holds, the only loss considered are the quiescentropy and the ranking loss. And through our comparison, we observe that the ranking loss outperforms the neural network trained with the quiescentropy loss function. Then, we observe the benefits of combining five identical neural networks where no interactions are induced during the training phase. This first observation shows that a non-negligible improvement can be observed whatever the loss function used. Indeed, as previously mentioned, the ensembling approach is beneficial as it retuts the global errors following the arc correlation observed between each individual committee members. Hence, even if the same neural network architecture is considered, the initialization of the trainable parameters induces some dissimilarities between the trained committee members and enhance the ensemble model performance. Finally, considering the ensembling loss is beneficial to generate interactions between the committee members during the training phase. And as these interactions promote uncorrelated errors, it improves even more the performance of the ensemble model. But looking at the resulting complexity measures, we can denote that the number of operations remaining high. However, following the European Sogai security guidance, a maximum brute force complexity of around 2 power 100 is considered as practical. Thus, while the state-of-the-art results mainly consider the first row of the table, it suggests that the RSA implementation is secure whatever the adversary's knowledge. But using ensemble models drastically reduce the complexity metrics such that depending on the adversary's knowledge, the security of the implementation can be reconsidered. Thus, in such a scenario, considering ensemble methods are useful in such context as a slight gain in accuracy provides a realistic improvement for a full attack scenario. Finally, to fully assess the benefits of the diversity types, we perform an attack combining the type 1, type 2, and type 3 diversity. So, in other words, we generate a pool of committee members and select a subset of models in order to only keep those with a minimum amount of error correlation. And finally, we promote interactions during the training process in order to enhance the error and correlation. So, depending on the loss function used, for example the cross-entropy of the ranking loss, the type 3 diversity cannot be considered. Indeed, as the ensembling loss is the only solution promoting interaction during the profiling phase, the relating ensemble model is the only solution considering all the diversity types. So, once again, depending on the adversary's knowledge, the European SAGAE scheme considers the RSA implementations as unsecures. So, to conclude, the ensemble methods sound useful to perform such an attack against PKC implementations because slight improvement in accuracy can be useful to reconsider the security of the implementation. And even if the training time increases, it stays negligible regarding the gain in remaining operations characterized by the complexity metrics. Through our new loss function called ensembling loss, we promote interactions between the committee members during the training process in order to enhance diversity and reduce the error correlation. As we assess the benefits of our proposition from a diversity point of view, its application can be extended to a wide range of scenarios as image classification, fraud detection, or even targeted symmetric implementations. However, for the latter use case, a trade-off should be found between computational issues and performance gains. Finally, in our paper, additional results are provided such that, for example, we validate our approach on a secure ECC implementation considering another multi-class classification problem, and we also evaluate the impact of the number of committee members on the diversity gain as well as the benefits of classical ensembling methods. All of our results can be obviously reproduced through a github repository, and if you have any questions, do not hesitate to contact me Hovering mate, thank you very much.