 Welcome everybody! Here I will present the foundations and results from our paper Keep It Unsupervised Horizontal Attacks Meet Deep Learning. The paper was published on Chess 2021. My name is Guilherme Perin and this research paper was elaborated together with my colleague Bukas and professors Stepan Pichek and Leila Batsina. This is a work from TU Delft and Radbaoud University. We propose a new unsupervised attack against public key implementations. As we know, horizontal attacks are applicable to real-world protected ECC and RSA implementations. However, due to practical limitations, horizontal attacks tend to deliver a high amount of wrong bits resulting in a low success rate. Correcting these wrong bits is not very trivial, mainly because we cannot know where these wrong bits are actually located. To solve this problem, we propose a deep learning-based framework that iteratively correct wrong bits from key candidates obtained from horizontal attacks. When a simple horizontal attack delivers only 52% of correct key bits from a private key, our framework is able to correct these wrong bits and deliver 100% of accuracy. First, let's start with some background information from our research. When attacking protected ECC implementations, an attacker can, for instance, query ECDSA commands and execute multiple signature generations on a target device. The adversary can measure side-channel traces for instance, power consumption from ECDSA executions. A main operation in an ECDSA execution is the scalar multiplication that uses the private key. Normally, a protected scalar multiplication would implement scalar-blinding countermeasure, making each scalar multiplication to be computed over a randomized key. Additional countermeasures include pointer randomization, for example. Scalar-blinding factors are usually 32 or 64-bit random numbers. Let's consider a protected scalar multiplication. Besides the scalar-blinding countermeasure, vulnerabilities can still be exploited if double NAD is implemented with its naive algorithm. In this situation, we can easily distinguish the patterns of double NAD operations inside a scalar multiplication interval. From these patterns, an adversary would be able to recover the private key bits and single trace by passing all other protections. The solution is therefore to consider the regular double NAD solutions such as Montgomery ladder and double NAD always. In this case, as we can see in this trace illustration, double NAD operations are still distinguishable from each other. However, these two operations are always executed for every private key bit, making the power consumption profile protected against SPA attacks. As a result, a protected scalar multiplication should implement at least uniform double NAD operations, message or coordinate randomization and scalar-blinding. Together, these countermeasures mitigate non-profiled attacks and profiling attacks that are based on multiple side-channel traces. An obvious solution is to attack single traces through horizontal attacks. In practice, a first step to implement a horizontal attack is to identify trace intervals and the processing of a single private key bit. Then, using pattern extraction mechanism, the trace is split into several sub-traces. The amount of sub-trace is equivalent to the amount of private key or secret scalar bits. Exceptions happen when a sub-trace interval represents more than one scalar bit, which happens in window scalar multiplications that also include pre-computations. In our work, each sub-trace represents a single scalar bit. Many types of horizontal attacks include horizontal correlation analysis, horizontal collision correlation or cross-correlation attacks, online template attacks and cluster attacks. In this work, we considered clustering attacks as this type of horizontal attack has less practical limitations. To implement a clustering-based process against a single scalar multiplication trace, a K-means algorithm, for example, is applied to each sample column from the sub-trace interval. When an adversary can identify points of interest inside the sub-trace interval, it is possible to obtain a key candidate from each clustered point of interest. At the end, the adversary would combine all key candidates into a final key candidate. This can be done through different statistical methods. In our paper, we follow the most simple approach by selecting 20 points of interest then we apply K-means to these points of interest and at the end, we combine the key candidates with the majority rule. These types of attacks encounter several limitations. Usually, trace splitting into sub-trace requires the knowledge of the implemented scalar multiplication algorithm. It is expected that an adversary would consider target documentation which could be available in the internet or even reverse engineering. Also, to correct the wrong bits from a recovered key is not an easy task. In principle, as the attack is fully unsupervised, the locations of the wrong bits are unknown to the adversary and the brute force alternative becomes infeasible. As I mentioned before in this presentation, this work proposed a solution to correct error or wrong bits recovered from the previous attack method, such as unsupervised clustering attacks. Let me first start explaining the principle behind our proposed deep learning approach. Recent research papers demonstrated that deep neural networks can actually be robust against noisy labels from training sets. In our case, noisy labels could be wrong bits from the recovered scalar bits. For the two classes classification problem, if the amount of wrong bits is lower than 50%, a regularized neural network should still be able to learn the main classes from the training set. As a consequence, predicting a separate validation set that should also contain noisy labels could result in a correct prediction. The network in this case is not getting confused by the noisy labels and is still making correct labels association. Now, I will explain the execution flow of our proposed iterative framework. From a set of sub traces, we first divide the set into two separate subsets that are initially labeled from an unsupervised clustering attack. Next, we train a neural network with each of the subsets. The third step corresponds to the relabeling of these two subsets. In this case, we swap the subsets and predict the neural network that was trained with one subset with another subset. The output prediction should now contain a lower error rate in comparison to initial labels. As a last step, the two subsets are combined and shuffled, after the process starts all over again. In our experiment, we always ran the framework for 50 iterations. In a first moment, the framework is executed with neural networks having fixed hyperparameters across multiple framework iterations. To deal with noisy labels, we considered in compare different types of regularization methods. For comparison, we start by running the framework without any explicit regularization. Then, we consider dropout, dropout plus date augmentation and date augmentation only. Date augmentation is based on random shifts applied to the sub traces. In a second moment, we randomize the hyperparameters and define a different neural network at each framework iteration. In this case, we also combine random hyperparameters with regularization methods. The motivation for random hyperparameters is simple. Training the same neural network multiple times with similar labels will overfit the model very soon during the training process. Therefore, different neural networks will reduce this overfitting. For this work, we consider two datasets that are software implementations of ECC on ARM Cortex-M. The implementation is based on curve 25.519 and uses Montgomery ladder for the scalar multiplication. As an extra protection, the Montgomery ladder implements conditional swap for constant time operations. In the first dataset, the C swap is based on Arithmet gimmicks and the second dataset uses pointer swapping. For the rest of this presentation, I will refer to these datasets as C swap Arith and C swap pointer. In total, we measure 300 power traces for each of the implementations. As each scalar multiplication contains 255 operations with a random scalar, we will have a total of 300 times 255 that is equal to 76500 sub-traces. For each scalar multiplication measurements, this color is randomized. This is an illustration of how we preprocessed the datasets in order to split a full scalar multiplication trace into a set of sub-traces. The first step is to identify each of the ladder steps and cut the full trace accordingly. In our case, for each sub-trace, we select the interval corresponding to others and control executions, mainly because we explored the leakage present after field or ECC operations from each ladder step. The C swap Arith dataset will contain 8000 samples for each sub-trace and the C swap pointer dataset contains 1000 samples per sub-trace. We also performed a leakage assessment on the datasets in order to understand the leakage present in our measurements. For that, we used a t-test and signal-to-noise radio. Using the true labels and signal-to-noise radio, we can clearly identify the leakage locations inside the sub-trace intervals for both datasets. When we consider the labels obtained from the unsupervised clustering attack, which has an accuracy of 52%, we can see that detected leakage is not very significant anymore. However, the main SNR and t-test peaks are still located around the same regions, indicating that this 52% will be enough as a label initialization. Let's see the results that we obtained by applying the proposed deep learning framework. In terms of training configurations for each neural network, we consider 10 epochs as the overfitting happens very soon as a consequence of the framework iterations. The batch size is set to 64. 250 scalar multiplications are considered for training, which are divided into two groups. Each group will then contain 31,875 sub-traces. A separate set of 50 scalar multiplications making a total of 12,750 sub-traces are used as test sets in order to verify the accuracy at the end of each epoch. Now, I will shortly describe the neural networks that we use for the case of fixed hyperparameters. For both datasets, we consider similar convolutional neural networks, where the main difference is the kernel size and stride in convolution layers. The network contains a first average pooling layer to reduce the dimensionality of traces, followed by a batch normalization layer and three convolution layers. For the classification, we consider two dense layers with 100 neurons each. A softmax is placed as the output layer. When dropout is considered as a regularization method, we place two dropout layers in between the dense layers. The dropout rate is set to 0.5. First, I will present framework results for the case of fixed hyperparameters. As we can see, without regularization, we cannot achieve a successful attack for both datasets after 50 iterations. The y-axis indicates the maximum key recovery rate, or accuracy, for a single scalar multiplication trace. When data augmentation is considered, we can obtain better results. For a C-swap array dataset, the improvement was not very significant. However, for the C-swap pointer dataset, the addition of data augmentation was enough to make the attack successful. A significant improvement for C-swap array dataset is observed when dropout is considered as a regularization method. In this case, we are able to reach a final and maximum single trace accuracy of 93%. For the other dataset, dropout also provided 100% of accuracy. Now, the combination of dropout and data augmentation delivers the best result so far. For the C-swap array dataset, we reached a successful attack that was not possible before with only dropout or only data augmentation. For C-swap pointer dataset, the combination of both regularization allows us to achieve 100% after only three framework iterations. To define random neural networks, we vary the number of filters, kernel size, and strides in convolution layers, the number of dense layers and the number of neurons in dense layers. We also vary the activation function that is equal for all hidden layers. Now, let's move to the results with random hyperparameters. Without any regularization, we are able to obtain very high final single trace accuracy for both datasets. Adding data augmentation is enough to achieve 100% of single trace accuracy for both datasets. In this case, dropout regularization combination with random hyperparameters also deliver good results. A higher improvement can be seen for C-swap pointer dataset. Finally, the combination of dropout and data augmentation again delivers successful results. In this case, the combination of two regularization methods did not improve results in comparison to dropout only. Here, I want to show a final comparison between the framework executions for fixed and random hyperparameters. Clearly, random hyperparameters increases the chances to succeed. For each regularization scenario, we run 10 framework executions. For fixed hyperparameters, only one case returned 100% of key recovery. For the random case, we achieve 100% of key recovery for 12 out of 40 executions. Note also how the case without regularization provides better results when random hyperparameters are considered. In this case, we obtain a final single trace accuracy of 86% and for fixed hyperparameters, we were able to reach only 76%. For the C-swap pointer dataset, the overall results show some differences between fixed and random hyperparameters. Indeed, fixed hyperparameters delivered better results. In total, we were able to succeed in 25% of cases for fixed hyperparameters. When random hyperparameters are considered during framework executions, we recover the full key of 17 out of 40 cases. We also investigated the possibility of improving the results by trimming the attack trace interval. We do that by obtaining input gradient visualization, which indicates what samples are more important for the neural network decisions. It is important to mention that calculation of input gradients is totally independent of labels. As you can see, the larger input gradients coincide with main signal-to-noise radio peaks obtaining with true labels indicating that our trimmed interval based on gradients is aligned with leakage location. These are the best results obtaining with fixed hyperparameters for both datasets after using gradient visualization. For C-swap I read, a dataset all regularization cases provided successful results. For C-swap pointer dataset, drop-out only case did not provide 100% of accuracy after 50 iterations. Grating visualization usage also affects the results for random hyperparameters. For C-swap I read dataset, all scenarios including no regularization succeeded. For C-swap pointer dataset, we can see a very early convergence with the regularization cases. Here we can see how selecting a trimmed interval improves the results for C-swap I read dataset. In the particular case of fixed hyperparameters, we were able to succeed only one time before. After gradient visualization we obtained 19 out of 40 successful attacks. A significant improvement is also verified for the random case. On the other hand, for C-swap pointer dataset, trimming the attack interval does not improve the results. This is mainly because the attack interval for 1000 samples was already short enough. As conclusions we demonstrated that deep learning approaches can improve attack success rate to its capacity of dealing with noise or wrong labels. After that, we proposed an iterative framework with regularized models that can achieve successful attack results. The framework should be applicable to other targets including RSA implementations. As a future work, we would like to investigate how the framework behaves in the presence of trace misalignment and other label initialization methods. Thank you very much for watching this presentation.