 Hello and welcome to this presentation of the paper DLLA, Deep Learning Leakage Assessment, a Modern Roadmap for SCA evaluations. My name is Thor Mos and this is a joint work with my colleagues Felix Weegner and Amir Muradi from the Rohr University Buchen. This work proposes a new strategy for black box leakage assessment. In simple words, leakage assessment is a technique to determine whether input dependent information can be detected in side channel measurements of a device under test. Typically, if I can learn information about the processed inputs from a side channel trace, then indeed some kind of side channel leakage is present. This does not necessarily mean that the device is vulnerable to attacks, but it's usually desirable that no detectable input dependent information is leaked to potential adversaries. The most common approach to perform leakage assessment is to try to distinguish leakage distributions for different input classes using a statistical hypothesis test. The most commonly used input classes are fixed versus random, where one group for fixed input is collected and one group for many different random inputs. And then fixed versus fixed, where two groups for two distinct fixed inputs are collected. Then either the Welsh T test or the Pearson's G-square test are used to evaluate the null hypothesis, which states that both groups are drawn from the same population and therefore indistinguishable. This method has been applied for many years in the community and different extensions to the tests and the methodology exist to tackle specific problems. But there are still a number of aspects where the technique works less than ideal. First of all, side channel measurements usually consist of hundreds or thousands of sample points or even more, but the classical tests are normally applied to each sample point individually instead of on the full trace at once. Therefore, side channel leakages that are spread over multiple points, for example multivariate or horizontal leakages, may go undetected. Also, it is known that the separation of statistical orders in the Welsh T test can lead to false negatives when the noise level is low and the leakage is of higher order or when the leakage is distributed over multiple statistical moments. We have thought about techniques that could potentially avoid some of these drawbacks. And we noticed that leakage assessment is essentially a statistical classification problem. If we can generate software and algorithm or a machine that determines whether a leakage trace belongs to group 0 or group 1 with a better success rate than randomly guessing, then indeed input-dependent information is included in the traces. And to us, this seemed like a pretty straightforward task for a neural network. Again, the potential benefits in comparison to the classical methods would be that the network would receive each trace in full length and therefore can make a decision with respect to a full trace instead of one point only. And also that the classifier is not limited to simple univariate first-order leakage but inherently captures all leakages that normally occur. Based on this idea, we have created a new leakage assessment methodology which we called Deep Learning Leakage Assessment in short DLLA. The procedure looks like this. First, we record a set of side-channel measurements while supplying the device under test with a randomly interleaved sequence of two distinct fixed inputs. Then we split the whole set into a training set and a validation set and train the neural network on the training set. Afterwards, we use the trained network to classify the traces in the validation set while hiding the true labels from the network. If this is done, we can compare the classifications with the true labels and count the number of correct classifications. When we have that number, we can calculate the probability that the correct classifications could have been achieved just by chance. If this probability is smaller than it shows in threshold, for example the typical 10 to the power of minus 5 threshold in leakage assessment, then we can reject the null hypothesis. The number of traces for the detection is then the cardinality of the training set only because this set was sufficient to extract and learn a generalizable feature from the traces. In the validation period, the network is not updated so it cannot absorb any additional information from the validation set. But of course the number of traces needed for the whole evaluation is the combined cardinality of training and validation set. Here is the formula to calculate the probability that SM correct classifications are achieved by a randomly guessing binary classifier where M is the cardinality of the validation set and V is the validation accuracy. To get an intuition for the required validation accuracy to confidently reject the null hypothesis for different sizes of the validation set, we have listed a few examples here. For a set of only 76 traces, at least 57 of them need to be classified correctly which corresponds to a 75% validation accuracy. For a set of 45600 validation traces, for example, the validation accuracy of 51% is already sufficient to reject the null hypothesis. For even larger validation sets, the required validation accuracy shrinks further and further. In that regard, it becomes clear that DLAA does not require the trained classifiers to be very good at the classification of a single trace. It is only important that it classifies the traces better than a random guess over a large number of traces. This is the Python code that defines the multi-layer perceptron in Keras which we have used throughout this work. It is an extremely simple network with four fully connected layers and two output neurons for the classification. We have kept it very simple on purpose to make it suitable for a wide range of side channel traces and different forms of side channel leakage. The second network we have used for our case studies is a convolutional neural network. Its description in Python code is shown on this slide. Here we actually apply a couple of parameters and will evaluate the robustness of its detection performance across eight different hyperparameter combinations. We have tested our new leakage assessment strategy in a total of nine different case studies using side channel data from different implementations and platforms. We always compare DLAA to the t-test and the g-square test in order to judge the quality of the deep learning-based leakage detection. The first seven case studies are all based on hardware implementations measured on an FPGA. For some of them, the traces are aligned. For others, they are misaligned. Some circuits are protected by a masking countermeasures. Some others are not. Some of the traces that contain univariate leakage and some others include only multivariate leakage. In addition to the FPGA-based case studies, there is also one case study based on measurements from a custom ASIC and another one where a software implementation is executed on an ARM Cortex-M0 microcontroller. In summary, we have tested our method on a large variety of realistic target implementations and compared its success to the classical methods. The one thing that all case studies have in common is the underlying cryptographic primitive, namely the present-AT ultra lightweight block cipher. This slide shows the architecture of the unprotected serialized present hardware implementation that has been analyzed in the two following examples. On the top of this slide, we see a sample trace recorded by measuring the voltage drop over a 1 ohm shunt resistor in the VDD path of the power supply of a Spartan 6 FPGA using a digital sampling oscilloscope. On the bottom, there are four figures showing the results of a univariate first-order T-test and a G-square test. They also show the progress of the highest confidence values over the number of traces. Clearly, a significant amount of leakage is present and can be detected with confidence after less than 100 traces. The maximum confidence expressed as a minus log 10 value is slightly above 80. Using deep learning leakage assessment on the same trace set, we achieve the results shown here. Since the DLLA procedure does not produce one confidence value per sample point but instead only one in total, we have to use a different method to extract spatial information about the leakage in the trace. This can be done with a so-called sensitivity analysis based on input activation gradients. This method locates the points of interest by quantifying how much they contributed to the leakage function learned by the neural network. More details on that can be found in the paper. On the right side, the confidence values over the number of training traces are depicted. Obviously, the confidence values achieved are much, much higher than for the classical methods. However, for the sake of completeness, we have to mention that the validation set is not counted in this figure and that the total number of traces required for the evaluation is sometimes higher than for the traditional methods in these simple case studies with unprotected implementations. The next case study we want to take a look at is based on the same present 80 implementation as before. But this time, the clock of the circuit was randomized by an LFSR circuit, leading to a misalignment of the traces and a large amount of noise. We can see that the traditional methods require up to 5,000 traces now to detect the leakage. Using DLLA for the same analysis, the detection succeeds with about 150 training traces. However, the acquired confidence is also reduced by a large margin and the identification of the points of interest might not be as sharp as for the t-test. Here we see a serialized 3-share threshold implementation of the present 80 cipher, which is the underlying hardware implementation for the next case study. The S-box of present needs to be decomposed into two quadratic functions before a TI with 3 shares can be realized. On the top of this slide, a sample trace is shown, which has been recorded during the execution of the first one and a half rounds of the present threshold implementation. Below there are again t-test and g-square test results showing the detection of higher order leakage. For the t-test, only the third order is shown as the statistical confidence is the highest for this moment, but the results for first and second order can also be found in the paper. Since the leakage of the implementation is partially found in the second and partially in the third statistical moment, it is no surprise that the g-square test outperforms the t-test here. We also apply DLLA to the same traces. But unlike before, we do not plot the confidence values over the number of traces anymore because such an analysis requires the analyst to train a classifier many times using different sizes of the training set. Although this allows a nicer comparison to the classical method, it's computationally expensive and rather unnecessary. Therefore, we have first chosen a fixed training set of 3 million traces and plot the confidence value over the number of epochs to show that a much higher confidence can be achieved by DLLA than by the classical methods with the same number of traces. We can also see an important conceptual differences between leakage location using sensitivity analysis on a trained network compared to the classical approaches. Parts of the trace that contribute to leakage but correlate heavily with other parts contributing to the leakage might not be learned and used for distinguishing the two groups because there is no intrinsic incentive for the neural net to learn redundant information. Thus we can see that the leakage is pinpointed only at the point of the strongest leakage here and not in each point of the trace where there is actually leakage located. Here we have used a significantly smaller training set to demonstrate that using DLLA it's also possible to confidently detect leakage using a smaller amount of traces than the T and G square test require. The confidence threshold is overcome while training on only 500,000 traces although just barely. The final case study we are going through in this presentation is also based on a present threshold implementation. But here we ensured that all six component functions G1, G2, G3, F1, F2, F3 together with the respective registers are clocked sequentially and not in parallel. We did this by gating their respective inputs with AND gates which are controlled by a finite state machine. Therefore no univariate leakage should be detectable but combining multiple points in the trace should still allow to distinguish the two fixed groups from each other. We can see here that indeed neither the univariate T test nor the univariate G2 test detect any leakage using up to 50 million traces. However, when using a multivariate T test by combining the respective points with the correct offsets we can get a confident detection after roughly 45 million traces. Yet this is a process that cannot easily be automated for any implementation as it requires exact knowledge about the underlying implementation. Thus it is not perfectly suited for a black box leakage assessment. Using DLLA we do not require any knowledge or information about the underlying implementation. We simply train a classifier on 20 million raw traces and the neural network learns the multivariate leakage automatically. When we evaluate the classifier on further 5 million traces we obtain a high confidence to reject the null hypothesis. These cases of multivariate higher order leakage are the most relevant and interesting application scenarios for DLLA since the classical hypothesis test cannot easily be extended to capture such leakages universally. Of course, the ability of the networks to automatically find and learn different leakage patterns does not come for free but at the price of a significant runtime overhead just like in most applications of deep learning. Here we have listed some numbers regarding the runtime on our machines. In short, if you want to analyze the same number of traces and train the classifier between 30 and 50 epochs the DLLA procedure will have a runtime that is about 100 times larger even when using the GPU support. Yet the absolute runtimes are still reasonable for a realistic number of traces and might spare you a lot of manual work for the more complex scenarios. Finally, we have also tested the CNN network which was introduced earlier for its ability to learn the different forms of leakage in the case studies from the different platforms. Also in this case we have tested whether small changes to its parameters have destructive effects on its detection performance. However, we can see that although changes are noticeable the confidence values achieved are typically in the same range. The largest variation is seen for case study 4 here. To conclude this talk, we have shown that leakage assessment using neural networks is feasible and worth the extra runtime for complex detection scenarios. We demonstrated that very simple networks can deliver quite universal detection performance even when faced with multivariate higher order leakages. Also we have seen that the amount of traces required for detection is typically lower in DLLA compared to classical detection while also providing a higher confidence. Finally, since DLLA produces one confidence value per set of traces instead of one value per sample point the risk of false negatives is lower if the same threshold is chosen. With that I would like to end this talk. Thank you for your attention. If there are any questions feel free to ask them during the live session at chess 2021 on September 16th.