 The first presentation is entitled Improving CMA Using Correlation Optimization. The work is by Peter Robbins, Peter Quax, Wim Lamotte, and the presentation will be given by Peter. Right. Thank you for the introduction. It's an honor to be here and in this presentation I'll talk about how to improve CMAT tags using a technique that we developed which is called correlation optimization. So first of all I wanted to start with an introduction about what electromagnetic side channel tags are. And I know a lot of you are already familiar with it so I apologize for those of you who already know a lot about EM side channel tags. But for the newbies around here, so EM side channel tags are possible when you have certain leakage from a device that differs between key dependent operations. And this presentation will talk about CMAT tags on AES in particular. So what happens in the CMAT tag is you have a Pearson correlation which we use as a metric to compare the leakage versus a certain hypothesis key that the attacker will compose. So schematically what happens is the attacker will first send the plaintext to encrypt to the device. The device will perform the encryption and then inadvertently leak some electromagnetic radiation during the computations. And then finally the attacker can simulate the leakage for each possible value of a section of the key, for example a single byte. And then try to correlate these with the actual measurements. And then the key byte with the highest correlation is then our best guess and that's what we will select for our attack. Now more formally what we have is for example a vulnerable point in the AES algorithm which is typically chosen as the S-box of the plaintext X-word with the key. Which happens in the first round of AES. And then the attacker, what the attacker can do is just compose a model for each possible value of the key bytes J. So he'll essentially create 256 models each with their own respective leakage values for each possible trace and M. And so if you know that we are using the Hamming rates leakage model here, we could use another leakage model such as Hamming distance as well. Hamming rate is actually more simplified version where you assume that the previous registers are zero but any leakage model can be used here. So what we'll end up with is a kind of a matrix with models for what the power consumption would look like or the EM consumption or EM leakage would look like. And then we can try to take a correlation with the actual measurements XT of our traces. Now the motivation for this work is that if we look at recent advances in machine learning and deep learning, and in particular if we look at the review paper that was published a few years ago by Jan Le Coon, one of the principle authors of the machine learning domain, then we can see that machine learning and deep learning algorithms consistently outperform classical methods in the area of pattern recognition. So if you look at, for example, face recognition or hand writing recognition, then we see that machine learning and deep learning models typically perform much better. So and if you consider a side-chain analysis as a kind of leakage, kind of pattern recognition problem, then the question is can we apply this to side-chain analysis as well and achieve a very good performance with these deep learning and machine learning models. Already there are some promising results in recent related work, so feel free to check those references out if you want to. They are at the bottom of the slide here. And this is an example of what a trace would look like for AES, for example. In this case, AES 128. So what happens in previous works mostly is that you have a certain model that's being trained and usually it's a deep convolutional neural network. So what does that mean? You have a stack of layers, of convolution layers and pooling layers. And they will essentially, if you train this model, it will learn some kind of filter that will slide over your inputs. Then you do some pooling on those outputs and that goes on to the next layer until you reach a certain output. In this case, the output is a probability distribution for the intermediate value of the key by taking on a certain value. So considering that we have 256 possible values, you have 256 possible probabilities and so we're going to try to optimize the average cross entropy loss between the true probability distribution and the predicted probability distribution that is the outputs of the neural network. And so typically we're going to attack one key byte at the same time because then we have 256 classes. If it would take more than you would need, of course, more classes. And alternatively, what we can also do is try to predict the hamming rate instead. So in this case, we have only nine classes. And then to attack the entire key instead of one byte, what you can do is you can train multiple convolutional neural networks to attack, for example, the first byte, the second byte, and so on. Or you can multiply the number of outputs by 16, for example, to give a prediction for all key simultaneously, for example. Those are all possibilities. Now, what we're going to do in our, what we did in our work is something slightly different. It's more inspired by the recent works in face recognition where the idea is not to classify and output a certain probability, but rather learn representation or encoding of the inputs. And in this case, our inputs are the EN traces. So what we're going to do is we're going to learn encoding of the traces that is correlated with the true leakage value that is output by the device. So we can do this by optimizing the correlation loss function, which is known in the machine learning community as a cosine proximity function as well. And so the advantage of using this methodology is that we only have one value per key byte as outputs. So that's a number of outputs reduced by factor nine for hamming rate learning and by factor 256 if we are going to try to classify a single byte. In other words, it's also trivial to learn a model for the entire key instead of just for one byte of the key. However, this advantage is that we need to perform a standard SEMA attack on the outputs of the neural network, so on the encodings instead of the actual traces. But this is also very fast because it only needs 16 points for a 16 byte key. And then a second contribution is that we devise a methodology to remove the alignment requirements. Because as you know, if you want to perform a SEMA attack on a set of traces, you want to make sure that every trace is aligned at the same point because otherwise you'd be taking correlations at different samples, which is not what you want, of course. So as a concrete example of what this looks like, suppose that we have five traces, five measurements, and so we want to predict for one byte of the key what leakage would look like. So suppose that the true hamming weight values of the intermediate value, which is S books of B, S, X, or K, S are five, six, seven, five, and one. What we then do is we feed the input traces to the neural network. We will try to optimize the correlation with those true values. And as output, we can get something like this. So this is an actual training that I did on neural network. And as you can see, these values are very strongly correlated with the true hamming weight values. In fact, they have a correlation of very close to one. And even if we multiply this by 100, for example, because it's Pearson correlation, it's independent of scale, so we still have a large correlation with the actual hamming weight values. And so what we do in essence is we discard the useless points from our input traces and end up with only one point, which represents all the information we need to perform an effective SEMA attack on the outputs. So then about removing the trace alignment requirements, well, the problem with these kind of networks is they are pretty simple. And if you, for example, would translate the trace with one sample to the right, so if X1 becomes X2 and so on, then you would see that the learned weights will not correspond to the right features anymore. So that's why MLPs are very sensitive to feature translations. And as a solution to that, what we decided to do is use the magnitude or power spectrum of the Fourier transform and use that as features for the neural network. And a similar idea has been applied in DEMA context by TU et al. in 2005. And so we're borrowing from that idea and applying it in a machine learning context in our work. Now, why does this work? So here's a little demo that I made. Suppose that we have a signal composed of two leaking signals. One is at 4 hertz and the other is at 30 hertz. And here you can see them both added in each other. I also added some random uniform noise, so that's why you have some noise in the FFT here below. And what happens if we phase shift the signal? Then you can see that those two leaking frequencies are still visible in the Fourier transformation. So if we use this as an input to the neural network instead of the time domain signal, we are certainly more resistant to translations. All right, so on to the more interesting stuff, the actual results. We have two experiments. In the first experiment, we compared to an SCA net-based model on the ASCAD dataset. This dataset features a protected software AS implementation, so it's protected against first-order sideline attacks. In the second experiment, we did our own dataset, which is just a very noisy and aligned measurement of unprotected AS. And the goal of this dataset is more to show resilience against feature translations and noisy datasets. We also released this dataset to the public domain, so if you want to do your own experiments with that data, feel free to do so. And what we show is that even with a very simple architecture, such as a two-layer MLP, we are able to outperform previously introduced deep learning models, such as the eight-layer CNN. Now, the dataset was introduced by a very cool work by Emmanuel Prouf et al. in reference number two. As I said before, it's an AS protected against first-order sideline attacks, and it consists of 50,000 training traces, 10,000 test traces of each 700 samples, and those samples are all located in the first round of AS. It also features three variants. One is the normal ASCAD dataset. It consists of time-aligned traces, so first they had a preprocessing step where all the traces were aligned nicely. Then there is a desync variant where the traces are desynchronized with a maximum gender of 50 samples, and then we have a desynchronized 100 version where the traces are desynchronized with a maximum of 100 samples. All right, so what does it look like if we run our model on those datasets? As a baseline test, I did a regular SEMA attack, and as expected, even after using the maximum number of traces, which is 60,000, you can see that it's still not able to find the correct key. So rank indicates how many guesses we have to do before getting the right key, and so the attack doesn't really succeed. Now if we train a one-layer MLP using correlation optimization, which we saw in the previous slides, then note that the XSS has changed here, so it's only 5,000 instead of 60,000, then we can see that for the ASCAD dataset, which features the aligned traces, the rank appears to drop and almost hits zero at around 5,000 traces. For the two-layer MLP, which is able to capture more complex features of the traces, we can see that after 1,000 traces, it already hits a rank of zero. But as expected, since the MLP is very sensitive to feature translations, what you can see is that for the desynchronized datasets, which are desync50, which is the orange graph, and desync100, which is the green one, that it's still not able to find the correct key, as expected. So we apply our methodology to remove the alignment requirements, so we transform everything to the frequency domain, and already there is some interesting result here, because even for our baseline test, the regular SEMA, it seems that after about 60,000 traces, all the datasets appear to go to zero already. If we then do the same for our one-layer MLP, again, we can see that now it has actually found the correct key after 5,000 traces, not so much for the desynchronized datasets, but if we add another layer and hence allow it to learn more complex features, then we can see that for all three datasets, it's able to find the correct key in about 1,000 traces. How does this compare to previous work? Well, this was the best CNN model from the original authors of the ASCAD paper, and so it appears to do a little bit better, especially for the desync50 and desync100 datasets, which the original DPCNM could not find, and that's because of our FFT-based approach. Then for the second experiment, recall that we just wanted to see how well our model could deal with very noisy and unaligned signals. So what we did is make a training set of 51,000 random key encryptions, a validation set of 32,000 fixed key encryptions, and then capture those with a relatively low sample rate of 8 million samples per second, and then without any pre-processing or alignment. So we just feed the raw data to the neural network and let it do its thing, and then after 20,000 traces, we were able to find the correct key for this dataset as well. All right, so on to the conclusions. We've demonstrated the usage of machine learning as a means for feature extraction rather than classification, and those features are extracted by optimizing the correlation loss as opposed to the cross-entropy loss from previous works. And on the ASCII dataset, we show that we achieve a better performance despite using only a very shallow MLP architecture, which allows us to train much faster on a dataset. Alignment issues can be resolved by operating in the frequency domain, and if you want to check out the code yourself or look at the data, it's all open source and on GitHub. There is also a framework that I made for easily processing batches of traces or working with the ASCII dataset, so feel free to check that out if you'd like to do some work on that as well. And then for future work, here are some ideas that I had in mind. If you look at the computer vision domain, there are a lot of things going on around Siamese networks, which is basically just another way of training a neural network, and they achieve very good results, so feel free to check that out. I think maybe in the side channel domain, we can apply similar principles in order to increase the performance even more. It would also be interesting to apply this kind of techniques to other domains in crypto or other crypto algorithms, for example, on ArrowsA, for example, and then improvements to existing datasets. This is actually funny because I had contacted the authors of the ASCII dataset yesterday, and they already fixed this issue, so at the moment previously it used a fixed key, but fortunately variable masking values, of course, but now they've also randomized the key to make it a little bit more realistic. And then finally, implement state-of-the-art architectures from the CV domain, such as Resnet, which I have not seen that much in current SEA literature yet, but are very interesting if you look at those papers from the CV domain. All right, so that's it on my end. If you have any questions, I would be glad to answer them. Thank you very much. Thanks for the talk. We have time for questions. No hands. I'll start. So if I understand correctly, you propose basically to use machine learning then, so not for the full attack, but for feature extraction. And in this work, you say, I find an encoding, right? And does that basically mean that I'm finding one feature? And if so, why not extend this and say, I'm going to find like two, three, four features? So if I go back to this slide right here. So as you said, it's a need one feature. And the advantages, for example, if you have, let's say you want to classify faces, for example, and you have one million faces, then you don't need one million outputs. You can just have one scalar value that indicates how that gives us a certain similarity and then you can compare it to a certain database. So what this allows you to do is, at first, we need a lot less outputs. So instead of 256 outputs, we need only one. And this allows us to train much faster. So that's basically the main benefit from that. Okay. But you said also that the method now allows to train for an entire key rather than only four bytes. Sorry, can you repeat that? You said that with this method, you can train directly for an entire key and you don't have to do it for key bytes. Yeah, so this was only an example for one byte of the key because there was only one output. But what you can do is you can just... I mean, these are now grayed out, but you can just allow the neural network to train all these outputs. And then you can perform the correlation on, for example, first you do a SEMA attack on the first sample, then you do a SEMA attack on the second sample, and then eventually you get all the bytes of the key. So then I would have 16 encodings. Yeah, exactly. Yeah, that's right. Okay. Anyone else? Please stand up to the mic. Just a very short question. Did you investigate the use of auto-encoders for the future extraction? Could you repeat that, please? Did you consider... Did you investigate the use of auto-encoders for the future extraction? Yeah, so auto-encoders, what they do is they would, after learning that, so with this output, they would project it back to the original space, which is, I mean, not something that would be useful in this case, I believe, because you're only interested in extracting the features and not projecting it back to the original input space. Right? Because that's what auto-encoders do, right? Yeah, but auto-encoders are so reducing the size of their representations. A little bit like future extraction. Yeah, if you use only the encoding part of the auto-encoder, then, of course, you get the same thing as... Yeah, this is the idea. Yeah, that's the idea. But you didn't work on that. Yeah, but that's exactly what this is. Thank you for your talk. Have you compared your approach with old ones like template attacks and stochastic attacks? Could you repeat the question, please? Have you compared your approach, at least the experiments, with template attacks and stochastic attacks like using a stochastic model? No, that's something I didn't do. I just wanted to compare with the previous machine learning based works. So that's not something that... More questions? No? Let's thank the speaker again.