 Here is yours, Stephane. Yes? OK. So hello everyone. Yeah my name is Stepan. So the title of the talk is Make some noise unleashing the power of convolutional neural network for profile side channel analysis. So this is a joint work with Jahun Kim, Anjeli Hoyzer, Shivam Basin and Alan Hanjaric. So first let's start with motivation. Why do we do this? im rage stayed up. So, for the capitalization of elitution, what will be my profile yet? Yeah, what do I have to invest save in change in the competition between the outcomes in prišlično, da je preformacija na zeločenih analizov. V komolusijštih neuralnih netvokov v zeločenih analizov. In včasno, da se početno vsečne vsečne vsečne vsečne vsečne vsečne vsečne vsečne vsečne vsečne. Zato, da je to najbolj, na ml, v zeločenih analizov, zeločenih analizov, zeločenih analizov v profilih atakov, v zeločenih analizov, zeločenih analizov, vsečne vsečne atakov, vsečne atakov vsečne, vsečne analizov in vsečne atakov početne vsečne vsečne atakov, vsečne analizov. Teh, da je dobliziran, vsečne analizov, to zeločenih analizov, zeločenih analizov. Poprav, da je iznamen na zelo, vse je celo zelo, je tako zelo mačno, je zelo, da je zelo, ta je zelo obržen. Učne, da je zelo, da je zelo. Tukaj da je vstavljamo zelo, da je zelo. Učne, da je zelo, da je zelo, tako zelo. Tako, da je zelo, da je zelo. Učne, da je zelo. Učne, da je zelo. tudi izz naliti začočniti, ki se pomejkamo, da je dejno, začo bilo, in kako se bo tudi zdajno? Kaj sem včunjeni? načečenje in však testov, da smo si svojali, da bomo zelo začeliti, da je bolj metod, da je to dobro nekaj, in vseh v 2016 početek smo odpočali, da se začeliti v komolusijstvih, zelo začeliti. I to se pričečenje. Vzelo, da smo se začeliti, da smo početek zelo začeliti, da je zelo začelit, da je dobro preformačne, da je dobro preformačne, da je dobro preformačne, da je dobro preformačne. O, tako so mi biti nekaj izgledanje. Tudi postojevali mesto izgledanje. Zdaj忘, nekaj se lahko, etni in nekatki. Protožitej, razmazitej, nekatke, nekatke, izgledaj se sahbiti, thej nekatki začelite, z tih rainy tunge nazivaj, bihca nekako vše, bihče zaradi. Bejte bozena. Svaj poččete jasnj, da ogledaj tih. But unfortunately, with something like this, we cannot do a lot. So we cannot calculate really complex stuff. So let's extend it a little bit. Let's add more of those neurons. So now instead of just one, we are adding many and we have many, many layers of those neurons. And actually, this is something that's called multi-layer perceptron. And this is already tip learning. So here we see input layer, output layer. So input is with x, output is with y, and then two layers in between, hidden layers. So usually we say if you have more than one hidden layer, it's already kind of deep learning. Then the options are quite, quite different. For instance, we could do something like this, where we would have many hidden layers, or we could do something like this, where we would have one hidden layer, but really wide hidden layer. Why all this is nice? Why all this works? Well, because there is something called universal approximation theorem, and basically a feed forward neural network with only a single hidden layer containing a finite number of neurons can actually approximate continuous functions on complex subsets. So what does it mean? Well, let's say with some assumptions, you have a neural network with only one single hidden layer, and you can do basically almost everything. So wow, from that perspective, yes, we would say we use this, we are able to get top results. Well, it's not so easy. First problem is, how do we really obtain that kind of neural network that works great? How many neurons would you have in that single hidden layer? So instead of having one huge layer, it's better to have a deep neural network with many smaller hidden layers. And then we come to the convolutional neural networks. So they were first designed for two-dimensional convolutions, like images. But they are actually really powerful in many, many domains. And they are similar to what they already said. So a number of layers, every layer, a number of neurons. We have convolutional layers, we have pooling layers, we have fully connected layers. So convolutional layer, we would have some input, we would do some kernel magic. So from something, we would map to some other value. So here we would multiply every value in the first square with the values in the second one, and we would get a number. And we would do some pooling. And here we can see, we just define some, again, some space to work with. And, for instance, this would be max pooling. So from each of those four separate squares, we would just take the biggest number. And that's what we call convolutional block. So convolutional layer plus pooling layer, convolutional block. So this is, let's call it like introduction for all section on machine learning talks. What do we do here? Well, we start with saying, how can we find a neural network that works really good for our side channel problems? And there we start with a design principle that is called, let's VGG like. So VGG is quite famous architecture, deep learning architecture used in image classification. And that kind of architecture has some rules. So it has small kernel size. It has max pooling with two times two windows. It increases number of filters per layer. Convolutional blocks are added, and the spatial dimension is reduced to one. After fully connected layer, we have the output layer. Convolutional fully connected layers use relu, output layer use softbox. So these are just some kind of, let's call them design rules. If you wanna do something that's VGG like, you would follow these rules. So, how did we actually come to this architecture that we are using here? Well, we were not really novel in that sense, but we said, well, it's fine, we are here working with side channel information, like information from oscilloscope. But that kind of data is not unique. For instance, if we consider speech, speech is kind of similar data. This regard the numbers, but the shape wise it's similar. So we said, let's take the most powerful, the most powerful convolutional neural networks using the speech recognition. So we started from that. And this is kind of the architecture. So we have a number of convolutional layers and pooling layers. And on that we built with fully connected layers, and on that we add softbox. I will not really go into details on all specificities here, but let's say we use some additional tricks, like batch normalization on every odd number player, on the output we use dropout, and so on, so on. These are all some tricks to make the most from your neural network. And this is the architecture that we had in the end. So we have the input, we have a number of convolutional blocks, we have the output, you use this, you get hopefully really good results. But can we improve the results and improve them, let's say it on a more general level. So of course, this architecture can be improved. Give me a specific data set, I will tune the parameters even more, and I will obtain something even better. But can I work with something that's, let's call it, as generic as possible, and still improve the behavior for many, many data sets? Well, to do that, we used noise. So, this is from where the title comes, make some noise, because we said, let's add noise to the input, let's add noise to the training data. And now this sounds a little bit counterintuitive, because in order to fight noise, to fight noise measurements, we are actually adding noise to the training data. Why it works? Well, simply because you can see it from the perspective, your convolutional neural network will learn stuff. But if it learns too much, it will start to overfit. So, I even, yeah, this is the depiction of overfitting, underfitting, and what let's say it's optimal. So we do not want our neural network to overfit, to learn every specific detail. And how do we prevent it from overfitting? Well, make the problem more difficult. Add noise, then for neural network, it will be more difficult to overfit. In one way you could see it, we wanna concentrate on big, important principles. I do not care about every small detail, I will hide small details with noise, and therefore my convolutional neural network will learn important stuff. What's important with measurements from oscilloscope, it's more difficult to see, but if we talk about images, well, and we talk about face recognition, it's important to recognize there is a face, but it's a little bit less important to recognize, aha, a person has blue eyes. So, regardless of the color of the eyes, you can recognize the face, but if you don't see the face, that's much, much more problem. So, there basically we just add noise to the input layer to the training data. It's simple Gaussian noise. Nothing specific, we do not care about the data set specificities, you just add noise, and you hope to obtain, well, you will actually have high, high chances to get much better results. Why? Well, theoretically you can see this as L2 regularization. So, now one can ask, well, so you are changing your data set. Is this data augmentation? Well, yes and no. Depends how you look at it. With data augmentation, we usually apply some kind of domain knowledge to deform the signal into something that's better, and then we use the original measurements and the form measurements as the new set. Our technique, well, it's not really ours, we were not the first one using the noise, but what we use here is a noisy training and data augmentation. Why? But at the same time, we do not change the data set and use original and changed. We just change the measurement and we work with that. So, we do not produce new measurements, additional measurements. If we started with 1,000, we still have 1,000. Only that 1,000 is a little bit different. So, what can we get? Well, first, what do we do? What do we test? We consider only the intermediate value models, so basically that means we are attacking AES as box output and we have 256 classes. Why is this important? Because we assume all the classes are balanced, more or less for each class we have the same number of measurements. We use, let's call them standard data sets, so DPAv4, something we call AESHD, no countermeasure, but a lot of noise. Then we use the data set. Well, the third one is AES with random delay countermeasure and the final one is ASCAT. So, despite means these features are more important. So, we see the first one and the last one are kinda looking easier while the second and third look difficult. And what do we see? Well, the blue line is the behavior if you add the noise. The orange line is without noise. So, what? And this is when we repeat experiments on the left is when we repeat experiments for a number of times. So, let's concentrate only currently on the right side. We see that our performance without noise is good. So, the guessing entropy, we reach guessing entropy of zero after, let's say, six, seven traces. But if we add noise, we reach that same value with four traces, something like that. This is so easy data set, the results are not really giving us so much. Let's go for something much more difficult. So, here we see again, if we add noise, we are able to get much, much better behavior on the right. Both of these settings are considering different convolutional neural networks. So, the first one is the one we proposed. This one is the one proposed by Emmanuel Prouf and others in ASCAT paper and ASCAT data set. So, they have one convolutional neural network. And what's interesting here, we see we just used that convolutional neural network, we added noise and it works better. And here is again, one nice example with our neural network for the random delay data set. We can see that if we don't have noise, we need 10, 11, whatever traces. If we have added noise, we need one max two traces to break guessing entropy. And finally, this is the result for the ASCAT data set with the ASCAT neural network where we see that if we add noise, it works much, much better. Why is this important? Well, it gives us some notion that we are kind of generic. We don't care exactly what kind of convolutional neural network you use. But you still can obtain better results if you just add the noise. So, what else is done in the paper? Well, we see that that addition of noise is quite stable. So, one can ask, yes, but how do you decide what kind of level of noise do you add? Well, we did a lot of experiments. Basically, you can add various levels of noise, it works good. And you can also, that it does not seem that you can have a single best convolutional neural network for all your side channel problems. Interesting, we showed that if you have really good neural network, attacking a data set protected with countermeasure can be even simpler, easier than attacking a data set without countermeasure. Depends on the data set. And finally, we showed that if you use this trick with adding noise, you need less traces in the profiling phase to obtain each performance. So, what does it mean? Well, you have only a limited number of traces and your attack is not so powerful. You can add noise and make your attack much more. Finally, just something additional. Once you're doing your training, there is different ways how you can select your training set. So, I select it like this, I select it like that. And then in the end, we all report somehow some average results. We also showed in this paper that your behavior, your convolutional neural network will depend a lot how you select that data set. For instance, in the first picture on the left, you see, well, one way it's good, well, orange one, your faults are not so good. And then on the other picture, you can even see that some faults depending how you select your training set can even have really, really bad behavior. So, that means often it's not enough to say, I just use 20,000 measurements. Yes, but if your data set is 100,000 and you used first 20,000, your results from second 20,000 could be really different. To conclude, VGG-like architecture seem really good for side channel. And by the way, Ascat is also VGG-like. There are other domains that use machine learning, deep learning. We can learn a lot from those domains. So, we do not need to reinvent new stuff. There is a lot of new stuff, at least from our perspective already done. We just need to recognize it and use it. And if we do that, we can actually reach the performance for side channel. And finally, yeah, we propose this noise addition procedure as relatively standard procedure to do in your machine learning profile attacks because we also showed even template attack, pool template attack, can be improved if you add noise. And thank you for the attention. Come to this micro. Any questions? Cool. Thank you for the talk. I have a quick question. So, you showed that if you add noise, it works with side channels. Yes. I tried on other tasks, like image recognition or... Sorry, sorry. So, your addignode works well for side channel cases? Addignode works for all the domains. It's not something that we invented for side channel. It's something that's done actually in many domains and it's recognized as a good thing to do. So, addignode is, as L2 regularization, it should work always. And I had another question. So, I think if you don't add noise, you are kind of overfitting. Have you tried with smaller network and in that case, do you also overfit if you don't add noise? And can you repeat? Sorry, and I, okay. So, if you, in that case, I mean, if you don't add noise, you kind of overfit. Have you tried with smaller network and see if you still overfit with the amount of data you have? Yeah, so, do we still overfit if a network is smaller? Yeah, depends. This is an interesting tradeoff. Goes little bit beyond the scope of this paper, but if your network is much smaller, you will overfit much less because your network does not have the expressive power to really recognize the data. But where is the level where you say, okay, I need this expressive level to have something. So, we did not really go into that direction in this paper. This is something we are doing in some other work. Okay, thank you very much. Yeah. So, we have time only for a very quick question. Otherwise, we have to take it off. So, I think you were first. I will try to answer fast. Thank you. Yeah, you said that what you are proposing is not to the documentation and the documentation is totally clear. On that point, my question is, did you test the efficiency of your proposal if clearly I use the adding of noise as the documentation, meaning that you don't add noise after the batch normalization, the first batch normalization, but you just take the initial databases and you create an increased database just by adding noise and then you execute the classical VGG model on that. Did you compare the two? In this, we did compare, we did not really put the results in the paper, but actually in the second presentation, it's more on that direction with other techniques. What's the most important message with this data augmentation? You often need to be really precise. How do you do data augmentation? So, you need to have domain knowledge. So, we really wanted here to work without domain knowledge. Yes, but if I add noise exactly as you do. Yeah. But just before the batch normalization? Yeah, we did it after batch normalization because of the batch normalization, it behaves just like a feature scaling. So, just from that perspective, you could move it a little bit if you know what will happen then with the layer. Thank you, Stefan, again for the nice talk. Oops.