 Talk is given by Anneli on the paper and title on the course of class imbalance and conflicting metrics with machine learning for side channel evaluation. Anneli, if the floor is yours. Thank you Sabah for the introduction. Yeah, so as you can see from the long title, actually my talk will be divided into two parts. So first I will talk about imbalances of classes or of labels and the second part will be about evaluation metrics. So this is a joint work with Stepan, Allen, Shivam and Francesco. So luckily Stepan already told a bit about the big picture, but so we have a profiling phase and an attacking phase. And before we introduced machine learning to the field of side channel analysis, we mainly had template attack and the stochastic approach. So what we do here is build probability density attributions in the training. We evaluate them in the attacking phase and then we evaluate the whole concept with evaluation metrics like guessing entropy and success rate. And then machine learning techniques has been introduced to the field. So in this talk I will mainly talk about the classical machine learning techniques. So like SVM support vector machines, random forest. And so we use these techniques to train and then also to predict in the attacking phase but then also accuracy came into the game of side channel analysis. So accuracy is a really popular metric in machine learning. And in this talk I will first talk about what can might go wrong with your labels in combination with machine learning and also what is wrong with accuracy in the field of side channel. So what is a label? So typically what we attack in this context is intermediate states of a cryptographic algorithm. So here, for example, let's assume we have IS. We would normally attack this in either the first round or the last round. So for the first round I would have the output of the S-box where some known part which comes from the plain text and some part which I want to reveal this would be the key. And a really common model in side channel is actually to use the hamming weight. So what is the problem with the hamming weight is that it introduces actually imbalances in the data set. So if I have uniformly distributed data, if I compute the hamming weight I will actually end up with binominal distributed data. So here's an example, if I have an 8-bit value how it would look like. So here we can see that hamming weight class four is actually occurring like 70 times whereas the class zero or eight is only occurring like once. So I really have a high distribution or a high occurrence in the middle where on the ends we have much less occurrences. So why do we use actually hamming weight in the field of side channel or for power analysis and electromagnetic emanation? Actually it does not really reflect how the device is leaking. So here I plotted the influence of each bit to the power consumption or in this case it is EM. So if I would have a hamming weight leakage I would see that each bit is actually contributing similarly or has the same amplitude. So this I can only see in the first part where as the other two parts they don't look like hamming weight at all. And here actually so the amplitude really reflects the influence to the leakage so these two parts where it's not hamming weight is leaking much more than the part where we actually have a hamming weight. So why do we still use it? First of all it really reduces the complexity of training. In the attacking we don't care so much because we have to attack the key anyway but in the learning phase it really makes a difference if I just have to compute my model to distinguish between nine classes for 8-bit or 256 and especially if I use machine learning techniques like SVM and the other point is that actually it works sufficiently good in many scenarios. So if I want to attack I don't really want to model the leakage so I have a precise representation. What I want to do is to distinguish classes so I need some function where I can easily classify between the classes and here what I plotted is the outcome of template attack in the hamming weight model versus template attack in the value model and on the x-axis you can see the number of traces I want to attack with and on the y-axis the number of traces I need in the profiling phase in order to reach a guessing entropy below 10. So how can you read this graph? So let's assume we want to attack as five traces so like here then for template attack with the hamming weight model I would need around 100 traces whereas for the template attack with the value model I need nearly 1000 traces so it is much more efficient to use the hamming weight model with the same outcome in the attacking phase. Of course if I want to attack just with one or two on this case three traces I need to use the value model because the hamming weight model reduces information on the key but if I have for example in this case again if I have like 50 traces actually the outcome is nearly the same but still in the training phase it is much more efficient just to compute the model. So in many cases hamming weight model is not such a bad idea actually and this gap between these two we saw that if we are in a higher noise scenario this gap actually becomes bigger. So even that hamming weight is not representing the leakage it's not bad actually but why do we care about imbalanceness in the data at all? It's because most machine learning techniques and here again I talk about SVM and when it falls not so much for deep learning techniques they rely on loss functions that are actually designed to maximize accuracy. So accuracy I will talk a bit in the end more but let's assume we are in a high noise scenario and I have the best strategy actually if we cannot really distinguish between the classes the best strategy would be to always predict hamming weight class four because it's mostly populated and actually 27% of the case I will be right but if I always predict hamming weight class four we will not be able to attack so we will have no information about the key. So even though the accuracy will be quite high it will not help us to attack to actually reveal the secret key. So what do we do? So in this paper we looked for methods how we can transform the data set actually to achieve balanceness. So what could we do? Either we throw away data so I will explain that. Afterwards we add data or we actually choose data before we do all this. But for this you really what we came up with is not uniformly distributed plain text and if you want to attack first and the last round it will get really messy. So throw away data this is called random under sampling so we will only keep the number of samples equal to the least populated class but the problem here is we have the binomial distribution so it's not like some classes are more populated than other just slightly we really have this distribution where we have a lot of data in the middle and only really minimal data on the outer classes that will throw away a lot of data away. So here I really made a toy example this is not a non-linear distributed. So class one has seven samples class two has 13 samples and what we will do we just throw away six samples from class two and they are balanced. The second thing we tried is random over sampling with replacement. So here we randomly select samples from the original dataset until the amount is actually equal to the largest populated one. It's a really simple method and it worked quite well it was reported to work quite well in other contexts so we said okay let's try this out also but it may happen that some samples are not selected at all or some are selected more often. So here what we would do is have 13 samples but they are not really distinct certain samples just the weight of the two classes would be the same. So here we would have one sample for example which is not selected two are which are selected three times. And another technique we tried is a small test so this stands for synthetic minority over sampling technique. So here we really add artificial samples so it's really adding data by new samples and this data is generated to the clean distance of K nearest neighbors. So here how it would look like we really add we really would have certain samples distinct certain samples. And the last technique we tried is smote plus edited nearest neighbor. So it's smote plus the data cleaning technique so it's first over sample for the classes we need to over sample and then under sample. So here we would remove data where we saw that the K nearest neighbors actually have different classes so we clean data. So here in this case we may come up with only 10 samples in both classes. So unfortunately I don't have time to show you all the results we have in the paper but just the most effective technique in our cases was smote and I will show you now the experience for this. And I just want to say that so this is data documentation so we add data where we need it and but we don't use any specific knowledge about data set implementation, protection, distributions. It's just we add data according to the neighbors. And we did this for a varying number of samples in the profiling phase. So it's like in the imbalance case we had one K, 10 K and 50 K and this end up together with smote for five K, 24 K and 120 K. So the data sets are the same as, or three of them are the same as in the talk from Steppen. So the first data set is on the DPA context V4. Here we assume that the mask is known, actually it's a protected implementation mode is used but we assume the mask is known and I plotted the densities of each humming wait and you can see that they are somehow look gaussian and you can easily distinguish between them. And then actually to make it really short in this for this data set, smote did not happen at all. But this is also really natural because if we can already distinguish the classes easily like you can see on this picture there's no need for the distinguisher to say oh we always have humming wait class four. So we don't actually need this added samples. So let's come to more interesting results. So here's the data set two, we have a FPGA on the Zazibut G2, again IS128 and you can already see that the densities are not looking so gaussian anymore for most cases and they're quite overlapping. So we have much more noise. And in this case, but the fall all the time or how you can read this graph first. So we performed SVM random forest for these three different training data sets. Everything what is in balance is like a straight line and what is the smote is like a dashed line. And we can see that actually in all cases and all scenarios adding artificial data so using smote helped quite drastically. What we can also see is that SVM and random forest just using one K with the blue and the purple line. They didn't converge until the 25,000 measurements we used in the taking phase. But the smote they performed actually better than the highest number we had in the profiling phase. So what this tells us that if you have a data set it's better to stop before than do balancing technique instead of taking more measurements and still be in the imbalancing scenario. And the third data set is the one with the random delay counter measures. You can see that we have much more noise. They all look more like Gaussian and they are overlapping. And also in this case, we see that adding artificial sample with smote actually improved a lot. We have further results in the paper and also more explanations of paper. Unfortunately, I don't have time to tell you all but we also used it for CNN, MLP and template attack. But of course, if you have the chance to already choose your data set and have perfectly balancedness of real measurements then of course it is better than by artificially adding samples to achieve balance. So let's now come to the second part about accuracy. So on the factor side, we have success within Gaussian entropy and this is the average estimated probability of success or the average estimated secret key rank. And here we really have a dependence on the number of traces in the technique phase. And the important thing is that the average is actually computed over the experiment. So for independent data sets, hopefully independent with different keys and on different traces, we compute the average. Whereas on the other side, the accuracy which is the average estimated probability of correct classification, the average is not computed over the experiments but just on one experiment over the number of traces. There's actually, I'm sorry, there's a mistake in the copy. Check it out in the paper. So there's no actual translation between the two. And what we can say is that there's an indication if the accuracy is higher than auto-guessing entropy and success rate should converge quite quickly. But again, there's no real conversion between the two of them. In the case for this, we named it because of two reasons. First of all, we have global accuracy versus class accuracy. What this means is that if we have a bi-adjective function between the class and the key, so for example, if we use the Hemingway, and it gives us more information to, for example, predict Hemingway class zero or eight, then it will give us to predict Hemingway class four. Because for four, we still have 70 possible key guesses where for Hemingway class zero or one, there's just one possible key guess. So it is much more important to classify correctly classes with which low population than the ones which are more populated. But accuracy actually doesn't care of this, it will just average over each class accuracy. And the second thing is that we have label versus fixed key prediction. So, and this is only relevant if we want to attack with more than one trace. Because for accuracy, we consider each trace independently, but for guessing entropy like I said before, we actually accumulate knowledge over the whole data set. So there are much more detailed formulas, explanation in the paper on this, please check it out. So I think the takeaway, there are two takeaway messages from the talk. So if you have Hemingway or Heming Distance, plus the machine learning algorithm and noise, this is really, really likely to go wrong. And data sampling techniques really have, and it's more effective to stop earlier than balance, than just take more measurements and keep it unbalanced. And the second thing is that machine learning metrics, accuracy and metrics which are similar to accuracy, do not really give a precise SCR evaluation. And this has two reasons of what I briefly showed you is the global versus classic accuracy and label versus fixed key prediction. So, thank you, I'm happy to answer any questions. Thank you, Anneli, for this informative and nice talk I was, I enjoyed a lot. If there is any question for Anneli, please come to the micro. Can I have a question about how can you make the data balance? In my opinion, Hemingway, the nature, feature of Hemingway is imbalanced because Hemingway number four will have the most, something like a earlier half. And Hemingway number zero and eight will have only one among the all. So, if you have one key, you can choose only one plain text. I mean, for eight-bit supply key, you have one sub, one eight-bit supply key, you can choose only one plain text to match with it for Hemingway eight or Hemingway zero. So, if you make balance, how many data will you have in your data set and how can you increase it in that case? I don't know if I understood correctly, but you said two bits, two classes, and how many I need? Yes, I mean, how can you make the data set in that case because for one eight-bit key, you have only one eight-bit plain text that can match together to make Hemingway zero. So, the amount of sample decrease significantly, decrease, how can you make a big data set enough to through the training? Maybe we can discuss later in the day. I couldn't really. I think because the time is restricted, it's good to take it offline. Is there any quick question for Annalyne? So, I have a quick question for you, myself. For days, some technique that you said, you said that you have generated synthetic data. How does synthetic data have generated? Yes, so this, for the SMOTE, I mean, there exists, we only tried also subset of the method which really exists. So, I think there are much more in the field of machine learning you can try, but here we used actually five nearest neighbors and then looked for the aclide in distance between them and added this sample. Thank you. But there exist many more, in some scenarios, they might work more efficiently. And we don't claim that this is the best, but just that the concept. That's a very good example showing that it's possible. Let's thank Annalyne again and Benjamin.