 Hello everyone, the title of this talk is a fast and accurate guessing entropy estimation algorithm for full-key recovery. This work is done by Professor Adam Ding, Yun Si-fei, and me from Northeastern University. This talk contains three parts. First, we talk about the motivation. We review the definition of GE and explain why it is an important metric for a set channel evaluation. Next, we will talk about the state-of-art estimators of GE, which are the empirical GE and GM. For estimating the full-key GE, the empirical GE is too uncertain, and the GM is biased. Then we explain our new GEA estimator. This estimator is based on the theoretical distribution of the score vector, and we simplify the calculation using a relationship between GE and pairwise success rates. Lastly, we compare GEA with empirical GE on empirical data. We did not compare with GM in full-key case anymore since it is biased. We will show that GE gives useful confidence intervals for the GE of a 16 bytes AS full-key in practical time. First, let's start with the introduction and motivation. What is guessing entropy or GE? Why it is an important metric for a set channel evaluation and how to estimate it? When the attacker launches multiple set channel attacks using random data set with equal size Q, GE is the average of rank of the true key in these attacks. Considering this concept, given a random data set in a set channel attack, a higher GE value means more wrong keys are likely to be checked before the attacker reaching the correct key. As a result, a higher average computation cost is needed for a successful set channel attack. That is why GE is commonly used to evaluate the resistance against set channel attack. Before I will propose the GEA algorithm, GE are estimated empirically. In this case, the evaluator needs to collect a fairly large and independent data sets with equal size. Attacks n times then estimates GE by the sample mean of the crack key ranks from all these data sets. Notice that GE changes with the size Q, which is the number of traces used in one attack. Usually, more traces results in better attacks so that GE is smaller. To decide GE at a specific Q, one data set of Q do not give reliable answer. And we need to average over n data sets for each value of Q. This estimation is actually pretty good when we try to evaluate key with short lengths. For example, you can check the result in this figure when we estimate GE empirically for 1 byte AS sub key. Because in this case, the ranks are computed by simply comparing scores of all key candidates in a brute force way. As a result, ranks are easy to compute. And since the rank only varies in a short range, the variance of the rank is also very small. With a large number of ranks computed in a short time and a small inherent variance, we can get an empirical estimation with very narrow confidence interval for GE. However, if we want to estimate GE in the same way for a large size key, two challenges need to be solved. First one is how to calculate or estimate the rank of crack key in multi bytes case. This problem is solved in the previous literature. More specifically, after the first ranking algorithm invented by Nikola's variance of line etc in Eurocrypt 2013. Many follow-on ranking algorithms further improve the efficiency and accuracy of the rank estimation, but still compute the rank in a single data set still takes some time for multi bytes key. The second challenge is that the empirical GE estimation is often too uncertain. In one of our experiments, to estimate GE of 16 bytes AS4 key, we compute 2000 true key ranks for each Q value, which is the number of choices for a single attack. We can see from the figure that the empirical GE only provides useful and accurate bounds in the law scale when attacks either always fail or always succeed. In these trivial cases, it is obviously that the device is safe or unsafe against the site channel attack. However, for most cases, the confidence intervals are too wide for us to render a conclusion about the safety of the device. Let us focus on the case of Q equal to 50,000. The confidence interval is labeled in the red here. The confidence interval has a lower bound of 1 and upper bound of 2 to the power of 44.2. Generally, 2 to the power of 20 is considered within an attacker's computational power to enumerate the key candidates. Thus, present a realistic threat. Usually, 2 to the power of 40 is considered too big to enumerate, that is safe against the attack. This confidence interval says that GE is somewhat between 1 and 2 to the power of 44.2, including values representing both realistic and unrealistic attacks. Hence, the evaluator cannot tell if the target device is safe or not against such attacks. The reason of such uncertainty is that the variance of the rank is very large. The distribution of the true key rank is also highly skilled. These issues were pointed out by Daniel Martin etc. in Asia Crip 2016. We will illustrate this phenomenon on our experimental data later. In order to reduce the variance of empirical estimation, the evaluator needs to compute a much bigger number of ranks. However, both the collection cost and the computation cost of doing so will be prohibitive. For the collection cost, the evaluator needs to collect n times q leakage measurements for the empirical GE. Also, computing one rank in multi-bice case still takes some effort using a state-of-art ranking algorithm. We cannot in practice do this n times for very large n values. In the picture we show on the earlier slides, computing 2,000 ranks for each q value costs roughly 9.6 hours on our workstation. So it is hard to increase n much bigger. Charlie and Pabskill also introduced a very fast estimation of GE called GM in chess 2017. GM is very fast by substituting the rank probabilities in GE formula with the ice-largest posterior probability from one dataset. However, there are two issues with this estimator. First, such probability substitution is biased. Causing GM to be a biased estimator of GE. You can check our appendix for the detailed discussion on this. And second, GM is dataset dependent. Well, theoretically, GE should be independent from specific dataset. As a result, GM needs to be averaged over multiple datasets to empirically eliminate such dependency. Due to this biasness, we will not compare GM with our GE-A in multibice case in the experiments later. Motivated by the absence of efficient and accurate multibice GE estimation method. In this work, we proposed a new GE estimation algorithm, GE-A, to fill in this part. Instead of averaged over actual ranks, GE-A algorithm provides fast and accurate GE estimation based on theoretical distributions of the ranking score vectors. This idea is inspired by the theoretical multivariate Gaussian distribution discovered in a previous literature. In the next section, we will further talk about how we utilize the relationship of GE with pairwise success rates to further reduce the GE calculation from multivariate Gaussian probabilities to the sum of univariate Gaussian probabilities. This allows us to accurately estimate GE within practical computational time. Now let's look into more details, and let's see the relation between the GE and pairwise success rate. Actually, GE has a linear relation with the sum of pairwise success rate. Why? Since we can write the rank of two key into the sum of one minus identity functions of score comparison between the crack key and the run key candidate. Where the value of the identity function is one, when the two key beats the run key and zero otherwise. Then after taking the expectation from the outside, the identity functions are turned into pairwise success rate of the two key against one run key. That is how we write GE in terms of pairwise success rate. The advantage of writing GE into sum of these pairwise success rate is that we do not need to compute the joint probabilities across different key pairs anymore. And the pairwise success rate itself is also extremely easy to compute. To illustrate this advantage, we first need to bring in the comparison score. As defined in a previous literature, comparison score is the difference of score between two key and one run key. The comparison score over one dataset is the sum of comparison score of each trace. Under the fair consumption that the comparison score of each trace is identically independently distributed. Then the comparison score over one dataset will asymptotically follows a univariate Gaussian distribution. Actually, in previous works, RIVINE and FAIT etc. have found that the whole comparison score vector involving all run key guesses follows a multivariate Gaussian distribution, which is a more general case. We emphasize here that each pairwise success rate involves only one dimension of that multivariate Gaussian distribution. So the relationship we discussed on the last slide allows us to calculate GE from univariate Gaussian distributions, which only requires two unknown parameters, the mean and the variance of comparison score. Then estimating GE is turned into profiling the mean and the variance for all comparison scores. After going through theoretical derivation, we can write GE as the sum of CDFs of univariate Gaussian distribution. In practice, we compute the sample mean and sample variance of each comparison score over the entire evaluation set. Then the GE-A algorithm gives the GE estimation by replacing the mean and the variance by their sample version using the same formula. As you may notice, in a new formula, Q is involved as an independent parameter, which means after profiling on the evaluation set, GE-A can give GE estimation for any Q values. Well, empirical method requires that Q should be far less than the size of the evaluation set. Until now, GE-A algorithm requires to compute CDFs under all other wrong key guesses. However, in AS-Folky evaluation, the size of the whole key space is too large to enumerate over. In order to solve this issue, we first break each full key candidate into one byte subkey. Next, we randomly select M samples over the key space and compute the sample version of GE-A estimation upon them. In the next two slides, we will explain these two steps in detail. Consider a multi-byte full key candidate. The side channel attack can be conducted byte by byte according to the divide and conquer principle. As a result, the comparison score under one wrong key guess equals to the sum of comparison score of its one byte subkeys. Then the mean and variance estimation of univariate Gaussian distribution under one full key candidate becomes the sum of a single byte estimation. Since there are only 256 different distributions in the one byte case, in the first step, we profile all one byte's distribution over the evaluation dataset. Then, given any full key candidate, we can quickly compute the sample mean and variance for the corresponding distribution. In the second step, instead of enumerating over all wrong key guesses, we create a sample set S by sampling M candidates from the key space uniformly. Then the sampled version of GE-A estimation becomes the sample mean of scaled probability over the sample set S, where the scalar is the size of full key space. GE estimates GE in a somewhat similar way as the empirical GE estimation. But GE-A samples over the space of scaled probabilities while empirical GE samples over the rank space. As a result, the difference of two estimation methods in accuracy may be caused by both the number of samples collected in the fixed lines of time as well as the variance of sample distribution. To illustrate the advantage of GE-A over empirical GE estimation, let's first compare their sample distributions. As shown in the figure, the variance of rank is much larger than the scale probability. What is more, the distribution of the former is much more skilled than the distribution of the latter. Make it even harder to estimate the mean of rank than scaled probability. This graph illustrates the three-byte case as the number of bytes increases, the difference of variance and skillness grows much bigger. Also, computation of scaled probabilities only involves the CDF of univariate Gaussian distribution once. Thus, it is very fast and almost constant in time when the number of bytes increases. In contrast, the empirical GE's computation cost will linearly grow with the number of bytes. In the case of AS128 full key, the computational cost is massive even with state-of-art ranking algorithms. In our work, we use FSE ranking algorithm to estimate each full key rank. An experiment shows that the GE is seven orders of magnitude faster than empirical GE estimation to raise the same accuracy for the full key. In this section, we apply both the GE-A algorithm and the empirical GE estimation on two datasets. The first dataset is the ASCAD dataset. ASCAD dataset is one of the power measurements dataset that works as the benchmark for a DN-based set channel attack. We conduct the single byte GE estimation on this dataset to show the accuracy of GE-A estimation as well as the generality of GE-A algorithm to DN-based attack. The second dataset is the SG21M containing one million unmasked AS power measurement collected from an SG2 board at our research lab. We conduct the full key GE estimation on this dataset in order to show that GE algorithm is currently the only practical solution for full key GE evaluation. In the first experiment, we used pre-trained CN and MLP included in the same project trained by first-order masked AES power measurements with 50 maximum desynchronizations. Only the CN model is able to recover the correct key. By computing both the GE estimation as well as the empirical GE estimation, we find that two estimations agree with each other in both successful and failed attacks, which confirms that GE-A is an unbiased estimator of GE and it can be further applied for DN-based attack. We can observe that the confidence interval for our GE-A is also much tighter than the confidence interval for empirical GE even in this one byte case. In the second experiment, we conduct a traditional template attack to break 16 bytes last run key byte by byte based on one measurement point with highest correlation with hamming weight label. By sampling 6 times 10 to the 7th scale probabilities for each key value in 1.8 hours, GE-A is able to provide GE estimation with very narrow confidence interval where empirical method cannot. We can see that Q equal to 50,000 the GE-A confidence interval is now much narrower and we can judge the safety of the device here. In conclusion, we propose a new GE estimation algorithm based on the theoretical distribution of ranking score vectors. We discover the relationship of GE with pairwise success rate and use the sum of univariate Gaussian probabilities to estimate GE. In real implementation, we apply GE-A to both traditional template attack and DN-based set channel attack and give the only practical full key GE evaluation tool. Result shows that GE-A is much more accurate and efficient than all current GE estimations and it can predict the GE for sizes larger than the size of experimental data sets. That is all the contents of this talk. Thank you.