 Hello. This presentation is on our paper High-Speed Masking for Polynomial Comparison in Letter Space Key Encapsulation Mechanisms. My name is Florian Bachel and this is joint work with Clara Paria-Longa, to be as Oda, to be as Schneider and Tim Gunasen. The outline of this talk is as follows. I will give a short introduction into the relevant topics. Then I will present our mask comparison algorithm, give a short analysis of the security and the correctness. I will provide some performance figures, some details about the security evaluation, and finally give a conclusion. So the concept of site-genar attacks has been around for several years and it basically works as follows. We have the cryptographic device, which performs operations on sensitive data and an attacker measures the site channel, for example the power consumption or the electromagnetic emanation using a oscilloscope to create the site channel traces on the right. And these are then evaluated using statistical tools to recover the secrets, for example the key of the device. And to counteract this kind of attack, there are several approaches, for example hiding and masking, which we will focus on in the rest of this talk. And in masking the sensitive data or sensitive variable is split into multiple shares and this can be done either by Boolean masking, where the sensitive variable X is split into several Xi using Boolean addition, or we can use arithmetic masking, where we split the data using addition with some modules. And then we compute on this securely shared data and combine it later again. And the security of the scheme relies on the idea that an attacker cannot reconstruct the secret without having access to all of the shares, which is assumed to be hard. In our paper we're interested in applying masking countermeasures to post-quantum cryptography, specifically key encapsulation mechanisms that are based on lattices. And many of these lattice-based chems rely on the Fujisaki Okamoto transform to achieve the required CCA security, that is security against the children's hypertext attacker. And one of the parts of this transform is a comparison step in which the re-encrypted plaintext is compared with the input ciphertext. So this comparison is very simple and fast in the unmasked case, but if we apply masking to the decapsulation, then it becomes very expensive, especially at higher orders. So even for the first order case, we can look at New Hope, which we used as a benchmark in our paper. And the first order secure New Hope implementation can be found in Oder et al., which takes around 25 million clock cycles on an ARM Cortex-M4 microcontroller. And the comparison alone is around 480,000 cycles. This isn't a lot compared to the overall runtime, but it's important to note that the comparison in the reference only works for first order masking. So there's no easy way to extend it to higher orders. There's a general approach, which is based on arithmetic to Boolean conversion. And this approach can be extended to higher orders, but it takes around 4 million clock cycles. So almost a fifth of the overall scheme. And as we will show later, it gets worse quite fast. So our contribution is a new algorithm for mask comparison, which is secure at high masking orders. It has a low runtime and low randomness requirements, even at these high orders. We provided theoretical security proof in the probing model and give some performance figures for different parameter sets. And we also provided practical second order security evaluation using test vector leakage assessment. So the overall algorithm we're trying to protect is the CCA security encapsulation using the Fuji's acupuncture transform. And in this case, we looked at New Hope and the algorithm looks like this. On the left, we have the input A prime, which is a polynomial. And this is then put to the CPA security encryption function. And the output is given to a random oracle G, which is usually instantiated using a hash function. And also to the re-encryption, which I mentioned earlier, which generates another polynomial, which we labeled A. And then these two polynomials, A and A prime, are put into the comparison. And only if the comparison shows that A and A prime are the same, the output is processed further. As I said, we focus on this comparison step. And again, we have two inputs here. We have the polynomial A prime. This consists of k coefficients. And we have the output of the CPA encryption A, which is also a polynomial consisting of k coefficients. And it's important to note that these must be shared to achieve security because the polynomial A depends on the secret key, sk. And the sharing used is typically arithmetic because of the lattice structure that is used in the CPA decryption and encryption. So related work, which also studied the masking of the comparison step, can be classified in two types. First, we have the efficient comparison by order at all, which is first order only. And we have the arithmetic to Boolean conversion-based approach, which can work at higher masking orders, but it has quadratic complexity in the number of shares. And therefore gets expensive if we increase the number of shares to increase the masking order. Now our proposed algorithm works on the core idea that instead of comparing only one coefficient at a time, we pull them into sets of coefficients, which we compare all at once. Again, the input are the arithmetically shared coefficients A i and the unshared or public coefficients A i prime. And the output should be true if the polynomial is matched. That is that all A prime i are equal to the respective A i. So the first step is splitting the coefficients into X sets of size L each. So such that the first set contains A1 to AL and the second set contains AL plus 1 to A2L and so on. And then we calculate a sum, a mask sum on these sets. And please note that we need fresh randomness here. So the R1i and R2i are random values. But also note that the randomness is constant over the shares. So we need two random words per coefficient, but this is independent of the number of shares. After we calculated this shared sum, we can unmask it just using addition. And yeah, also not all arithmetic operations here are moduloom q, but I left it out for readability. Then we can rewrite the sum Vm as the sum over A i plus n times R1i times R2i. And as there are no shares in here, so this is a sum over the unshared coefficients, we can calculate the same sum for the unshared polynomial. So we have Vm and Vm prime. And the final step is just comparing these unshared sums. To see the complexity of our algorithm, we can look again at the equation for the unshared sums Vm. And we see two sums, the inner sum is over the number of coefficients and the outer sum is over the number of shares. So the complexity is linear in the number of coefficients and the number of shares. And the required randomness is also linear in the number of coefficients, but as I said before, it's independent of the number of shares. So if we increase the masking order, we don't need more randomness. So to analyze the correctness, we need first to show that valid ciphertext actually passed a comparison and this is easy to see. And the second part is showing that invalid ciphertext are actually rejected. And we show on a paper that the collision probability for one of the sums, so for one of the sets is one over q. And the collision probability for all of the sets is q to the minus x, where x is the number of sets. So we can set the number of sets depending on the collision probability that we are comfortable with. So for New York, we set the number of sets to x equals 16. And this gives us a lower collision probability, but also the possibility to use up to 64 shares. So very high security all. So the general idea of the proof, the detailed proof is of course in the paper, but the general idea works like this. We show that the comparison is T non-interferent in the probing model. And doing this, we first identify all intermediate variables in the algorithm and sort them into groups and then continue simulating all of the elements in the group without using all of the input shares of any coefficients. And we use like a bootstrap approach to do this. As two remarks, the first is that the proof doesn't work if both input polynomials are equal, but we show on the paper that this does not provide an advantage to the attacker. And also that the simulation or the proof only works for prime moduli, so we cannot use a power of two modules. Now to the performance evaluation, first we implemented the algorithm on the Cortex-M4 microcontroller. And this is the same platform that was also used in the other implementations which we used in our benchmark. The implementation was done in assembly with some optimizations. And we used the onboard through random number generator combined with rejection sampling to generate the required randomness. And these are the performance figures from Europe and please note that all values include the time needed for randomness generation. Yeah, some of the numbers I showed earlier already we see in when we are using two shares, we are faster even than the first order only comparison provided by order at all. And for higher orders of course this comparison algorithm doesn't work, so we cannot provide any numbers there, but we can compare ourselves with the A to B conversion-based approach. And as you can see, the performance gain is significant, so it starts out with a factor of 16 for two shares and for five shares we are already almost 100 times faster. And we also benchmarked the performance for different parameter sets, for example for Kaiba 768 and LAC 192. And here we get performance improvements or a factor of 10 for two shares up to again 95 for five shares depending on the scheme and of course the masking order. So regarding randomness consumption, as I said earlier the required randomness is independent on the number of shares. So if you see different figures here, this is because we used rejection sampling, so the actual number of random words is probabilistic, but as you can see in the table this is almost constant independent of the masking order and we see a reduction of a factor of 15 for two shares on New Hope up to 275 for LAC at five shares. So the A to B conversion-based approach needs 275 times more randomness when applied to LAC 192 at five shares compared to our approach. Now to the practical security evaluation we use the same cortex and foreboard as for the benchmarks and we use the constant time version of our implementation and this just means that we generate the randomness before the actual measurement and this allows for easier alignment of the traces and also we reduce the number of coefficients to four. So in the complete implementation for New Hope we have 1024 coefficients, but if we measure the complete comparison the trace gets very long and it becomes computationally prohibitive to perform a multivariate second order TbLA and therefore we use this to reduce the trace length. The measurements were collected using an EM probe which was sampled with an oscilloscope at a sample rate of around 150 megahertz and with 8-bit resolution and the analysis metric we used is fixed versus random T testing with a significance level of 0.01 and we collected up to one million measurements for the analysis. So this is a sample trace what it looks like this is the comparison of four coefficients yeah it's rather noisy but as you will see in the next slides we get some interesting results. So this is the first order leakage of two versions of our algorithm on the left we see the two share implementation so this should guarantee first order security and on the right we see the three share implementation which should give us second order security and as you can see the threshold is not reached for any of those implementations so they appear to be first order secure. This is the second order multivariate evaluation of our two share implementation so again we have four coefficients and this time with two shares and in order to make this plot more visible we highlighted the t-values which cross the threshold with red color and as you can see and as is expected we have several multivariate leakage points so that just means that if yeah the leakage of these points is combined it reveals formation about the sensitive data and again this is expected because this is only the two share implementation which should be first order secure but should not be second order secure and for the three share implementation the plot looks like this so again if the leakage crosses the threshold it's highlighted in red and yeah as we don't have any combination of points which produce high leakage there is no highlight and we conclude that there is no multivariate second order leakage in our three share instantiation of the algorithm so to conclude this presentation we propose a new higher order masking algorithm for the comparison function in the Fujitsu Aki Aqua Motor Transform has a complexity which is proportional to the number of coefficients and the masking order so it's linear in the masking order in comparison to the previous A2B based approach which is quadratic in the masking order we also provided a theoretical security proof of our algorithm and a practical second order test vector leakage assessment analysis which you just saw so in future work we would like to see if it's possible to extend this algorithm to also work with non-prime queue because several of the lattice based key encapsulation mechanisms use power of two moduli it would be also interesting to see if our sectional countermeasure can be combined with fault injection countermeasures yeah thank you for listening to this talk