 Hi, welcome to the presentation of the paper, Masking Kuiper, First and Higher Order Presentation. I am Marc Grosjean, and this is a joint work with the shopper boss, Tobias Schneider, your strength and Christine Van Weyden. Now our work, we applied the masking countermeasure to the post-quantum crypto scheme, Kuiver. Kuiver is one of the finalists in the ongoing NIST competition and might be selected as one of the standard post-quantum crypto schemes for the next few decades. So it might make sense to have a look whether we can mask and protect it against side-channel attacks. In doing so, we have not only masked the post-quantum scheme, but we also had to come up with two new algorithms to mask the algorithms used in Kuiver for the compression and the comparison. In the following, I'm going to detail how we mask Kuiver and why we actually did it, as well as go into our evaluation details a bit. One slide recap on side-channel attacks. Side-channel attacks work by exploiting physical effect. That is, if we look at this processor here to the right-hand side, then when this processor performs a computation, for example, here adding the sensitive key K with plain text P, then this processor will actually draw a power consumption or amount of power, which depends on the data used in this instruction step. So on the values actually summed up here. And by placing some measurement gear, an adversary might learn this data-dependent power consumption and infer back information on sensitive data. Now, masking is a countermeasure against this. And for a sensitive value X, it performs, it splits this value X into multiple shares here, X0, X1 up to XN. And there are different forms of sharing, which is relevant for post-quantum crypto schemes. Here, the sum over all the shares using Boolean SOAR is equivalent to the original sensitive value X. But there are other kinds of sharing because this one here on the upper row allows to perform Boolean operations on these masked values quite easily. Instead, if we wanted to perform arithmetic computations with our shared value, an arithmetic sharing would be much more helpful, which uses the arithmetic addition over shares in a certain ring with a certain modulus. That is here, the addition of shares is equivalent to the sensitive value X. And for the post-quantum crypto scheme, Kyber, it's quite relevant to notice that here we have the choice to use a modulus, which is power two, or to use a prime modulus as we are actually required to use it in Kyber. In general, side-channel attacks might explore multiple measurement samples up to T measurement samples, and then this is a T-th order side-channel attack, and to protect against a T-th order side-channel attack, we also need to perform higher order masking, which is why we actually masked our post-quantum crypto scheme not only at first order, but also generic in the order at higher orders. A few facts about Kyber. Kyber is a key encapsulation mechanism which is based on the module learning of errors problem, and it uses the prime modulus, as we're going to see this is causing a few headaches. It has three primitive operations. The key generation, which will produce on every invocation a secret key, which, and since this is usually susceptible to single-trace attacks, asking us not to read the right countermeasure to reflect against this. Then we have the encapsulation, which will encapsulate a secret message M using the public key, and then only the party which owns the corresponding secret key is able to decapsulate using the decapsulation operation, this secret message M. And M is done by both parties used to derive a common-session key, but this one is ephemeral, and we usually don't want ephemeral secrets using masking. This decapsulation, as I already said, is using the long-term secret key, and therefore we need to protect this long-term secret key using the masking countermeasure. Furthermore, the decapsulation is secure against chosen cipher text attacks by using the Fujisaki or Komodo transform. And again, this is also going to give a bit of difficulty. Let's quickly look with a bit more detail into the decapsulation. The decapsulation has on the left-hand side as input, the cipher text C, which contains the message M, or is an encapsulation of the message M, and on the right-hand side, will produce a shared ephemeral session key K. And then it works by performing first the decryption using the secret key S, and producing the original message M prime. And now the chosen cipher text, the resilience against chosen cipher text attacks works by re-encrypting this message M prime using the fully deterministic encapsulation procedure again, encryption procedure. And this will lead then using the public key, and this will result in a cipher text C prime, which has to be compared here in this stage to the original cipher text C. And only if these two cipher texts are equivalent, then or equal, then the key derivation will actually use secret key dependent material, key primed output of a hash function here in the block G. And if they are not equivalent, then a random number, random static number will be used. All the blocks marked in color need to be masked to some degree. The special case here is the K prime, which we don't need to mask, or which we are able to unmasked the moment where we know that the cipher text matches. Match, so C prime is equivalent to C. And also the output of this comparison here doesn't need to be masked, but only the very output, no intermediate of this comparison. So we can unfold the blocks and look a bit more in close detail. Here on the left-hand side, we have again the cipher text, which is decompressed and will result in two polynomials, where U and the bold ones are actually vectors of polynomials. Multiple polynomials, and every coefficient of these polynomials is in the range 0 to Q minus 1. And then there are some operations performed, which involve the long-term secret key, and therefore these blocks here need to be masked. Then at this point here, we perform a compression, which will turn a coefficient, which is in 0 to Q minus 1, into a message bit, which is 0 or 1. Then here we have the re-encryption, which uses the output of a masked hash. This is a SHA-3, and then a lot of operations are performed, which result in two polynomials again, or here a vector of polynomials, and here a polynomial, which will usually be compressed into a cipher text again, and then compared to the cipher text C prime, and then compared to the original cipher text. But in our case, we come up with a new algorithm, which avoids the compression and performs the comparison based on the coefficients here of the polynomials without performing the compression. I will now go into detail. So for most of these blocks, we have solutions available, but not for the compressed Q, and not for the decompressed comparison. So for the PRF, there's pyro work. So the pseudo-run functions, there's pyro work, and for the center polynomial sampler as well. And the decompression here in this case is actually like from one bit to zero or Q. I think it was like this. So this is a one-bit Boolean to aromatic conversion, and there we have also algorithms available. Okay, I will now detail this first green box, the compression of the coefficient into a single bit. And if we look at the equation for this, then this is wonderful to mask because we have a coefficient X, which is divided by Q and then rounded to the next bit. This is awesome, since we don't have a mask rounding operation and we don't have a mask division. And therefore, this is quite hard to mask or we have to come up with new approaches to mask it. Yes, and then the following, we're always going to look at the interval which X, the coefficient, can take. Usually this is in zero to Q minus one here on the right-hand side, and we can see that this interval is as equally spaced and by Q four and three over, three Q over four. The first approach, and this is already in prior work, is to shift this interval to make, to construct an interval, two intervals which are equally shaped by just adding Q over four to the polynomial and coming up with a new shifted compression function, which decides whether the output of this compression, so this bit is zero if X is smaller than Q over two or is one otherwise. In other post-quantum cryptoschemes, this is already sufficient to mask this almost because in Sabre, for example, we have a modulus which is a power to two. And therefore, the most significant bit of this variable X will immediately tell whether the coefficient falls into the interval which is smaller than Q half or the interval which is greater or equal to Q half. Unfortunately, we have in Kyber prime modulus that is the most significant bit here is set at an end, has a certain offset to Q over two. And that means also that the intervals which are defined by the most significant bit have a different shape because the most significant bit is all zero here on the left-hand side and all one, but we cannot use the most significant bit approach immediately. Okay, so we have a different approach to do this for prime Q for arbitrary moduli. And this works actually by looking at the individual bits which make up X. So here I have annotated the bits of X. And if we look at this, we can immediately tell that, oh yeah, one thing to observe is that we are operating mod Q here. That is, there are no values in this operation since the coefficient value X has been reduced always before we perform these operations. So we can look at the bits of X and if we see that the 11th bit set, we immediately know that X must be in this part of the interval here and therefore compress, shift, Q must be one. On the other hand, we can also look at the cases where this bit is specifically not set. And then we can see that if, well, two to the power of the 10th bit, the 9th bit and the 8th bit is set, then we are approximately somewhere here in this interval and then we again know that compress, shift, Q must be one. And this is a very easy binary search over the bits of X and when we do this, using, for example, here, this table-based approach, then we will end up with a formula like this where we actually say like, well, either X 11th bit is set or the bit is not a set and this formula holds true, then compress, shift, Q of X must be set to one. And the great part is now we have transformed our original equation which said rounding and division by Q into a formula here, which is just composed of Boolean operations, XORs, negations and ant operations. And this is something we can very easily mask because there are lots of algorithm-masked versions of these operations here. So overall, the algorithm performs, it's shown here on the right-hand side and the representation here actually uses multiple coefficients that is all the coefficients of a polynomial and it starts by shifting as I detailed before, then performing an arithmetic to Boolean conversion that is changing from a sharing equation where we had X0, sorry, plus X1 plus X2 and so on to a setting where we have X0, XOR, X1, so the sensitive value is composed of the addition of the shares to a setting where the Boolean sum of the shares is equivalent to the secret, we have here, that's right. Okay, so after this aromatic to Boolean conversion, we are able to use the bits of our Boolean sharing of our X in this equation. So this is the mask representation of the equation I have shown on the slide before. And moreover, since this is a Boolean operation, we can easily bit-slice it and perform these operations here on multiple coefficients at once, which speeds up the entire algorithm. In summary, we get a compression to one bit for primordially and it only requires a single A to B conversion which is at higher orders, usually the bottleneck. Furthermore, it's higher order probing secure, that is we can also use it to protect against adversaries which exploit multiple measurement samples. By using the bit-slice Boolean search, we are generic in Q and can apply this to different settings and it's also well-suited, not only well-suited for single bit compression, but it's also and we detail how applicable to multiple compression although it becomes a bit more complex there. Now I'm going to detail our second algorithm or second new algorithm for the ciphertext comparison. So we can quickly remember that as an output or intermediate output of the re-encryption in the decapsulation, there is this vector of polynomial u prime and this polynomial v prime. Usually these are compressed to multiple bits so I think for the v prime it is compressed to four bits and for the u prime depending on the security order of kyber, security parameter of kyber it's compressed to 10 or 11 bits. And this means that our previous approach for the compression becomes quite complex and each of these compressions here are quite heavy if they are masked. Furthermore, there's a comparison so the result of this is essentially the u part of the ciphertext and the v part of the ciphertext which together make up the ciphertext z prime and then this is compared to the c. And this here needs to be protected as well whereas the output of this it does not need to protect it as anymore and the detail why is actually in the paper described. Yes, so we don't want to mask this because it brings a lot of overhead and we actually circumvented by not performing the compression here, these two blocks and performing a different kind of comparison here. Instead of performing an equality check of the compressed ciphertext we rather ask, will the coefficient u prime i belonging to this vector u all, does it belong to the set of values a coefficient can take which would compress to the correct ciphertext bits? So here we have, yeah, so this is the set of values which compresses to write ciphertext t i which would be here which belongs to the input ciphertext. And the important part to notice is that this is public information we don't need to protect it it's actually controlled by a section adversely possibly and it's public so we can perform arbitrary computations and that means we can pre-compute a set of values which belongs to this, yeah we can pre-compute a set of values. We do this by having a function s and e which denote the start and the end of the interval for valid coefficients. That is our question now instead of this whole block is rather that we have the coefficient u u prime i and we ask, does it belong to the interval where a valid coefficient corresponding to c i primeness? And then all we need to do is to perform this check does it belong to the valid coefficient to the valid range for all the coefficients in u prime and v prime and if all of them belong to the correct set then we can output true and if one of them does not belong to the set we output false. Now you might have observed that this question is something smaller than the start of the interval and no sorry is it larger than the start of the interval and smaller than the end of the interval wasn't that straightforward to perform in a mask manner we have a prime margin. But we are quite lucky here because this interval in start and end for a correct type of text is quite small for the relevant compression parameters we have in kyber. That means that these intervals always fit into the range where the most significant bit of kyber is one. And then we can approach the setting to ask whether x is smaller than the end of the interval and whether x is greater than the start of the interval greater or equal to the start of the interval by just shifting the value of the coefficient of corollary such that it is in the range strictly in the range where the most significant bit is one and then extract the most significant band in a mask manner to determine whether x is smaller than E and greater or equal to the start of the interval. Yes, so this is quite cool because we again have a comparison now for prime moduli which avoids the costly compression stages and is again higher order problem secure. Unfortunately we require two A to B conversions one to check whether it's smaller than the start of the smaller than the end of the interval and one A to B conversion to check whether it is larger or equal than the start of the interval. On the other hand it's again widely applicable for different modulis non-prime and prime and also different compression values with some limitations as I detailed in a slide before. Okay, we implemented all this also hardened a few implementations and actually used verification technologies to assess that our implementation correctly hardened using SCVRF but here I'm going in this talk I'm going to focus on our benchmark results so the first question which arises is are our algorithms actually faster than a generic approach using masked lookup tables and the answer is yes. So in our paper we have three different settings and setting three at first order our two algorithms are used and we have a masked lookup table for the arithmetic to booding conversion and this in total outperforms all the two different approaches where one or both of the algorithms are replaced by a generic masked lookup table approach and this is mainly due to the case that the initialization of those lookup tables takes quite a lot of time whereas here we have just the A to B which needs to be initialized and is shared across a lot of components. Then we can look into the actual benchmark results at first order we performed the benchmarking on two devices Cortex M0 plus and the Cortex M4 with two reference implementations the PQPlean implementation and the PQM4. For Cortex M0 plus we observe a slowdown of factor 2.2 but this excludes the randomness generation and is more or less a comparison between compiler generated code and compiler generated code with no assembly optimizations available for the Cortex M0 plus. You can then see that the majority of the impact of the overhead is caused by the pseudo-random functions in these get noise functions involved in the re-encryption. For the Cortex M4 this situation is a bit different there we observe a slowdown of factor 3.5 where also because we included the random number and generation but also because the PQM4 implementation we used as an unmarked reference is highly optimized and uses a lot of assembly optimizations for example for the NTT. Again we are able to see that a majority of the impact is caused by the sampling of the error polynomials and also by the catch-up. Then for higher order we also wanted to have a look and we implemented our scheme again for second and third order implementation second and third order protection on both devices and this time there is no lookup table involved and all the A to B conversions are performed using actual algorithms and we can immediately see that there is a severe impact which is happening due to these A to Bs we still have the sampling which produces a lot of impact on second and third order but we can also see that the comparison is contributing a larger overhead depending on the order which uses the A to B and this has a very massive randomness consumption and operation count which comes with these I to Bs aromatic to boolean conversions at higher order. So in total for second order we see a slowdown of factor 20 or factor 50 and for third order it's even severe but we also need to mention here that these implementations have hardly been optimized for second and third order and there is a lot to gain so this is more like a first result and we can from here on start more detailed focused optimization strategies. There's a lot more to be found in the paper first we have all our constructions are proven strong and interference secure and we also give complexity estimators we detail how we mask the CBD which we adopted detail on why this KDF and the output of the comparison don't need to be masked and we performed extensive physical evaluation in low noise environments with TVL-A on a Quadrix M0 plus and we also used formal verification for implementation to really be sure it's correctly masked which actually involved to come up with new techniques to verify the security of our lookup tables and we also present the leakage model we use for verification on the Quadrix M0 plus here. Thanks for your attention and please ask any questions during the live session of the chess talk. Thanks a lot.