 Finally, last talk in this session is on multiplicative masking of AES in hardware, and it will be given by Laura and the Mayor. It's a collaboration with Oscar Reperas and Pegul Bilgin, both from COSIC or former COSIC. Hi, so this is the third talk of the session, and it's the third day of chess, so I'm going to give a very short introduction, because I'm going to assume that everyone already knows what the situation is, which is that we have a problem, the problem of side channel attacks. And luckily, there's also a solution, or many solutions, and one of them, a very popular one, is masking, which basically means we're going to split the sensitive variable into multiple shares. And this is the hardware masking session, so we are also dealing with an extra problem, the problem of glitches. And that makes hardware masking quite different from software masking. And over the past few years, we've seen a lot of works dealing with this problem, and we've seen a lot of implementations that are secure in the presence of glitches. A lot of these implementations have used Boolean masking, which means that the sum of the shares is equal to the sensitive variable. That makes linear functions super easy to mask. I mean, no one's writing any papers about that. But nonlinear functions, on the other hand, are very tricky. For example, the AES S-box is an inversion in a HALWA field. And it's very tricky to implement. And we've seen a lot of works trying to, I mean, successfully implementing AES and trying to do it as efficient as possible. Here we see only some of the many works that have implemented AES using Boolean masking. And what they all have in common is that they use the tower field approach of Kenwright, which is a very nice approach. And it has given the community the chance to develop their inner artists a bit more, as you can see by these nice pictures. But it's also a very complex method. And the good news today is that for this talk, you do not need to understand any of this, because we're going to do something a lot simpler. I will skip ahead to our results for a second and tell you what we're going to present. So I'm going to present you a first-order AES S-box, which is 29 percent smaller than the state of the art. And when I say state of the art, I'm comparing to the two best implementations of the last years, which happened to have been made by the previous two speakers of this session. And I'll also present a second-order AES S-box, which is 18 percent smaller than these previous results. And these area results we obtained for similar randomness and latency costs. And best of all, we didn't need the tower field approach. So then the big question is, how did we do this? What you see here is the round function of AES, which everyone knows. And we have three linear blocks, which are easy to mask with Boolean masking. And we have one nonlinear block, the S-box, the Halwa field inversion, which is very difficult to mask with Boolean masking. But suppose we would use multiplicative masking. Then the S-box would be super easy to do, and the linear blocks would be very tricky. So a logical idea is to just do the S-box with multiplicative masking and the linear blocks with Boolean masking, and just convert between the two types in between. And this is an idea that people already had 17 years ago. And it was presented at chess 2001. And that's the idea that we decided to revisit, because for the last years, it seems like everyone wanted to use Kenright. And we really like the simplicity of this idea. It has already been revisited for software masking in 2010 in a work of Janelle and co-authors. And that work was a starting point for what we did. We have to be careful, though, because 16 years ago, people realized that multiplicative masking is not as great as they originally thought. Because we cannot effectively mask the zero element. No matter how many shares you're using, you will always have a first-order flaw. Because a single share can reveal whether your secret is zero or not. The solution to this problem can be found in NPC literature, as well as software masking literature. And it relies on the fact that in a Galois field, if you invert a zero, you get a zero. And if you invert a one, you also get a one. So what we're going to do is we're going to change our zero into a one before the inversion and then change it back afterwards. And this way, we never have to mask the zero multiplicatively. The only thing we need is a function that can detect if the input is zero. And everyone knows that function. It's a Kroniker delta function. So that's basically what we're going to do. We have here the masked AS inversion. We start with Boolean shares at the input. The input goes through a masked version of the Kroniker delta function, which has at the output a shared version of one or zero. This is then added to the input, which means we can safely convert the Boolean shares to multiplicative shares. When we have multiplicative shares, it is trivial to do the inversion locally on every share independently. And then finally, we go back to Boolean masking for the rest of the RIND function. And I will now zoom in on a few highlights of our implementation. Let's start very easy with the first order Boolean masking of X. We want to convert this to a multiplicative masking. So the first step is to add a multiplicative share. So now our X is shared by three shares of which two are Boolean and one is multiplicative. And then, since we're dealing with glitches, we're going to add a register to stop the glitches from propagating and to synchronize the wires. And finally, we're going to go back from three shares to two shares by compressing the two Boolean shares into one. And voila, we have a multiplicative sharing of X. So easy. Now we want to get the inverse. And if you look at the inverse, you can see that we already have one of the shares in our circuit. So we only need to invert B to obtain a multiplicative sharing of X inverse. So that's pretty cool. Instead of two inverters, we only have one. And now we're going to go back to Boolean sharing. And we're going to use the same three steps. We're going to first expand to three shares with an extra Boolean share. Then we're going to synchronize to stop glitches from propagating. And then we're going to compress back into two shares by multiplying the multiplicative shares with each of the Boolean shares. So that's easy, right? And the second order is very similar. It looks a bit more complex, but really it's just the same. You can see here the three stages of expansion, synchronization, and compression. And we repeat it because we have to go from three shares to three shares. There's two things I'm going to point out here. First of all, we need an extra remasking here and here. The reason for this is that we found a flaw in previous works that were using this methodology. And we were able to solve it and make the circuit second order secure by adding this extra remasking. And then secondly, what's really cool is that we still only need one inverter. It doesn't matter if you're going to do this with four shares or five or if you're crazy, 10. You always need only one local Halwa field inversion. So that's really pretty good for scalability. And the reason for this is that we're using two types of multiplicative masking, which you can find more information about in the paper. Now let's look at this Kronecker delta function. This is the function that's going to detect if our Boolean shared input is zero. And you can do that with a big end gate of eight bits or with a tree of two input end gates, which we prefer because we have Boolean shares. We're going to implement this circuit by replacing each of these end gates with a first or second order secure multiplication gadget. Now let's count the inputs of this circuit. For example, for first order, we have eight inputs each in two shares, so that's 16 bits. And then we're adding seven more bits of fresh randomness in the end gates. So 23 bits of input. How many outputs do we have? There's one in two shares, so two bits. That's a very expensive two bits if we're computing that with 23 bits. I mean, if you think about how much entropy is in these inputs, and we're all shoving it only in two bits. So we were thinking we should be able to recycle the randomness to make this method more efficient. And for that we find we observe something interesting about the end gates that we're using. Some people notice as the domain oriented masking gates that Hannes presented, also known as ISW, multiplication, but the most important thing is if you look at one output share, it depends on the XOR of the two shares of B. So it only depends on the value of B and not on its sharing. And that's a subtle but very important difference, because it means that any randomness that was used to obtain the shares B1, B2, that randomness is canceled by the time you get to the output of this gate. And what does this mean for us? It means that this point of the circuit is going to be completely independent of any randomness that was used in this gate. And this point in the circuit is independent of any randomness that was used in this gate. And so on, the output is also independent of the randomness that is used in this part, the lower half of the circuit. So that's pretty cool. And we use that to recycle some of the randomness. We reduced the fresh randomness cost from 7 bits to 3 bits. And we did something similar for second order. It's a bit more complicated. And we went from 21 bits of fresh randomness to 13 bits. And finally, this circuit takes three cycles to compute, because we have three stages of a nonlinear block. So that could mean that you have to store the S-books input in three stages of registers while it's waiting for this output to be ready. And those registers can make this method very expensive. It almost makes this method more expensive than what we already had in the state of the art. So what we did was we reorganized the state and the key array so that this function is pre-computed while the input is still stored in the state. And that way, we avoid these registers. And that's one of the reasons we were able to obtain the results we got. So let's look at these results again. This you already saw, our S-box is 29% smaller in first order and 18% smaller in second order. If we look at the entire AIS implementation, we have an improvement of about 10% for both orders. And as you can see, our randomness is pretty much the same as the one that was obtained by Gross and his school authors. And also, the latency does not differ much from state of the art. Finally, I will tell you about the evaluations that we did. In the paper, you can find theoretical evaluations that we did and simulations. Now, I will only show you our practical lab evaluations. And for those who are not familiar with TVLA, it's a hypothesis test that allows you to detect leakage by comparing the distributions of two groups with a fixed plaintext or a randomness plaintext. And on the left side, we see what happens when we turn the masking off. As you can see, leakage everywhere because our T-statistic surpasses the threshold everywhere. When we turn the masking on, we see that the first-order leakage disappears. We still have leakage in the second-order moment, which is normal because this is our two-share implementation. We did the same thing for our three-share implementation, which is second-order secure. Again, when we turn the masking off, we have leakage everywhere. And the leakage disappears when we turn the masking on. However, this is a univariate leakage detection test. And in a second-order attack, the attacker is allowed to combine multiple time samples. So we also did a bivariate leakage detection test. We applied it first to our first-order implementation, which naturally shows leakage peaks. And then when we applied it to our second-order implementation, the leakage peaks disappeared. So to conclude, we wanted to keep it simple. And we found some very nice inspiration in works from almost 20 years ago. And we really wanted to push the limits, recycle randomness, and customize this implementation. This is why I presented to you a first- and a second-order implementation and not a generic higher-order construction. Because for any order that you're implementing, the customization, only customization is going to make it as optimal as you can possibly get. Thank you for listening, and I'm ready to receive your questions. Questions for Lauren. Amir. Do you hear me? Yeah. I have two questions, actually. One is related to how the mask is generated. You know, in multiplicative masking, you have to get rid of the mask 0, right? Yeah. And then I can imagine that you had a buffer, probably. You had a buffer that you saved masks inside, because you need to this fresh randomness, right? R0, R1, and so on. And none of them is allowed to be 0. And then if I consider it straightforward implementation, then you need to have the output of the PRNG, and then check whether it's 0 or not. If it's 0, then you need to regenerate. You cannot map 0 to another random variable or another random value, otherwise you get a bias. And then if you don't have such a large buffer, then you have to stop holding up the system to generate another random mask, which is not 0, and then continue. Yes. So we believe that you can do this with very little overhead. So if you have your randomness being generated offline, there's basically no problem, because you can check it in advance. If you normally have an online PRNG to generate your fresh randomness, and you want to get a non-zero mask, so the way if you look at our round function, the clock cycles in one iteration, not all of them need randomness, because some of them are spent waiting for the result in the pipeline. So we use those extra cycles. During those cycles, we're getting extra randomness. So we're storing that randomness in a buffer, like you said. And in the paper, you will see we computed how large this buffer has to be to get a negligible probability that you will have to stop the pipeline. Because you can compute how many cycles one encryption is, and how many randomness you're getting in total. And you really don't need that big a buffer. OK, then the problem would be for me is that the comparison with your scheme and the state-of-the-art, which doesn't need that buffer. Yes, but there might be, that's true. But again, we think that the overhead is negligible. OK, thank you. Then slide number 27, can you show, or even that one was OK or still? It doesn't matter here or in the first or the first of the secure design. The inversion that you have, I can imagine, that you use a tower-fill approach, which makes it very small. No, we use the Boya Peralta one, because it's smaller. That's also small. And how about the multiplications? I mean, the inversion is only one you need, independent of the order of the masking. But you need many multiplications. And these multiplications are not cheap, right? And then did you just implement the straightforward-of-the-school book multiplication of boredom, or you did something else? I don't, I have to remember now what we used, but I don't remember spending a lot of time on this. I think we used a pretty straightforward GF2-to-date multiplication. I'm just thinking loudly, yeah? I mean, does it make sense to have the, sorry, I ask a lot of questions, sorry. I'm sorry. Does it make sense that you change the field here at the start for all of the variables? And do the multiplication also in the hover field and also do the inversion in the hover field? That's a very interesting question. And we should definitely try that out. Okay, we can have one more short question or not. Okay, let's thank the speaker and all the speakers in this session.