 Okay. Welcome back for this session on Hardware Masking. We have three talks in this session and they all three have Hardware Masking in our title, so I guess it was very easy to put them in the right session. So first one is Hannes Gross from Graz. So he will speak about generic low latency masking in hardware. It sounds right. And it's a collaboration with his colleagues from Graz, Rinat Yusupov and Roderick Blum. Thanks for the introduction. So my talk is on generic low latency masking in hardware and with this work we tried to give answers to two very intriguing masking questions. So the first one is is it possible to securely evaluate a really complex mask function in a single clock cycle? And yes, we achieved this. And the second question is does higher-order masking require any online randomness? And quite surprisingly the answer to this is no. But as always when things sound just too good to be true, there's a huge butt trailing our answers. But we'll come to that afterwards. So let's start from the beginning. Since I'm the first guy in this session, I thought I'd give you some brief introduction to masking. So what we're trying to achieve or what we're doing is we're splitting up secret information into a couple of fresh random shares. Well, we denote random shares by capital letters starting from A, then B, C, and so on, to reach our security parameter D. And at all time we're insured that only by bringing all the shares together we can reveal the secret information. And this is what we tried to compute on. And the goal of the masking scheme is now to give simple rules to keep the separation of the shares throughout the entire circuit. And since we built upon the findings of DOM, I'll just give you a short introduction on how we transform an unprotected circuit to a DOM-protected circuit. So this is quite easy. So we have here an example circuit consisting of some inputs and outputs and some linear gates and a nonlinear gate. And the first thing we do is we're just copying the original circuit two times. So we're targeting first order protection. And these copies now become our domains. So the first domain we will from now on denote A, the second B. And we also will identify the wires in domain A by prefix with the domain letter and for domain B accordingly. So up to this point, everything was really simple. And you can obviously see that this achieves the goal that we set. So we want to have a separation of all the A shares and all the B shares. The only problem with this circuit is now that it's not already correct. So it's secure, but not correct. So we need to change this in order to change this for linear operations. This is actually quite simple. The only thing we have to do is to throw out the inverters from our second domain here. Then we're actually fine. So all the other stuff stays completely unchanged. Things get a bit more complicated once we change to nonlinear operations like the end gate here. Because this means that our domains need to communicate with each other or a protected channel. And this usually involves some fresh randomness. But we go into details on this later on. Just a brief summary about domain oriented masking. So it's a circuit-centered scheme with circuit-centered rules. So this makes it really convenient for hardware designers. It uses D plus one shares, which is the minimum amount of shares to achieve leaf order security. So this leads to quite efficient masking designs. It is also completely generic, which means that we can easily synthesize our circuits for any protection order that we target. And also compared to other generic masking schemes, it is low randomness. So what is it that we actually try to achieve here? What is the issue with latency and where does it come from? In order to see where we spend our cycles on, let's have a closer look at a two-bit end gate. A non-protected two-bit end gate. So this is like a multiplication of two elements over gf2. And once we put it in this shared form here and try it to manage the multiplication terms in such a way that we put everything into the according domain, we immediately see that we will always end up with two shares per domain that don't belong into this domain here. And so for this reason we constructed this DOM end gate here. And what we do here is for this critical across-domain terms, we add some fresh randomness and then also put everything into a register to make sure that the glitches don't propagate. And then finally we compress everything together. And so from this you immediately see where the latency obviously comes from. So we have this register stage here and in order to evaluate this end gate we need to spend two cycles on it. However there's another issue which comes from the inputs. So in order that this multiplier is secure we need to ensure that the inputs are independently shared. And if this is not the case usually it is enough to place a register before the multipliers here and this would cost us an additional register stage. So in order to get rid of the first issue this is actually kind of easy. We can just completely skip the compression here. We don't need to to throw in randomness and have registers. This is a bit surprising at first but actually when we look at the multiplication terms themselves they're all secure. So as long as we don't add them together this is totally fine. So the first thing we introduced new in this paper is that we extended them by allowing for multi-dimensional domains. So if this we already saved this register stage here the only problem is the more multiplications we do in the sequence the more shares we get the more domains we create and at some point this becomes unbearable and so we can't do this forever but at least to bring the latency down for to some extent we can do this. The other issue is if we have related inputs because of glitches then it could happen that in a short moment in time what we actually calculating is for example we are doing the multiplication of the same element of the same element and therefore bringing together the shares of this variable in a straightforward way and in order to circumvent this in a circuit where we don't prevent glitches by registers we just copy the or add an additional sharing for the same variable and now the equation becomes fine. So we use fresh randomness for the sharing of this X prime here that is independent of the sharing of X. As a first simple example let's have a look at the Ascon S-Box and already here on the unprotected circuit we see that there will be some colliding inputs once we reach this multiplier stage here. So what we would do in DOM now is we would just place after the thin layer we would place some registers to get rid of the glitches and make the shares independent and here we try to avoid this by copying some of the inputs of this Ascon S-Box and then we would just use the multiplication gadgets as from before that the Lola DOM gadget for the multipliers here. So some intermediate results so as you can see here for the full implementation of Ascon we successfully managed to bring down the cycles from 3 to 7 depending on the protection order to only one cycle and also in terms of actual latency in nanoseconds we achieved the reduction. However when we look at the first order numbers for example we see that we needed to invest more than 10 kilo gates in order to achieve this and also we use more than six times the randomness. So it costs something but it works. However the Ascon S-Box was actually designed to be very easy to be protected against side channel analysis so we also picked a much more difficult example for this which is the AES S-Box and the first thing we needed to decide on was which actual S-Box design we are targeting. So there are a couple of designs in the literature and the first presumably suitable choice would be the Boyer-Beralta S-Box because it was especially designed to have low circuit depth and low complexity. However we built a tool that helps us to trace all the signals through the entire circuit and will tell us where we would get a collision and as it turned out actually this isn't so optimal for our scenario because we get a lot of gate collisions and input collisions for that. Another S-Box instruction that is quite frequently used is the Can-Write S-Box. However this also turned out to be not suitable and so we finally decided on using the actual most simple S-Box design we found which is this Mui S-Box. However when we trace the input signals here through the circuit then we would still detect that we have all of these points here collisions of the inputs and so what we did is in order to avoid these collisions we again copied the input and for some things we also needed to copy the fan-in circuit of the inputs. For example here for this GF16 multiplier we needed not only to copy the input X prime here but also needed to copy the input transformation that comes with it and for the inverter we had to copy this whole area before that feeds into the inverter and so on. Since with each multiplication stage we increased the number of shares we thought that we also tried to find suitable spots where we do an intermediate compression to bring down the number of shares so we we also have some two additional variants of this S-Box so the first one uses a compression at the output of the AES S-Box and the second one uses additionally a compression in between. So from the results of the AES S-Box we again see that we succeeded in bringing down the cycles that we span on the S-Box from three to eight cycles to only do below three. However again we see that depending on which actual design we use we have quite some overhead in terms of chip area and randomness. So this zero latency variant here was only to show that in general it's possible to create hardware designs even that are generically masked higher order designs that don't require any online randomness because we could continue with this approach as for the S-Box for the rest of the circuit but as you see we will end up with quite some chip area here so it's more a theoretic result. Finally we also did a formal verification of our designs so therefore we use our tool that we present to Eurocrypt this year and this works by doing an approximation of the Fourier spectrum of the circuit for all possible signal timings. However for larger circuits this can take quite some time so we also gave it some thoughts on about how we can increase the verification speed especially for these lower designs that we created and we came up with this idea so since what we are actually trying to do is we're trying to avoid collisions of all of the shares so what we can do now is we just split up the whole circuit into the domains we use and simply by ensuring that there is no connection between either of these domains we can give the security guarantee that this is actually deep order secure and this turned out to be quite faster so we verified the zero latency AES S-Box design in only 11 minutes and also verified the Ascon S-Box up to order 3 with both approaches. So this brings me to the conclusions so what we demonstrated is that masking doesn't necessarily require any registers or online randomness however we see our contribution really as to give the designers of protected circuits a new design choice for trading area and randomness against latency and also this is only our first approach into bringing this low latency to generic designs and so there are quite some open questions and if you're interested in low latency masking then I suggest you have a look at the paper because there we give a list on some open research questions that we think will be interesting in the future. Thank you. Questions? Comments? No, throwing cubes. Okay, then I have to. Have you considered my randomness recycling to cut the randomness down? No, not really. So this would be one of the open research questions I'm sure. Also we tried a quite straightforward approach for this compression where we didn't use domain-oriented masking for that because this would again cost us two cycles so we used a CMS for that but I'm pretty sure that this is not the optimal choice so I think we can bring down the randomness as well and also for the AESS box design for example the numbers seem really impressive in terms of area and randomness but so it's important to see that we really try to go to an extreme and just demonstrate that the scheme itself is working and that the theory is sound. I'm pretty sure that you will find or can with much more effort create more suitable low latency design based on this. So for example you can cut off the linear layer on before of the AESS box so if you avoid glitches here for example then a lot of your variable collisions will immediately vanish so we didn't consider this in the first place but this would be just one of the things you could try for example. Okay let's thank the speaker.