 So in theory there is no difference between theory and practice, but in practice there is that's a well-known saying among scientists and engineers alike So one might say there is a gap between theory and practice, right? And let's look at the theoretical sides We know since crypto last year that masking with D plus one shares, which is the minimum number of shares required For a deep order secure implementation. We know it's possible also in non-linear operations But looking at the practical side of things No implementation and specifically for our accords. No must a yes implementation has yet been Published with D plus one shares So in this work we bridge this gap and we provide the smallest must a yes implementation We also verify them for both the first and the second order And both our implementations use a minimum number of shares, which is D plus one So how did we go about this? We started By delving in a bit in the theory behind it the threshold implementations as we saw in the earlier talk too Afterwards when the implementation is done, we can load them on a target platform Which is a FPGAs in our case and we can proceed with the side channel analysis evaluation Only then only when our security claims are satisfied Only then it makes sense to quantify the implementation costs So that is the outline of this talk. We'll first scratch the surface on threshold implementations a bit And also, yeah, the previous talk was about TI so you can already guess it's a rather popular masking scheme And it's popular because it offers provable security with minimal assumptions on the hardware So even if your hardware glitches Security is still provided It's a Boolean masking scheme based on secret sharing and multi-party computation Does that mean is if we want to mask a function f with inputs a B and output c the first thing we will do is split these and since we use D plus one shares I exemplify this on a first-order implementation. We get a one a two b one b two And of course the basic Properties of masking need to satisfy which are uniform inputs and correctness Whatever you do in the most implementation should reflect which is done in the unmasked implementation Specifically to threshold implementations is the d th order non completeness That means if you observe the wires or the sub circuits Uh, you should not be able to observe all d plus one values. This is where the security Uh in the glitchy setting really comes from Then finally, so we focus here on a yes, which is a Is it's a it's a block cipher? It has rounds if we do if we send those must values through a non-linear operation This uniformity and plus the inputs at the next round will not be satisfied That's why we will add fresh random values to our outputs And this uh, so yeah, we will add fresh random values and then clock these outputs into registers Otherwise early propagation might deteriorate our our security So the number of outputs is immediately related to the number of output registers We have and to the fresh random values we need that's important for later One extra condition, uh, we need to inherit here from uh The consolidated masking schemes paper is uh that the input shares need to be independent And the best way to explain this is uh by looking at what would go wrong if they are dependent So let's take a one equal to b one a two equal to b two And what happens then is that observing f one so observing one sub function We have both a one coming there and b two coming there b two equaling a two We have enough information to unmask our secret value a and so our security would be broken So this these input shares really need to be independent If we want to apply this to the advanced encryption standard now The first thing we will do is uh, it's We will mask the linear functions or in the affine functions and since we use Boolean masking This is rather easy to do we just have to assign one of those operations to one input shares And there's a function. I'm sure you're all well a really accustomed way. It's called copy paste That's the really easy thing Things get a bit messy when we want to mask uh the the non-linear operations though the s box or sub bytes operation And this is what we get This is uh the diagram of our second order master implementation I will not detail this instead. I will detail the first order implementation and I'm sure you will enjoy reading this in our paper So in most implementations, uh, the can write s box has been shown to be a good starting point for implementations and Previous work has shown that based on how we implement the sub components So we could take them together In a gala field 16 inverter Or we could mask the sub functions here the subfield functions in gala field 4 Based on the choice we make there we get more efficient and more or more compact implementations That leaves us to a choice and we need to justify that choice And as I said earlier the number of output shares we have Immediately influences the number of randomness we require and the number of output registers So it's also related to the area If we want to share the gala field 16 inverter in one go which has been shown to be more efficient in previous work We would require the cube of d plus one output shares If we would mask the square the multiplier instead which has algebraic degree two requiring the square of d plus one output shares We have Less cost on the randomness and the area That's why we choose to only mask the multipliers here And this is how we would partition our design Next since we require the uniformity of inputs in every stage here as I said earlier We need to clock in every of these values into registers We will also mask there after every every of that stage And to satisfy the independent inputs here, we need to make sure we put a register immediately after the linear map The mask refreshing is Generically done using a ring refreshing scheme, which means we use an equal number of random units as we have output shares For a first order We do not need We only need unit various security that means we only we do not allow the attacker to combine points and time across clock cycles That is why we can get away with using one unit of randomness less here Another trick we use to further reduce the area. So we not only Go from three share implementation to a two share implementation for first order security We also Adds the contribution of this square scalar function here before the register allowing us to save some space there Now our implementation is as is done. Everything is coded. We load it on our FPGAs and we proceed with the side channel evaluation So for that we use an evaluation board with very low noise And we like to use low noise here because the lower the noise The faster you would see leakage in your masking scheme So that means if we have no leakages And high number of traces that means we can be really confident that we have achieved the security we want We have a lot of fresh random masks To to achieve our security And we need to generate them somewhere and we chose to generate them together with a with a cryptographic algorithm With with a yes, but in order to not Increase the noise we choose to alternate every clock cycle of a yes with every clock cycle of the random number generator Just as a side note for the random number generator we use The tree innermost function of prints and based on how much bits we need we Instantiate some of these in parallel And from these figures you can see that when the pr and g is off there's minimal activity In between the a s clock cycles and when we turn it on those peaks rise and Yeah, the separation there is clearly visible. We also check that in the critical path. There's no overlap Now leakage detection itself. So in the state of the yard we use tv la And Very roughly it goes as follows. We would take Several we will take a lot of traces and group them into into two sets one set corresponding to a fixed But must input and we choose zero for that the zero plain text the other Set corresponding to a random input value Once we acquired all these traces We take the mean and we take the difference of that mean and that means if in the first order if we have no difference of mean so if this Statistic here falls between our confidence level of 4.5 minus 4.5 We can be confident that the security is achieved that there's no difference in the first statistical order between our two groups When we apply this now on our implementation, we first apply it when our mosques are off And what we see here is this is just a sanity check what we see here is We have clear leakage both in the first order and the second order But that's to be expected since we did do not refresh and our masking scheme is not working properly If we now turn on the random number generator, and that's the only thing we do Uh in our next experiments all increase in side channel resistance comes directly from using our proper masking scheme And that's what we get here for our first order implementation With 100 million traces we see no leaks everything falls nicely between our confidence interval Since we only use d plus 1 shares, which is a theoretical minimum number of shares We observe high leakages in the second order Now we can do the same for a second order implementation And again, we see we have uh when our prg is turned off we have First order second order and third order leakage perfectly expectable Turning on the prng everything falls nicely within the confidence threshold For both the first order and the second order T test clear Leakages in the third order can again be be observed For the second order though, we also need to combine points in time The bivariate attack we perform here is On one execution of the sbox itself so in time we combine points And we process them using centered products and we calculate the t test on that With prng off what we see is clear leakages corresponding to our implementation Turning the prng on now We see that all leakages disappear and for the ones that are in need of coffee the scale here changed So everything falls nicely between below 4.5. This is the absolute value And our implementation is deemed secure We can now proceed with the assessment of our implementation costs and going from an unmasked implementation The previous work the first order must as would scale this up to three times Roughly Now we have reduced that with 10 for the first order And with around 30 for the second order So especially in the second order we see a good decrease in area most of this area comes from the sbox itself since in linear parts d plus one were already used in the previous work So we have a degrees of 20 for our first order sbox and a decrease of 50 for a second order sbox, which is Yeah a good significant improvement When we look at the number of clock cycles There's only a Really small increase in number of clock cycles to achieve our security So going from the first order we have an increase of 10 percent For the second order it stays constant small disclaimer there the previous second order implementation could have been made smaller and could have been made With the same number of clock cycles as the first order implementation here The drawback of our work is that more randomness is consumed and We have an increase of 70 percent in randomness for the first order and this becomes only 30 percent for the second order So this is something that will be interesting to investigate To conclude here We went from theory to practice and we realized the smallest Most a yes to date And we verified it for the first order security and the second order security both with d plus one shares And as an engineer it's always fun to know that there's another gap ahead And for that gap for that future work we could look at higher orders But more interestingly we could look at how to reduce the consumed randomness in our smallest must implementation Thank you for your attention