 We have now the session of Threshold Implementations consisting of two talks. Please note that this session is followed directly by the first in-water talk. Then please don't leave the hall. The first talk of the session is titled Rhythmic Ketchak, SCA Security and Low Latency in Hardware, altered by Victor Arribas, Begol Bilgin, Georg Petrides, Svetlana Kova, and Vincent Reimann from Kao 11. And now Victor is giving the talk. Thank you, Amir, for the introduction. Good morning, everybody. Thank you for being here. Hackers can guess pins using your smartphone sensor data or also as well-known side-channel information. Today I bring you two topics. I'll start by introducing key concepts to securely successfully secure this hardware. And second, I'll present a methodology that we use to reduce the latency of these secure implementations. So my talk is divided into two parts. First, I'll introduce this concept, and then I'll continue by introducing my methodology, the methodology we used for reducing the latency of these secure implementations. So let me start by introducing two masking schemes we used in our work, threshold implementations, or TI, very well-known already. This is an example of an first-order and gate. The inputs given in TI are dependent on the algebraic degree of the function and the degree of security that we want to achieve. A key property of TI is the third-order non-completeness that is defined as follows. Every T combination of output shares must be independent of one input share. So here in this example we see that every output is dependent on only two input shares. Second-sharing scheme we use in our work is domain-oriented masking, also known as DOM. In this case, the number of input shares is given only by the desired degree of security. It's important to note that we need a layer of registers to secure this multiplier in order to synchronize randomization. And also we need an extra property. We need the inputs to be independent of one another to ensure security. So I'll be focusing on these input dependencies, why they are dangerous, and how they are caused. So to illustrate that, I'll bring you a very small example. This is a DOM multiplier x times y. Here we see the four cross-products. There are two dangerous cross-products that are mixing the two shares that we use. So these are the two dangerous ones. Now, assuming our inputs are independent, we are safe, and non-completeness is fulfilled. Then I bring you another small example, calculating this operation, x times x shifted once. So previously, before the nonlinear operation, we execute a linear operation. It could be whatever, even the identity function would work as well. So we see the result of the linear function and then the result of the nonlinear function. Again, look at the cross-products that are dangerous. Now, due to these dependencies, we have non-completeness broken. This is an important security flaw. This previous example was pretty simple, maybe very obvious, but what happens when we have more complex operations, like the theta step in K-Chack, that introduces these intricate dependencies in the state. Run-based implementations are very risky, and we found a flaw in previous run-based implementations in the literature using a DOM multiplier. This could lead to potentially exploitable leakage. We contacted the authors, they acknowledged our findings, and they updated their work in Eprint. So I'll show a few implementations we did. First, we implemented first-order secure K-Chack implementation with this DOM sharing scheme, and we saw that there was a leakage, a non-completeness flaw in G, and now I'll show you how we saw this leakage, how this leakage appears. So let me go over one round of K-Chack. First, we applied theta with these intricate dependencies. We focused on these two bits, and let's see where they go. Then we applied the raw step. Raw is just a shift along the sheet's lane. Then it comes P. P is just a shift in the XY plane. These last two are only wiring, so they don't mean any problem. But then we have the non-linear permutation G, which combines these two bits we were looking at into just one bit failing non-completeness. We found a total of 112 bits failing this non-completeness in this way out of 200 bits of the state. With the tool, we implemented ourselves. Later, we evaluated this implementation with a test vector leakage assessment. We used 55 million traces for the evaluation. We test first with the countermeasure, switched off. We see huge leakage. We see how the T value, the blue graph, is way out of the red threshold. Then we switch on the countermeasure. We see small leakage in the first order, which shouldn't have no leakage at all. It's small leakage, but we are not sure if we don't know if this could lead to an attack. We propose a fix. In our fix, we introduce a linear layer of registers in the middle of the linear permutations and the nonlinear permutation. This way, we break the dependencies created by this complex theta operation. This way, we ensure non-completeness is fulfilled. We run our tool, and we see that indeed non-completeness is safe. We added a new layer of registers. We tried to optimize that. Then we merged the state register and this new layer of registers in order to reduce a bit more the latency, so that the latency is not affected that much. The resulting implementation is this one. First, we have the linear operations, then we have the state register breaking the dependencies, and then already the nonlinear operation. We evaluate this implementation as well with the same setup. We don't change anything in the setup. Then we get indeed a first order secure implementation with no leakage, and second order huge leakage, as we expected. Now, I'll continue with the second part of my talk. I'll tell you about the methodology I used for securing these enrolled implementations. What's our aim now? We want to halve the number of cycles in our implementation. For that, we will need to avoid registers, and you've seen how this DOM sharing scheme would need registers in order to break these dependencies. So now is when we will use TI in order to successfully achieve this. To illustrate my methodology, I bring you a small example of a quartic operation, a for input AND gate. To secure it, we split the operation into two layers of quadratic operations, which we know better how to secure. In the first time, we implement two first-order sharing schemes in both layers. If we look at the input dependencies on the output of the first layer, and then we compute the second layer, every output will break noncompleteness as TI needs it. So then what do we do? How do we fix this problem? We implement a second-order sharing scheme in the first layer, and the first-order sharing scheme in the second layer. So this way, no outputs at the end is failing noncompleteness. Why this works? Due to the two-bit nature of an AND gate, we know that the outputs of the first layer need twice as much security than the outputs of the second layer. So then we extend this methodology and we generalize for higher orders and higher algebraic degrees. We start from the targeted DF degree of security of the last layer, and from there we crawl back to subsequent previous layers. So the current layer will depend on the algebraic degree and the degree of security of the next layer. So this methodology does not specify the sharing scheme to use, it just specifies the degree of security that every layer should have. We don't provide the randomness needed for multivariate security, and it might not be the most optimal design, but we aim to, we aim this as a starting point towards reduced and fastest security implementations. So we use this methodology to speed up our paycheck implementations, and what do we want? We want to run, to execute two rounds at once without registers in the middle. So with using the previous equations, we have that the first layer needs a second order sharing and the first layer needs a first order sharing since the nonlinear operation used in Kechak is algebraic degree 2, and we are aiming for a first order secure implementation. First we use the sharing scheme with five inputs in the first layer and ten outputs of the first layer. Then the second layer will use these ten outputs, compressing them to five outputs. Yes, five outputs in the second layer. Then this was not very optimal because there are a lot of shares involved, so we tried to optimize this, and we found a six-to-six sharing scheme that we can apply for both layers. So I'll talk a bit more in detail of the six-to-six sharing scheme. If you are interested in the five-to-ten, you can check the paper. So first layer has these dependencies. Every output uses three input shares in the way we see in the slide, and then second layer uses, in the same way, three input shares, in this case the outputs from the first layer. And then with this, we get that at the very output of the second layer, every output share is missing at least one input share from the inputs in the way we see in the slides. We evaluate this design as well with the same test we used for the previous evaluations. We get that masks off with countermeasure sheets off indeed leaks, as expected. And when we switch on the countermeasure, we see that first order doesn't leak. This expected. And second order and third order, they don't leak either. And this is due to the great number of shares we are using. Six shares generate a lot of noise, which cover the leakage in the second and third order. To finish with my talk, I conclude with wrapping up with the contributions. We discovered a flaw in previous round-based secure implementations of Kechak. We fix it, we propose a fix, we analyze the causes and evaluate them practically. And then later, second fold, we give a methodology to secure unrolled implementations, which is very related to the previous one. And then we propose the fastest Kechak known to date in the literature, which is a 70K gate equivalent, and it takes 20.61 nanoseconds to compute. That was all in my part. Thank you so much for listening, and I'll be happy to answer any question. Thank you, Victor. Any questions? Thank you for the talk. On slide 21, I just wanted to be sure. When you look at the first graph, you say that there is no leakage, but for me, there is a peak achieving the threshold. So why do you say that there is no leakage? I see that it is 4.5, but it's a spike that went at that point, and that goes down, it went up and down, so it's due to the statistical test, but we don't think it's a leakage, since it went up and down during the test, so we don't find it as a potential leakage. Okay, so you tested this point, and you verify that it's not leakage? Yeah. And the second question I had is, you achieved some security with 6 shares. Do you have an idea of the optimal number of shares needed to achieve your goal? Is 6 a good... For my implementation, 6 was the most optimal in the way we wanted to achieve it, in the sense you cannot achieve this second-order non-completeness we look for if you go under 6 shares, because the combination you have to do for input shares cannot satisfy this non-completeness if you go lower than 5 shares. If you go lower to 5 input shares, as we aimed at the beginning, you need to expand it in order to satisfy all the combinations, so that no combination fails non-completeness. Okay, this is true for your construction, or do you think it's true for any way or to implement the... I think it's true for any implementation that wants to achieve this objective using TI. Okay, thank you. Any other questions? We have time for one more question. Can you get to slide number 19? Yeah. Or even the next one doesn't matter. Can you just say something about the amount of fresh randomness that you need to add here between these two stages? Do you need any, but no? We don't add extra randomness. We rely on... You saw the dependencies that Theta creates. So we rely on the dependency of bits of other S-boxes of the same state to add that fresh randomness, so that you only need this layer of registers in order to synchronize that sum of bits. Okay, thank you. Okay, if there is no more question, then let's thank Victor again. Thank you.