 Hi, thank you for the introduction. So today I will be talking about the multiplicative complexity of Boolean functions and bit-slice higher-order masking. First of all, what is higher-order masking? It's a widely used countermeasure where we split every sensitive variable into D random shares such that the sum of all these shares is equal to our sensitive variable. This is a sum countermeasure since it has been formally proved that the more the masking order D grows, the more it is difficult for an attacker to recover any information about the sensitive variable. In the presence of mass variables, how to perform operation? So linear operation on linear in the masking order D, so quite easy to perform whereas non-linear operation or quadratic in the masking order. Therefore, when we want to protect a block cipher with higher-order masking, the main challenge is to efficiency evaluate the non-linear part, namely to efficiently evaluate the S-boxes. In 2003, at Crypto, Ishai, Sai, and Wagner propose a scheme to perform secure multiplication between two mass variables. It simply consists in adding fresh random to the D-square cross-product in such a way that when we recombine the D output shares, we got the sound output namely A times B. Another variant exists called the CPRR evaluation, which allows to securely evaluate any quadratic function. The ESW multiplication and its variant are the main building blocks that we use when we want to efficiently compute S-boxes. In fact, one of the main approach to do that is to look at the S-box as a polynomial over GF2 to the power n and to try to find the representation such that we have a minimum number of calls to this building block. For example, for specific S-boxes, we will rely on the algebraic structure of those S-boxes, namely, for example, for IES, it's just the exponentiation of a monomial to the power 254 combined with a fin transformation. There also exists generic methods that allows to perform, to evaluate any kind of S-box, the two main, the two best methods in terms of efficiency are the CRV decomposition proposed by chess 2014, where you look at the S-box as the sum of T product of polynomial where each of the product is performed with IESW multiplication. The second best method called algebraic decomposition has been proposed at crypto 2015, and this time you look at the S-box as the sum of T composition of algebraic two function where each of these compositions can be performed with CPRR evaluation. So this is for the polynomial approach, a slightly less known approach called the bit slice approach can be used to evaluate S-boxes, and how do you do that, you look at the S-box as a Boolean circuit. All the bit slice works, you start from an N-bit input circuit composed of logical gates, and you transform it into N-circuit of one bit where each of the logical gates are replaced with CPU instruction, and you can pack into register several bits. Then, how to apply this at the S-box level, you need to find first a compact Boolean circuit that represent this S-box, so what does compact mean, you need to have a minimal number of gates and more specifically a minimal number of N gates since it's the main complexity of our problem. For example, for the IES, at each round you need to perform 60 S-box computation and this can be done with only one bit slice computation. How to apply order masking for the bit slice approach, at the circuit level you will replace every XOR gate with the XOR CPU instruction and every N gate with the ISWN scheme, and what is ISWN, it's just the ISW multiplication on F2. And so there the goal is to minimize the quadratic part, namely to minimize the number of calls to ISWN. So, in a previous work that we made, that we can find of any print called how fast can I order masking being software, we compare the polynomial and bit slice approach for two specific block ciphers, the IES and present, and we find out that the bit slice approach allows very good results for any order D. Therefore, the main motivation for this work was to generalize the bit slice approach for any kind of S-box, find decomposition that work for generic S-box evaluation. So, with the bit slice approach, we look at the S-box with Boolean functions, so let me now introduce some notation and definition about Boolean functions. So, first we call the span of a set of Boolean functions or the possible linear combination of elements of this set. We will denote by mn the set of all monomial functions, which is defined as fellow. And we can represent every Boolean function with their algebraic normal form, which simply consists at looking at the Boolean function into the span of the monomial functions. Then, the S-box, n-bit S-box, can be seen as a n-coordinate function where each of the coordinate function is a Boolean function. We cannot define the multiplicative complexity of a Boolean function as the minimum number of multiplication to compute the Boolean function. For this, we have some result in the state of the art. So, a simple upper bound, very trivial to prove, is that the multiplicative complexity of a set of Boolean functions is upper bound by the multiplicative complexity of the set of monomials, which is equal to 2 to the power n minus n plus 1. There also exists a lower bound that says that there exists a Boolean function such that its multiplicative complexity is lower bound by 2 to the power n over 2 minus n. This year at FSE, Stoffelsen propose a method to find optimal solution for small S-boxes, namely for S-boxes of size n less or equal to 5, with the use of self solver, but this cannot be used for bigger S-boxes because of the difficulty of the problem. In 2000, Boyer et Hall propose a constructive method to approximate the multiplicative complexity of a Boolean function with the following result. In this work, we extended this constructive method for S-boxes to approximate the multiplicative complexity of a S-box with the following. And we also propose a new method that is a generalization of the CRV decomposition to get better results. In fact, when you look for example at 8-bit S-boxes, from the extended Boyer et Hall result, we were able with our generic method to reduce by 10 the multiplicative complexity and with an improved method that is S-box dependent by 16. I will not describe this new generic decomposition method. So first how to decompose a single Boolean function. Let F be a Boolean function. We can represent F as the sum of t plus 1 product of Boolean function g and h. How to generate such decomposition? We will first take the GI as random linear combination from a non basis B. And then the goal is to find the coefficient of h, so the cag for h under its representation with the basis B, by solving a linear system where we evaluate the Boolean function on every point. And in this term, the green part of what we know and the red part of the coefficient that we need to find. So when you look at this as decomposition, you could wonder how this will work for any Boolean function. In fact, we need some special condition on our basis B to be able to decompose any Boolean function. I will talk about this condition later. So then how to solve this linear system with simple linear algebra. Let's the EI be the vector of all possible vector over F2 to the power n. Then we can rewrite this linear system with the sum of matrices AI that we will know, all the elements in the matrix are known, times the CI which are the coefficient that we need to retrieve. Equal to all the possible evaluation of a Boolean function. Also for this system to have a solution, we need conditions, so the overall condition on our decomposition. The first one is that t plus 1 times the size of B, we have t plus 1 times the size of B unknowns, 2 to the power n equation. So we need, that's our parameter, satisfy the following in equation. This means that for a fixed basis, we have a condition on the sum, on the number of product that we will need. This means that t must be at least equal to 2 to the power n divided by the size of B minus 1. Here it is interesting to notice that the more the basis B will be big, the smaller t we will need and therefore smaller number of decomposition product we will need to decompose our Boolean function. And the condition on the basis B is that the square basis B times B must span all Boolean function in order to be able to decompose any Boolean function. So how to construct such basis? We will start from a specific basis B0. I will not go much into detail about the basis but you can find how it looks like in our paper. Such that the main condition is that B0 times B0 is equal to the span of the Boolean function. And then as I said, we want to have maybe a bigger basis, so we will add element to this basis by simply taking two elements in the span of our basis. Multiplying them together and add them in the basis until achieving a satisfying size. How we will determine such satisfying size? We need to have the minimal multiplicative complexity for our method. So what is the cost of the method? We need our multiplication to generate the basis B. T multiplication for the decomposition product, so the total cost is simply of r plus t. So how to find the right balance? We did some experiment and we find out that we can achieve the best multiplicative complexity, for example for 8-bit X-box, by doing 25 multiplications to generate the basis and by having 9 decomposition products. And this basis that is optimal is simply the basis B0, so we didn't need to add element. Now we have a decomposition method for Boolean function. How to use it for S-boxes? Let's recall that n-bit S-box is simply n coordinate function, so we just need to apply n Boolean decomposition on each of the coordinate function. The new cost will be of r plus t times n, so now we have to find optimal multiplicative complexity to take a bigger basis. So we have 37 for 8-bit S-box multiplication to generate the basis plus 5 times 8 multiplication to have a total multiplicative complexity of 77. So this decomposition method works for any S-boxes. Now I will present some improvement that are S-box dependent. The first one is called the basis update and it's based on the fact that you will start to decompose the first coordinate function as I explained earlier. You pick a basis B1, you decompose F1 with this basis B1, so nothing new. Then you go to the second coordinate function and you pick a new basis B2 where you add all the decomposition product that you use in F1 to the basis. Since you already completed them, it's completely free and now you will have a bigger basis so you will need less decomposition product to decompose F2 with this new basis. You do the same for the next coordinate function and so on until you decompose completely the S-box. In term of cost that will mean that each time we go to the next coordinate function the TI will be smaller. So the new cost will be of r plus t1 plus t2 plus tn. Why I said it's S-box dependent decomposition? Because the elements that you feed in the basis are dependent on the coordinate function that you are decomposing so dependent on the S-box. You cannot use these basis for every S-box. It is dependent on the current S-box that you are decomposing. The second improvement is called the wrong drop and it's based on the fact that the linear system that we will solve which can be expressed as the product of a matrix equal to b doesn't necessarily need that the matrix is of full rank. You can allow some wrong drop of let's say delta and this will work for one Boolean function out of 2 to the power delta. And so to find such Boolean function you will need to try at least around 2 to the power delta system and why this is an improvement because on the condition that we had on a power theta it will be reduced by the fact that now we have 2 to the power n minus delta. Combining these 2 improvements, we applied them on specific S-boxes. So serpents for n-4, SC-2000 for n-5 and 6 and Kleffia for n-8. And we were able to reduce from the generic method by 15 to obtain gain of 15 in the multiplicative complexity. So this is our new decomposition method. Now let's look at how we use it in practice. So the bit slice approach, one of the main advantages is that it allows highly parallel level of parallelization. So for example, if we look back at the example of the IES that we did in our previous work, you have 16 S-box computation at each round. So you will be in the presence of 16 bit slice registers. But we did our implementation on a 32 bit architecture. So this is non optimal since our registers are not fully filled. So in fact, you can do 2 16 bit ESWN with only 1 32 bit ESWN. So this means that you need to rearrange your circuit. So the circuit that represents the S-box in such a way that you try to maximize, you try to group as much as possible every end gate by pair. And we did that for the IES circuit. So this is a circuit that is composed of 32 end gates. We rearrange it and we were able to group every end gate by pair. And this means that we were able to reduce by 2 the multiplicative complexity of this circuit. For this decomposition, we need to first define a parallelization level. So let's say the parallelization level K is equal to the architecture size, divided by the number of S-box that you need to evaluate at each round. For the generic method, we can fully parallelize all the steps. However, for improved method, it depends on the specific S-box you are currently decomposing. And I didn't describe it here, but you can find how we did it. And we get a good result also for the improvement. Then, we did a performance comparison in ARM with optimized implementation of the best polynomial generic method, so CRV decomposition and algebraic decomposition, where the parallelization level K is equal to 2. And you can see that the big size approach is asymptotically way better than the best polynomial method. However, for N equal to 8, we can see that for small masking order, so from D less or equal to 8, we are slightly less efficient. This is due to the fact that we have a big factor in the linear terms. We need to perform every linear combination on our basis by hand, whereas the CRV and algebraic decomposition due to the particularity of the form of their basis, they can do that with lookup table and win several linear operations. However, we can see that this new generic method is better than the best polynomial one asymptotically. So, thank you for your attention and if you have any questions.