 So here's the outline of my talk. So we saw from the last talk what is higher order masking and it's well-known why it's important. So I'll skip through all the introductory material in this slide. So one important point to note is that the complexity of side channel attacks increases exponentially with the number of shares for nearly all of the known masking schemes. As mentioned in the previous talk, affine functions, the linear affine functions are straightforward to compute in the presence of shares. And the time complexity and random complexity is linear in the number of shares. So the main challenge while designing higher order masking schemes is to secure nonlinear functions and operations. And various higher order masking schemes mainly differ in how they secure the nonlinear operations. And note that in a block cipher evaluation, only the S-Box computations are nonlinear operations. One class of countermeasures is one proposed by Carlin and others from FSE 2012, which in turn is based on the seminal work of Isai Sahai Wagner from crypto 2003 and its extension by Riva and Proof from just 2010. This CTPQ scheme provides formal security guarantees in the probing leakage model. And it's well suited for software implementations. So the main idea behind their scheme is to represent D2RB test box as a polynomial or field of 2% elements. And now securely evaluating the S-Box, computing the S-Box in the presence of shares reduces to the problem of evaluating a polynomial in the presence of shares. While evaluating polynomials over binary finite fields, polynomial addition, multiplication by scalar, and squaring operations are linear. The cost is mainly determined by the nonlinear operations that are secured by the technique from Isai Sahai Wagner and its extension from Riva and Proof. So the time and randomness complexity to secure a nonlinear multiplication or binary finite field, it takes a quadratic amount of time and randomness. And there are already several improvements to the original CTPQ scheme. So the cost analysis of the CTPQ scheme reduces to the problem of evaluating polynomials over binary finite fields. The cost model that we follow is just to count the nonlinear multiplications that are multiplications which are not squaring. And we ignore the cost of linear operations, such as addition, scalar multiplication, and squaring, because they are relatively cheap than a nonlinear multiplication. So the best polynomial evaluation method with respect to the nonlinear multiplicative complexity cost model is the one with proven worst-case complexity is the parity split method proposed by Carly and others, which is in turn is based on a method by Knuth and Eve whose the worst-case complexity of this method is of the order square root of 2 power d number of nonlinear multiplications. At chess 2014, Coran, Roa, and Vivek they proposed a heuristic method which has a better asymptotic complexity of square root of 2 power d or d number of nonlinear multiplications. And they also show that it is asymptotically optimal. In practice, the CRV method performs quite well. To evaluate any 4 to 4-bit S-box, one needs only two nonlinear multiplications, and it's shown to be optimal. And to evaluate any 6 to 4-bit S-box, in particularly all the desks S-boxes, it needs four nonlinear multiplication, and three is a known lower bound. And to evaluate any 8 to 8-bit S-box, it needs 10 nonlinear multiplication, and three is a known lower bound. So that's for the state of the art. So the main contribution in our work is to improve the CRV method to evaluate polynomials corresponding to S-box with respect to the nonlinear multiplication complexity cost model. So just to quickly recall the CRV method, so we are given a d to r-bit S-box as input. So we need to evaluate a strategy to evaluate the polynomial corresponding to the given S-box. So the first step is to naturally encode the d and r-bit strings as elements in F2 power d. And then pre-compute set of monomials. This we may require that this set is close with respect to squaring, because I mean, because squaring is a free operation, we would like to have this pre-computed set as large as possible. And it must also satisfy a property that we should be able to produce any monomial of degree at most less than 2 power d by multiplying some two monomials from this pre-computed set. Why this is needed? Because this ensures one will be able to get a decomposition as of this form. So in the next step, we obtain this decomposition for a polynomial corresponding to the given S-box, which itself is determined in the course of the computation. So in this decomposition, we need the polynomials P i in Q i to have monomials only from the pre-computed set. And how this decomposition is determined by first choosing the Q i's randomly, having monomials only from the pre-computed set. And then we set up a system of linear equations over F2 by evaluating this relation at each of the 2 power d inputs and then obtaining one equation for each output bit of the S-box. And the variables are the unknown coefficients of the unknown bits corresponding to the coefficients of the polynomials P i that we would like to determine. So our modification to the CRV method is mainly in the step 0 and step 1. So instead of naturally encoding the d and r bit strings as in F2 power d, we encode it in a bigger field. And as in the CRV method, step 1 of the CRV method, we pre-compute a set of monomials, which is close with respect to squaring. And now instead of requiring that every monomial of degree less than every monomial of degree less than 2 power n can be computed by multiplying some two monomials, we just require that first to put the condition that one should be able to obtain a decomposition in the step 2 of the CRV method. And step 2 is exactly the same as we saw here. We want decomposition if this form. But now instead we work over the bigger field F2 power n instead of F2 power d. So that's about the modifications to the CRV method. And for the analysis, the number of nonlinear multiplications used is given by this expression. So here we see that the bigger the field we work, then we just need to invest lesser number of nonlinear multiplications in the pre-computation step. But in practice, this comes at a cost because performing field arithmetic in a bigger field is now begins to be more expensive as n grows. To determine the parameters L and T, as in the CRV method, we just use the heuristic that in the matrix step, it has full rank. And we will be able to obtain decomposition for any S-box if the number of columns is more than the number of rows. So this is just a heuristic. It works well in practice, but still we do not know how to prove this thing rigorously. So and then heuristic analysis shows that this is the complexity of the improved method. In the limiting case, the complexity square root of T power D over D, which is half of that of the CRV method. But again, this analysis is only heuristic. And in practice, we can reduce the complexity of many S-boxes. For 4-bit S-boxes, 2 is already, while working over F2 power 8, we do not get any improvement for 4-bit S-boxes, like for example present, because 2 is already known to be optimal over any field. But what is interesting is for 6 to 4-bit S-boxes, in particular, this S-boxes, we can do with three nonlinear multiplications instead of four. And in a software implementation, performing field arithmetic over GF2 power 8 or GF2 power 6 does not, I mean, the cost is essentially same. And it turns out that this improvement from 4 to 3, actually it reflects in, it leads to overall improvement in the running time corresponding to this factor 3 or 4. And 3 is already known. This 3 number of multiplications for S-boxes is now optimal. And of course, working over F2 power 8, we do not get any improvement for 8-bit S-boxes. It's same as the CRP method. But from a theoretical point of view, it's interesting to note that while working over F2 power 16, we can reduce the multiplicative complexity to 6 from 10. And still, we have quite a way to, no, we will see that. I mean, we can even reach 3 later in the talk. So we perform the mass implementation of DES. Just recall that DES uses 8 6 to 4-bit S-boxes. In the pre-computation step, the pre-computed set of monomials consists of constant, the given input x, and x cube, and all its squares, and x power 7, and all its squares. And we obtain a decomposition of the form P of x to be equal to P1, Q1, plus P2, where P1, Q1, and P2, they have monomials only from this pre-computed set. Now note that we are working over F2 power 8 and not F2 power 6. We performed a software implementation in C and using code from previous implementations. We ran the experiments on a laptop, but we ensured that we manipulated only bytes and we tabulated the linear functions for reasons of efficiency, and these functions can be stored in ROM. So these are the timing results. What we see in the third column, it corresponds to the number of shares. So for a three-share input, the last column represents the overhead factor relative to an unprotected implementation. So it is 13 times lower compared to an unprotected implementation of this, and it is better than the CRB method, which has a factor of overhead factor of 18, and when there are five shares, we get an improvement of, now the factor is 26 instead of 35. When there are seven shares, it is 42 instead of 58, and when there are nine shares, it is 64 instead of 91. So we see that actually we get an improvement of at least 25% in the overall running time, and the RAM memory required is also now lesser because we need to store, yeah, it's now lesser, and also the randomness complexity. The remaining results that I present, they are mainly of theoretical interest. So in spite of our improvement to the CRB method, still the upper bound to evaluate any D to RB test box is of the order square root of D over D, and using a different technique, we drastically improve this upper bound to logarithm of D, and the main idea is the observation that we can pack several independent multiplications in a smaller field in one multiplication or a big field extension, and then we can extract individual products using, it can be extracted for free using just linear projections, which do not cost. So one you can refer to the paper for more details, and we show that this bound is optimal, and our argument is based on the algebraic degree. Algebraic degree of a polynomial is the maximum hamming weight of its exponents corresponding to non-zero coefficients. When we perform a multiplication, algebraic degree cannot most double, so in order to reach D, we need at least logarithm of D in number of multiplications, and we could apply this technique to the case of AES and reduce the number of non-linear multiplications to three while working over F2 power 16 instead of four over F2 power eight. So the main idea is to identify the field, the method is to identify the field of two power eight elements as a sub-field or F2 power 16. So we need to compute the X power 254 or F2 power eight, that's the non-linear part of the AES box. So first using one non-linear multiplication compute X cube, and then form this element or F2 power 16 where we X square is already computed for free from X, X cube and it's combined with an element which is outside of F2 power eight, and then using one non-linear multiplication and X power 12 can be computed for free from X cube. So using one non-linear multiplication compute this product and now we have this is equal to X power 14 plus Z X power 15, and then we can extract X power 14 and X power 15 for free using linear maps, and then we using one multiplication we can compute this. So this sequence of operations is motivated from the paper by Gentry, Halevi and Smart where they homophically evaluate AES, but that's in a different context. So then we also generalize the previous lower bound results. In the CRV paper, this lower bound was established to evaluate any D to D bit S box. It needs only one needs, there are some polynomials that need square root of T power D or D number of non-linear multiplications and this bound is only over F2 power D and only for D to D bit S boxes. So we generalize it to any D to R bit S box where R can be less than D and for any field and this is the bound and our argument is based on counting argument as in the CRV method and we just need to additionally use the fact that projections are linear functions. So to conclude the main contribution is to improve the CRV method to evaluate polynomials corresponding to S boxes in the non-linear multiplicative complexity cost model. So the main idea is to work over bigger fields than strictly necessary and this leads to reduction in the number of non-linear multiplications for many S boxes. We, in particular for this, now needs only three non-linear multiplications instead of four over F2 power six and we performed a mass implementation and we get an overall improvement in the running time by about 25%. On the theoretical side we improve the upper bound to evaluate any D to R bit S box. The complexity is just a logarithm of D non-linear multiplications but this comes at the cost of working at an arbitrarily large fields and as an application of this technique we showed that AES needs only three non-linear multiplications over F2 power 16 and we generalize the previous lower bound results to generate arbitrary binary finite fields. So thanks for your attention. Thank you.