 I am Annapurna Valiveti, set scholar from Pruplity Bangalore. I will be presenting our just 2021 work title, higher order lookup table masking in essentially constant memory. This is a joint work with my supervisors, Srinivas Vivek. The agenda of my presentation is as follows. I will briefly introduce side channel attacks, followed by masking countermeasure. I will present our scheme, higher order lookup table masking using PRG, followed by experimental results. I will conclude with a brief summary of our work and possible future work. The grid-to-algorithm must hide the secret to prevent the adversary from compromising the device. But while executing the algorithm on the device, the device may have measurable effects in the form of a leakage so that the adversary can use this leakage to compromise the system. If we consider the side channel attack model, along with the input and output to the device, the adversary can also have leakage, which is in the form of either the power consumption or timing information or electromagnetic emission so that the adversary can mount side channel attack to recover the secret from the device. The takeaway from this slide is nothing but secure algorithms still need secure implementations. So, what do we mean by secure implementation? Since the leakage from the device is correlated to the secret, the adversary can mount SCA. If we can make the leakage key independent by inducing randomness into the computation, this can possibly counter the side channel attacks. This way of inducing randomness into the computation is known as masking. As part of the masking scheme, the secret will be divided into shares and shares. We'll use additive secret sharing as part of our scheme and the security parameter for the masking scheme will be represented with T. As per the state of heart, T is equal to N minus 1. As part of this presentation, our focus will be on software countermeasures to power side channel attacks. Even though the masking scheme, secure at Tth order, is prone to attacks at T plus 1th order, Chari et al in their crypto 99 work demonstrated that the adversary require exponential efforts in masking order to mount SCA. Hence, we can consider the masking order as a sound security parameter. Now, we'll continue our discussion with security models. How we can model the leakage obtained by the adversary? The probing leakage model initiated by Ishae Sahaya Wagner in crypto 2003. As per this probing leakage model, we can model the leakage as T-intermediate variables, exact values of T-intermediate variables. If we can prove that any set of T-intermediate variables is independent of the secret, then the implementation, the masked implementation is probing secure. Whereas, as per the noisy leakage model, the leakage is modeled as the adversary can obtain all intermediate variables, the noisy values of all intermediate variables. If we can prove that the adversary quote-unquote obtains no information about the secret, then the scheme is secure in noisy leakage model. Dup Shimboski and Faust in 2014 bridged the gap between these two models showing a reduction from probing leakage model to noisy leakage model. We prove our schemes secure in probing leakage model. I'll briefly summarize the various motions of probing leakage model. As per Ishae Sahaya Wagner's work, to achieve T-th order security, we want the number of shares to be 2T plus 1. Further, the bound is optimized to optimal level n is equal to T plus 1 in just 2010 work by Riva Prof and Mathieu Riva and Emmanuel Prof. Then we have the notion of compositional security where the bound remains to be n is equal to T plus 1. As part of CCS 2016, Berthe et al. introduced the compositional security notion. As per this notion, if we prove the individual gadgets, individual gadgets to be T-SNI secure, we can say that the overall implementation remains T-SNI secure, which is nothing but compositional security. Now we'll see how we can extend the masking technique to block ciphers. So the operations involved as part of the block cipher are linear and non-linear operations. The linear functions are trivial to implement. We can apply the function to the individual shares, whereas the non-linear layer, the non-linear function, which is nothing but S-box for block ciphers, has to be handled carefully for a secure implementation. The approaches from the literature to implement S-box can be broadly categorized into two categories, circuit-based and lookup-table-based. Before comparing the circuit-based and lookup-table-based schemes, crypto implementation depends on three main factors. The execution time it takes, the amount of RAM memory it requires, and the randomness required for a secure implementation. Now we'll compare the circuit-based and the lookup-table-based schemes in terms of these parameters. The circuit-based schemes require almost a constant memory for the implementation, for a mass implementation, whereas the lookup-table-based schemes require an exponential amount of memory in terms of the input to S-box, and it also depends on the masking order. Whereas lookup-table-based schemes enjoy the pre-processing. Pre-processing refers to the amount of computation that can be done before availability of the actual secret. So with the advantage of pre-processing, lookup-table-based schemes will have only a constant amount of online execution time, whereas for circuit-based schemes, the entire execution happens during the online phase. So the main problem with respect to the lookup-table-based schemes is the amount of RAM memory required. So our goal is to optimize the RAM memory required for lookup-table-based schemes. So we look at the high-order lookup-table-based schemes available from the literature. The scheme is proposed by Karan in 2014, from EuroCrypt 2014, and further optimized by Karan et al. in 2018, just 2018. As per the scheme, the size of the lookup-table, as I mentioned in my previous slide, requires an exponential amount of memory in terms of the input to S-box to par-k, and it depends on the number of shares, N. Since we require an additional temporary table to construct the randomized lookup-table, which is the total amount of memory required is nothing but the double of the quantity, and this is the number of two random bits required for the implementation. If we instantiate this scheme for a specific order of T is equal to 10, I mean N is equal to 11, we need 440 kilobytes of memory for a single AES execution, and this amount of memory may become affordable for a resource-constrained device. This is the high-level overview of higher-order LUT scheme from EuroCrypt 2014, where lookup-table will have N columns, and it will go through a computation of N-1 shift followed by refresh masks, and the output of shift and refresh by Xi is passed as an input for the next shift, and once we finish N-1 shift followed by refresh, we'll have a final lookup to output S of X, to output shares of S of X. Since we want to achieve optimized RAM memory, the idea behind our contribution is we want to make number of columns of the lookup-table independent of the masking order. Essentially, we'll be storing only the first column of the lookup-table, and on this storing the N-1 remaining columns of the table. Randomness is computed using a PRG. The random values required for the computation are computed using a PRG. So how actually we can store, we can prevent storing the N-1 columns of the table is nothing but we'll compute the PRG outputs on the fly. This is the idea from our just 2020 paper second-order lookup-table compression scheme. Even though our current work is different from our just 2021 work, we are reusing this idea from our previous contribution. Now we'll discuss the challenges to achieve single-column lookup-table. It can be observed from Eureka 2014 scheme that the lookup-table which is output from shift X side is given as input to shift X i plus 1. Since we are not storing the N-1 columns of the lookup-table explicitly, there is a need to re-compute these N-1 columns. This will be the additional overhead to achieve optimized memory. And the random mass we want to generate using a PRG construction. The PRG also should be proving secure. The input c to the PRG depends on the locality of the circuit. Locality refers to the maximum number of random bits, any variable as part of the computation depends upon. It can be observed from the prior works that the locality of the circuit, improving the locality of the circuit will result in optimal c to the PRG. So we replace the refresh mass with locality refresh to improve the locality of the circuit. So the proving secure PRG constructions from literature are the robust PRG and the multi-PRG techniques. As per the robust PRG technique we will consider the whole circuit as a single entity and the locality will be computed for the circuit. Whereas the multi-PRG technique is essentially multiple non-robust PRG technique will be dividing the randomness into subsets and the locality will be computed with respect to the subset. So essentially we achieve we observed better online execution times using multi-PRG technique because the time taken by a multi-PRG to output one unit of randomness is lesser compared to the robust PRG technique. We will discuss about these two techniques in detail in the further slides. Now I will present the high level comparison of our scheme versus the higher order lookup table based scheme. It can be observed that we have only a single column lookup table whereas in the other scheme we have the lookup table having 10 columns and the refresh mask is replaced by the locality refresh LR and the randomness is generated from a PRG and once we finish the n-1 steps the final lookup is similar to the original scheme. Now we will look at the security analysis of our scheme. First we will discuss about the higher order lookup table based scheme using robust PRG construction. Since the adversary can also probe the input or intermediate variables of the robust PRG along with the intermediates of the SBOX implementation we need to carefully choose the input C2 robust PRG such that it remains secure against T probing attacks and the size of the input C2 robust PRG depends on the locality of the entire SBOX implementation. We have proven the security of our robust PRG scheme using a compositional security notation. We build the robust PRG using a trivial construction which can be obtained by combining outputs of T plus 1 non-robust PRGs. I would like to mention that the security proofs will work only for a linear PRG construction. We move on to the multiple PRG construction where we need to divide the randomness required for the implementation into subsets. It is mentioned in the previous slides that the lookup table is constructed in N minus 1 steps the shift and refresh operations and each of the temporary tables during the shift and refresh have N minus 1 columns. So we divide the randomness required for the preprocessing into N minus 1 square subsets and each of these subsets are generated from a non-robust PRG. The locality is computed with respect to the randomness subset that is one column which is of size 2 power k into k prime bits and the input C to the PRG is chosen accordingly. We prove the security of the multi-PRG construction in an extended security model TSNI R star The reason behind choosing the extended security model is that the adversary can possibly leak the outputs of a non-robust PRG using a single probe. So the extended model will address the related simulation with respect to the leakage of a entire randomness subset using a single probe. This is a high level comparison of the asymptotic complexities of our scheme with the original scheme. You can see that the total number of PRG is cubic in terms of number of shares for robust PRG construction and it is square firm multiple PRG construction and the only execution time depends on the time required for the PRG which is cubic for robust PRG and linear for multiple PRG. So we have achieved better only execution time using the multiple PRG construction. This slide presents the experimental results of our schemes using the multi-PRG approach. Since we would like to demonstrate a 10th order lookup table based scheme on a resource constraint device we have chosen an Ardham Cortex M4 device having 256 kb of memory as our target. It can be observed that for a 10th order secure AES-128 implementation we need 41.2 kb of memory. This includes the memory to store the pre-processed lookup tables and the input seat to the PRG and our only execution requires around 4 million clock cycles. Finally I would like to conclude my presentation with a brief summary of our results. Our scheme requires approximately 40 kb of memory for practical orders we have experimented till 10th order. Essentially the columns of the lookup table is independent of the masking order and we can say that the memory optimization achieved is independent of the speed of the inbuilt RNG of the device and we also explored the possibility of the RNG versus PRG trade-offs, speed trade-offs. The target chosen has a relatively fast RNG it took around approximately 300 clock cycles to generate a 32-bit random number so we also explored the possibility of the trade-offs between the PRG and the RNG to achieve better online execution time and if we compare our results with an 8-bit bit-sliced AES execution our execution times are 1.5 times faster than the 8-bit bit-sliced AES and our results are almost comparable to the 8-bit bit-sliced masked AES execution and it will be an interesting future work to design a higher order lookup table based scheme which can have a faster online execution time than the 32-bit masked bit-sliced AES and you can find the imprint of our work in the following archive and thank you for your attention.