 Welcome to this CHEST 2020 presentation on Low Latency Hardware Masking with application to AES. My name is Pascal Sastrich and this was a joint work at Rambus with my co-authors Wigel Bilgin, Michael Hutter and Marc Marsen. Throughout this work we try to address and answer the research question can we implement mask hardware circuits that can be evaluated securely with low or zero latency overhead? As part of the answer we also addressed the following problems. What is the suitable hardware masking scheme with respect to a low or zero latency overhead? Is the masking scheme of choice composable and robust against glitches? Hence can we design a full hardware architecture based on small, secure and glitch resistant components called gadgets that can be connected without reducing security? Can the chosen masking scheme provide zero latency overhead compared to an unprotected design? If so, what is the cost in terms of area, performance or fresh randomness? If not, what is the lowest latency overhead that we can achieve? Eventually we would like to address the problem of building a secure, single-cycle-per-round AES implementation with this approach. Let's start with addressing the first problem. What is the suitable hardware masking scheme in particular with respect to low latency applications? After our initial research we decided to use the LUT-based Masked Dual Rail Precharge Logic Masking Scheme, short LMDPR, that was presented at CHES 2014 and describes a first-order, secure, gate-level masking technique to build protected gadgets. Let me explain the general construction and operation principle of a protected gadget first, before answering the question why we decided to use this specific masking scheme. Each LMDPR gadget operates on two Boolean shares in order to provide first-order security. In addition, one of the shares is also provided in complement format, used internally for the dual rail logic. Moreover, each gadget is constructed based on two separate layers. First, the Masked Table Generation layer using a first share and some fresh randomness to generate the Masked Lookup Table to process the second share. And second, the operation layer processing the second share and the complement using the generated Masked Lookup Table. However, the randomness and Masked Table are only required for nonlinear gadgets, while linear gadgets have independent layers and each share is processed individually. As we can already see, the data path within each gadget is fully combinational, independent of the function implemented by the gadget. But we also must note some restrictions when using and implementing LMDPR gadgets. First, we can only use monotonic gates within the operation layer. And second, each gadget is operated in two phases that are called pre-charge and evaluation. Apart from these restrictions, LMDPR seemed to be a good and simple candidate. In fact, after our initial research, we decided to use this masking scheme mainly for two reasons. First, as already mentioned, the data path in each gadget is fully combinational, making it a good candidate for low latency optimizations. And second, the original LMDPR paper already presents an AES architecture based on LMDPL that has a two-cycle per round latency. As we opted for a single-cycle per round latency, LMDPR seemed to be a good candidate for our purposes. After explaining the basic construction and operation principle of an LMDPR gadget, we now have a closer look at the theoretical security, in particular with respect to composability of gadgets and security in the presence of glitches, as they naturally occur in hardware implementations, but may reduce the security in practical realizations. In the original LMDPR paper from chess 2014, the security of a nonlinear gadget was investigated using activity image analysis. That is, given a certain unprotected input, there is no information leaking based on the gate and wire toggles. This analysis does not discuss the composability of this gadget directly. However, we also show the security using practical FPGA analysis of AES when multiple such gadgets are composed. Exactly how and why the synchronization points or registers need to be introduced is also not detailed. In this work, we aim to investigate the gadgets under the glitch-extended probing security model. This is a model that has gained attraction in recent years, as it provides a format setting that can be directly linked to practical security given the independence leakage assumption on a circuit with non-ideal gates. In order to do so, we first analyzed the strong non-interference of a nonlinear gadget, specifically an AND gate, and discussed how, combined with non-interference of linear gadgets, this enables a direct composition for first-order security. We further discussed which synchronization elements are strictly needed and which can be relaxed due to the inherent synchronization of monotonic gates. Note that our investigation is limited to first-order security, which is the security level provided by the LMDPL gates, and is not directly extendable to higher orders. Before giving a high-level view of the security argument, it's useful to remind that, as it is done in literature, we distinguish an intermediate probe, which gives information up to the closest previous synchronization point, and an output probe, which gives information on the value itself. A gadget is one GSNI if the observation is of a single intermediate probe and the information gained from this observation is independent of at least one share. If the observation is of a single output probe, the information gained from this observation must be independent of all input shares. Our security argument relies on two implementation assumptions. First, the operation layer circuit, using shared sensitive information, is implemented using monotonic gates and wiring only. Second, each monotonic gate is pre-charged. The structure of the circuit and the monotonic gates is kept intact using don't-touch constraints during synthesis. Given these assumptions, we observe that the gates in the operation layer do not glitch. Therefore, a probe on an output of a gate cannot be extended to give information on all its inputs. The information gained from the output of the gate is its value itself only. We checked that for each SI the information is independent of at least one input share and for X2 it is independent of both input shares. The mask generation layer depends on only one share and additional randomness. Hence, it can be implemented completely independent without monotonic gates or registers. Finally, the mask table output, which crosses domains, needs to be registered to ensure pre-charging and synchronization. For more details on this and further discussion on the absence of timing leakage, we refer to our paper. Next, we are briefly going to discuss the latency overhead of this scheme and whether we can reduce the original two-cycle-per-round construction to a signal-cycle-per-round construction. To explain our approach of reducing the latency overhead, we take the example of the original LMDPL-AES design and briefly discuss our main observations that led to a latency reduction. The original LMDPL construction has two register stages for a two-step calculation of pre-charge and evaluate. Initially, the registers before the field inversion are pre-charged and the mask share and table generation is active to generate the first table. Next, the field inversion is evaluated while the mask share and table generation is inactive and the table remains stable. Lastly, the remaining linear AES-round operations are evaluated while the mask share and table generation becomes active again and the registers before the field inversion are pre-charged. For the original LMDPL design, the two-phase operation ensures and maintains the glitch-free evaluation. However, based on this, we observed that if the operation layer is pre-charged and if the table generation of all LMDPL gadgets is synchronized, then we can freely compose any LMDPL gadgets without any intermediate register stages. Hence, we could combine all AES-round operations, linear and non-linear, into a single combinational logic composed of LMDPL gadgets. Combined with the principle of duality, we duplicate the LMDPL operation layer in order to ensure that one layer is always active while the other layer is pre-charged for the next operation. Fortunately, the mask generation layer is not duplicated since it does not need a pre-charge phase but can be reevaluated in every cycle. With this, the first operation layer initially is pre-charged and the mask generation is active while the second operation layer is idle. In the next cycle, the first layer is evaluated while simultaneously, the second operation layer now is pre-charged and the corresponding masks are generated. Eventually, the first operation layer is pre-charged again and the corresponding masks are generated while now the second operation layer is evaluated. In summary, this principle could be extended arbitrarily and we can implement any M-round algorithm in M by N cycles. If we can ensure that we have an active N-round evaluation circuit in every cycle. Next, before we show practical security evaluation results of our single cycle per round AES architecture, we will briefly provide implementation results and a comparison to existing low-latency implementations and architectures. For our architecture, we implemented six different options with varying functionality and security. For this, we differentiate between encryption-only, decryption-only and combined encryption-decryption architectures. For all three choices, we implemented options for standard or protected key expansion. In addition, all our designs include a dedicated entropy engine implemented as a PRNG that provides necessary fresh entropy on the fly during operation. Naturally, the PRNG is larger when the key expansion is also protected as we need fresh random masks within the key schedule S-boxes. Further, as we can see, most of the area is consumed by the data operation layer, as this is fully implemented using LMDPL gadgets and also duplicated to ensure the one cycle per round operation. In contrast to this, as already mentioned, the mask table generation is not pre-charged and hence we did not need to duplicate this module. Taking a closer look at our S-box construction, as this is the non-linear and crucial part of our protected architecture, we also provide a brief comparison to state-of-the-art and existing S-box architectures. In general, our S-box construction takes 3.45 kg equivalents in the GF28 nanometer cell library, where the operation layer takes roughly 1.1 kg equivalents and the mask table generation takes about 600 kg equivalents. In addition, the 288-bit mask table register takes almost half of the area, which is about 1.7 kg equivalents. However, with this design, we provide a single cycle S-box that needs 36-bit randomness per clock cycle to ensure secure operation. For the comparison, we focus our discussion here on those alternative solutions that also address or focus on the principle of low-latency architectures. The only single cycle construction we could find in literature was proposed by Kors et al. But in comparison to our design, this approach is very large and uses a lot more fresh randomness during operation. For two cycle latency constructions, we would mainly like to mention the approaches of Leisersen et al., which is the original LMDPL design, and an improved version of the construction of Kors et al. that reduces area and randomness, but still is worse compared to our design. Lastly, we would like to mention the construction of the Maya et al. that is smaller and needs a lower number of random bits, but has a latency of three cycles when instantiated in a full, round-based AS architecture. As this construction must address the zero-value problem of the inversion separately. Eventually, we present the results for our practical security evaluations based on the test vector leakage assessment methodology. Our global security goal was to provide resistance against such an analysis, even when an attacker can observe and record up to 100 million power traces. For this, we performed an analysis based on Welch's t-test using fixed versus random plaintext as classifiers. On the left side of the figures, we showed the evaluation results for the protected AES-128 for a first-order statistical moment analysis on the upper left and a second-order statistical moment analysis on the lower left side. When compared with the average power trace on the upper right side, we do not observe any significant leakage during the entire AES operation, which takes 10 cycles. However, if we disable the PRNG, which is shown on the lower right side, we can observe significant leakage even for lower number of traces. In conclusion, we can say that we achieved our initial security goal and our AES architecture provides resistance against such an analysis with up to 100 million power traces, as we cannot observe any significant leakage in our TVLA results for first- and second-order statistical moments. As an extended security goal, we also wanted to investigate the resistance of our AES architecture beyond the limit of 100 million power traces that we took for our initial security goal. For this, we extended our analysis and even recorded and evaluated 1 billion power traces. Again, we showed TVLA evaluation results on the left side of the figures where the first-order statistical moment analysis results are shown on the upper left side and the second-order statistical moment analysis results are shown on the lower left side. However, this time we can observe some significant leakage for both the first-order and the second-order statistical moment, as indicated by the peaks exceeding the thresholds of plus and minus 4.5. For a more detailed analysis, we also provide the maximum and minimum leakage for the first- and second-order statistical moment as a function over the number of traces. On the upper right side, the evolution of the maximum and minimum T-statistics over the number of traces is shown. While on the lower right side, we also provide the evolution of the T-statistics for the second-order statistical moment. In conclusion, for our extended security analysis, we can state that we can only observe significant leakage after recording and analyzing 400 million power traces for the first-order statistical moment and 220 million traces for the second-order statistical moment. However, we also would like to mention that all our attempts to perform a key recovery for those leakages were not successful. Eventually, we also performed a multivariate security analysis based on bivariate statistics. For this, we captured a full AES operation with 500 samples per trace and normalized all samples in each trace by subtracting the corresponding mean samples. Then, the bivariate analysis was performed by combining two samples through multiplication of the normalized samples. The figure on the left side shows the bivariate analysis results. While gray dots indicate T-statistics within the thresholds of plus and minus 4.5, red and blue dots indicate significant T-statistics exceeding the thresholds. Further, the lower triangle of the plot shows the bivariate evaluation results for an unprotected design with disabled PRNG and based on 100,000 recorded traces. While the upper triangle of the plot shows the evaluation results for a protected design based on 1 billion recorded power traces. For the multivariate analysis, we can conclude that we can observe some weak but expected leakage but only when using up to 1 billion power traces. To conclude this presentation, we can state that we presented a hardware mask single-cycle-per-round AES implementation and proved its first order security under the D-glitch extended probing model. In addition, we addressed the initially formulated problems such that LMDPL is a good candidate for low-latency masking schemes in hardware. It is secure even in the presence of glitches and gadgets can be composed securely to build complete encryption and decryption architectures that are resistant to such an analysis. LMDPL can be implemented with zero latency overhead compared to unprotected designs when using the principle of duality. We presented practically secure round-based AES architectures that resisted such an analysis even when observing and evaluating up to 100 million power traces. With this, I would like to conclude this presentation. Thank you for your attention and watching this recording. If you have any questions or remarks, feel free to reach out to me or any of my co-authors. For the sake of completeness, we also provide a list of references to be used throughout this presentation.