 Hello, I'm Ting Pham from the University of Bristol. In this presentation, I will talk about our CHES 2021 paper titled An Instructionist's Extension to Support Software-Based Masking. This is a joint work with CIGAL, JOINS, Cross Shadow, Ben Marshall, Dan Page, and Francesco Regazoni. In the paper, we focus on the implementation of masking scheme counter measures against seasonal attacks. I think seasonal attacks, SCA, is currently a well known topic, especially in CHES. At the high level, SCA is a kind of implementation attacks which targets on a concrete implementation that potentially bypasses the theoretical security properties. Particularly, SCA exploits the information leakage from seasonal to recover sensitive data. Basically, if you have an implementation being performed by a target device, then as an attacker, you can monitor its computation by capturing time execution or power consumption as seasonal information. If the computations involve security critical information, then potentially analyzing the seasonal information can get some knowledge about the security critical information. Power seasonal attacks are generally categorized into two types of attacks. Number of power analysis, SPA, and differential power analysis, DPA. SPA can potentially reveal sensitive information by observing the power consumption of one single execution. In contrast, DPA possibly extracts sensitive information by statically analyzing numerous power measurements of the executions on different inputs. On the other hand, masking and hiding are two widely used counter measures against SCA. Hiding counter measures try to reduce the signal to noise ratio of leaking information. While masking counter measures which are focused must the sensitive information with random says unknown to attackers. They can be viewed as a low level analysis of computing on encrypted data concepts. The counter measures are becoming increasing where understood by now. For example, theoretically, the first order masking could resist the first order attacks, but vulnerable to the second order attacks. In turn, such attacks can be mitigated by second order masking. In Edison, masking can be utilized as various levels in either hardware and or software. For masking implementation, we probably have at least two significant challenges as a high level that must be resist. It must translate theoretically security properties into practice. Second, it suffers large increases performance and implementation overheads. More specifically, implementing software-based masking could be a non-trivial task to achieve the guaranteed security due to the leakages that steam from the underlying micro-architecture. Such implementation also suffers from significant overheads in terms of, for example, execution latency, God density, and high quality randomness demand. On the other hand, implementing hardware-based masking faces difficulties to mitigate the glitch-related leakage that occurred at the gate levels. And the masking also imposes large overheads in terms of area and energy consumption. Moreover, hardware-based masking is inflexible for novel designs appearing regularly. We can pay more costs to address the leakage like order reduction or pressure implementation. Such approaches employ additional random sets to maintain its guaranteed security level regarding the leakage. First example, pressure implementation used three instead of two sets for first order masking. We can try to solve it with a collaborative and considerate approach. In this paper, we explore the use of an instruction set extension IC as a means of supporting masking in software-based implementations of cryptography. The fact is that an IC is well positioned to act an interface between hardware and software, which enables a collaborative approach of hardware and software to address the challenges of masking implementations. We believe there are at least three possible attractive benefits of an IC approach. First, possibly mitigates the leakage steaming from the underlying micro-architecture, which is communicated in software. Second, offers flexibility through IC-assisted software, which allows the possibility to apply for novel cryptographic designs. And finally, against affordable compromise, improving push-spring latency versus a software and error efficiencies versus a hardware alternative. Regarding the security and the performance of masking, our paper presented some main contributions. First, we introduced a design of an enriched IC with a wider set of operations. Second, we presented an area efficient and leakage-aware implementation of the IC within existing RISC-5 micro-architecture. Finally, a leakage evaluation and quantitative analysis of the overheads on the wide range of IC-assisted cryptographic software is conducted. As a result, we showed that it's possible to achieve secure masking using dedicated instructions without the need of duplicating the data pass, which is often introduced in existing hardware and IC-based approaches. At a high level, we aim at supporting operations on operands either in boolean or arithmetic masking representations. The instructions are designed to execute three types of operations, namely representation conversion, unary and binary operations. Concretely, we designed four sets of masking IC regarding functionalities, namely class A for arithmetic masking, class D for boolean masking, class C for conversion between boolean and arithmetic masking, and class F for field arithmetic. The eight class IC includes a set of instructions that support first-order arithmetic masking. They allow masking and masking, remasking, and mask operations include addition and subtraction. Likewise, the B-class IC consists of a set of instructions that support first-order boolean masking. They allow masking and masking, remasking, and mask operations, including this wide operations, for example, node, N, XR, and rotations, and arithmetic operations such as addition and subtraction. The C-class IC supports the conversion of operands under boolean masking to arithmetic masking and visor version. Finally, eight-class IC includes boolean masking arithmetic operations in the final field. They are especially youthful and generic to support mask AES and AES-like designs, for example, SM4 or Camellia. At the high-level system, we design and implement the IC on our SCARF core, a five-stage pipeline RV32 IMCB microcontroller. We introduce a masking-specific ALU to execute the IC at the execution stage. We employ paired visitor files and try to minimize the changing of the base core data path to accommodate the additional operands of the masking operations. Especially, we take careful efforts to mitigate the possibilities of accidental share combination along the data path. The masking-specific ALU is the main component to execute on-mask IC. The mask ALU is designed and implemented considering the following motives to support mask operations in a manner for area efficiency and leakage awareness. The random base generation RBG generates random masks for the masking and remasking operations in the mask ALU. Based on the demand of mask operations, this includes two instances. This instance used the hybrid design motivated by the trade-off between area throughput and randomness quality. It includes both mosquito and true random components. A and F class instructions are supported by separate motives. These instructions are executed in an area-optimized module, which leverages integration and functionality reused between operations. Especially, we carefully mitigate glitch-related leakage in the mask ALU by adopting the domain-oriented masking-based strategies. Domain-oriented masking is applied for non-linear mask operations in the mask ALU. For example, Boolean Masked N and Boolean Masked few multiplier. The figures show the circles of the masked N and masked few multiplier. As we can see, the circles follow two principles. The first is separating the operation on associated shares into their separated domains. The second is inserting the matching and remasking steps for cross-domain operations to prevent glitch-related leakage. Moreover, we carefully select suitable DOM-independent and DOM-dependent variances regarding the dependency between the two inputs operands. For example, DOM-dependent is applied for masked N and DOM-independent is for masked few multiplier. In addition, we insert the additional remasking steps where it is necessary or possible. This is a conservative decision with respect to the leakages. Especially, we adopt the use of double bump blocking to mitigate the latency overhead caused by the cross-domain latching. For example, the inputs and outputs flip blocks operate with the different clock edges versus the latching ones. To evaluate the area overhead, we consider two variants compared to the baseline core without IC. Namely, IC-CBA supporting A, B, and C classes and IC-CBAF supporting A, B, C, and F classes. We synthesize the systems for two implementation targets, FPGA and ASIC. The results are reported in the table. We can see that the area overheads of IC-CBAF is fairly modest. We support the F class instructions. This overhead further increases in the IC-CBAF. One can view this as a trade-off to specifically support AES and AES-like ciphers. After the implementation of the masked IC hardware system, we evaluate the IC utilization through IC-assistive software for a range of cryptographic kernels. While we implement such IC-assistive software, we face some challenges which are fairly like that of masking software using the base ICI only. For example, first, various design choices regarding function inlining and loop unrolling for an implementation. Second, register access regarding efficient utilization of a limited number of registers, pair registers, and avoiding leakage from moving shares to and from registers. And third, the possibilities of accidental unmasking caused by speculative execution. However, we found that implementing a mask function using the ICs in less effort than using the ICI. And securing such implementation involves a similar process, but using the ICs is easier because the program is softer. Generally, the cryptographic kernels are implemented using assembly language, and we arrest the implementation challenges manually. In addition, for a comprehensive evaluation of the IC utilization for each kernel, we implement at least three variants, namely unmasked for unprotected implementation, ISA mask for masking implementation using only the base ISA, and IC mask for masking implementation using the pro pro IC. Due to the importance of AES, we particularly focus on it. By design, AES can be implemented on hardware software platforms with different approaches. For the unmasked implementation, we evaluate various implementation approaches, namely tech mode, T-table, and IC support acceleration. That would provide a more dimensional comparison. We found that not all implementation above are easy to mask. For the ISA mask implementation, we adopt the Riven-Proud scheme. This scheme computes few inversions using squares and multiplications. For the IC mask implementation, we also leverage the Riven-Proud scheme. We employ the mask AC to compute the mask operations in the scheme. As expected, the mask IC brings a significant performance boost, as we can see in the table. To demonstrate the general realities of the IC, we also evaluated the utilization of the IC using other cryptographic kernels, namely SM4, SPEC, SPAC, and ChaCha20. As you can see in the table, the base ISA mask implementations suffer enormous increased overheads. The IC assisted masking, as expected, gains more than one order of managed overhead reduction in terms of instruction count and cycle count in comparison to the base ISA alternative. Regarding the security of masking, our strategies modally relies on the dome approach. For the IC implementation in hardware, as mentioned above, then we do empirical validation by evaluating the leakage of large cryptographic kernels, of which implementations are constructed by using the glitch-resilient IC. In this regard, our FBJ implementation of the mask IC is built on the Sensible Z3 side-channel analysis platform. And we use TVLA-based leakage detection to evaluate the security of the mask cryptographic kernels. For example, the figures show the leakage evaluation for AES kernels. Each evaluation employs a set of 100,000 power consumption choices. The blue blocks show the T-value of an evaluation and the red lines are the threshold for leakage detection in a TVLA evaluation. As we can see, the leakage is clearly observed in the evaluation of the unmasked implementation. By carefully implementing masking, the ISA mask and IC mask implementations can mitigate the leakage below the thresholds. The leakage evaluation of other kernels obtains the same results. In summary, we presented the design implementation and evaluation of an IC to support software-based masking. Our evaluation suggests the IC can support efficient, secure first-order mask software. For example, IC-assisted mask implementation of AES in order to manage more efficiencies than the software-only alternative. Moreover, our IC allows maintaining the flexibility of a software to suit several cryptographic algorithms while incurring acceptable inherent area overhead. However, utilizing the IC in those trades for work, which demands care. For example, ensure non-interaction between sales, load and store sales, and deal with registered pressure. Through the experiences of implementing and applying the mask IC for a range of cryptographic kernels and the results, recognize various higher level directions represent either important or interesting future works. For example, it is interesting to look at how an IC-assisted approach supports generalized masking schemes like higher-order masking and how an IC can be changed to support flexible work sizes like sitting mid and sitting for mid. We also recognize that the support of the mask IC allows implementing secure masking software less complicated, which potentially enables an automatic generation of mask implementation. So it is interesting to look at how shut-masked IC can be integrated into an automatic tune like tornado. In addition, there are possible extensions to support the long-term importance of post-quantum cryptography. For general, the IC can have possible extensions to assist with further challenges. For example, secure register access and share-aware memory access. So that my presentation, of course, this would have some unclear details in the presentation. I would like to encourage you to read the paper for full technical details. I look forward to answer any questions that you may have at the left section. Or you can drop us an email using the addresses on the first slide. Thank you for your listening.