 The paper tornado automatic generation of probing secure massless slice implementations is a cool operation between Père-Rise-Dagon and Daïus Mercadier from the lipsticks and Sonya Belly, Mathuriva and myself, Rafael Vintas-Dof, from CryptoExpress. I will now present this work with the help of Darius. As you surely know, real-world devices can leak some information about the operations they perform through their power consumption or electromagnetic leaks, for example. This leakage can be used sometimes with the help of statistical analysis in order to gain some knowledge about the secrets handled by the device. There are two main ways to deal with these attacks called side-channel attacks. One solution would be to reduce the amount of information that leaks and the other is to make sure the leakage is uncorrelated with the secret data. Masking is a common countermeasure of the latter category. In this talk we will present Tornado, a tool to generate mass implementations from high-level specifications. Tornado takes as input some code written in a programming language called Usuba, which was designed to efficiently compile high-level specifications of ciphers on high-end x86 CPUs. Tornado compiles those high-level specifications into mass implementations in assembly for embedded devices. We'll start by talking about what masking is and why it is a useful countermeasure to side-channel attacks, followed by the exposition of the two models of attackers that we use. We'll then explain how the tool works by first exposing the underlying security notions it relies on and explaining how detecting attack can be interpreted as a linear algebra problem that can be solved algorithmically. We'll finally describe how to make the works and evaluate its performance on various ciphers. Masking ends up making the secrets uncorrelated with leaked information. It consists in replacing every variable with sharings, which are tuples of shares chosen randomly except for the last one, such that they verify the completeness relationship, meaning that the value a sharing group presents is equal to the sum of its shares for a given additive group law. Note that this completeness relationship implies that in order to get the value represented by a sharing, one needs all the shares, as any subset is indistinguishable from a random distribution. In the simple case of Boolean circuits, in order to mask an implementation, one also needs to redefine the operations so that they now work with sharings instead of Boolean variables. These new operations, called gadgets, can be seen as black boxes, taking sharings as inputs and outputting as sharing. In addition to that, we also define a special kind of gadgets, called a refresh gadget, which aims at renewing the randomness of a sharing. In the first model we chose, the bit probing model, the adversary is allowed to place a certain number of t of probes inside a circuit, meaning that he has access to the values of t chosen variables inside the masked Boolean circuits, possibly variables that are located inside gadgets. The property one of the circuits to achieve is tight t probing security. In other words, an attacker must not be able to retrieve any secret value with t probes inside a t plus one shared circuit, a circuit where the length of the sharing is t plus one. While this model is relevant in a hardware scenario, it is not enough when we consider mass software implementations, where variables are manipulated in registers in which all the bits link together. With this, they find the register probing model where the attacker has access to the description of such a mass software implementation in the form of a circuit composed of gadgets, processing shares which are now vectors. Just as before, the attacker can choose t probes inside a t plus one shared circuit. They can now contain some new gadgets. More precisely, the gadgets we consider are addition and multiplication gadgets that perform the logical XOR and N gates, as well as rotation and shift gadgets plus refresh gadgets. In order to produce an algorithm that can verify whether a secret is t probing secure, we first need to define the problem in mathematical terms. We do so by introducing a security game and reduce it to the now algebra problem. In the game that corresponds to the t probing security definition, the adversary chooses the inputs of a t plus one shared circuit and the placement of t probes, and the simulator tries to simulate the distribution of the values of the probe variables without knowledge of the inputs. The circuit is t probing secure, if and only if, both the adversary and the simulator are put the same distribution. We then use a series of similar games that are all equivalent to one another. We won't go through these games in this talk, but let's see what happens in the last game. Since the multiplication and the refresh gadgets introduce some fresh randomness, we can consider their outputs as fresh new inputs, so the circuit can be transformed into a functionally equivalent one of multiplicative depth one. Also, in this last game, the adversary is only allowed to probe pairs of inputs of multiplication gadgets. That means that by placing one probe inside a multiplication gadget, the adversary can get one share of both inputs. This final game can be interpreted as a linear algebra problem. Since the multiplication depth of the circuit is one, there are only linear operations, rotations and shifts before the multiplications. And thus, the coordinates of every share in the inputs of a multiplication gadget are linear combinations of the coordinates of the input shares. One probe and a pair of inputs of a multiplication gadget can now be seen as a pair of matrices that we now call blocks. And the t probing security is now equivalent to the following property. No matter how an adversary chooses the placement of t probes and concatenates the two t input blocks into t plus one matrices, the intersection of the images of these matrices must always be empty. In our example, by placing probes on the two right most multiplications, the adversary can get the two pairs of input blocks A, B and C, B3. And since C is the sum of A and B and B3 is just B rotated by 3, it is possible to find a way to create three matrices with images intersect. We can thus declare that the circuit is not tight t probing secure for t equal to. That attack will allow an attacker to retrieve the secret value of B entirely. In order to prevent this kind of attack from happening, one need to add a refresh gadget. Thanks to the linear algebra formulation, verifying whether a circuit is tight t probing secure for any value of t can be automated. This is what the tool called typeproof plus is made for. Given an ample version of an algorithm as input, it outputs whether a circuit is tight t probing secure as well as security proof. It does so by creating a directed acyclic graph for every input of multiplication gadgets, where the branches diverge when different input blocks are chosen for different potential attacks. Each node contains some information about how a potential attack could be made. Notably, it contains what we call a permissive attack span, which represents the intersection of the images of the matrices we are trying to construct to create an attack when we follow that path on the graph. We generate these graphs layer by layer until all the permissive attack spans are either empty or an attack is found. If no attack is found in all the graphs, then the circuit is tight t probing secure for any t, and otherwise it is possible to describe an attack explicitly thanks to the information contained within the graphs. In our example, let's describe what happens when the algorithm chooses b3 as the first probe input block. Also, let's not forget that this means that we can get c for free, since by probing the corresponding multiplication, an adversary can get both input blocks. The algorithm creates the graph for b3, and then proceeds to choose the next input block. As we've seen before, by choosing either a or b, an attacker can get both by probing the corresponding multiplication, and thus an attack is found. For the other input blocks, no attack is found. Note that the graph is not always a depth 2, its depth depends on the number of multiplication needed to create an attack. The goal of Type Proof Plus is not limited to verifying the security of circuits, and has another purpose, which is to protect circuits that are not t-proofing secure. By adding refreshed gadgets and carefully choosing locations in its own inputs, until the circuit is insecure. To do so, it first analyzes the sub-circuits that appear frequently in the main circuit, because if a refreshed gadget needs to be placed there, it will need to be placed every time this sub-circuit is called. Then, if there are still some probing attacks, the algorithm will refresh the operand that is present in the largest amount of attacks. This method is bound to eventually stop and yield a secure circuit, but it is not guaranteed to find the optimal number of refreshed gadgets needed to make the circuit secure. Finding this optimal placement of refreshed gadgets is left of you to research. I will now let Darius continue the talk and tell you about how Tornado is built. Note that we have Type Proof Plus to verify the tip of insecurity of a circuit in either the ROGITOR or the BITPLANE model, we are interested in automating the generation of proofed secure mask implementations. Those implementations will make use of a programming trick introduced by Biham and ColdBitSlicing that I'm going to present in the next few slides. In the BITPLANE model, Type Proof works only on Boolean circuits. To show what this implies, let's consider the example of a circuit that computes a left rotation on a 5-bit viable X and solves the result with a 5-bit viable Y. So I'm using 5-bit variables for simplicity, but of course a real-world cipher would rather use 32 or 64-bit variables. Since the variables are each 5 bits rather than Booleans, Type Proof cannot analyze it in the BITPLANE model directly. Instead, we must expand each variable into 5 variables and expand each operator as well. The left rotation becomes a simple wiring, while the 5-bit source becomes 5 1-bit XORs. The code for these Boolean circuits can thus be written as 5 ascendments and 5 XORs. We can then remove the temporary variables and we get the following program. Type Proof can now be used to verify that the circuit is secure in the BITPLANE model. Actually, in this setting, both the BITPREMING and the RGISPREMING models are the same since each viable or resistor only contain single bits. The issue with this representation, however, is the low performances in software. What was initially two operations, a rotation and XOR, is no 5 XORs. In a real-life cipher, variables would probably be 32 or 64 bits rather than 5, and you would then have to compute 32 or 64 XORs rather than 1 XOR and 1 rotation. In order to efficiently run the circuit in software, we can use a programming technique called BITSLIZING. The idea of BITSLIZING is to represent an N-bit viable as 1 bit in N-registers. If we take for instance one of the 5-bit variables of the previous example, we would need 5 registers to store it. If those registers are 32-bit wide, then there are 31 unused bits in each. The idea of BITSLIZING is to take subsequent independent inputs and store them in those registers in the same fashion. The second input would go into the bit of the second rank of the registers, the third input into the bit of the third rank, the fourth input into the fourth rank, and so on until the registers are full. Now, computing a bitwise operation between two registers, like XOR for instance, computes it as many times in parallel as there are bits in the registers, and 32-bit registers, that 32 parallel XORs at once. So recall the example of the previous slide, where 1 XOR and 1 rotation was transferred into 5 XORs. Well, executing those 5 XORs on BITSLIZED data on 32-bit registers computes the circuit 32 times in parallel and deserves very efficient. However, still, BITSLIZING is somewhat limited. A 128-bit plaintext becomes 128 registers, except that modern ARM CPUs only have 16 registers. Thus, BITSLIZING introduces a lot of spilling, which means that data are often moved back and forth between registers and memory, which is harmful to performances. N-SLIZING is a generalization of BITSLIZING, which reduces register pressure and thus usually yields better performances. In N-SLIZING, instead of throwing one bit in each register, several bits of the same inputs are stored in each register. For instance, a 64-bit input could be split as 16 bits in 4 registers. The remaining empty bits of the registers can still be filled with subsequent independent inputs. Applying a bitwise operation on those registers, this computes it 16 times per input for several inputs in parallel. Some ciphers naturally rely on this technique to compute their S-box in parallel, like SERPENT, which represents its plaintext as 4x32 bits and can thus compute 32 S-boxes in parallel for a single plaintext. The register-playing model is more realistic for N-SLIZING, since it assumes that the probe can retrieve a whole register and thus gain information on multiple bits related to the same input at once. To analyze N-SLIZED code, we thus need the register-playing model extension of TypeProof Plus. And we now arrive to Tornado. Tornado is a tool we developed to automatically generate bit-slides and N-SLIZED masked implementations from high-level specifications. It is built from the UZUBA programming language which already does automatic bit-slizing and N-SLIZING. We modified the UZUBA compiler to automatically mask ciphers as well, and to use TypeProof Plus to make sure that the generated implementations are cheaper than Secure. Tornado takes as input UZUBA code, so UZUBA is a high-level programming language for cryptography. Here you can see for instance the UZUBA code corresponding to the example used earlier to N-SLIZED TypeProof Plus. This code is written as a node F which takes 5 inputs on 32 bits and returns a single output on 32 bits as well. The first step of Tornado's pipeline is to use the UZUBA compiler to normalize this code down to UZUBA0, a low-level subset of UZUBA. On this example, Tornado can either bit-slice or N-SLIZED the code. Bit-slicing would produce the following UZUBA0 code. Since it's quite unreadable and huge, we will instead focus on the N-SLIZED code, which does not require any transformation from the UZUBA source. Still, in UZUBA0, equations are not nested, unlike in UZUBA, hence the few temporary variables here. Note that not every cipher is N-SLIZED by Tornado, though. In particular, Tornado does not N-SLIZED ciphers whose N-SLIZING implementations would require a lot of bit manipulations and thus be prohibitively expensive. This can be the case, for instance, of ciphers relying on lookup tables and which, unlike some other ciphers, did not anticipate the need to bit-slice a lookup table. The next step is to send this UZUBA0 code to TypeProofPlus. The input syntax for TypeProofPlus is fairly close to UZUBA0, and this step is fairly straightforward. TypeProofPlus will then analyze the code and insert refreshers if necessary, and on this example on TypeProofPlus chooses to refresh T5, as you can see. The output of TypeProofPlus is then transferred back to UZUBA0 with the refreshers. As you can see, the UZUBA0 program now is the same as before, with just an additional refresh. Since TypeProofPlus can take a long time to verify a given program, there is a cache to avoid recomputing non-results. The UZUBA0 code is then masked. This transformation is done by replacing each variable with an array of shares and replacing each operator with a masked gadget. For linear operations, the masked gadget is written directly in UZUBA as a loop, while for non-linear operations and refreshers, it is written as a function code. We then have an optimization path to speed up the preferences of the generated code. For instance, loops used to compute masked linear operations are fused when possible. On this example, this optimization reduces the number of loops from 9 to 3, which is going to improve performances. Finally, the masked UZUBA0 code is compiled to C, and then to assembly or binary using GCC. And we will now show some benchmarks, and I'll let Raphael present the verification parts of those benchmarks. We apply TypeProofPlus to implementations of 11 primitives used in the round-2 candidates for the NISLite white cryptography standardization, chosen for their inherent ability to resist such an attack or their ability to be easily masked. The order of magnitude for the time it takes to verify the security of such implementations in the big probing model ranges from a few minutes to a few days. This disparity can be explained by the silo circuits and more importantly by the number of multiplications. But it is also caused by the linear relationships between the inputs of multiplication gadgets. Also, we can see with pungents that the verification time does not increase linearly with a silo circuit. Even though we multiply by 10 the number of multiplications, it took about 200 times longer to verify the security for 10 rounds of pungent than for one round. In the big probing case, no attack has been found on any implementation. This does not mean that no attack can exist in this model. It is impossible to handcraft some circuits that are not secure, but large secrets tend to be secure in this model. Verifying security in the register probing model is usually faster than in the big probing model case, as there are typically 32 or 64 times less multiplications to examine. Contrary to what happens in the big probing model case, it is very easy even for large circuits to find flaws in the register probing model. This explains why we found attacks for the implementations of Clyde, Ace and Gimli. Our tool added to their circuits 6, 384 and 120 refresh gadgets respectively. For the first two circuits, this is proven to be optimal. As for Gimli, this has not been proven to be optimal. We wish to make it clear that attacks are dependent on implementations. So the cyphers are not fault here, but only our implementations of them. And now, the risk will end this talk by telling you about the performances of our tool. We benchmarked the implementation generated by tornado for the aforementioned 11 missed candidates in both bit slice and end slice modes. In this talk though, we will only focus on the end slice performances, please have a look at the paper for the bit slice benchmarks. We report in this chart the performances of each cypher in cycles on an arm context M4 depending on the masking order. Note the logarithmic scale on the vertical axis. Without masking, performances range from barely more than 1,000 cycles for Clyde and Ascon up to more than 10,000 cycles for Pijamaskin Gift. At order 3, performances are between 30,000 and 300,000 cycles. At order 31, between 500,000 and 4.5 million cycles. And finally, at order 127, between 30 million and 300 million cycles. Tornado can thus reach very high masking order with decent performances. The only masked implementation we found in the NIST submissions was in the Pijamaskin mission. And this implementation is about twice faster than the implementation generated by Tornado at any given order. Considering that this implementation was carefully untuned and is written in the mix of CN assembly, the performances of Tornado seem indeed reasonable for an automated tool. We observe that cyphers with a low number of multiplications perform better at higher masking order, which is to be expected since the cost of a multiplication is quadratic in the masking order, while the cost to mask a linear operation is linear. Pijamaskin, for instance, is fairly slow and unmasked because of an expansive multiplication with a constant matrix. However, since it uses less multiplications than all of the cyphers except Clyde, it overtakes them one after the other. At order 127, Pijamaskin uses the second faster cypher. Similarly, at order 127, Clyde, with the lowest amount of multiplications, is more than 10 times faster than Ace, which has the most multiplications. The contributions we made in this work are three-fold. First, we developed TightProof Plus, an extension of TightProof, which can prove the tight T-proving security of a circuit in the register-proving model and insert refreshers to protect vulnerable circuits. Second, we combined TightProof Plus and Usupa into a tool named Tornado, which can automatically generate sliced mask implementations of cyphers from high-level specifications. And finally, we evaluated Tornado on 11 candidates of the NIST lightweight cypher competition, compared their performances and showed that three of our implementations needed additional refreshers to be secure in the register-proving model. One of the areas of improvement of this work concerns the placement of the refreshers inserted by TightProof Plus. Ideally, we would like to be able to show that the number of refreshers inserted by Tornado is minimal. Another area of improvement is the performances. We would like to improve the performances of the code generated in order to be as efficient as Henshin implementations. Thank you for your attention.