 Hello and welcome to this presentation of the paper the speedy family of block ciphers, engineering and ultra low latency cipher from gate level for secure processor architectures. My name is Thorben Moos and this is a joint work with Gregor Leander and Amir Muradi from the Ruhi University Bochum and Sharam Rasulzadeh from Radbot University in Nijmegen. The fundamental research question we were trying to tackle in this work is how do we design a secure encryption algorithm whose hardware implementation is fast. In order to answer this question we first need to clarify what kind of hardware implementation we are talking about and what our definition of fast is. By hardware implementation we mean here a circuit representation of the encryption algorithm constructed entirely from CMOS gates. Why do we talk specifically about CMOS gates? Because CMOS has been the de facto standard for integrated circuit fabrication for more than 40 years now and that situation is not going to change anytime soon. By fast we mean that the time period between providing the input signals to the circuit and receiving the correct stable final output signals should be minimal. Since the inclusion of memory elements to store intermediate results inevitably increases this time period we only consider circuits made entirely from combinatorial logic gates. In other words we are interested in the minimum latency of fully unrolled implementations of encryption algorithms. Now this is not the first work in this area. A number of low latency cryptographic primitives which can be used to encrypt data in a short period of time has been proposed in the past. The first notable one is the PRINCE block cipher proposed at AsiaCrypt 2012. This block cipher has received a small redesign to increase its security level at SEC 2020 and the result is called PRINCE version 2. Midori is shown in brackets here because it is not primarily a low latency cipher but rather a low energy design. Finally the two tweakable block ciphers Mantis and Kama have been proposed at crypto 2016 and fse 2017 respectively. Apart from regular block ciphers there are also two other cryptographic primitives which are relevant to this field. First of all the cross-platform permutation Gimli. This one is also shown in brackets here because it is not primarily a low latency design. However recent works have claimed that its fully unrolled implementation is rather fast in hardware. Autros is not a permutation or key permutation like a block cipher but it is a PRF or pseudo random function which is based on the sum of two key permutations. As a result Autros is not invertible which means that its outputs cannot be decrypted. In the SEC 2020 paper called PRINCE version 2 more security for almost no overhead a latency comparison of the previously listed block ciphers and only the block ciphers in Nungate open cell libraries is presented. In this figure the leftmost dot of each line represents the minimum latency that could be achieved. The farther it is on the left the smaller is the minimum latency. Clearly PRINCE and PRINCE version 2 are faster than the competition in this comparison in Nungate 45 nanometer technology. Comparable results are achieved in Nungate 50 nanometer technology so it seems that both PRINCE versions are among the fastest existing block ciphers. Of course this comparison is not entirely fair since tweakable block ciphers like Mantis and Kama are expected to require a larger circuit depth and since Midori is a low energy design first and foremost and not optimized for maximum single cycle encryption speed. Now it is important to understand that PRINCE, PRINCE version 2, Midori, Mantis and Kama are all lightweight block ciphers and lightweight block ciphers are typically supposed to be suitable for resource constrained environments like battery powered devices in the IoT. In particular they are designed to be area and energy efficient. We thought maybe if you take that requirement away you could develop something much more performant. Additionally the listed low latency block ciphers are designed to offer both encryption and decryption functionality in one circuit without a significant area or energy overhead. For that purpose PRINCE for example introduced the alpha reflection property that allows to encrypt and decrypt data with essentially the same circuit. This is another restriction to the design space. Final we were found that gate and transistor level characteristics of the underlying hardware have hardly ever been considered at the design level in the previous constructions. This all led us to the question whether we might be able to design a cipher that is faster than the state of the art by focusing on maximum encryption speed and security only while taking the low level hardware characteristics into account. In particular we were interested in designing a block cipher which is able to encrypt data faster than the state of the art without necessarily keeping it super lightweight and without paying attention to the efficiency or the overhead of the decryption. This is essentially what this presentation is about. The potential applications for such a cipher design are found in the area of high-end CPUs. If the myriad of micro architectural attacks over the past couple of years has taught us one thing it is that the security architectures of modern CPUs require improvement. A lot of potential solutions have been suggested in the literature and it can be observed that many of them call for a higher level of encrypted communication inside of CPUs and between CPUs and their surrounding components. This includes secure caches based on address encryption. This includes memory encryption of essentially all storage elements inside and outside of your CPU. This includes pointer authentication as implemented using karma and ARM processors. We believe that many more of such features will be needed and implemented in future CPU generations and the one requirement they all have in common is that they need super performant cryptographic primitives to avoid the large performance penalty. Okay, so much for the introduction. Let's jump directly into our latency considerations. First we concentrate on the latency of individual CMOS logic gates. If you look closely at the way static CMOS gates are constructed, namely from a pull-up network made from PMOS transistors and a pull-down network made from NMOS transistors, it becomes clear that CMOS logic gates are naturally inverting. Consider the example on this slide. The left figure shows the three input NAND gate, which is an inverted output AND and the right side shows the three input AND gate. In CMOS hardware, the AND gate is realized by concatenating a NAND gate and an inverter gate. Clearly, the logic gate on the left side can be evaluated faster and hardware than the gate on the right side due to the second stage of pull-up and pull-down networks that the signal needs to pass through in the right figure. The same can be observed for other logic gates like OR for example. Additionally, CMOS gates require a minimum of two times N transistors to realize any logic function where N is the number of inputs. As you can see, the three input NAND is realizable in the minimum of two times three transistors, while the AND gate is not. This prompted us to introduce a notion of natural CMOS gates. Our definition calls all inverting logic cells which can be realized as static CMOS gates in a single stage of two times N MOSFETs, natural CMOS gates or NCGs. We argue that logic cells with this property are immensely beneficial for low latency constructions. Common NCGs include the inverter or NOT gate, the NAND and NOR gates and the complex or compound logic gates AND or INF and OR AND INF. However, we don't stop there as far as categorizing logic gates and look even deeper into the hardware characteristics. In fact, CMOS transistors in silicon hardware are known to have a higher ON resistance than NMOS transistors because holes have a smaller mobility than electrons. This means that a switched on CMOS transistor is less conductive than a switched on NMOS transistor of the same size. The effect of the different ON resistances is even amplified when the layout of a logic gate requires multiple PMOS transistors to be stacked, which means connected and serious. As a result, the three input NAND on the left side can charge its output much faster than the three input NOR on the right side. Therefore, it is clear that natural CMOS gates which require no or only small PMOS stacks in their layout are the optimal choice for low latency constructions. Such gates include the inverter or NOT gates, the NAND gates and the OAI gate. This observation is universal for any CMOS standard cell library. As a quick example, we have listed the base latencies of different logic gates in NAND gate open cell libraries in these tables. For each fan-in number, which is the number of inputs N, we have highlighted the minimum latency. The tables also contain the fan-in-to-latency ratio and short FLR, which allows to compare gates across their fan-in classes. As expected, inverter, NAND and OAI gates are the fastest logic gates with respect to their number of inputs. It is insufficient to look only at the latency of individual logic gates. We also need to consider the impact on the latency when logic gates are connected to logic circuits. In this regard, we first want to dispel two common myths. The first one is that each CMOS standard cell has a fixed delay and each instantiation of the same exact standard cell adds approximately the same latency to a pass. Interestingly, this is false. The propagation delay of a CMOS cell always depends significantly on its direct electrical environment. In particular, the delay is always a function of the transition time of the input signals to the cell as well as the capacitive load that the CMOS cell needs to drive at its own output. The variations of the delay of a cell instance depending on its electrical environment can easily be in the range of 200 to 300 percent. Another myth is that adding a gate to a pass of a circuit and not making any other changes to the pass will always increase the pass's latency. This is also false, often adding a well-placed buffer or inverter where logically applicable to a pass in order to charge a significant capacitive load faster can decrease the overall latency of the pass. Therefore, the mirror gate depth is not always indicative of the latency of a circuit. Generally, the topology of a circuit, primarily the fan out of the logic cells, is similarly important as the number and type of gates in its critical pass when determining the latency. Here is a quick example demonstrating the previous two explanations. On the left side, we have an XOR gate which drives eight further XOR gates. That means the first stage XOR has a fan out of eight and needs to drive a significant capacitive load to charge the input pins of the second stage XORs. The base latency of a two input XOR is about five picoseconds in this technology. However, due to the capacitive load that the first stage XOR needs to drive, it has a latency of about 21 picoseconds, while the second stage XORs have a maximum latency of eight picoseconds because they are driven by a signal with a large transition time. What the synthesizer will do in such a case is either upsizing the drive strengths of the first stage XOR if such cells are available or adding a drive strength buffer between the first two stages. The result of the second option is depicted in the right figure. Now the first stage XOR needs to drive only one pin and the second stage XORs are driven by a signal with a much smaller transition time. Therefore, despite increasing the gate depths by 50% on the right side, the overall latency is decreased by about 35% compared to the left. This is of course nothing we need to do by hand because it is done automatically by the synthesis tool. Our point here is simply that the topology of a circuit plays a pretty significant role in its latency. We have used all the hardware based characteristics discussed above and tried to find optimal round operations for an ultra low latency cipher which we then called speedy for obvious reasons. This is the hard and center piece of the speedy block cipher, namely its 6-bit s-box. It has a uniformity of 8, a linearity of 24 and provides full diffusion. At the top you can see that each input bit is buffered and inverted in parallel. Due to the necessity to buffer each input bit anyway because of the fanouts, the inverters do not add any latency here. At the bottom we see that each coordinate function is realized as a two-level NAND tree. As we have seen in our previous discussions, NAND gates are among the fastest logic gates for any number of inputs. The search for this particular s-box was not trivial and took a lot of time. More details in that regard can be found in the paper. In order to analyze how our new s-box compares to other cryptographic s-boxes, we have collected the minimum achievable latencies of a number of s-boxes in six different standard cell libraries. Two of those libraries are NAND gate open cell libraries, the other four are manufacturable foundry libraries. The individual results per technology are listed in the paper. Here we compare the average normalized latencies of the s-boxes across all six standard cell libraries. For the comparison we have selected all s-boxes that have been proposed for a low latency literature, which are four-bit s-boxes only. Additionally we have added the Ascon 5-bit s-boxes, the DES S1 box and the IES s-boxes. Of course those last three are not supposed to offer a low latency. They are merely included to provide a better perspective with respect to larger s-boxes. The results show that the speedy s-box is impressively fast for a six-bit s-box even faster than some of the four-bit boxes. Please note that we distinguish between speedy with a star and speedy without a star here. For speedy with a star we have replicated the EXACT gate level description shown in the previous slide. For speedy without a star we have only given the behavioral description to the synthesizer, which means in a table lookup manner which is what we have done for all the other s-boxes as well. Please also note that the inverse speedy s-box which is ranked the third from the right is not required for the encryption algorithm therefore its latency is not of primary importance here. We have chosen a rather strong and expensive linear layer in order to limit the number of required rounds to a minimum. This figure shows the combined mixed columns at round key and at round constant operation. It is realized as a three level XO tree based on two input XORs. We have also experimented with three input and four input XORs to reduce the number of stages but those did not lead to a reduction of the latency. This is the structure of the whole cypher. Please note that the s-box and the shift columns operation are applied twice each before one expensive combined mixed columns at round key and at round constant operation is executed. While the block and key size are flexible in the speedy family we have concentrated on 192 bit block and key sizes as 192 is the least common multiple of six which is the width of the s-box and 64 which is the common data with the modern CPUs. The number of rounds can be used to adjust the security versus latency tradeoff. In this work we concentrate on the speedy versions with five six and seven rounds. We claim that attacks on speedy with 192 bit block and key size and five rounds require at least a time complexity of two to the 128 when the data complexity is limited by two to the 64. For six and seven rounds we claim 128 bit and 192 bit security respectively without restrictions to the data complexity. Therefore only for speedy with seven rounds we claim that exhaustive key search is the best attack for the variants with fewer rounds better attacks than exhaustive key search are expected but hopefully not below our claimed complexities. For comparison Prince claims that attacks against that require two to the 127 minus n time complexity for a two to the n data complexity. Prince version 2 claims two to the 112 time complexity for two to the 50 data complexity. In that regard speedy with five rounds already claims a significantly higher security level than Prince and Prince version 2 which up to now are the fastest known block ciphers. This figure shows a comparison of the average normalized latency of the full encryption algorithms across six different standard cell libraries. Clearly speedy with five and six rounds constituting the four leftmost bars here outperforms all other low latency primitives despite its comparably large 192 bit state. Even with seven rounds speedy outperforms quite a number of other low latency constructions. The full comparison can be found in the paper. Regarding area comparisons of block ciphers with different block lengths it is common to consider the area per bit instead of the total area and although this was not a focus of this work we can see that the area per bit of speedy five is actually lower than that of Prince and Prince version 2. Even speedy six is not far behind. It becomes clear that the dedicated lightweight block ciphers are not necessarily more area efficient than speedy. To summarize we have created a new ultra low latency block cipher which outperforms the state of the art in terms of execution time when implemented in CMOS technology. Speedy five is about a quarter faster than Prince and Prince version 2 while occupying less area per bit and providing a significantly higher security level in case our claims prove to be accurate over time. Speedy six is 8 to 12 percent faster than Prince and Prince version 2 and 3 to 7 percent faster than Ortrus which claims the same security level but is not invertible. Speedy seven provides full 192 bit security while being less than 8 percent slower than Prince and Prince version 2. Thus we believe that speedy is a great choice for any application where high encryption speed and high security are the primary requirements. Thank you very much for your attention. If there are any questions feel free to ask them during the live session at chess 2021 on 17th of September. See you there.