 Hi, everyone. Thanks for tuning in and welcome to my presentation of the paper Unrolled Cryptography on Silicon, a physical security analysis. I hope this talk will give you a basic understanding of the most important aspects of this work, but in case any questions arise with respect to the paper or this video or whatever, you can either ask them during the live session at chess on Tuesday, September 15th or you can directly contact me via email and I will get back to you as soon as possible. So what is this presentation all about? Essentially, we have designed a test chip in 40 nanometer CMOS technology, which includes, among some other features, an unrolled cryptographic primitive, and we are going to carefully analyze this primitive's physical security with respect to passive and non-invasive side-channel attacks. What do I mean by unrolled cryptographic primitive here? So in the last decade, cryptographic algorithms whose full execution can be performed at a very high speed in hardware have gained more and more popularity. In particular, we are talking about primitives that can be executed in a single clock cycle, but still at a moderately high operating frequency. Obviously, in order to achieve this, the primitive should have a short critical pass in circuit representation when implemented without any memory elements like flip-flops. The term unrolled refers to the fact that block ciphers and similar cryptographic primitives typically follow an iterative structure where the data to be processed is repeatedly passed through a round function, but for a single cycle operation, these rounds have to be laid out one after each other and no part can be reused. That process is called unrolling and it really means that the size of the implementation directly depends on the number of rounds that your primitive requires. Think of the present block cipher, for example, which can be implemented really tiny in hardware, but if you have to unroll 31 rounds, the result will be quite large and slow. A primitive which promises better performance in that regard is the prince block cipher. Prince has been developed specifically for high-speed single cycle encryption and decryption at moderate hardware cost. It has been proposed at AsiaCrypt 2012 and is probably quite well known in this community, so I won't recall all the details here. You can see a picture of its overall structure on this slide. Let me just quickly tell you that Prince is a 64-bit block cipher with 12 rounds that has an innovative reflection property around the middle, which allows to essentially use the same circuit for encryption and decryption, which means that the overhead of implementing one on top of the other is almost negligible. Secure and fast single cycle encryption and decryption is tempting for a lot of different applications, but memory encryption is probably the most cited one in this context. The core motivation for this work has been that we came to realize that unrolled circuits are pretty hard to protect against side-channel attacks. Glitch-resistant masking, for example, which is arguably the most relevant hardware countermeasure today, cannot directly be applied as it typically requires register stages for synchronization to prevent glitch propagation between combinatorial subcircuits. In that regard, it adds latency and increases the number of required clock cycles, which is completely undesirable for the kind of application that requires low latency cryptography. Even when applying so-called generic low latency masking, the trade-off between the number of clock cycles and the size of the circuit does not really allow to preserve the low latency property of a full cipher very well when trying to stay within any realistic area and randomness bounds. However, several different reports have indicated that the high parallelism, the asynchronicity, and the speed of execution of unrolled circuits create an inherent resistance to side-channel attacks when making use of a reset between encryptions. So this is something that we take a closer look at in the following. The second part of the motivation is essentially that there has not been any evaluation of a print hardware implementation on a non-programmable IC so far, at least not reported in public literature to the best of our knowledge. There have been a couple of FPGA case studies, but obviously there is a significant difference between these two hardware platforms, especially when it comes to high performance implementations like unrolled latency ciphers. To get a better impression on that, I listed some numbers here from researchers who have tried to make an independent assessment of the cost of programmability, which means the performance penalty and overhead that you have to accept when implementing a circuit on reprogrammable fabric as opposed to a fixed standard cell implementation. What they found is that in their comparison, FPGA circuits occupied 35 times as much area, consumed of 14 times as high dynamic power and have been more than 4 times slower than equivalent ASIC designs. In that regard, it's safe to assume that it's at least pretty difficult to transfer any conclusions from FPGA to ASIC platforms with respect to physical vulnerability in a meaningful manner. Another emphasis of this work is that we not only analyze the dynamic behavior of the circuit to extract the keys, but also the static behavior. As you might know, the static power side channel becomes increasingly effective in advanced nanometer technologies, so that certainly of interest here with the chip being manufactured in 40 nanometer CMOS, but also in general I believe that there are several reasons to not neglect the static leakage as a threat to unrolled cryptography, which we will highlight later in more detail. Before we move on to the experimental analysis of the manufactured ASIC, here are some implementation details and results of post-layout gate-level simulations of the PRINCE core. In short, the core itself is about 9,000 logic gates large, which occupy about 10,000 gate equivalents on the chip and the circuit has been synthesized and implemented for a 200 MHz operation. When both the key and the plaintext input make a random transition, so change from one random value to another, the 9,000 logic gates are subject to about 115,000 transitions in less than 5 nanoseconds. In other words, each gate in the circuit toggles its output on average about 12 times before the final state of the circuit is reached and stable. Clearly the majority of all those gate toggles are glitches and in fact unnecessary for the computation of the correct output. This also means that at any point during the execution of the primitive there is a high level of simultaneous computation going on and the noise level when targeting a small part of the circuit in a typical divide and conquer manner will be quite large. When only the plaintext makes a random transition and the key is fixed still about 57,000 gate toggles are caused which correspond to about 6 toggles per gate. The figure on the bottom here shows the first and respectively last occurrences of gate toggles when exclusively taking the output gates of the circuit into account. Clearly when the key is changed the first output toggles occur much earlier as the round key is propagated to the later rounds at the same time as it propagated to the first rounds but even when the key is fixed the first output transitions occur after only 2.1 nanoseconds. So the first signals have already made their way through the whole 12 rounds at that point. We have looked a little closer into that and found that depending on the input transition the evaluation of an individual S-box in the last round for example can occur at very different points in time. This also means that the processing of a certain part of the circuit which may be targeted is heavily misaligned when capturing multiple traces for different input transitions. This should directly affect the ability to perform attacks on later rounds. Now we take a look at our experimental results. First of all we start with the exploitation of the dynamic power consumption of our target. And this first scenario which we call no reset is the most trivial one. Each plain text is processed by the unrolled circuit as soon as it arrives and no cleanup of any kind is performed after the operation. Also the fixed key is constantly applied to the circuit in the same manner as a hardware fused fixed key would. In the figure on the top of the slide you can see an overlay of 30 sample traces and on the bottom the result of a correlation power analysis attack targeting the hamming distance between two consecutively processed first round S-box outputs. This attack works fairly well and is able to recover all key nibbles of the combined 64-bit keys that is used in the first round. So in the most trivial scenario extracting at least parts of the key is very straightforward. Yet as we will see in a moment extracting the remaining 64 bits of key material can be challenging at times. However as a first result we can conclude that this unrolled print score between nanometer ASIC is trivially susceptible to simple first round attacks when no further measures are taken. I have previously mentioned that a reset strategy has been proposed in literature to mitigate these kind of simple attacks. Therefore as the next step we test four different reset methods as alternative scenarios to the trivial one. In particular it has been reported that clearing the data pass between encryption enhances the side channel protection of an unrolled primitive. The four different methods that we compare here are first resetting the plain text to all zeros between encryptions, second resetting both the plain text and the key to all zeros, third resetting the plain text to a random value and fourth resetting both the plain text and the key to a random value. For all four scenarios we have collected traces in a fixed-verse random manner so that we compare the scenarios visually by looking at the different distributions for the fixed and random groups and statistically by means of the non-specific Walsh T-test. Clearly the groups are easily distinguishable in this particular scenario which indicates a good amount of input dependency in the traces. On this slide we see the same evaluation metrics for the plain text and key reset to zero. First of all one may notice that the voltage drop measured is about twice as large as for the previous scenario which matches our expectation from the gate level simulations due to the larger number of glitches in the circuit when the keys changed as well. Otherwise it can be observed that the noise level is a bit higher in this scenario and the distinguishability is reduced. When resetting the plain text input to a random value between encryptions the noise level is even higher and the distributions are much harder to distinguish than before. This is due to the fact that the dissipation of the circuit has now become non-deterministic from the adversaries point of view. This is because the transition from one random value to a fixed value has a very different power consumption footprint than the transition from another random value to the same fixed one. In other words even if the attacker supplies the unroared circuit multiple times with the same input as it is done here for the fixed group the dissipation will be very different each time despite the identical input because the previous state of the circuit determined by the random reset is different and affects which input bits make a transition. If this is not directly intuitive there is a toy example in the paper explaining this in a little more detail. Finally when resetting both the plain text and the key input to a random value the voltage drop is again more drastic but other than that the matrix show pretty similar results. What you can see here are the optimal attack results that we have found for the five different scenarios. In the left column of the table the name of the scenarios denoted and in the right column the number of successfully recovered key nibles is given. It can be observed that the number of recovered key parts is reduced by all of the reset methods yet to a different degree. Especially in the random cases the attacks do not work very well anymore. Again this is primarily due to the inability of the adversary to predict a transition between consecutively processed values based on a partial key guess. Just take a look at the power model here. The previous state of the circuit is unknown to the adversary, freshly random between each encryption and cannot be controlled. We have verified that even with 100 million traces no CPA with any bit model or the Hemingway power model could extract a larger part of the key nor could a leakage model independent collision based attack deliver any better results in these random reset cases. Security against basic attacks was up to 100 million traces just by randomly preloading the circuit is quite impressive in my opinion. However the primitive of course still has detectable leakage and more sophisticated attacks might be able to extract some more information in this scenario. Finally we have calculated the signal to noise ratio in short SNR for all reset methods based on the 12 different inputs to the ciphers rounds. On the top left you can see the degrading SNR over the number of rounds for the plaintext reset to zero method. On the top right the SNR for the plaintext and key reset to zero. On the bottom left for the random plaintext reset on on the bottom right for the random plaintext and key reset. In all four graphs the SNR for rounds 11 and 12 are statistically insignificant as the values are at or below the red line. This means that given the available amount of observations an attack from the ciphertext side on the last rounds would be extremely unlikely to succeed in any of the scenarios. This supports the intuition that we have gained from the gate level simulations namely that later rounds of unrolled circuits are difficult to attack when using the dynamic dissipation as a source of information leakage. However even the values for the earlier rounds are pretty low and there is a significant drop after the first round already. Apart from the higher asynchronicity of signals in later rounds this is also caused by the fact that a gate toggle towards the end of the circuit does not influence nearly as many further gates as a toggle directly at the input. There have been proposals in literature to mask the outer rounds of unrolled prints using a proper glitch resistant masking scheme including register stages and leave the rounds in the middle unrolled and unprotected. This is supposed to achieve a compromise between latency and side channel security. Our result indicates that a protection of the last round by a masking may not be necessary or at least not cost effective because the actual increase in security could be negligible. Furthermore using unrolled encryption circuits in scenarios where the physical adversary may trivially obtain the ciphertext but not the plaintext seems like a clever choice because the text in the ciphertext only model are expected to be orders of magnitude more difficult. Just take a look at the figures on this slide. Unfortunately these arguments come without any security guarantees so it still depends on your hardware and the desired security level. Now we finally come to the static power analysis of our target. There is no need to distinguish as many scenarios as before here because either the unrolled circuit is idle at some point in time while sensitive intermediates are still applied to the logic gates in the circuit or not. Which means either exploitation is possible or not. In case the adversary can stop the clock of the target he can artificially pause the circuit whenever he wishes and create that exploitable situation on demand. In that case none of the previously analyzed reset methods have any effect on the measurements as the static power consumption is not affected by any previous state of the circuit. Remember this is not a transitional effect that we observed but a static one. When the adversary cannot control the clock it may still occur that sensitive intermediates remain in the circuit if the whole cryptographic instance is not reset immediately after any encryption. For that it's really important from a designer's perspective that the reset is performed right after the last encryption and not just some time before the next. In our experiments we have assumed a second type of situation where some parts of the chip are actively computing but the unrolled rinse instance is not used at the moment but still includes the remains of the previous encryption. We have measured the DC shift and collector traces in a fixed-verse random manner and the results can be seen in the two figures at the top here. The bottom two figures show two successful CPA results targeting the least significant bit of an S-box output in the first round and an S-box input in the last round respectively. So the remarkable difference to the dynamic power analysis is here that attacks from both sides. The plain text side and the cipher text side work well. This is confirmed by this table showing the best attack results that we have found. All but one keynable can be recovered in the first round and all of them can be recovered in the last round. In that regard only 4 bit of information are missing about the whole 128 bit key after these two attacks. We have also calculated the signal to noise ratio based on all 12 round inputs for the static power measurements. The results are compared to the four dynamic scenarios from before which are shown as dotted lines here. It is obvious that the static power consumption leaks information about all gates in the circuit in the same manner independent of whether they are located close to the input or the output of the circuit or anywhere in between. In that regard the SNR values for all 12 rounds fall roughly in the same range and it's clear that information may be extracted about intermediate values from all rounds. To me this really highlights the conceptual difference between observing the dynamic power consumption and the static power consumption of CMOS logic. Keep in mind that the static leakage becomes even more informative and more advanced CMOS notes while the dynamic power per logic unit declines. It's time to draw a conclusion. Effectively protecting unrolled circuits without causing severe area or latency penalties is pretty hard but some simple usage principles if correctly applied can deliver promising results. For example when the plaintext input of an unrolled cypher is reset to a random value unknown to the adversary between encryptions it is difficult to extract secret information through the dynamic power. Static power adversaries remain dangerous in such a scenario if clock control is an option or if other mistakes are made so this certainly needs to be considered in security analysis of such primitives as well. Finally due to its different nature the static power consumption was clearly the easiest method to extract the full 128 bit key of our unrolled print score because each round can be targeted with the same effort. Thank you very much for your attention. Like I said in the beginning I am very happy to answer any questions in the live session or via email. Have a nice day.