 Hello everyone. In this talk, I will explain how we can use side channel analysis to recover the key and nonce of AIS encounter mode with only 256 traces. Side channel analysis is a powerful method to recover secret keys from a device by exploiting the power consumption or electromagnetic radiation. Differential power analysis is a very popular attack for symmetric ciphers, which was introduced in 1999. It's a known plaintext attack, which means that apart from side channel information, it also requires the knowledge of plaintext or ciphertext to recover a secret key. In this talk, we consider counter mode, which probably needs no explanation. But in this mode of operation, we cannot always assume that plaintexts are known. Consider, for example, that it is used as a PRNG. In that case, the initial state consists of a key and a nonce, or IV, and it is secret. This scenario was considered in a previous work by Jaffe, and he managed to recover both the key and nonce from 2 to the power 16 power traces using only the information in those traces. NIST actually has an official recommendation for using counter mode to generate pseudo-random bits. During each request, the plaintext gets incremented by the counter, and the ciphertexts are collected until there are enough random bits. An important detail of this recommendation is that between every request, the state, which includes the key, is updated. So this means that a DPA attack that wants to recover the state must succeed within a single request. And then NIST also specifies that the maximum size of a single request is 2 to the power 19 bits. Now, in the case of AES, which has 128-bit states, this means that there are only at most 4,096 encryptions per request. So that actually means that Jaffe's attack is prevented because it needs more traces than 4,096. So here's our question. Can we attack AES in counter mode in less than 4,096 traces? I already told you the answer. Yes, because we will do it in only 256 traces. Now, before we explain the attack, let's recap how an AES round works. So first we do subbytes, which is the only nonlinear step, where every byte goes individually through the S-box. Next, we do shift rows, where every row of the state is rotated over 0, 1, 2, or 3 bytes. Then we do mixed columns, which individually transforms every column by multiplying it with a matrix, and therefore we mix the bytes of that column. And finally, we do add round key, where every byte of the state is exored with the corresponding byte of the round key. During this presentation, we will use power traces that we took with the cheapest per light board. We connected it to the UFO target board with an SDM32F3 mounted. So the CPU is a 32-bit ARM Cortex M4, which we programmed with a C implementation of AES in counter mode. Now in the paper, we also use different devices to demonstrate the attack apart from only this device. Now, let's start the attack. Remember, we have only 256 traces, so that means that only one byte of the state is varying all the time. And for simplicity, let's assume it's the least significant byte, so byte number 15. We will indicate bytes like these that are changing all the time with a black square. When this byte reaches a value 255, the counter will overflow, right? It will go back to 0, and a carry will activate the next byte, byte number 14. This happens only once in our set of 256 traces. And so we will use gray squares to indicate bytes like byte number 14 that change once. All the other bytes, they are constant, and we will indicate them with white squares. Also, we will use x to denote the state before, at round key, and z for after subbytes. And what we know about the very first state, x1, is that it's the sum of denotes, or IV, which we don't know, and a counter that starts from 0, which we do know. Then, the first step of the attack is actually identical to that from Jaffe from 2007. It targets the 15th byte after the first subbytes. So this is the S-box applied to the XOR of key byte 15 and state byte 15. Both of these bytes are unknown, but we can use the knowledge of the counter. In this expression, there are 16 bits that we don't know. 8 bits of the key byte and 8 bits of the non-byte. And then Jaffe did an optimization to reduce the number of unknown bits to 15 by combining the most significant bits of the non-byte and key bytes in a bit B. So that means that there are 15 bits unknown, bit B, and the 7 lower bits of the key byte and the non-byte. Which means that we can do DPA with 2 to the power 15 hypothesis. Now let's try out this first step on an actual example. So we took traces for multiple devices and I'll use here the example of the Cortex M4. These are traces that we took with the chip whisperer. And as you can see, the set consists of exactly 256 power traces. And each trace has 12,000 points. So let's see what the traces look like. And so they consist of four rounds of AES. The AES round functions are always very easy to recognize. In subbytes you do the same S-box 16 times. So you always see 16 peaks in the subbytes regions. And the mixed columns region is the same operation on four columns. So then you always have four peaks. And so we will target the subbytes region in this attack. And here we have the code for the first party of the attack. And like I said, we will loop over 2 to the power 15 guesses. And then for each guess and for each trace, we will compute a hypothesis. Which is the value after the S-box in the first round for byte 15. And then we will compute the Pearson correlation between the hypothesis and the traces. And for each guess we will store the maximum correlation that we find. And then we can use that to rank the guesses and choose which one wins. So let's go ahead and start this up. This will take a while because we have to do 2 to the power 15 guesses. So that took about 25 minutes. And we have a winning guess which clearly has a much higher person correlation than the others. We can make a plot of the 10 first ranked guesses. And then we see that the winning guess, which is the green line, has a very large correlation peak toward the end of the subbytes. Which makes sense because we're attacking the last S-box. And so the winning guess has bit B equal to 0, which means that the most significant bits of the key byte and the nonce byte are the same. And then the lower 7 bits are given here. So for the rest of the attack we will assume that the key byte is equal to this and that the nonce byte is equal to 13. And so we're assuming that the most significant bits are 0, but the only thing we know is that they're the same. So it's possible that they are 1, but for now we have sufficient information to continue with the attack. After this first step we have recovered byte 15 after the subbytes. Then because of shift rows this byte will move to the first column. And then during mixed columns it is mixed with the other bytes of that column. So our next step will again target the bytes after the subbytes operation. For example, the first byte of the first column will be the S-box applied to a key byte XORed with X to 0. We can express X to 0 with the mixed columns equations. Now one of those bytes, byte 15, is known and constantly varying. The other ones are unknown, but luckily they're also constant because they are white squares. So that means that we can consider them together with unknown key bytes as one constant of 8 bits. That means that we can use a DPA attack with 2 to the 8th hypothesis to recover one constant for each byte of this column. So we'll try this out with the cortex traces as well, but first I want to draw your attention to these mixed columns factors. For two bytes of the column the factors are the same. So the hypothesis that we compute for these two bytes will be identical. So we return to the notebook and here we have those four mixed columns factors that we need to recover each of the bytes of the column. And then the code for the second round is pretty similar to that from the first round, except now we only have to guess 2 to the power 8 times. For the rest we're again computing hypothesis and Pearson correlation and etc. We need to do this code for each byte of the column. And so I'll show it here for the first byte, but like I said the first and the second byte have the same mixed columns factor. So that also means they have the same hypothesis and we'll actually be doing both of them at the same time. So this goes a lot faster than the first step, right? Because we have a lot less guesses to make. We get the result after only 10 seconds. And so since we did two bytes at the same time we have also two guesses that very convincingly win from the others. But there are different time samples in the trace. So the first one, the orange one is the constant that we wanted to recover for the first byte of the column and the second one is for the second byte. So then we still need to recover two other bytes from the column. So here we do all of them and I'll fast forward this second. And then we've recovered all the four constants needed to know the first column of the state after the second round. At the end of round two we have recovered the entire first column and then with shift rows those bytes are spread over all the columns of the state. As in the previous step there is now one known byte per column but this time there's also a gray byte in every column which means that we cannot just assume that it is constant. Let's consider an easy scenario first. If byte 15 of the nums was zero then the counter for that byte would never overflow and byte 14 would have been constant. And then there would be no gray squares and we could apply exactly the same method as in the previous step to each of the columns of the state. Now for each byte we could recover a constant by using the only known byte of that column multiplied with the right mixed columns factor. Now what if nums byte 15 were value 255? Then after only one trace the counter would overflow and the byte 14 would change. But then for the rest of the traces it would be constant again so we can still kind of apply the same method. Only this time we would recover different constants because the value of x14 changed which also changes the value of all the gray squares in the rest of the cipher. So what happens in general is that the byte, that the value of byte 14 in the first state changes after 256 minus n15 traces. So depending on this value of n15 we will have more traces with the first constant or the second and our DPA attack would recover the constants that occur the most. Of course the worst case scenario is that the value n15 is approximately 128 which means that half of the traces use one constant and the other half the other. So what we would see in a DPA attack is that for the first half of the traces one of the constants wins but then as we go to the second half it actually starts to lose to the other constant. Let me show you how we deal with this in our example. Okay so let's start by doing the most obvious thing which is assume that we can just use the same methodology as before and use all the traces. The code for the third step is essentially the same as the previous steps. The only difference is how we compute the hypothesis. And so if we try this, let's try this on the first byte. Fast forwarded it a bit. What we see is that not a very good result because there's a race going on between two of the guesses and neither of them are winning very convincingly. So when you see that that gives you a hint that we might be close to the worst case nonce, right? And indeed we already know that our nonce, byte 15 of the nonce in the first step was either 13 or 13 plus 128. And so when you see this you might think well probably it's 13 plus 128. So what if we just use half the traces? I'll fast forward again. Then it looks much better, right? So now one of the constants wins very clearly. We have a very good margin and the attack works. One thing that we can do which is even better is since we can kind of guess that the most significant bit of the nonce and also the key byte in the first step is one, we can use that to select exactly the traces that will correspond to the same byte 14. So in this case that gives us 141 measurements and if we do the attack then, again, we get a winner with a very convincing margin. So let's now do this to recover all the 16 constants in that round and I will also skip this part until it's finished. Okay, so now we have recovered all 16 constants which means that we know we can compute exactly the states at the end of round three. After round three we have recovered the entire state which means that we can compute state X4 at the start of round four. At that point we can perform a regular DPA attack where we only have to hypothesize on the round key bytes. After recovering that round key we can simply invert the key schedule to compute the master key. We can do the same thing but this time we're going to actually guess the round keys. The computation of the hypothesis becomes more and more complicated the more we go into the rounds, right? But it's not that difficult. So again, we need to perform this for each of the 16 bytes of the states. So now we're trying to recover the first byte of the round key in round four and we're using exactly the same set of traces that we were using in the previous step of the attack. So also not the full set and then the attack works nicely. Again, we will have to do this for all 16 bytes of the state and that's the last byte. So we have recovered all 16 bytes of the fourth round key. So the last part is really quite easy. We just have here an implementation of the inverse key schedule of AS so that gives us very quickly the master key. And then with that master key we can also invert the AS rounds from the state that we recovered. So this is the state in round four. Here we have inverse AS implementations and then we apply it together with the key and we obtain the first round input. So there's one thing you need to pay attention to. It's possible that when we selected a subset of traces that we selected the traces after the toggle of X-14 happens. So if that's the case then we still need to subtract one from X-14. But in this case, which in this case was actually the case so instead of 0XBB we needed to have 0XBA. So this is it. We have recovered both the nonce and the master key. And so I did check this with what was actually sent to the chip whisper and it turns out that it was correct. But from the fact that we recovered the same bytes in the key and nonce as we did in the first step that also gives you a lot of confidence in the fact that your solution is correct. We applied the attack on several sets of simulated traces with different noise levels to investigate the success rate. Naturally that success rate depends on the least significant byte of the nonce. So for different nonces we get different success rates. The best case is of course when the nonce is 0 which is the blue line. The worst case when it is 128. We see that in the worst case, the pink line that regardless of the noise level we can never reach 100% success. But if we throw away half of the traces that's the green line here the success rate is similar to that of other nonce values. And then we also show here in the yellow line that if you have 512th traces instead of 256 at your disposal then you can always select the optimal subset of 256 traces that start with the best case nonce and you have this success rate. Now from a designer point of view what should we take away from this? One thing we can do is use the hiding technique of random shuffling. Because as we saw in the attack it's important to see the correct order of the sub bytes in the traces if we want to be able to distinguish bytes with the same mixed columns constant. If the sub bytes operation is randomly shuffled it's more difficult for the attacker to recover the correct constants. Also, in the NIST specification of counter mode they note that we can use an LFSR update instead of a regular counter. That could mean that each iteration changes bits that are spread over the entire AES state instead of just a few bytes. And since the attack really depends on the fact that a large part of the state is constant it wouldn't work if we used an LFSR. Another very important thing to do is to limit the size of the requests so that the attacker cannot obtain enough traces with the same key. I noticed that some entities already use a lower bound than this prescribed so that's good. For example, the MBET-DLS library only allows 64 encryption spur requests and the RBG from Texas Instruments only allows 16 encryption spur requests. And then finally, if we had tried to do the attack on a hardware implementation of AES in counter mode it probably wouldn't have worked with 256 traces. If you have a run-based implementation the useful signal is only 1-16 of the power consumption. And also, you can't distinguish between different bytes in the same run because they're all happening at the same time. And if you have an unrolled implementation it's quite difficult to distinguish between the different runs in the trace. Overall, the signal to noise ratio in hardware implementation is just much lower than with software implementation. So it's not very probable that the attack would succeed with only 256 traces. So what does that mean for the application of masking? Masking is a counter measure against side-channel attacks that usually requires an online PRNG to supply randomness. A question that is often asked is whether that PRNG needs to be protected with masking as well. But that could create a chicken and egg problem, right? Because if the PRNG needs masking then another PRNG would need to supply that. Based on this work I would say that if the PRNG is based on AES encounter mode it does not need masking to protect against DPA. If we just follow the recommendations from the previous slide we can prevent the attack from this work. Of course, PRNGs for masking are not necessarily based on AES encounter mode. Sometimes people just use a simple LFSR construction. And from a side-channel point of view linear functions are more difficult to target so that might be an advantage of LFSRs. But we do need to do more research into specific constructions and attacks that might work against them. In conclusion we can attack AES encounter mode in 256 traces not only in simulation but also on actual devices. We use this attack to create some recommendations for protecting AES encounter mode and to decide that it does not need masking to be secure against DPA. In the paper we also use the deep learning version of DPA to demonstrate the attack and we compared our results to blind side-channel analysis. Both are techniques which were presented at chess in previous years. We also show more applications to different devices in our appendix. As for future work I think we should consider what kind of PRNG is needed for masking and what kind of side-channel attacks might work against them. Based on that we can determine what kind of additional protection is needed.