 All right, good afternoon everyone, my name is Benjamin and I'm going to talk about side channel analysis of the Xilinx Sync UltraScale Plus encryption engine, which was a joint work with my colleagues Sebastian Leger, Daniel Fennes, Stefan Guerra and Tim Guinezu. Okay, this talk is about FPGA security. So FPGAs are powerful and flexible configurable devices, which are mostly based on SRAM technology. That means that a configuration file or bitstream has to be loaded each time during startup. And since this bitstream potentially covers a lot of valuable IP, it has to be protected against duplication, manipulation and also reverse engineering. And because of that, since a couple of years bitstream encryption and also authentication is supported by many FPGA devices. And this basically works as follows. So in the EDA software, the bitstream gets encrypted with a symmetric key. Then the bitstream is loaded onto the device in some non-volatile memory. Also the key is stored on the device, often in fuses or battery-backed block RAM. And then during power-up, the bitstream gets decrypted, often by a dedicated decryption engine, and then is loaded into the configuration memory. However, there have been shown a number of successful key extraction attacks from FPGAs by using such an analysis against the bitstream decryption engine. So for example, Moradi at all showed successful attacks against vertex 4 and vertex 5 series devices and later also against Spartan-6, Kintex-7 and the Arctic-7 series. There have also been successful attacks against Altera devices like the Stratix-2 and Stratix-3 series. So the general problem here was that these devices had actually no dedicated side-channel countermeasures. Okay, so this motivated us to take a closer look on a chip from the current generation of Xilinx devices, which is the Sync Alcascale Plus based on 16nm production technology. The device features an on-chip encryption engine based on AES-256, implemented in GCM mode, completely in hardware. It can be used for bitstream encryption and authentication. And in contrast to the devices mentioned before, here the encryption engine is protected by a protocol-based countermeasures named key rolling. So with key rolling, the initial bitstream is divided in several blocks and then each block is encrypted with an individual key. On the device, the initial key is stored in eFuses or battery-backed block RAM while the keys for each successive block is encrypted by the previous block. And so this limits the data collection for the adversary since each AES key is only used for a certain number of encryptions. Well, however Xilinx gives actually no recommendation about a suitable block size. So how many AES blocks should be encrypted under the same key? Well, because of that our goal was to find an appropriate value for this key rolling parameter. Our security analysis relies on the following assumptions. First, we assume that hardware root of trust authentication based on RSA is enabled. This means that only authenticated items can be decrypted by the AES engines. So JOS and Cypher attacks are not possible. However, the attacker can record one specific bitstream decryption multiple times and then perform an averaging of the traces in order to increase the signal-to-noise ratio. As there is no dedicated masking countermeasure in the AES core itself, we focused our analysis on first order leakage. And finally, we assumed that the adversary has access to an open copy of the device such that profiling attacks like templates and machine learning based are possible. Okay, now some words about our setup. So we have used a SYNC Ultra-Scale Plus evaluation board of type ZCU-102 as mentioned before with 16nm processing technology. The device itself is packaged in Flipchip technology. So we had to remove the metal cap with a small driller in order to have access to the silicon. However, as leakage vector, we have used EM signals that are induced by the on-chip decoupling capacitor that are placed around the chip. So we placed our EM probe directly next to the decoupling capacitor that is related to the AES power rail. We also tried measuring the EM signal directly on the die surface. However, this led to a worse signal-to-noise ratio compared to measurements on the decoupling capacitors. Our target broad clock frequency has been set to 48 MHz. For acquiring EM traces, we used a longer EMV probe in combination with a 4-axis positioning system. And we also used a picoscope from the 6000 series with an assembling rate of 625 MHz per second and 500 MHz bandwidth. We used an averaging factor of 250, that means 250 traces with the same inputs are recorded, but only the average of the trace has been kept for the analysis. Okay, once our setup was ready, we performed a black box reverse engineering of the AES-256 architecture. For that, we've used a standard Pearson correlation with known key. What were the steps? Basically from the timing behavior of the core, we assumed a round-based implementation of the AES. Then we tried different leakage models, which are typical for round-based hardware implementations with a varying number of pipeline stages and registers. And finally, we found that the AES is implemented as shown here on the slide with four rounds implemented in parallel and a state register between each round. Okay, in order to attack a cryptographic design, the attacker needs a suitable power leakage model. And in cryptographic hardware implementations, usually the most power is consumed by the registers. So for each of the registers that we have seen before on the slide, the power leakage can be roughly summarized as given here in the formula on the slide. And it basically says that the power leakage is equal to the hamming distance between the output of encryption i in round n and the output of the next encryption in the same round. So we have here a power leakage that depends on two encryptions and not two different rounds, which is kind of exotic. Okay, how can this leakage model actually be used to mount an attack against the AES? Well, we have seen that our hamming distance leakage model involves two consecutive block encryptions. However, in GCM or counter mode, the only difference between two encryptions is the counter value. More specifically, in most cases, only the last significant byte of the counter value changes between two encryptions. And for that, we have visualized here the changes in red. We can see that only one byte out of 16 is changing at each encryption and is leaking information. And that means that the hamming distance leakage in round one can only be related to one key byte, key byte 15 in this case. So in order to extract the complete 256-bit AES key, we need to consider more rounds. Indeed, we need to take into account the first five AES rounds to be able to extract the complete AES key. So in round one, only one key byte can be extracted with an 8-bit hypothesis space attack. Then the goal of round two is to extract four constants, alpha 0 to alpha 3, which are related to the subkey bytes. And this can also be extracted with an 8-bit hypothesis space attack. However, there is a restriction. These constants are only stable for 256 encryptions. So we have here an upper bound in the number of encryptions. Then in round three, the goal is to extract 16 constants with four attacks each with a 32-bit hypothesis space. And this enables us to calculate the input vector for round four, but only for the next 256 encryptions. Then in round four, finally 16 subkey bytes can be extracted again with four attacks each with a 32-bit hypothesis space. The same is true for round five, and then this finally enables us to calculate the complete 256-bit key. So in summary, our attack procedure goes over five rounds, requires a 32-bit hypothesis space attacks beginning from round three, and we have only 256 encryptions available. So we are in a very challenging setting here. Okay, coming now to our practical experiments. So our dataset was composed of 200,000 traces after averaging. 75% of these traces have been used for profiling, 24% for validation, and 1% has been used as attack traces. Each trace has an individual key and also initialization vector, and was composed of not only one encryption, but includes 256 successive AES encryptions, so from counter zero to 255. This mainly corresponds to our attack procedure that I showed before that requires at most 256 encryptions to extract the key. So we can validate our setup with just one attack trace. So in summary, then one trace contains almost 16,000 samples as shown here on the slide. Okay, as an attack baseline, we have used a CPA with LDA pre-processing. So LDA is a technique that projects the traces into a smaller dimension where the infraclass variance is maximized. We have calculated LDA coefficient for each Hamming distance leakage that appears per round attack using 50 sample points per leakage. And the location of these sample points has been determined by a correlation with known key. And for the attack, then we compress those 50 sample points into a single point using the LDA pre-processing. However, there is an additional challenge. So in theory, we have 255 leakage that can be used per attack traces. However, the first exploitable leakage appears only for a counter of 3. And then due to the pipeline architecture of the AES, every 4 o'clock cycle generates a leakage of around i plus 3. So leakage that we cannot use for the attack in the current round. So in summary, we only have 190 encryptions that generate an exploitable leakage. We have used two power models for our baseline attacks, the regular Hamming distance power model and linear regression base power model, also known as the Stochastic approach by Schindler et al. Here are the results. So in the first round, the Hamming distance power model attack requires 170 traces and the linear regression base model 130 traces to extract the key byte. However, the attacks in later rounds were not successful. So it was not possible to extract the key with the available 190 encryptions. Okay, since conventional state-of-the-art methods are not enough for our target, we decided to change our strategy and move to sophisticated deep learning attack methods. The scheme we have applied is called correlation optimization, CO that was proposed by Robbins et al. at chess 2019. The idea of CO is to drain a deep neural network to produce an encoding of the input data that maximizes the Pearson correlation with a hypothetical power leakage. So it can be considered as an extension of the classical CPA with an additional profiling phase. And part of the CO method is a specialized loss function that is used during the trading process. Why CO? Well, as said before, we only have 190 encryptions available, so it makes sense to apply a profile attack. Then CO automatically encodes the traces into a single value. This is good if you want to speed up the correlation process and especially when dealing with 32-bit hypothesis. This is very important because we have here a lot of hypothesis. And finally, the heading weight in the power leakage model produces a very imbalanced dataset. Again, this is especially true for 32-bit hypothesis. And this can create a big problem for Gaussian templates but also standard deep learning based attack methods that make use of the cross entropy loss function. We have also developed two extensions for the standard correlation optimization approach. The first one is a bit-wise correlation loss. Here, the bit flips in the registers are directly used as leakage labels, so we don't apply the heading weight function. And then the sum of the correlation for each bit gives the total loss that is used during the trading. The second approach we propose is called weighted-bit correlation. It's basically an adoption of the stochastic approach to neural networks. Here, an additional weight coefficient C is introduced for each bit and is approximated or learned by an additional small neural network, basically it's just a neuron. And this neuron creates then an optimized encoding also for the leakage that is used then in the correlation. Apart from specialized loss functions, also the architecture and the type of neural network play an important role in DNN-based side-general attacks. And here we have tried different methods. So first, since we have 190 leakages, we can train an individual network per leakage. Advantage is that we can use shorter traces as input for these networks, but in total the training time can be very long since we have to train a very large number of networks. Then the other way around, we can use a single network that has 190 outputs, one per leakage, and uses the complete traces as input. Here the training time is faster. However, the model is far more complex in terms of parameters that have to be drained. And last but not least, a so-called triple output model that outputs three leakages per trace segment. So this is possible since the leakages of these three outputs appear directly next to each other in the trace. So it's just one clock cycle to the next one. So that means we have here a trade-off between training time and model complexity. We have implemented all three schemes. So for the model per leakage approach, we have used the standard correlation optimization leakage function, then the bit-wise correlation optimization and the weighted bit correlation as shown before. And for the output per counter leakage, the triple output model, we just used the plane correlation optimization approach. We have used two neural network types. First, a multi-layer perceptron or MLP, a rather small network with two hidden layers with 10 and 15 neurons. And the second network we have used is CNN. That is composed of three blocks of batch norm convolution and max pooling. And all of the attacks have been implemented in Python with the Keras and TensorFlow frameworks. Okay, and here are the results for the first round. As a reminder, the target in the first round was to extract one byte of the key. And when looking at the plots, we can see that the multi-output model here denoted as all counter and three counter are in general outperformed by the model per counter approaches. However, the difference on the CNN is not that large than on the MLP. Comparing the different loss functions, it becomes clear that the bit-wise correlation loss and weighted bit correlation obtain the best results and successfully extract the key with less than 50 encryptions on the CNN model. Please note also that these results are much better than the baseline attack where at least 130 encryptions were needed to extract the key byte. Then the results for round two, here the target was to extract four 8-bit constants, alpha 0 to alpha 3, which are needed to calculate the leakage of round three. The attack complexity is the same as in the first round. So we have 8-bit hypothesis attacks per constant. However, the algorithmic noise is higher and therefore the results are worse. Nevertheless, several techniques are able to extract the alphas using the 190 available encryptions and again the CNN models performed better than the MLP models. Then round three, here the target was to extract 16 constants, gamma 0 to gamma 15 with four attacks each with a 32-bit hypothesis. This then enables to calculate the input vector of round four for the next 256 encryptions. We implemented the attack phase on GPU using the NVIDIA QDA framework. The main reason was that we had to evaluate 2 to the power of 32 key guesses, which are more than 4 billion and letting it run on the GPU reduced the recovery phase from several weeks to approximately six hours. However, none of our attacks were able to extract the constants within 190 encryptions. So we achieved a key rank between 10,000 and 10 million. The reason for the failure was that the complexity is too high and the signal to noise ratio is too low for a 32-bit attack. For our setup we roughly estimated that at least 1,000 encryptions are needed to extract the gamma constants. Okay, same picture for the fourth and the fifth round attacks. Here the target was to extract 16 sub-key bytes again with 32-bit hypothesis attacks. Of course key recovery was not successful since the round three constants are not available and we roughly estimated that around 3,000 encryptions would be needed to extract the 16 sub-key bytes here. So coming now back to our initial question, how to set the key rolling parameter? Well, the actual value of the key rolling parameter, so how many bit-stream blocks are encrypted under the same key has a major impact on the size of the configuration image and therefore the boot time of the SYNC UltraScale Plus. So on the plot here we can see that for example a key rolling parameter smaller than 5 increases the boot time by more than 400% while a value larger than 40 increases the boot time only by 10% or less. In the experimental attacks we have seen that the complete extraction of the key has not been successful within the 256 encryptions. So why not just setting the key rolling parameter to a value of around 200? Well, we have also seen that parts of the key could be extracted with less than 50 encryption. And although we have done here a worst-case scenario, so a profile attack on a single device, there is an old NSA saying that states that attacks always become better, never get worse. And additionally, a lifetime of products that use the SYNC UltraScale Plus can be 20 years or longer, for example in automotive. And to account that, we recommend a key rolling parameter between 20 and 30 to have considerable security margin against future attacks and also a reasonable boot time overhead of 15 to 25%. Thank you very much.