 Hello everybody, welcome to this presentation. Today I will be giving a talk about our recent check paper. Pay attention to raw truthies, a deep learning architecture for end-to-end profiling attacks. Let's start with the basic motivation of this work. Deep learning has been widely used in the third channel analysis in recent years. Many works defeat the concomitants like masking and address the desecronization issues simultaneously by deep learning. The neural network has shown its ability in the third channel context. In almost all of these papers, the selected narrow truth intervals instead of the raw truthsies are used, even when the implementation is protected by masking countermeasure. That means there is a manual feature extraction before the profiling. In our opinion, this previous work do the research with an implicit assumption that the number of time samples in raw truthsies can be reduced before the profiling easily. However, if we consider a practical black box analysis on a masked implementation, locating the leakages is arguably the most challenging part of the whole analysis. The assumption may be too strong or even invalid in a practical analysis targeting a masked implementation. Therefore, we argue that to fully utilize the potential of deep learning and get rid of any manual intervention, we need the end-to-end profiling that directly maps the raw truthsies to target intermediate values. So in our paper, we propose an end-to-end architecture composed of encoders, attention mechanism and a qualifier to conduct the end-to-end profiling. And here we have four main contributions in this work. The first is an architecture for practical end-to-end profiling, so it could directly profile raw truthsies. The second is that to build the end-to-end architecture, we introduce some new structures into the SAA. Third, we get satisfying results on public data sites with our architecture. And fourth, we explore the attention mechanism in the SAA context, which further explains why our end-to-end architecture works. Here is our new architecture. We propose this architecture to conduct the end-to-end profiling no matter the raw truthsies are desecronized or protected by the masking countermeasures. And we could see how these three components work from the viewpoint of SAA in this picture. Roughly speaking, the encoders will first encode raw truthsies to the abstract features. Then the attention mechanism will give each feature a score and wait to sum them up. Finally, the classifier generates the probability from the final feature vector. I will give a more detailed introduction of our end-to-end architecture in the next. First is the encoder component. The encoder component includes two paths, the junior encoder and the senior encoder. Here we start from the junior encoder first. For the junior encoder, we also have two kinds of structures to handle the synchronized and desecronized scenarios. For the synchronized scenario, we use the locally connected layer. This layer is quite similar to the convolutional layer, except it does not share the filter width among the steps when it slides along the truthsies. We could see how it works in the red-hand figure. We choose this layer because the locally connected layer decouples its neuron from the whole height dimensional input, which is crucial to generate fine-grade features with high quality, as we observed. For the desecronized scenario, we still use the stacked convolutional layers and polling layers. We use them to extract shift-in-variant features to address the desecronization, but we avoid the falling connections and very deep convolutional structure by other upper components like the senior encoder and attention. Then is the senior encoder. For the senior encoder, we use the non-short-term memory. So why do we choose LSTM as the senior encoder? First, LSTM could learn to control the data flow automatically when it goes through the sequential data. There are three gates in LSTM, and these gates will control what information is collected for God and Yeldin. Second, there is a built-in structure in LSTM called memory field for storing the combined features. Compared to some other gated recurrent layers without the built-in memory like JRU, we find that the built-in memory makes the feature combination more stable. Third, LSTM could theoretically handle infinite non-sequences. The LSTM in our architecture works under the second to second mode. The reason of using this mode is that including a two-long sequence into a single feature vector is still too hard. The LSTM still faces a great unusual impractice. As a result, we need some mechanism to reduce the hardness of training. Ideally, if we can show some critical time steps or at least shorten the time sequence automatically, the training will be much easier. Therefore, we select the second to second mode which imposes the heightened state at each time step to satisfy the necessary precondition. Our senior encoder consists of two LSTM layers with different directions, one forward and one backward. Since one of the LSTMs reads the input backward, the order of accessing the sensitive languages is reversed from the forward one. These two LSTMs may learn different kinds of combinations or features. And they are also two ways to utilize the output seconds of these two LSTM layers. We could concatenate them along the feature axis, like the figure on the left side, or keep them independent, like the figure on the right side. If we don't want the outputs of LSTMs to interact with each other through the margin operation, nor limit the representing flexibility of higher layers, we could just keep them independent. And the independent LSTMs give better explanation of how our architecture locates the informative intervals. We will give more evidence when we realize the attention mechanisms later. Next is the attention mechanism. The attention mechanism evaluates each feature editor each from the senior encoder and gives each of them a score. The primary motivation of using attention is to reduce the valid length of a sequence, so it is unnecessary for LSTMs to include the whole feature sequence to a single feature vector. The attention mechanism selects some important time steps, and it essentially departs the sequence by the probabilities where the value is large enough. As a result, in ideal conditions, the LSTM could only remember the information in our shortened interval, that is, remember the information before a distinct attention peak. This naturally solves the gradient issue in practice. For the detailed implementation, we modify the attention mechanism proposed by Bodno by inserting a best normalization layer after calculating the Rothgauss. Then the Rothgauss are still normalized by the softmax function. Finally, we conduct a weighted sum according to the normalized scores, also known as the attention probability. We just skip the introduction of our classifier component, as it is just a simple fully connected layer. If we jump off of the implementation details, there are two variants of our architecture. These two variants are related to how our two LSTMs are concatenated. For the case that the LSTMs are connected along the feature axis, we use only one attention instance as the figure of Runt1. For the case that the two LSTMs are independent, we append an attention instance on each LSTM, like the figure of Runt2. In Runt2, the attention components are also directional. We find that the Runt2 could handle more complicated leaking scenarios. For example, the sensitive leakages are far from each other, or spread in multiple closed echoes. But Runt1 is more efficient in training when the leaking scenario is simple. So, what do we get through our new architecture? Let me start from some basic settings of our experiments. First, we consider the identity label. That means we have 256 labels, and typically we use the output of AES as books. And we also consider both synchronized and desynchronized cases. Below are the data sites we use in our experiments. Four of them are public, two from DPA Contest, and the rest two from FK. These four data sites are protected by BooleanMask. We also acquire 80128 data sites from an 80Mega128 MCU, on which we simulate a masking counter mirror. The 80128-F means the leakages of mask and mask value are far from each other. So, it is a harder trick that compared to the 80128-N when we consider the N2N profiling. You can see from the third column that we could use over 400,000 time samples directly in profiling. We also want to clarify that we use fcatv2 to refer to the data site collected on 80Mega8515 with variable key. The holders released a new tree set collected on ARM STM32 after the submission of our paper and also named it fcatv2. So, don't confuse these two data sites. First, we apply our network to synchronized trees. Here are the results of the data site DPA Contest v4.1. The first, second, and third column are the validation accuracy, validation loss, and the gather entropy respectively. In the first rule, we do not use the knowledge of mask. We reach the gather entropy zero in four or five trees. And in the second rule, we suppose the value of mask is known. We need only one tree to recover the key. Then we test the data site DPA Contest v4.2. In this data site, there are 13 sub-sites, each corresponding to a different key. We use the trees from all of the 16 sub-sites to train our tech. In the gather entropy figure, we could see that most sub-sites could be successfully attacked in 10 trees. The sub-sites 12 is harder to attack, and it needs about 120 trees. Next, we consider the fcat data site. From the figures of fcatv1, we find that our network could reduce the gather entropy to zero very efficiently. It costs about six trees to recover the key. To our best knowledge, this result is state-of-the-art, even compared to the result based on the selected PUIs. From the figures of fcatv2, we see that the attack is also pretty efficient. We use about 10 trees to recover the correct key. To our best knowledge, this is also a state-of-the-art attacking result on this data site. Even we use the raw trees. We give detailed explanations about the loss and accuracy curve in our paper, so I will not dive into them in this presentation. Next, we come to the 8128 data site. We use both of the variants on these two data sites to explore how to use architecture according to the different leaking scenarios. In 8128-n, the leakjes are limited in a narrow interval on the raw trees. Where in 8128-f, the leakjes are far from each other. Besides the efficient attack, the gather entropy indicates that the network based on variant 2 handles both leaking conditions better in terms of profiling quality. The advantage of variant 1 is that it could converge with a larger batch site when the number of training trees is the same and thus get a good network for attack faster. We also give a more detailed discussion in our paper about how to choose these two variants. So, what will happen when we consider the de-synchronized trees? Well, in de-synchronized cases, we use stacked convolutional layers to replace the locally connected layer. The delay interval in figures mean the range of random shift we added to the trees to simulate the de-synchronization. We also find that the number of trees for both versions of the adcat is not quite enough for our end-to-end profiling, so we also conduct data augmentation. Finally, as the figures show, we could reduce the gather entropy to zero with very few trees. Next, we test the de-synchronized 8128 data sites. For efficient attack, we also conduct a data augmentation. And here are the gather entropy results. Again, you can see that we need only several trees to recover the key. Here are some comparisons to the previous works on the data site dp context v4.1 and adcat v1. In the first figure, the blue curve is our result using the raw trees and without knowing the mask. Well, the previous was using the shortened trees with knowing mask. We can see that also our profiling is conducted on raw trees, the attack is more efficient. In the second figure, the comparison result is similar. Our network trained on raw trees performs much better than the previous work trained on the reduced trees, and the curve of our gather entropy is in the corner. Here is a summary of our attacks. The first column is the data site. The second column is the random delay we used to simulate the de-synchronization. And the third column is the number of trees to recover the key. We can see that for most of the cases, we could reduce the gather entropy to zero in several trees. For some more challenging scenarios, we need tens or even more than a hundred trees. Now we show some comparisons of our network to explain how one of our key structures' tensing mechanism works. These are the results of our network trained on 8128 desks in, based on the variant 1. From the SNR and gradients, we could conclude that our network has found out the leakages, and the comparisons of the tensing so that the tensing mechanism decides the time step after the backward LSTM goes through the leakages is very important. So we can see an tensing peak in the subfigure C and subfigure D. The network trained on F-CAD based on variant 2 could give a more clear intuition of how our tensing mechanism focuses on the informative interval. In subfigure F, we plot the probabilities of forward and backward tensing. We could observe that both attention instances pay special attention just after LSTMs go through the time steps between about 800 and 900, and these time steps correspond to an interval around index 45,000 on the raw trees. In other words, this interval is subjected by a tensing mechanism to be the most informative interval on the raw trees. Not surprisingly, this interval includes the 700 time samples that are manually selected by the authors of F-CAD. Finally, we dive deeper into the LSTM. We plot the GIT activation state, and try to find out whether the LSTM indeed collects the leakages before the attention asks it to output. The GIT state is presented by blue curves. We can see the GIT is activated just as the position where the peaks of S&R are located. Therefore, the input GIT of LSTM is definitely controlling the LSTM to collect the informative leakages. The relations among the attention probabilities S&R and GIT activation indicate that the LSTM and attention mechanism interact as we designed. Finally, we move to the conclusion. In this paper, we introduce a neural network architecture for end-to-end profiling. That means rather than leveraging the knowledge of implementation and the value of masks to locate the informative intervals, we could profile the raw trees directly now. To the best of our knowledge, this is the first architecture that achieves end-to-end profiling. We also introduce some new structures into the S&E field to build the architecture, like the locally connected layers and the attention mechanism. Besides the effectiveness, we find our approach working under the end-to-end context performs even better than the networks trained on reduced trees. We also explore how the attention mechanism works in the third channel context to verify the architecture works as we designed. For the future works, it will be interesting to replace the LSTM to self-attention mechanism to fasten the training process. Since the LSTM cannot be parallelized, the training is time-consuming now, and we will also explore other neural networks or training strategies to improve the performance of deep learning in the S&E. So that's it. Thank you for your attention, and you are welcome to read our paper to get all the details. If you have any questions, feel free to contact us. Thank you.