 Hello and welcome to our paper, cross-device profiled set-channel tech with unsupervised doma adaptation. My name is Sao Pei and this work is a joint work with Zhang Shi, Lu Xiangjun and Gu Dao. So we have three main contributions in this work. First is a cross-device profiled set-channel tech strategy with transfer learning techniques. We will use fan-twin to adjust the printing model with unlabeled tech-traces. Second is a no-loss function and cnn architecture for robust profiled attacks. And third we hope to provide a benchmark of cross-device set-channel analysis with satisfying results. In set-channel analysis, we usually divide the text into two categories. First is non-profiled attacks where we only have a tagged device available. We can't change the key and usually we attack on the fly with methods like dpa and cpa for example. On the other side, we have profiled attacks where we can get access to a cloned device which is open and similar enough to the tagged device. We can use the cloned device to characterize the leakage of the tagged device with a method of machine learning techniques. For example, the soft-wicked machine and the random forest. More recently, the set-channel community highlights the ability of deep learning because first it can deal with high dimensional inputs better than template attack. Second, neural networks naturally cater to masking variables. And third, deep learning-based attacks are robust to mislaments by using specifically architecture like provisional neural networks. However, profiled attacks have a major limitation called portability, which is often neglected in research papers. The portability issue occurs when there is a gap between experimental settings and reality. For example, in experiments, we usually use a single device for profiling and then use the same device for the tag. But in reality, no two chips are exactly the same. Even for devices of the same type, the leakage of the set-channel information is inevitably different, which is likely due to random process variations introduced during fabrication and packaging. Unfortunately, this discrepancy information is not utilized in the classic two-physis profiled attacks. As a result, attacking a different device may cause a successful single device model attack to completely fail. Today, device discrepancy is still a bottleneck restricting the application of profiled attacks in practice. We note that an implicit hypothesis of deep learning techniques is that the training data must be independent and identically distributed with the test data. However, when we adopt deep learning in the context of profiled set-channel attack, this hypothesis is too strong because attack traces are often acquired from a different device without control. In such a context, various settings can easily break the high processes and lead to poor performance when we try to attack the attack device. In fact, device discrepancy is not the only reason for the portability issue. Different implementations and different settings of acquisition can also lead to poor attack performance. For example, training on hiding countermeasures and changing the placement of the props can also result in two different distributions of traces. So, a profiled attack is composed of two phases, a profiling phase and an attack phase. We note that a limitation of the two phases attack is that it cannot utilize the discrepancy information with its directly neglected. So, in order to address the portability issue, we propose to extend the traditional profiled attack and introduce an additional fine-tuning phase to adjust the crunching model. Fan-tuning is a widely adopted technique in transforming for deep neural networks where a few approaches of training are applied to the portrait model parameters to adapt them to a new task. A straightforward approach for Fan-tuning to take a crunching network and then trim parts of its parameters using the data from the tag domain. However, in a realistic set channel text scenario, there is no labeled trace measured from the tag device. So, in our strategy, the inputs of the Fan-tuning phase are the original profiling traces with known labels and a limited number of unlabeled traces measured from the tag device. Our networks capture the discrepancy information of the two domains and to learn domain environment features. So, in order to capture the discrepancy information, we must decide how to quantify the distance between the profiling and the tag traces. In this work, we introduce the Maximium Main Discrepancy, which is a standard distribution distance model and has been widely used in many other transfer learning tasks. In order to learn domain environment features, we must design a new loss function. The loss function is composed of two parts, a classification loss and an MMD loss. The classification loss makes sure that the learned features are discriminative. In our experiments, we use cross entropy loss by default. However, a tag can also select other loss functions that are specific to the deep learning-based SCA. The MMD loss can be regarded as a constraint term with a penalty parameter lambda. The lambda here behaves similarly as the penalty parameter in L1 and L2 regularizations. As for the network used for Funtune, there are two main differences between classic CN models and our architecture. First, our Funtune network receives two batches of traces for each training process. Specifically, one batch of labeled traces is acquired from the profiling device. The other batch of traces is unlabeled acquired from the tagged device. The second difference is we have to decide whether to calculate the MMD loss in our network. In fact, previous works have shown that the deep features must transform from generic to task specific as one goes up the layers of a deep network. In other words, the transferability of the heading representation tends to significantly drop in here layers with increasing domain discrepancy. Therefore, we decide to minimize the MMD loss on the fully connected layers. The conventional block of the network is still trainable during the Funtune phase to further adapt it to the tagged domain. We make the encoded batch trainable because we expect the convolutional block to learn shift environment features in case the tagged domain is not well-led. To evaluate our methodology, we consider four data sites covering the main types of set-channel tag scenarios. The XMega data site is created from eight XMega chips by marrying the power consumption when joining an unprotected AES algorithm. The Sakura data site is created from three Sakura G evolution boards, which corresponds to an unprotected hardware implementation of AES on FPGA. We also use the well-known ASAD data site to show the potential of our method in removing some noise and countermeasures. Finally, the XMega EM data site provides electromagnetic measurements created from the eight XMega chips. We use the same probe for all chips but at different positions and distances. We will show how our method helps in such a scenario. The picture on the left shows the eight XMega chips. We initialize each device with a different section key. Then we calculate the signal noise ratio of each device, calculating rising the leakage of the Xbox output. We can conclude from the results that the leakage differs and each board has its own leakage characteristics. We notice that device 4 is shifted apparently in time. This could be explained by the in-process clock because we expect to have 625 points in 10 clock cycles, but we only observe 616 points for device 4. We first show the results on the XMega data site. We can compare the performance of the point-tuned models and fine-tuned models. NT here donates the number of attack traces to successfully recover the key. For point-tuned models, we can observe that although the value of NT for single device attack given on the diagonal is very small, it varies widely in the case of cross-device attacks. In fact, the best performance of cross-device attacks can also be explained by overfitting. In other words, the model trained on the profiling device cannot be safely generalized to the attack device. We therefore draw the learning curve when we train the networks. The blue and green curves are the training loss and validation loss respectively on the profiling device. The red curve is the test loss on the attack device. On the profiling device, overfitting is not observed because the training loss and validation loss are highly consistent. However, on the tagged device, the test loss increases rapidly, which means the model is overfitted in the first field. This kind of overfitting is hard to identify because in realistic attacks, the labeled attack traces are available and cannot be used to calculate the test loss. After fine-tuning, the performance of the cross-device attack is very much improved. We also draw the learning curve when we fine-tune the network. The curve in blue is the MMD loss that we want to minimize in the loss function. The curve in red is the test loss on the tagged device. We can conclude from the results that minimizing the MMD loss can effectively reduce the test loss and improve the generalization of the models. Apart from power analysis, EM-based cell channel attack is becoming more and more popular. We note that EM measurements are very sensitive to probe placement. However, when we consider the realistic profiled attack scenario, the probe must be moved from the profiling device to the tagged device. So, there is always a slight difference in the probe placement caused by human error due to the position, distance and rotation. To investigate the impact of human error, we perform more cross-device experiments on the X-Mega EM dataset. We can observe that the NT matrix is very similar to the results of X-Mega dataset. As we expected, the fine-tuned models outperform the pre-tuned models significantly. Again, we see that the evolution of MMD loss is highly consistent with the test loss, which confirms our previous results. Unlike the above described dataset, the secret dataset provides measurements of unprotected hardware implementation of AS on FPGA. Since the signal noise ratio of this dataset is relatively small, our pre-tuned models require around 1,000 traces to successfully recover the key of the same device. When we apply the pre-tuned models to other devices, the required number of traces is likely to double. We can observe that all the cross-device experiments get improved after applying our method. Most fine-tuned models achieve most similar performance as using the same device for attacking. Therefore, our approach is also suitable and efficient for hardware implementations. Apart from different devices, the variance in implementations can also lead to bad performance. Based on the ASAD dataset, we simulate different implementations by adding artificial quantum measures or noise to the original dataset. After the simulation, we train the same model on the original dataset and evaluate its performance on the deformed dataset. This experiment simulates a complex tag scenario that the tag device is treated as a black box that can turn on such channel quantum measures. As we can see, Gaussian noise distorts the shape of the original traces in the amplitude domain. Well, desynchronization and clock teachers add redness in the time domain. We can hear the performance of the pre-tuned models and fine-tuned models using Gaussian entropy. The dotted lines donate the pre-tuned models and the solid lines donate the fine-tuned models. It is overviews that the fine-tuned models are therefore from the pre-tuned models. We can infer from the results that CNN may not generalize well if only clean traces are fed to the network. However, fine-tuned use a small number of unlabeled noise matrices can elicit the power of CNNs and train the network to learn domain environment features. Here is the computation cost of training and fine-tuning. The training time of each applet is mainly determined by the size of the training dataset, the batch size and the length of the raw traces. We can observe that the airport time for fine-tuning is approximately twice that of training. This is reasonable since more traces are processed and an additional MMD loss is calculated in the fine-tuning phase. In addition, the fine-tuning time is still affordable. For example, if we run the fine-tuning phase for 15 airports, this process can be completed within two minutes for all considered datasets. So, in order to further understand how the location of the adaptation layers affect the output, we conduct a series of experiments on the ex-Mega dataset with different adaptation layers. We use a CNN whose class-facer part has three fully connected layers. We first fine-tune the network using only a single layer, and then compare it with the result of using all three layers. An obvious observation is that our method still works even a single layer is useful, minimizing the MMD loss. Another observation is that the deeper the layer, the more difficult it seems to learn domain environment features. This is reasonable since the features obtained in here list must depend greatly on the specific datasets, which are not safely transferable to normal elements. Still, using all three layers of the class-facer part is a good treat-off, which usually brings better results than using a single adaptation layer. The hyperparameter lambda in the loss function determines how strongly we would like to confuse the source and texture domains. Intuitively, setting the lambda too small can cause the MMD regular user to have no effect on the learned representation. But setting the lambda too large will regularize too heavily and may result in a degenerate representation in which our features are too close together. Although there is usually a wide range of lambda where the print train model got improved, a good empirical choice is to start with a relative small value, especially when the signal noise ratio of the dataset is small. A smaller value of lambda means that the optimizer should put more effect into the tough classification task. If it is observed that the reduction of the MMD loss is not significant or too slow, we can gradually increase the value of lambda to speed up the fine-tuning process. Although fine-tuning with MMD loss helped us obtain a robust model, we need a set of attack traces to estimate the MMD. Despite the fact query multiple unlabeled traces from the target device is not a strong assumption, it is still meaningful to figure out how many traces are appropriate in practice when we fine-tune the network. Therefore, we conduct a series of experiments with number of traces varying from 100 to 900. It can be observed that 100 traces are sufficient for the fine-tuning phase. It is not surprising because 100 unlabeled traces can provide sufficient information that is distinguishable from the source domain. We can also conclude that using more traces could lead to a more stable and robust fine-tuned model. This is reasonable since more traces help us to obtain a more precise estimate of MMD. So, they are coming to the conclusion. This work focuses on addressing the open question of portability in profiler set channel attack using transformer learning techniques. We introduce a fan-tuning phrase that utilizes the domain discrepancy information to enhance the print-tuned model. Although CNN is known to be robust against misleadment, we show that CNN may not generate well if only clear traces are fade. However, after fine-tuning with MMD loss, our network is able to learn domain abandonment features. Besides, this approach does not require multiple profiling devices, target-specific processing, and any information of the target device. So, thank you for your attention and I welcome you to read our paper to get all the details. If you have any questions, feel free to ask. Thank you.