 Hi, I'm Leonard Waters and together with Victor Arios, I will be presenting our work titled Revisiting a methodology for efficient CNN architectures in profiling attacks. As was probably evident from the title of this presentation, our work revisits the methodology proposed by Gabriel Zait and his co-authors in the first volume of the 2020 issue of transactions on cryptographic hardware and embedded systems. We try to shed some light on some misattributions and misconceptions within their paper and try to relate our experiments with results from the time series classification field. Our work challenges some of the elements of the methodology proposed in their work, which should by no means be considered an attack towards their original work. In fact, we want to congratulate Zait and their co-authors with their work which demonstrates that the datasets used in many deep learning for side channel analysis papers today can be easily profiled using very small networks. Even though many papers still use networks that seem overly complex. We would also like to thank Gabriel Zait for providing their implementation online. This made it possible for us to reproduce their findings. This is something we find to be missing in many other machine learning papers. In a similar fashion, we provide our code online and we hope that some of you will use our code to carry out your own experiments and maybe even demonstrate that we are wrong. Finally, if you're not familiar with the work from Gabriel Zait, we suggest you read their paper. Before we look at our experiments and findings, we want to provide a brief recap of the network design methodology proposed by Zait et al. As with any convolutional neural network, this network starts at the input. In our application, the inputs are samples from a side channel measurement. The convolutional neural network consists of convolutional blocks. Each block contains a 1D convolution, a batch normalization layer, and an average pooling layer. The first convolutional block employed in this network architecture uses filters with a size of 1. An unconventional choice we will explore in more detail. Another convolutional block is added in the case of a misaligned dataset. In this case, the filter size is said to be half of the maximum misalignment present in the dataset. Another convolutional block can be added to reduce the derangementality. Finally, a flattened layer is followed by one or more fully connected layers followed by the output classification layer. As mentioned earlier, the first convolutional block uses a filter size of 1. While convolutions with filters of size 1 or sometimes used in inception type networks, they are not commonly used at the input of a one-dimensional network. Each filter of size 1 has an associated weight and an associated bias. In this scenario, each sample of the trace is multiplied by the same weight. The bias is added and the result is passed to an activation function. Intuitively, if the convolutional block uses four filters of size 1, it will output four scaled and shifted representations of the network's input. Having four scaled traces instead of one trace at the input did not strike as being very useful. However, simply removing this layer from the provided networks makes it increasingly difficult for the network to converge to a working solution. Therefore, we conjecture that this first layer can be omitted by properly preprocessing the input data, a technique which is commonly used in a time series classification domain. Filters of size 1, as used in the proposed architecture, could be considered network internal preprocessing units. The filters are learning to scale the input into a range that makes it easier for the network to converge. Looking at the time series classification domain, we could not find many works on network internal preprocessing units. However, it is common practice to preprocess the input data. A good introduction to the field of time series classification and an intuitive explanation as to why preprocessing is useful can be found in a paper accompanying the UCR time series classification archive. This made us wonder if we could simply remove the first convolutional layer with filters of size 1 and instead preprocess the raw traces. To that end, we train the original networks on traces that are preprocessed using multiple techniques. We also perform the same experiments on modified networks in which the first convolutional layer is removed. Each combination of network and preprocessing strategy is trained 10 times on different splits of the training data. We do this to ensure that the networks and training procedure yield stable results. Both the original and modified models are trained in the same way with the same number of epochs. This figure shows the results for each of the models trained on the datasets in which the traces are nicely aligned in the time domain. On the left you can see the attack results from multiple training runs of the original networks and on the right you can see the results for the simplified models in which we remove the first convolutional layer. In addition to using the filters of size 1, Zahid Adal chose to use feature scaling between 0 and 1 as a preprocessing methodology for the aligned datasets. This is not a recommended approach, but you can clearly see that their first convolutional block managed to compensate for this choice. As expected, using this preprocessing strategy with our simplified models does not yield good results. In the paper we provide some additional experiments that demonstrate that the use of batch normalization allows the original models to converge more easily without a proper preprocessing strategy. You can also see that we get stable results over all training runs when using feature scaling between minus 1 and 1 or feature standardization. This is not the case for the original models used on the ASCAD dataset without desynchronization. Next we can look at the results when training the models for the misaligned datasets using different preprocessing strategies. In this case the preprocessing is applied horizontally or on a per trace basis. We can see that neither our models or the original models perform consistently over multiple training runs when no preprocessing is applied. Applying horizontal scaling between minus 1 and 1 or horizontal standardization seems to benefit both the original as well as the modified models in most cases. Interestingly, we get slightly worse performance on the ASRD dataset. This is likely the case because we remove the non-linearity of the initial convolutional block. Nevertheless, we can easily obtain the same results as the original models by simply training for a few more epochs or slightly adapting the network. Our experiments demonstrate that the first convolutional layer can be omitted if a proper preprocessing strategy is applied. Not only do we obtain networks that perform as well as the original models, we also obtain more stable results over multiple splits of the training data. Compared to some of the earlier works on machine learning-based side channel analysis, the models used by Zaid et al do not contain a lot of trainable parameters. However, as an added benefit, our models in which the first layer is removed contain significantly less trainable parameters. This additional reduction could be beneficial when targeting larger datasets. Zaid et al present two different visualization techniques, namely weight visualization, example on the top, and heat maps, example on the bottom. These two techniques aim to show how neurons at intermediate layers behave. The problem with these visualization techniques is that the information conveyed by the neurons at a given layer does not directly translate to the points of interest or leaking points of the input trace. Additionally, the authors try to draw conclusions on how well the network will work based on the absolute values of these metrics, hinting at the fact that smaller filters will lead to better network as they have higher absolute values on the weights. This leads the reader of the paper to believe that filters have to be small to obtain a network that will work better. We disagree with the way these visualization techniques are used and the conclusions that are drawn from them. We will explore multiple experiments later on that indicate that bigger filters do not necessarily perform worse. Another indication that larger filters could, in fact, help the networks to learn more complex patterns is provided by the field of time series classification. Hassan Fawath adapted the notion of receptive field from the image segmentation to the time series classification field. The concept of receptive field is the theoretical value that quantifies the maximum field of view of a neural network in a one-dimensional space. In other words, it refers to the number of input samples that a weight from further layers in the network sees or how many input samples affect one weight deeper in the network. A larger field can allow the network to learn more complex patterns which can be advantageous. A larger receptive field can be obtained by using larger filters and more layers. These notions do not align well with the concept of entanglement suggested in the paper by Said et al. To confirm our understanding of these theoretical concepts, we conduct multiple experiments each time modifying only a single network parameter. We start off by evaluating the network's performance while changing the filter size. To do so, we kept all parameters untouched and only varied the filter size of the first convolutional block of the network. These experiments aim to clarify that in contrast to what is suggested by Said et al. Using bigger filters does not degrade the network performance. From these results, it is clear that choosing a filter size that is too small would result in a network which is unable to converge. On the other hand, excessively increasing the filter size does not seem to impact the performance as suggested by the notion of receptive field in the work from Fawad et al. However, it is important to keep in mind that excessively increasing the filter size can increase the training time a good result in overfitting. From an attacker's perspective, it is more useful to have a model with larger filters if that means it is more likely to yield a good attack result. While training time can be considered a valid metric, it is often a one-time effort. In the black box setting, one might want to train a model that is more likely to succeed even though it takes longer to train. Training a simpler model which fails to converge does not provide you with much information in a black box setting. You would not know if you are using a bad model, a bad training strategy, or incorrect labeling. Thus, having a more robust model is advantageous. A working model architecture can always be pruned to obtain a smaller model if necessary. Moreover, the datasets analyzed in these works are nothing near a realistic dataset which makes the difference in training time among different filter sizes negligible. The methodology introduced by Sight et al additionally suggests the use of filters which are equal to half the maximum horizontal shift or misalignment present in the dataset. These proposals sounded strange to us. In the computer vision domain, convolutional networks are considered to be translation invariant. For example, your image classification or segmentation network is trying to find a cut in an input picture. The location of the cut or the maximum distance between two cuts should not affect the outcome. As we found this idea to be rather counter-intuitive, we conducted multiple experiments on the misaligned datasets while barring the filter size. Our paper contains multiple experiments but in this presentation we'll focus on the most striking example, the AS Random Delay dataset. The AS Random Delay dataset was provided by Koran and Kisbatov as part of their CHES 2009 paper on Random Delay Countermeasures. While we don't know the exact amount of delay in this dataset, we try to get an idea of its value. To that end, we used a working model and the gradient visualization technique to determine the leaking samples in some individual traces. Comparing these gradient outputs, we can see that the misalignment in this dataset is rather large. In fact, it looks like the misalignment could be as big as 1000 samples. This could mean that according to the methodology from site at all, we should use a filter size of 500. But the examples provided by them use filters of size 50. Interestingly, training multiple models with different filter sizes on this dataset shows us that in this particular case, a filter size as small as 2 is sufficient to tackle this dataset. Therefore, we believe that the best filter size is unlikely to be solely affected by the maximum amount of misalignment in the dataset. Similarly, site at all seemed to suggest that it is best to minimize the number of convolutional layers in a network. As using more convolutional blocks would result in a network which is less confident in its feature detection. This is not in line with what we would expect from the receptive field metric. Additionally, in the machine learning domain researchers have shown that training deeper networks and thus networks with more layers can result in better performance. For example, in other fields it is common to see resnet architectures with 50 or more layers. Therefore, we again conduct a few experiments in which we vary the number of convolutional blocks. We can clearly see that if we do not include enough convolutional blocks, the networks fail to capture the leakage and hence the attack is not successful. On the contrary, in the first two examples we see that by adding an extra convolutional block, the network is now able to attack the dataset. Additionally, in the third example we can see that the more convolutional blocks we add, the better the network performs in the attack. Gabriel Zait and his co-authors provided us with a starting point for a generalized methodology and made their implementations available online. Thanks to these publicly available implementations, we were able to independently reproduce their experiments. The networks proposed in their work do work well on all of the explored datasets. However, our experiments show that some elements, such as the first convolutional block, can be improved upon, and that some of the design guidelines must be reconsidered. An open question which remains is how this methodology will scale to more realistic datasets in which traces longer than a few thousand samples are considered. We believe that having a network design methodology can only be useful if we also understand the limitations of the approach. When writing a machine learning paper or any other paper, it is important to not confuse your reader. Overloading definitions for terminology that is already widely established is confusing. The paper titled Troubling Trends in Machine Learning Scholarship by Lipton and Steinhardt covers commonly made mistakes that we should all try to avoid. Similarly, many machine learning papers contain experiments that try to demonstrate that some novel idea works. However, those same papers rarely contain experiments that try to disprove their own hypotheses. We believe it is valuable to not only highlight how great your new ideas, but to also demonstrate what the limitations are. Of course, I don't expect you to take my word for it, but maybe you will listen to François Collet, the creator of Keras. Keras is the machine learning framework used in the vast majority of machine learning papers in the field of side channel analysis. According to François, you should spend at least 10% of your experimentation time on an honest effort to disprove your own thesis. The application of deep learning in the context of side channel analysis is relatively new. However, there is a slightly more established field, namely time series classification. As mentioned during the presentation, a profiled side channel attack could be considered a specific application of time series classification. Therefore, it makes sense for us to keep track of the developments within the field of time series classification. Another thing we should ask ourselves is if we can learn from the time series classification domain. For example, they use the UCR time series classification archive, a collection of 128 time series data sets. Any paper claiming to have come up with some novel methods is expected to evaluate that method over all of these data sets. Maybe we can make our own side channel data set archive, covering data sets from multiple sources and multiple difficulty levels. This is not an easy problem to tackle, as we need to agree on a common data set format. We need to establish baseline results and find an entity that is willing to publicly host all of this data. Alternatively, we can try to submit one or more data sets, which we believe to be representative to the UCR time series classification archive. That way, new time series classification papers could also contain results that are informative to us. As the models proposed in these papers are often evaluated on 128 data sets, it is not unimaginable that they can be easily adapted to fit our application needs. Feel free to contact us for more information or join us for the live discussion during a virtual chess 2020 conference. Thank you for taking the time to watch this presentation.