 Bonjour, I'm Gabriel, a Ph.D. student from the Laboratoire Vircurien de Thales-Ait-Seph, and today I will introduce you a methodology for generating suitable CN architectures in order to perform efficient profiling attacks. This work is a joint work with William Boussou and Amory Avrar from the Laboratoire Vircurien and Alexandre Venely from the Thales-Ait-Seph. But first, before starting the presentation, what's the point with side-channel attacks? A side-channel attack is a class of cryptographic attacks in which an adversary tries to exploit vulnerabilities from a system by analyzing its physical properties. But to clearly understand this concept, let's take a look at an example. So let's assume that an adversary has access to a physical device in which a cryptographic algorithm like AIS is implemented, and such that she can configure the secret key and the pentax to perform encryption. Obviously, the goal of the adversary is to recover the sensitive information manipulated during the encryption. So to perform a side-channel attack, at least one probe and one SEO score are needed, such that during the encryption, the adversary can capture a physical trace that directly depends on the sensitive information manipulated during the encryption. Obviously, because the adversary can configure the pentax and the secret key, she can perform as many encryption she wants to generate physical traces and find some leak wages related to the secret she wants to retrieve. Typically, in the deep learning-based side-channel attacks, the adversary generates an overall network to automatically match a physical trace with the correct target sensitive value. In this kind of attack, two phases can be highlighted. First, during the profiling set, the adversary trains a network to predict the correct sensitive information she knows based on the physical traces generated in the previous slide. So once this phase is performed, the adversary can predict the intermediate variable on a similar target device containing a secret the adversary wishes to retrieve. So then, during the attack phase, the adversary generates a new set of physical traces from the target devices, and she can compute the score rate to each key, but unfortunately usually one physical trace is not enough to retrieve the correct target value. So the adversary has to predict the score associated to each key hypothesis multiple times so that to finally aggregate all the prediction and recover the secret information she wants. In comparison with classical profiling attacks, the deep learning approach is useful to perform side-channel attacks because it allows to reduce the preprocessing phase and still performs well even if traditional control measures are implemented, such as the desynchronization or the dummy operations. But however due to the black box nature of neural networks, it can be challenging to explain and interpret its making decision and obviously generate suitable architectures. Typically, the network architectures used in side-channel analysis are based on image classification problems like VGG16 for example, so through our paper we want to understand the impact of some parameters that compose the convolutional part of a CNN and propose a new approach to generate more suitable architectures in deep learning based side-channel analysis. So this is our question, how to select modally per parameters for efficient CNN architectures in profiling attacks? But before answering this question, we have to understand how a CNN works. So as demonstrated by Ionara Tagli et al. at just 2017, the CNN can be used to reduce the desynchronization effect while remaining relatively efficient. Typically, in the CNN the convolutional part of a network is known as the feature selection. So in our work, we assume that once the convolutional part precisely locates the point of interest, then the classification parts that recombine the features to accurately classify each input will be easier to configure. So in our work, we decide to focus our interest on the convolutional part only in order to find a good trade-off between the convolutional part complexity and the detection of the point of interest. But however, how can we constrict an efficient convolutional part? One way is to understand how the parameters that compose the convolutional part impact the feature detection. So one solution is to understand and to visualize their behavior. Typically, in the deep learning side-channel approach, the gradient visualization is used as a tool to evaluate the global performance of a network through the computation of the gradient of the model such that depending on its value, we can visualize the points that are considered as relevant by the network. But unfortunately, these tools cannot be used to independently evaluate the convolutional and the classification part. So we decide to adapt classical visualization tools in such a context, namely the weight visualization and the heat maps. So first, in a role network, the weights reveal how important a feature is in order to accurately classify an input. For example, if the weights associated with the relevant information increase, the network will be more confident on the detection of the leak weight and its expectation will be easier in the classification part. So by visualizing the weights related to the first fully connected layer, we can clearly identify which features are retained by the network and we can independently evaluate the impact of the parameters that compose the convolutional part, such as, for example, the filter size, which is unfortunately not the case with the gradient visualization tool. But even if the weight visualization is a useful tool, we still cannot understand which features are selected by each filter. So one solution is to plot the heat maps. So the heat maps plot the convolution operation between an input and each filter, so that we can visualize the activation part induced by each filter. So this can be useful to adapt a network according to how the features are selected. So these tools, the weight visualization and the heat maps, sound useful to evaluate how the convolutional parameters have affected the point of interest detection in such a context. But in order to deeply explain our methodology, I need to give you a little bit of background on how the convolutional part works. In a classical CNN architecture, the convolutional part of the network is composed by multiple convolutional layers and multiple pooling layers. So first, a convolutional layer aims at extracting some relevant information from the input in order to help the decision making. So this layer, this convolutional layer, is configured by multiple filters of size n and performs a series of convolutional operation with the physical trust, such that during the training process, the filter parameters are automatically set in order to identify the features that increase the efficiency of the classification. So here, in our example, by sliding this window through the input, through the physical trust, is helpful to identify the features extracted by each filter. So thanks to this visualization tool, we investigate the impact of the filter size on the feature detection by implementing an unprotected IS on a cheap whisper board. So with the help of the visualization tool, we observe that increasing the length of the filters will spread the same relevant information through the convolved time samples. And because the same relevant information is spread over the convolved trust, the expectation of this relevant information could be difficult during the classification part. So much larger the filter is, less confident the network is, and its ability to precisely locate the relevant information. And thus, the resulting training time needed for reaching the final performance could dramatically increase for large filters. And this phenomenon can be intuitively explained by the number of parameters that increase with the filter size. So by reducing the length of the filter, we will accelerate the training phase without significantly degrade the performance of the network. But obviously, depending on the pattern we want, we have to adjust this length. However, you have to notice that our work was only focused on the feature detection, and that no investigation were made in order to understand how this choice impacts the efficiency of the classification part to combine the relevant information. Indeed, this choice is not necessarily the most optimized for all types of classification part. And a further study should be made in order to reduce this issue. But once the conditional layer identifies the relevant features, the pooling layer is used to reduce the dimension of the convolved stresses, such that the most relevant information is preserved. Typically, two kind of pooling operation can be employed, the average pooling and the max pooling. So first, the average pooling has the particularity of preserving the ensured relevant information between the consecutive convolved samples. But one issue with this operation is that, unfortunately, the average pooling gives the opportunity to the noisy sample to impact the expectation of the relevant information. So sometimes adding too many average pooling layer could impact the efficiency of locating precisely the relevant information. On the other hand, the max pooling is not affected by these noisy samples, but tend to share only the most sensitive information over the convolved samples. For example, in the worst case scenario, if two consecutive relevant information are included in the same max pooling operation, the network choose to discard the less significant one, so it tends to reduce the number of liquidities and make more difficult expectation phase provided by the classification part. This explanation of the pooling layer can be helpful to evaluate the impact of the convolutional blocks on the point of interest detection. So starting from the same cheapest pooling data set, we observe that adding convolutional block reduces the distance between the point of interest by a pooling layer factor called pooling stride. So if we assume that a desynchronization effect appears, therefore, adding convolutional blocks reduce the desynchronization effect. But however, if no desynchronization appears, using more than one convolutional block seems irrelevant because adding too many convolutional blocks impact also the detection of the relevant information, as you can see here in our example. So a good trader needs to be found to detect the desynchronization effect and to preserve the maximum amount of relevant information related to the relevant points. So once these concepts, the pooling layers and the convolutional layers are understood, we can introduce a new methodology for generating suitable CNN architectures when desynchronization effect occurs. So to understand this methodology, let's take a look at an example. So let's assume that an adversary has access to a set of traces such that the SNR shell is defined as fellows. So here we have five SNR peaks, and each of them has pre-over 100 samples. So obviously what an adversary wants is to generate a suitable CNN architectures such that the network focuses its interest on the relevant points while limiting the impact of the desynchronization effect and the irrelevant information induced by the noisy samples. So first, the adversary has to configure a first convolutional layer such that reducing the lens of the features help the network to optimize the extremum of a trace to be able to easily extract the secret information. So we suggest setting the lens to one for the filter of the first convolutional block such that the feature extraction will be facilitated and the dimensionality of intermediate convolved traces will be reduced. Then the second convolutional block tries to detect the desynchronization effect. So by applying the filter size to the desynchronization divided by the pulling stride factor, we focus the interest of the network on the desynchronization effect for being sure to detect the point of interest. So the network obtains a global evaluation of the liquidages by concentrating its detection on the desynchronization effect and not on the liquidism felt. And finally, a pulling layer is used to reduce as much as possible the trace dimension. So here, thanks to the heat map, we evaluate that the SNR peaks are perfectly identified by the filters. Here, by comparing the activity detected by the convolution operation with the SNR peaks, we can argue that the features are correctly selected by the filters. And furthermore, as you can see, however dimension is drastically reduced because we want to concentrate the spreading information into a single sample. So the third convolutional block aims at reducing the dimensionality of each trace in order to focus the network on the relevant points and to remove any irrelevant one. So hence, we forward the network to focus its interest on this sensitive information and applying these techniques limits the desynchronization effect because we forward the network to concentrate the initial desynchronized point of interest into a single sample. And this property can be visualized once again thanks to the heat map. So in our simulation, we identified five SNR peaks, as already mentioned, and therefore we want to reduce the trace dimension into five samples at the end of our convolutional parts such that each sample contains the information related to a single leakage. So here, for example, the first green point defines the relevant information containing in the first part of the trace. And because we concentrate the network on the leakage, the second green point sample contains the information related to the second peak and so on and so forth. The other points defines the information related to the other relevant information. Using this technique force the network to reduce the desynchronization effect by concentrating its interest only on the sensitive information. And finally, as I previously mentioned, the rest of the CNN architectures, namely the free connect layers, defining the ability of the network to combine the relevant information is defined as out of our scope. So to validate our methodology on the convolutional part, we decide to perform it on the most common datasets used in deep learning side-chain approach. When desynchronization occurs, our methodology was performed on two public datasets. So the first dataset used is called ISRT and this is an IS implementation with a random delay effect. And because of these countermeasures, the typical profiling attacks like template attacks are more difficult to perform because of the misalignment. The second dataset is called ASCAD and this is the first open database that has been specified to serve as a common basis for further work in a deep learning side-chain analysis. Once again, the target device is also an IS implementation where a first order masking is implemented with different levels of desynchronization. To accurately evaluate the performance of our networks, we introduced a metric called NTGE that defines the number of traces that we did for reaching a constant-gassing entropy of one. And for being confident in our experiment, we performed 100 attacks and defined NTGE bar as our future metric. Our goal here is to prevent the desynchronization effect by finding some model such that their performance are as much efficient as possible in comparison with the model train on synchronous traces. So when our methodology is applied on the RESRD dataset, the impact of the desynchronization effect is drastically reduced while the related performance increases. And in comparison with the state-of-the-art results, the overall performance is similar while the network complexity and the resulting training time is greatly reduced such that the network that we train is 40 times smaller than the network introduced in the state-of-the-art. However, one notification should be made indeed regarding our chess paper. The results provided on the RESRD dataset have been modified after a report on our work. So then for the SCAT dataset, two levels of desynchronization can be notified. The first one with a maximum desynchronization amplitude of 50 and another one with a maximum desynchronization amplitude of 100. So first, the SCAT dataset with a maximum random-day effect of 50, we observed that our network is far less complex than the architecture introduced in the original SCAT paper. So by reducing the network complexity, we train our network much faster. But in addition, following the NTG value, we observed that we converge toward a constant-grossing entropy of one with only 240-volt traces while the original paper couldn't converge toward the same performance with 5,000 traces. So comparing this performance with SCAT when no desynchronization occurs, we know that our methodology drastically reduced the impact of the desynchronization effect while keeping similar performance when no desynchronization occurs. So through this experiment, we can conclude that our methodology limit the effect of the random-delay contra-missure. And this observation can be confirmed with the SCAT when a maximum random-delay of 100 is considered. Indeed, the original paper does not reach a constant-grossing entropy of one with less than 5,000 traces while applying our methodology generates a much more powerful network that does not seem impacted by the desynchronization effect. Hence, once again, we can conclude that our methodology limits the effect of the random-delay contra-missure. So if you want to reproduce our result, our source code is available on GitHub and currently more than 500 unique visitors and 100 unique downloads can be notified. So this shows a heavy demand for open-source code in deep learning-based sectional analysis. But over the last five years, while the number of papers dedicated to the deep learning sectional approach to increase the loss, only 10% of the research paper published an open-source code for making a reproducible experiment. So we deeply encourage all the researcher to promote the open-source approach and make the DLSR great. So to briefly conclude on our results, we highlight that the networks needed to perform sectional attacks do not have to be complex to retrieve the point of interest and perform efficient attacks. So while our work was only focused on low-trace dimension, our observation was recently confirmed on much larger traces. Obviously, we do not ensure that our methodology is the most optimized in all the cases because indeed, for example, using additional techniques such as data augmentation, like smote or using additional noise can be useful to improve our experimental results. Furthermore, the random shifting effect does not seem secure anymore against sectional attacks and our conclusion is in agreement with the work provided by Elinor Akegliel at chess 2017. And finally, as already mentioned, we deeply encourage the researchers to promote the open-source approach for making reproducible results. So that's it. Thank you very much for watching the video and do not hesitate to contact me over email if you have any questions I'd be happy to answer or during the chess events. Thank you very much.