 Hello, I'm Anh Tuấn Hoàng from Queen University Benfast. I've been talking about our work on uplining deep learning on side-channel analysis, in which traditional deep learning models are extended using plaintext as an additional feature to enhance the attack efficiency. And so our proposed CNN model can break multiple layers of side-channel countermeasures in an AES implementation. This is outline of our presentation. I've been talking about the background of side-channel analysis, convolutional neural network and our targeted Ascat Open database. Our attack model and the implementation of plaintext as a feature then I will give you detail of our CNNP models, experiment condition, attacking result and discussion on effectiveness of our selection. As you all know, when an electronic device operates, it can leak information through side-channels such as power consumption, electromagnetic fields, or computation time. Hence, even though cryptographic algorithms like AES are secure in theory, its implementation can leak the secret key via side-channels. Scar-based attacks like differential power analysis and correlation power analysis are well-known since 1996. More recently, it has been shown that machine learning can be used to improve side-channels analysis and can obtain the secret key even when countermeasures are present. Among the many artificial neural networks, we use CNN for our research because CNN can learn from unaligned data and so pre-processing methods can be removed. A number of layers are used in a convolutional CNN model. The convolutional layers use a number of filters in which the parameters are learned from the training data to analyze the influence of nearby sample points in a trace. For each point in the trace, a value is calculated based on the filter using a convolution operation. The results are feature maps. Pulling layers are used to progressively reduce the spatial size of the representation by selecting or combining points in the feature map together. The fully-connected layer combines all previous features or nodes together in a feed-forward neural network. Drop-out layers are used to prevent overfitting in training by removing a number of random nodes or features out of the training progress at every epoch. The rectified linear unit activation function changes an input value to either 0 or itself depending on whether that value is below 0 or not. It is used to overcome the vanishing gradient problem. The shockmax, another activation function, is used to handle the final classification to evaluate our CNN model. We targeted an open database developed by the National Cyber Security Agency of France called ASCART. This database has been created with an AES software implementation on an 8-bit AVR-at-mega microprocessor. The implementation includes a SCAR counter measure with masks applied to the plaintext and s-box as shown by the equations on this slide. The side-channel information for ASCART is recorded from the EM radiation emitted by the device and it touches the third subkey, which is protected by masking of the plaintext and s-box. There are two different datasets, one for a fixed key and the other for variable key and both use random plaintext values. The fixed key dataset has 50,000 traces for training and 10,000 traces for testing. Each trace has 700 points. The variable key dataset has 200,000 traces for training measured using a random key. The testing group has 100,000 traces but measured using a fixed key. Each trace has 1400 points. Synchronized and desynchronized datasets are available. As we are using the ASCART database, we evaluate our work on the third s-box in the first round of AES. Our model classifies the submind key from the output value of the s-box as shown in this equation. This value is never directly computed because of masking and so we have 256 classes in total. The three inputs that affect our power trace are the plaintext or ciphertext, masks and the key. Among these, the masks and key are considered as unknown factors. Providing plaintext as an additional feature to the attacking model reduces the number of unknown factors. The plaintext feature is added to the machine learning model using two methods, integer encoding and one-hot encoding. CNN was originally developed for image processing. Many researchers show that CNN works well with a small convolution of filter corner size like 3x3. However, research in which deep learning is applied to SCART typically uses a convolution of filter size that is big enough to cover all points of interest. The reference models in the SCART database use a filter size of 11 and so we select our convolutional filter corner size in the range from 3 to 19. We use max pooling for local point of interest selection, which reduces the size of parameters to be learned. Similar to the CNN model used in the SCART database, the number of corners in each convolutional layer of our model is in the range from 64 to 512. Based on research into the effectiveness of the depth and the width of a neural network in which deeper networks provide better accuracy compared to wider networks with the same number of parameters, we limit the number of nodes in each fully connected layer to 1024. Five layers are used. We also choose the most common activation function, RELU. With these considerations, we developed two original CNN models using plaintext as an extension called CNNP and one more extended version, which is the combination of those two. This is the first model. As you can see, the convolutional path, which is used to detect points of interest has three layers. The number of convolutional filters in each layer is reduced from 512 to 256 and then 128. As previously discussed, we use max pooling after each convolutional layer for local point of interest selection. After the final max pooling, the feature maps are extended with the plaintext feature either by integer encoding or one-hot encoding as discussed. These features are then combined by a deep but narrow network of five fully connected layers with less than 1024 nodes each. We also use a dropout layer, which randomly disconnects 40% of the nodes at each trending epoch to prevent the model from overfitting. Finally, a shop max layer is used for classification of the 256 classes. This is our second CNNP model. As you can see in the convolutional layer path, our model has four convolutional layers increasing in the number of nodes from 64 up to 512. The remainder of the model is the same as the first model. Our extended CNNP model is the combination of the models 1 and 2. After the plaintext feature extension in each branch, a number of fully connected layers are included to embed the plaintext feature among the points of interest before combining the branches together in the feature combination layer, which simply concatenates the output of the two branches together. After combining the branches together, a dropout layer, three fully connected layers and a shop max layer are used. We use transfer learning to train this model. For our evaluation, we assume that the attacker understands the AES algorithm. They knows the plaintext or ciphertext. They are aware that a SCAR counter measure has been applied to the AES implementation but do not know the determined design or random mask values. They can provide keys on the implementation. Also, hypothesis keys are ranked using the maximum likelihood score and training is performed on VMware hosted even to with access to a virtual NVIDIA GPUs. In order to understand the effectiveness of our proposed CNNP model, we compare our profiling result with four publicly available models in the SCAR database. A template attacked a multi-layer perceptron model with five hidden layers, 50 neurons each. A multi-layer perceptron model with five hidden layers, 200 neurons each. And the final reference is the VCC16-based CNN model. We will go into detailed comparison between this model and our models in the next slides. In comparison to the VCC16-based benchmark model, our CNNP model is deeper but narrower. We use five fully connected layers with 1,024 and 512 nodes. While VCC16 uses two layers with 4,096 nodes each. Our model has less convolutional layers, only three or four. It uses smaller convolutional filter corner size of three and five whereas VCC16 uses a corner size of 11. Our model uses max pooling instead of average pooling to find the local point of interest. It also uses dropout to prevent overfitting and it uses plaintext as an additional feature. This graph shows the guessing entropy of the models on the ASCAD fixed key dataset versus the number of traces. The yellow line shows the result of our CNN model with no plaintext extension. It is similar to a random guess. The blue and black lines are the results of the ASCAD reference models. The results at the bottom of the graph are from our CNNP models. You can see that our CNNP models are much better than the reference ones. It is worth noting that the CNNP models with integer encoding needs two traces to reveal the secret key while the one-hot encoding can reveal the key with only one trace. This graph implies that the CNNP model relies on the bijection of the SBOX function between the random plaintext and the fixed key to reveal the key in the case of the ASCAD fixed key dataset. This graph shows the results for the models on the ASCAD variable key dataset. The guessing entropy of our models is shown by the red lines. From the dotted red line, you can see that our models rely on both the plaintext and traces to reveal the secret key. The dash red line implies that our deeper but narrower CNN model without plaintext feature extension works better than other reference ones in revealing the secret key of the AES implementation. You will see this line again in the next slide. This graph shows a detailed comparison of our CNNP models versus the VGT16 model, the VGT16 results as shown by the green dotted line. The red dashed line here shows the guessing entropy of our deeper but narrower model and above that in yellow is a wider CNNP model with 1,536 nodes at each fully connected layer. We can see that both CNNP model 1 with 3 convolutional layers and model 2 with 4 convolutional layers are better than the VGT16-based model. They can achieve key ranks 3 and 5 for the third supply key. We can also see that the models with smaller convolutional filter corner size 3 are more efficient than for a corner size of 5. The combination of the two models using transfer learning achieves the best result. It achieves key rank 2 after 40 traces. In comparison, the VGT16-based model achieves a key rank of 20 after 40 traces. We also evaluated our models on the ASCAD desynchronized database and the results for this can be found in our paper. In conclusion, this research proves the effectiveness of CNN models in processing unaligned data shows that we don't need large convolutional corner size to protect the AES implementations in the ASCAD database. Small corner size works best. We also see that the plaintext feature plays an important factor in improving the performance of the CNNP models as it reduces the unknown factors that contribute to features in these traces. We also conclude that one-hot encoding should be used to encode the plaintext feature instead of integer encoding and the location of the plaintext feature extension has less effects on the performance of the model. Our experiment also proves that even those deep but narrow networks require less parameters. They perform better than why but shallow networks such as the Benchmark VCC16-based CNN model. Thank you very much for watching.