 Hello everyone, my name is Li Chao Wu. In this presentation we will show our recent research, reinforcement learning for hyperparameter tuning in deep learning based side channel analysis. This is joint work with URI RESTAC, Guiheng Pering and Steph MPsec. First, let's give a brief introduction about the side channel analysis. Assuming that we have a hardware running the encryption algorithm, we input plain text and we can get the cyber text at the output. However, during this execution, we will have the leakage, we call it side channel leakage. It can come from the power domain or electromagnetic domain. By analyzing this leakage, doing some mathematics stuff, we can actually retrieve the secret information and we call it side channel analysis. There are two types of side channel attack, non-profile side channel attack and profiled side channel attack. For the profiling side channel attack, we assume that the attacker will have the full control of the device, which means that they can collect both the traces and labels during the executions. We call this profiling traces and profiling labels. By collecting these two types of data, they can build profiling models to map the relationship between the traces and labels. Well, I want to know that here the traces will mean the leakage traces. By mapping these two, they can just simply do the attack by fitting the attack traces obtained from the attack device to the profiling models and ask the profiling models to do the predictions. When knowing the attack labels, they can do the simple calculations and the secret information that carried by the leakage can be retrieved easily. There are two popular profiling side channel attack, template attack and deep learning attack. For the template attack, it's based on building the probability density function between different clusters. Well, for the deep learning attacks, it's just map the relationship between the traces and labels by training a deep learning model. However, training a central model is not that easy, especially designing such a model is really a difficult task. Well, consider that we have different type of data set and we need to customize the deep learning model for each data set. It will be a really difficult task. And also, we know that there are a lot of papers that represent that using different networks can give us better performance than the other. And it's really difficult to make the decision which hyperparameters or which network is better. So here we list three categories of the hyperparameters. The first one is the network types. We have the multi-layer perceptions, convolutional neural network and ResNet. The layer configurations, we have the convolutional layer, pooling layer, batch normalization layer and this layer. And for the training configurations, which is also really important, we have learning rate, batch size and training airbox. And as an attacker or evaluator, it's really difficult to find the optimal hyperparameter combinations of these options and performs a good attack. And the goal of this paper is to automate this process, to simplify this process. And the attacker and evaluator can use our methodologies to apply our attack in a different dataset. And we use the reinforcement learning, more specifically the Q-learning. Now that there are three important notations, the state, which is S, the actions, which is A and the reward, which is R. Let's start from the beginning. We feed the current state to the agent and based on the current state, the agent will take an action. The action will influence the environment, so the environment will update to a new state and also based on the current actions, the environment will give us a reward. And we feed the new state and the new reward to the agent again and we start this loop. By this interactive process between the agent and the environment, the agent can know better and better about the environment. And finally, at a certain step, the agent can give us the optimal solutions about the objective we set to the agent. Here are the simple examples of how Q-learning works. Assuming that the agent is standing in the middle and the objective is moving to the right up corner, which is a green block, while getting the maximum reward. Know that for each block, we have the reward. And because the agent has no idea what the environment looks like and no idea about the block, the agent needs to explore this environment. So the initial state of the agent is the current positions and the agent takes the actions. Actions in these situations could be moving up, moving down, moving right or right or left. And by taking such actions, the environment has changed and the agent has been moved to the new space, a new new state. For example, we are moving left and then the state will be the blocks on the left. And also by moving left, we are collecting a reward and the agent will go forward with different actions based on the current state and the reward and then finish these iterations and start the next iterations. Know that for these simple examples, we calculate the Q-value, which stands for the qualities. Why we want to Q-value? Because we want to evaluate the quality of such state with this state action pair. For example, if we are moving right, we want to know if the quality of such movement is good or not. If it's a good movement, it indicates, the Q-value is high, it indicates that this movement is pretty nice and it may indicate that we can achieve our destination, which is a green block and get a higher reward. And if the Q-value is low, this means that probably this action is not that good enough and probably we want to take different actions. So how to calculate this Q-value or how to update this Q-value because we are doing this interactive process. Here is the equation. It seems that this equation is complex, but it's really simple. This equation is based on the Bellman equations and we have two blocks, the old Q-value and learned Q-value. The old Q-value stands for the Q-value we learned from the previous iterations and the learned Q-value is a value we learned from the current step. So for the learned Q-value, we have two steps as we mentioned before, the reward from the new state and also the maximum Q-value that we can take actions in the new state. So by balancing this old value and learned value with the Q-learning rate alpha, we can update this Q-value and also know that we have these discount factors which determines the weight given to the short-term rewards over the future rewards. Please check our papers for more details. So we demonstrate how the Q-learning works and we know how to update the Q-values. We can simply observe that the reward here is very important no matter we are calculating the new reward, new Q-value or updated Q-value. This reward plays a really important role here and this reward actually determines the objective so we really want to optimize this reward for the side channel. And here is the reward we are using. As you can see here, we use four metrics and the detailed calculation is shown here. So the T stands for the percentage of the traces required to get the gas entropy to zero out of the fixed maximum attack size. The GE10 stands for the gas entropy value used 10% of the attack traces. The GE50 stands for the gas entropy value using 50% of the attack traces. And finally, if we use the accuracy, the accuracy stands for the accuracy of the network on the validation set. Also, we design a different alternative reward function which we call it RewardsMall. The reason we design this is because in the recent paper there are research showing that the models can be really small while performing is really good. So we also want our field learning process to find these small but good models. So by doing this, we design an additional metric which is the P-value which stands for the trainable parameters. We set the threshold of the trainable parameters based on the state-of-art models. And then we subtract the state-of-art trainable parameters with our selected network topologies. Higher P-value means that our model is smaller. It gives added value to the reward. And if the RewardsMall is higher, it means that we can find good performed and small models. Then by knowing all of these backgrounds, we are showing the network searching method. We start by assembling the network topology and we train the network. And finally, we evaluate the model and update the Q-value. We are doing this process again and again with the Ypsilom-Grady schedule. The Ypsilom-Grady schedule here balances the exploration and exploitation. We are in the beginning of the search with that Ypsilom to a higher value. Though the sampled network topology will be random and the agent can better explore the searching space. And then when the agent knowing the environment better, we gradually decrease this Ypsilom value. And then the agent is moving from exploration to exploitation and the agent will tend to select the topologies that give higher Q-value or higher reward. Also, for sampling the network topologies, we customize or we set the restrictions on the designed network. First, we set the searching range and also we limit the selected selectable layers. For example, here we have the convolutional layer, pulling layer, pulling connect layer, soft layer, and also we have this batch normalization layer and gap layer. Now that these two layer are shown in a recent research and it performs really good. And so we still add this to our research or to our searching space and probably it will give us good models. So we also said, as I mentioned before, we set the restrictions to the models. For example, for the fully connected layer, we set the threshold. So the number of the fully connected layer cannot exceed our threshold. For more details, please check our papers. So compared with normal network design methodologies, our searching methodologies give small possibilities on different type of layers, type of configurations and probably it will give us a better result. We will show that in the experimental result part. So here we experienced, we will test our framework, our searching method with three public available data set, AsgardFixKeyDataSet, AsgardRandomKeyDataSet, and ChessCTF. For each data set, we are showing the searching overview, the benchmark with the state-of-art models, reward, and gets into a B. This is the result for the AsgardFixKeyDataSet. On the left, it's showing the searching overviews. First, we noticed that we experienced different leakage models having with leakage model and identity leakage model and also we test different reward functions. To understand this graph, the x-axis stands for the Q learning reward, the higher the better. The y-axis stands for the trainable parameters of each model, the smaller the better. And the dot in the graph represents the topologies of the models and you can see there are a lot of dots. This means that we have to test a lot of different topologies. And we know that the dots have different colors. The colors stand for the different epsilon values that we are in a different searching stage, like for example, the exploration. And the red cross here stands for the state-of-art models and we directly benchmark the state-of-art models within our reinforcement learning framework. So from the result, we can see that the model on the right-down corner performs better than the state-of-art model and also it's smaller than the state-of-art models. So both for both reward and reward small reward functions, we can get such a model, and especially for the reward small functions, we can get plenty of these best performance models. So to benchmark our best models with the state-of-art research result, we have the tables on the right. So here we highlight our result here, our best model performance of our best model here. So clearly our model is way smaller than the state-of-art and performs better than it. So it shows the efficiency of our framework. Also, to evaluate our searching process, we also calculate the reward and the guessing entropy of the best models. For the reward, we consider two terms. The rolling average reward and the average reward per epsilon. So for the rolling average reward, we calculate the average reward from the previous 50 iterations and for the average reward per epsilon, we simply calculate the average reward for all of the models with specific epsilon value. So here from the plot on the left, we see that for both cases with different reward functions, the reward is getting higher and higher when the epsilon is getting lower, which means that our agent is really learning from the environment and when moving from exploration to exploitation, it can really generate the models that performs really good, performs well on the dataset we are testing. And also on the right of the plot, we are showing the guessing entropy of the best models from the search and the result is pretty consistent to what we observed from the searching plan in the previous slide. Next, we are showing the Askeladd random key dataset. Note that here we do not have the red cross with tenfold state of art. However, we still benchmark the reported result from the other papers with our result. The searching process is kind of similar, but the result still showing that we can find the good models that really will perform model, but the model is still with the limited size. So pretty amazing result. And also we are showing the reward, which is reward similar to the previous dataset. So here we observe that for the Hemavid leakage model, the reward is going higher when the epsilon is decreased. While with the ID model, it's going from to the different directions. So meaning that when the reward is getting smaller and the epsilon is getting larger. So we assume that this is because that the model, our searching framework is stuck into the local optima and by more searching the reward is getting smaller. However, our solution would be that fitting more upvoting traces when doing the searching all the evaluations and probably it will solve the problems. And also we're showing the guess entropy here again as performs really consistent compared to what we observed before. Finally, we show the test CTF dataset. Here we only show the Hemavid leakage model because this dataset do not leak much ID leakage. So for both searching results and the reward is pretty normal, as expected, because the reward is getting higher and when the epsilon is getting smaller we are finding better models. And for the result, our best thing and best thing with the reward small models, our model is the smallest while performs the best. And the guess entropy confirms our statement. Okay, let's move to the conclusion and future works. In this paper we propose the reinforcement learning framework that enables automated and powerful search operations for the profiling side channel analysis. We motivate and develop customer reward functions for the hyperparameter tuning inside channel and we demonstrate the effectiveness of our method with different datasets. For future work, it would be interesting to investigate the deep learning paradigm performance and we would like to know how the best model obtained through the reinforcement learning behaves in the ensembles of models. And also we want to learn better about our searching framework when searching for other type of network, for example, the multi-layer perception. Thanks for your attention.