 Hello and thank you for joining this talk on the Security Issues and Challenges in Deep Reinforcement Learning. I'm Vahid Behzadan, I'm an Assistant Professor of Computer Science and Data Science at the University of New Haven, and I also direct the Secure and Assured Intelligent Learning Lab, or SAIL for short, working on AI, safety, security and applications of machine learning to cybersecurity and safety of complex systems. So, here's the outline of my talk. I'm going to quickly go over the basics of Deep Reinforcement Learning and Reinforcement Learning. There will be some math, but I'll keep it to a minimum. Then we'll talk about vulnerabilities of Deep Reinforcement Learning and whether DPRL is susceptible to classical adversarial machine learning attacks like adversarial examples and such. We'll talk about and develop a threat model for Deep Reinforcement Learning. We'll identify different attack models, attack surfaces. We'll identify different types of vulnerabilities. And we'll introduce a number of attack mechanisms and corresponding defenses that have been developed in recent years. And we'll talk about the frontiers and areas of future research and work in this area. So, just a quick overview. I assume most of the audience here is already familiar with the terminology used here. We can classify machine learning algorithms as supervised, unsupervised and reinforcement learning algorithms. These supervised learning algorithms are those where the training dataset includes labeled data, meaning that each data point in that dataset comes with the correct label or the correct output expected from a model trained on that dataset. Then there is unsupervised learning where classification, where clustering and anomaly detection and some other algorithms fall under where there are no labels available for data points. There's no feedback available on the data. However, the goal is to find some underlying structure in the data. And then there is reinforcement learning, which is concerned with the problem of sequential decision making. There's a reward system included in reinforcement learning, but there are fundamental differences between the settings of reinforcement learning and supervised learning. And that's why it merits its own category. We'll talk about those differences in the next few slides. So, you can see the general settings of reinforcement learning. We're in general sensory motor agent problems and settings. In RL, we have an agent, which is to be trained by a reinforcement learning algorithm. This agent interacts with the environment by performing some action. This action causes the state of the environment to change. Then the new state is observed by the agent. There's also some sort of reward associated with this change that is provided or inferred by the agent based on this change in state. Well, you can think of this in terms of playing a game. The environment can be the game environment where different actions may result in scoring or loss of score. And the actions are, well, essentially the actions that the player can take. The state of the environment can be the state of the game. For example, in Breakout, the configuration of these breaks can be one of those where the agent is, where the ball is. If you recall the game of Breakout, of course, if you're old enough to remember Breakout, you're probably familiar with the dynamics. But you can think of any other game and this setting still applies. What's the goal here? The goal is to learn how to take actions in order to maximize cumulative rewards, not instantaneous rewards, but cumulative rewards. So an RL agent is not just concerned with maximizing the current score. It wants to learn how to act so that at the end of the game its total sum of rewards is maximized. What are the different applications of RL? This game playing is well publicized because it's a very good testbed and it has become one of the common testbeds and experimental settings for RL research. However, there are some major real-world applications for reinforcement learning. In essence, RL, or reinforcement learning, is the machine learning response to the need for data-driven control problems. If we encounter such problems in robotics, like autonomous navigation, object manipulation and such, algorithmic trading, this is one of the areas that my research group has recently become active in. Critical infrastructure, controlling smart cities, resource allocation in smart cities, traffic management, intelligent traffic systems, smart grid management, and control healthcare, such as clinical decision-making. This is actually one of the better-known applications of RL in the real world. Other types of resource management and applications in operations research and such. RL is either envisioned or is already heavily adopted in various industries, many of which are critical and may become targets of malicious actions. Let's formalize the RL problem a little more. I'm just going to quickly go over this. There will be some math, but this is just to introduce the basic settings. The underlying framework to formulate the RL problem and provide a framework to think about and reason about the RL problem is the Markov decision process. Why do we call it the Markov decision process? Because it's based on the Markovian assumption or the Markov property, which says the current state completely characterizes the state of the world. So if you have the current state and you perform an action, the next state is only going to depend or be a function of the previous state and nothing before it. You don't need to know the history of the environment. All you need to know is what the previous state has been to infer what the coming state is going to be. Now, in Markov decision processes, a problem or setting is formulated by a tuple of five main parameters. One is the set of possible states, S, set of possible actions, A, distribution of reward given state action pair or R. This distribution can also be a function. It doesn't necessarily have to be probabilistic. It can be just a function that tells you if at a particular state, S, I, an action, A, I is taken, then the reward is going to be, let's say, R, I. Then there is transition probability or transition dynamics, P, which represents the dynamics of the environment. Which state are we going to end up in if we are in a particular state and we perform a particular action? And there is finally a discount factor which defines how myopic the agent is. How much it values rewards that may occur further in the future, further down the line. Now, in general, the RL setting is based on the following process. At time step T equals zero in the beginning. The environment samples initial state S zero, then for T equals zero, this is a loop. Until done, the agent selects action AT based on some criteria. It can be completely random in the beginning and then slowly becomes more targeted and more policy driven. It performs some action. The environment produces a reward signal and the next state and the agent receives those signals. It may be partial, it may be incomplete or it may be noisy but the agent receives some signal resulting from the transition to the new state ST plus one and a reward that has emerged from that transition that has resulted from that transition. Now we define a new entity here called Pi. A policy Pi is a function or a distribution in the probabilistic case that maps from state to action. So it tells the agent what action to perform given any state. And the objective is to find the optimal policy Pi star that maximizes the cumulative discounted reward. We also call this return. So cumulative meaning the sum of rewards in the entire duration of interaction. It can be one episode of a game, it can be throughout the training period or the training horizon and it's discounted. This gamma is the discount factor that defines how myopic agent is or how much the values events that occur further down the line. It's typically a constant value between zero and one. And as you can see as T increases this gamma to power T decreases. So the value, the preference or the observed value of something that happens further down the line which means with greater T is going to decrease as T increases. So again the objective is to find the optimal policy Pi star that maximizes the sum here. Alright we are going to quickly define two other definitions here. If we want to evaluate a particular state in these settings, one approach is to measure the value of that state using the value function. The value function of state S is the expected cumulative reward from following the policy Pi from this state. So let's assume we already have a policy. What will be the expected cumulative reward if you start in state S and keep following policy Pi? Now we also define a Q value or a Q function which tells us how good is a particular action A if it's performed in state S. In other words, if we are in state S, if the agent is in state S and performs action A, the Q value of this setting is the expected cumulative reward from taking action A in state S and then following the policy. Remember the policy is a function that tells us, tells the agent what action to perform given any state. If the agent performs action A in state S, it goes into a state S prime or the next state and then we can use this policy function here to see what the action should be, what action should the agent take at that state which takes it to another state and then the policy is used to figure out the action to take in that state and this goes on until the termination of the episode or the horizon. Now that we are familiar with the value function and the Q function, let's go a bit deeper. The policy Pi, the value function, value functions both V and Q and model, the transition dynamics or the reward model are all functions. We want to learn at least one of these from experience. This is the essence of RL. If there are too many states, however, we cannot just tabulate everything and try to experience every state and the corresponding result from that state or that state and any action performed because as a size of the problem, as a state space and the action space increase in settings like let's say playing GTA 5 or a self-driving navigation policy in the real world, these state spaces just explode. The dimensionality is too high and it's just not feasible to store every possible state. In those cases we need to approximate. In general and traditionally this is called RL with function approximation. If function approximation is done using deep neural networks then we call the training setting or the RL setting deep reinforcement learning. The term is relatively new. It was, I believe, introduced in late 2013, early 2014 by Nie and David Silver and others at DeepMind in their deep Q learning paper. However, the concept is not very new. That said, there have been fantastic and even mind-blowing advances in this area in the past three or four years. We'll talk about some of those as we move forward. A quick overview of the taxonomy of different RL approaches and RL agents. Remember we have value function, policy and model. There are different approaches to solving the RL problems. Sometimes we just want to find a policy directly. The approaches that respond or satisfy that need are called policy-based RL. Sometimes we want to find a Q function or a V function first and then derive the policy from those functions. These approaches are called value-based. Sometimes we want to first learn a model of the environment and then solve the problem. These are called model-based approaches. Sometimes we don't have the model and we don't want to learn the model explicitly and those are called model-free. When we're dealing with both value function and policy at the same time, in other words, we have one agent learning the value function and another learning the policy and then contrasting those with each other in a zero-sum setting. We call those active-critic RL approaches. As you can see, there are different approaches to different settings and different problems. However, all of those can still be grounded on top of the Markov decision process framework and the general solution approach we looked at before. One of the better known approaches to RL, which falls under the value-based approaches, is Q learning. The objective in Q learning is to derive the optimal policy pie star based on optimal Q. The optimal Q is one that maximizes the value of each state and action. From that point onward, with an iterative formulation based on Bellman equations or dynamic programming, it becomes possible to find this through bootstrapping and iterative re-estimation of the Q value. There are different approaches to Q learning in large state spaces or large action spaces where function approximation is required. One such approach is to parametrize Q as SA and an estimation parameter theta. If we solve this parametrized Q function by neural networks where theta corresponds to the waves of that neural network, the approach is called Q networks. This solution approach was proposed very early in 2000s, even earlier, but was perfected in some sense in 2013-2014 by David Silver and his team, which resulted in the proposal of deep Q networks, or DQNs. We're talking about deep networks. What does that mean? It means that these networks use deep neural networks like CNNs, which help with both function approximation and also end-to-end feature learning. One of the advantages of deep learning is its superior performance in learning feature representations, especially from images. For those of you who are more statistically oriented, you may have seen the term IID. It means that the data are independent and identically distributed. It means that one data point does not depend on another data point, and also each data point is equally likely to occur. These do not hold for the RL settings. We know that a particular state, for example, is highly correlated with its previous state and action. Why does this matter? Well, a lot of our supervised and deep learning approaches are based on the assumption that the data, the training data, is IID. When this is not possible, what happens is, in response to this problem, what happened was the DQN approach introduced experience replay. It's like a bag of all data, which is randomly sampled in each training iteration to reduce the correlation, the sequential correlation, the temporal correlation of data points and also make it more likely for data to be evenly distributed. Also, to reduce the effect of oscillation during training, DQN uses fixed parameters for a target network. The target of optimization is fixed and is updated every few thousand iterations, so that reduces the oscillation problem. Of course, the rewards are normalized to minus one to one to make sure the performance, the reward signals are bounded. Now that we have a preliminary understanding of deep reinforcement learning and one of its implementations, one of the parallel approaches called DQN, let's take a quick look at adversarial machine learning. I assume that by now, those of you who are not familiar with adversarial examples have been introduced to this concept. The idea here in adversarial examples, we are now talking in the realm of, we are speaking in the context of image classifier supervised learning. Let's say we have an image classifier trained on a set of images of pandas, cats and others to identify the object in those images. In the beginning, we passed the image of a panda to the classifier and it correctly classifies it as panda. Now it's been demonstrated, it's been actually established by now that it is possible to make the classifier to induce incorrect classifications in deep learning models or machine learning models in general by adding minute minimal perturbations to the original image. As you can see, it's almost impossible to detect or see any changes in this final image. These perturbations, these pixel perturbations are very small. Now, this is one example of different attacks or different vulnerabilities in the classical realm of adversarial machine learning, which is mostly concerned with supervised learning and sometimes unsupervised learning. In general, the adversarial objectives in AML or adversarial machine learning can be classified under the traditional CIA triad, confidentiality, integrity and availability. So with respect to confidentiality and privacy, an adversary may wish to target the confidentiality of the model parameters, model architecture, no intellectual property theft is an issue in larger models, or it may target the privacy, the adversary may target the privacy of the training and test data. If medical, for example, if medical records were used in the training data, there have been proof of concept attacks showing that it is possible to infer whether a particular patient or the records pertaining to a particular patient were used in the training data or not, or sometimes it's possible to reconstruct the training data set by just having access to the model itself. And that is a major hip violation in essence, it's a privacy violation. Also, with regards to integrity and availability, the attacker may target the integrity predictions or the outcome of the model, the performance of the model. For example, adversarial examples are an attack on the integrity of image classifiers or supervised machine learning models, and also an adversary may target the availability of the system that is deploying machine learning, for example, a facial recognition system or an autonomous navigation system in a driverless car. Now that we know so much about adversarial machine learning and security vulnerabilities of classical machine learning, supervised learning and unsupervised learning, there is a major question and it's whether deep RL is immune to those attacks. In late 2016 when the research community was just beginning to pay attention to both deep reinforcement learning and the issue of adversarial examples, I came up with this question and decided to experiment with it a little to find out whether deep RL can also be vulnerable to such attacks. I started from a simple observation. The deep neural networks in DQN models and classifiers are both function approximators. At training time, there are function approximators that test times are just functions. I came up with this hypothesis. If classifiers are vulnerable to adversarial examples, then action value approximators of DQNs may also be vulnerable. I started to set up an experiment where the aim was to perform adversarial attacks. The adversary's goal was twofold, test time attack to perturb the performance of target's learned policy. So the target at this point is fully trained and is deployed in the environment and the adversary wants to somehow cause the agent to perform incorrectly to manipulate its policy. What does the adversary know about the target? It knows the type of input to the target. For example, it knows whether the target policy is looking at image data, text, audio and such. Why? Because it helps with estimating the architecture. For example, if it's images, the adversary can come up with a good guess that the architecture includes the target architecture includes CNNs or convolutional neural networks. I also assume that the adversary knows the reward function. So the adversary may have access to the environment and if it's, for example, a game environment, it knows what the scores are, how the scores are generated. What is not known? The knowledge of target's neural network architecture is not known. So it's a black box attack. And also the initial parameters, the initialization of the neural network, the target neural network is also not known. What is available to adversary in terms of actions? The adversary may perturb the environment where the target performs. For example, it can change pixel values in a game environment through a man in the middle attack. I consider in this work two techniques for perturbing that environment. One is the classical fast gradient sign method for generating adversarial examples. And the other is the Jacobian-based saliency map attack or the JSMA approach introduced by Papernot in 2015, I believe. Now, in that experience experiment, I used the classical DQN approach, the classical DQN architecture introduced by Nie in an Atari game, the game of Pong. And this was all implemented in open AI gym with TensorFlow, back then PyTorch wasn't really a thing. And train the agent against heuristic AI. And here's the initial proof of concept result. This is for the white box attack. Of course, later on I will show you the results for the black box attack. And you can see that for FGSM and JSMA, regardless of how far along the training the agent is, both policies, the policy for the Pong agent is highly vulnerable to adversarial perturbations through simple techniques like FGSM and JSMA. You can see that for JSMA, the success rate was 100% for all of the cases. For FGSM, it was slightly lower. And it was mostly because of the termination criteria and the perturbation threshold that I had defined. But you can see it can be observed that the policy can be very easily manipulated through adversarial example attacks. So at 100 random observations perturbed with FGSM and JSMA, the results were fed. The perturbed images were fed to the trained neural network representing the agent's policy as test input, and then the success rate was measured. Now, is this type of attack practical? Is this really something that we should be worried about? So a few years later, in 2018, Clark and his co-authors published the report of a similar attack on an autonomous robot based on DQN policy using ultrasonic collision sensory input for collision avoidance. And they had shown that they can use adversarial perturbations to manipulate the trajectory of the robot and make it follow a path defined or desired by the adversary, not one that the robot itself wants to follow. There are more recent examples of how this sort of attack on Dparl or in general attacks on Dparl can be of concern. One of the recent works by my graduate students is on attacks on automated trading algorithms based on reinforcement learning. The paper is coming out in a couple of months, so I can't go into more details, but this is one of the more severe and urgent cases for security researchers to consider. Dparl is already being used by many major financial players and stock traders, and it can be easily manipulated in the real world. There are other cases, of course, but this is one of the examples that demonstrates the practicality and the applicability of this attack to real-world scenarios. Now, before we go further into different types of attacks, let's develop a threat model for deep reinforcement learning. Again, the adversary's objectives can follow those of the CIA triads. The adversary may wish to access the internal configurations like model parameter, reward function, policy and such to still the model for intellectual property theft and such. There can be an attack on integrity, which means compromising the desired learning or enactment of the policy. There can be attacks on availability, which are essentially compromises of the ability of the agent to perform training or actions when needed. Now, let's look at the attack surface of Dparl. This is the general block diagram of a Dparl agent or Dparl system. Now, we have the agent. The agent typically has some memory where it stores its experiences during training and then those experiences help with function approximation, data-driven policy learning and such. And then there is an exploration controller, which controls how the agent explores the environment during training. There is an experience selector, how to select experiences from the bank of data or observations stored in this dataset. There is an actuator, which then enacts the actions of the agent inside an environment. The environment is connected back to this agent block through an observation channel, observation of the state, and reward channel. And it's of no surprise to the more seasoned security researcher and professional that all of these components can be a subject or target of adversarial attack. As we go forward, we'll cover some examples of attacks that can occur on each of these components. We've already seen an attack on the observation channel. There are attacks on the reward channel, which we'll hopefully touch on. And there are attacks on the agent during training and the actuator. Let me give you a quick example of the actuator attack. Let's assume we have a robot, an actual robot, learning to navigate in an environment while avoiding obstacles. If the robot commands or decides to, let's say, move the left wheel forward, but there is some sort of obstacle in front of the left wheel and the left wheel doesn't actually move, then the resulting observation is going to be skewed because the agent is going to assume that the actuation has happened and then look at the changes in the observation and use that to retrain its policy, to optimize its policy based on faulty data. What are the adversarial capabilities? Well, we first look at different attack modes. The attacker can perform a passive attack where it's only observing the target, it's not changing anything, or it can perform an active manipulation. In passive attacks, the attacker can perform inverse reinforcement learning to learn about the reward function of the agent, or later on we'll see it can perform imitation learning to steal the policy. Active measures include attacks on the actuation, observation, or the reward channel, and attacks on observation can be targeting the representation model, how the agent sees the environment, or perturbing the transition dynamics, how the agent sees the changes in the environment. Okay, going back to our initial proof of concept, remember our original goal was to perform a black box attack. So, to achieve this objective, we introduce an approach based on the transferability of adversarial examples. So, we create a second DQN. The adversary creates a second DQN with similar architecture, but different initial parameters. When I say similar, it doesn't necessarily have to be imagined, it just needs to be a convolutional neural network with some functional approximation techniques, but it doesn't need to have the same parameters or the exact same architecture. And it trains that agent, that model on the same environment. The assumption is the adversary has access to that environment, and then uses the knowledge of that architecture and the trained replica policy. It crafts adversarial examples the same way it did for the white box case, and we know that many of those adversarial examples can transfer to similar models trained on similar data, which also you can see applies to DQN policies as well. And this is how we implemented a black box attack against D-parallel policies. Now, what about training time attacks? In the same paper we introduced the policy induction attack, where the attack is of the adversarial example type against a DQN agent during training. Now, what are the different steps in this attack? First, the adversary derives an adversarial policy from the adversarial goal by training on the same environment where the target is going to perform or be trained in. So if the adversary wants to minimize the reward gained in a game by the agent, then the goal, the optimization goal is going to be the exact opposite of that of the target policy. The target wants to maximize the reward and the adversary wants to minimize it. There are of course different ways of formulating this adversarial goal. Then the adversary creates a replica of target DQN and initializes it randomly, and then comes the exploitation phase, where the attacker observes the current state and transitions in the environment, then estimates best action according to the adversarial policy derived in step one, then the attacker crafts perturbations to induce adversarial action based on the replica of target DQN. This is exactly the same as our black box test time attack. The attacker applies the perturbation as a man in the middle in the observation channel. The perturbed input is revealed to the target and the attacker waits for target's actions. And this is a loop. This loop can go on until the target, either the training process of the target, either converges to a suboptimal policy or up to a certain number of iterations. This is a very rough plot. This is not smoothed yet, but you can see that the unperturbed agent moves towards, this is in the Game of Pong style, moves towards convergence to an optimal total sum of rewards, while the attack agent moves towards a convergence to the minimum possible return of zero. It's getting closer and closer to zero, which indicates that the training process of DQN and D-PARL in general can also be targeted through adversarial attacks. Now I'm going to introduce another type of training time attack. Again, this aims to induce some form of misbehavior. We call this misbehavior addiction, and this is a follow-up to a work I did with my colleague Roman Niempolski on psychopathological modeling of AI safety problems. This is a proof of concept. We consider the Game of Snake, many of you probably remember Snake from older Nokia phones. The DQN agent is a snake and is learning to play in this environment. What the attacker does is it adds a drug seed with more instantaneous reward than the typical seed, but it also results in more increase in the length of the tail, and you can see that this can end up, well, if the increase in the tail length is more than a certain amount, then a longer-tailed snake is bound to eat its own tail sooner rather than later. So we show that we actually derive some theoretical close-form solutions for what the additional reward and increase in the tail length should be for addiction to emerge, meaning that the agent learns the more myopic policy instead of the optimal policy, and you can see that it's actually possible to make the agent addicted to the drug seed, as we call it, and this results in learning a suboptimal policy. And, of course, due to time limitations, I'm going to introduce only one more type of attack, and it's that of targeting the confidentiality of a depraved policy. The problem here is, or the question is, is it possible to extract a depraved policy from observations of its actions? Why does this matter? Well, the security challenge posed by this sort of action is, of course, Model Theft, a company like Google or, let's say, Uber or Waymo may have spent billions or millions of dollars on coming up with a very accurate depraved policy for autonomous navigation, if it can be stolen by an adversary, then the intellectual property becomes worthless. And also, a stolen policy, an extracted policy, can be leveraged in integrity attacks in the same way that we mounted black box attacks on DQM policies. So, let's see. As it happens, a branch of reinforcement learning, or in general, one solution to the sequential decision-making problem is not in the URL domain, but in the supervised learning domain, and it's called imitation learning. Imitation learning is the supervised learning of policies from observed behavior of an expert, and by behavior on in-state action behavior. What is the policy of an expert? Based on this concept in 2018, Hester et al. proposed DQF FD, a DQ learning from demonstrations, which is DQN, where the initial training is done based on observed data using deep learning. So, they have data from human players playing a certain game, or human performance doing a certain task that they want the agent to learn. The initial step of training for a DQFD agent is supervised learning on observed data, and then it starts building on top of it through reinforcement learning approaches, and it was shown that it can result in faster convergence, better sample complexity, and sometimes more interesting and robust policies. As security researchers, you can probably see where this is going. This wonderful algorithm, DQFD, can also be used to replicate policies instead of applying it on observed data collected from human performance. It can be applied on observed data from a target policy. So, here's a proof-of-concept attack procedure. The attacker observes and records and interactions of the sarsotype, state action, the next state, and the reward based on this transition of the target agent in a particular environment. And then the attacker applies DQFD to learn an imitation of the target policy and Q function. Now, at this point, the attacker may either just go away and sell the extracted policy, or it may decide to target it using different adversarial perturbation attacks, some of which we've covered so far in this talk. So, as a proof-of-concept, we consider a slightly less complex environment, that of cart pull, where the objective is to stabilize this pull on this cart by moving the cart to right and left. The reason for choosing this simple environment is merely economical, because we didn't want the experiment to take days or weeks. We start with a simple case, and we consider different types of policies. DQM with prioritized replay, a enhanced version of the classical DQM proximal policy optimization, and asynchronous actor critic. And, of course, we also train an adversarial RL agent, a DQM agent, whose objective is to incur maximum loss of reward, or in other terms, in more technical terms, maximize the regret of its target. And here are the results. First, with regards to replication progress, based on only 5,000 demonstrations, 5,000 state action, next state reward observations, we see that all three policies are almost exactly replicated. We can see convergence to the optimal performance of those policies in the environment. And then we perform adversarial training. We train an adversarial RL agent to attack and maximize the regret of those policies. And you can see that this can also be easily achieved within very few iterations of training in cart pull. I believe we can see that for PPO2, which is a somewhat robust DPRL algorithm or approach, it's possible to incur maximum damage or find a policy that incurs maximum regret on the target within 60,000 iterations, which is a relatively low amount. Now, what about defenses? So, of course, similar to adversarial examples, one approach or one technique for reducing the impact of adversarial example attacks is through regularization. And one common type of regularization in supervised learning for mitigating adversarial example attacks is adversarial training, training the model on adversarily perturbed samples to make sure that it sees different perturbations of the same image and knows that all of those results in the same label, in the correct label. This is called essentially data augmentation in general as a regularization technique. So in 2017, when I was just starting to look into this problem or this domain, I had gone through the adversarial training literature and thought that the same may also hold true for DPRL. I made a hypothesis, actually two. One is with regards to recovery. If training time attacks are not continuous, if not all of the observations are perturbed, then DRL adapts to the environment and adjusts the policy to overcome the attacks. This is training time. And with regards to robustness, I made another hypothesis, such policies, policies trained under attack are more robust to test time attacks. And this particular investigation is published in a paper titled Whether It Does Not Kill, Deep Reinforcement Learning Makes It Stronger. So, similar as before, we are looking at DQN in Atari games, Breakouts, Enduro, and Pong. Now, the way I designed the experiment was based on the probability of attack. So as an attacker, I assign a certain probability for each state, for each observation during training time to be perturbed. I perform different experiments with different values of this P attack, 20%, 40%, 80%, and 1, which means continuous attack. And it's interesting to see that for values of P, less than 50%, the agent actually recovers. But for values greater than 50%, the training process plummets and either does not converge or converges to a very, very low mean return value, or mean total reward value. I later on publish a theoretical analysis of why this happens. It's available in my PhD dissertation, which I'll reference in the final slide. Also, it was interesting to see that the robustness hypothesis is also true. And we hear that after training, if we attack the test time policy with probability of 1, the plane, or vanilla policy, a policy that was not trained adversarily, performs very poorly. However, policies trained under adversarial attacks with probability 0.2 or 20%, and 0.4 perform really well. For 80%, it's even for policies trained at 80%, as you can see, the policy itself is already performing very poorly. It's surprising to see that at P equals 1, the performance gets slightly better. It's comparable with 40%. To this day, I'm not entirely sure why this happened. I've repeated the experiments a number of times and still get the same result. I still don't know why this happened. This is one of the interesting problems that we are looking at right now. As a research group. Another difference that we introduce is based on parameter space noise. This is very much like dropout. The idea of parameter space noise was introduced in 2017, I believe independently by Plamper et al. and Fortinotto et al. And the idea here is, again, similar to dropout, to introduce zero mean random noise to the learnable parameters of neural network in deep RL to enhance exploration and convergence in deep RL benchmarks. Now, in another paper in 2018, we investigate whether this approach can be used to mitigate the impact or severity of policy manipulation attacks on DQM. And it was shown that it actually performs very well compared to vanilla or classical DQM. And these are the training time results. You can see that if parameter space noise is used, the performance degradation for all environments is at a much lower slope, at a much lower rate than the vanilla architecture. Finally, to have proposed a solution for the policy extraction problem, I along with William Hsu at K-State came up with the idea of watermarking the RL policies. Watermarking has already been introduced in deep learning in general. The idea is to come up with a unique signature that is both difficult to remove and does not impact the performance of the policy itself or the model itself. But it still provides a unique signature, which proves that a model is the same as another model or is a replica of the suspected model. So we introduce an interesting watermarking procedure. I know it's arrogant of me to call my own work interesting, but I still get excited when I think about the moment I came up with this idea. The idea is to create a second environment whose state space is disjoint from the main environment. So create an environment. If you're training an agent to play a game, let's create another environment which has no states. None of these states are the same as the original training environment or deployment environment of the agent. But the dimensionality of the states are the same. So if each state in the original environment is represented by, say, three values, three features, then each state in the second environment is also represented by three features. It doesn't really matter what the second environment looks like. It's just some other environment that the agent may interact with. And then we craft the transition dynamics and reward procedure for the second environment such that the optimal policy follows a looping trajectory. So an optimal policy for an agent trained in the second environment is going to be one that follows a loop, goes to, let's say, state one, then state two, state three, and then goes back to state one. During training, what happens is we periodically alternate between the two environments. So let's say at every end iterations of the training process, we take our RL agent from the original environment, we drop it in the second environment, train it for a few iterations, and then bring it back to the original environment. Now, once trained, if you want to examine the authenticity or whether a policy is copied or not, we apply the policy in the second environment and measure the total reward. Here's the experimental setup. Again, we are working with cart pool. So the watermarking environment is defined with five states, states one to four plus a terminal state, which should never be reached if the policy is optimal in this environment. And none of these states, as represented here, can be found in the original, can occur in the original cart pool environment. These are all highly impossible, not highly, definitely and absolutely impossible to occur in the original environment. As for the transition dynamics, this is how we've defined it. Let a particular state be A0 and A1, if we are, I'm sorry, actions, actions A0 and A1, if the agent is in state I and performs action I modulo two, if I is even, this is going to be action zero, if I is odd, this is going to be action one, then the next state is going to be state I star or I modulo four plus one, and it receives a reward of one. If the agent performs any other actions, so instead of going from one to two, it performs any other action, it will immediately go to the terminal state and receives a reward of zero. So the optimal trajectory is state one to two to three to four, back to state one and so on. All right, let's see how it works. Let's look at the test time performance comparison of watermark and nominal, non-watermarked policies. The watermark policy performs exactly as well as on-watermark policies. It reaches the optimal or best performance of 500 and when it's applied to the watermark environment, the second environment, it also performs optimally. It gets the maximum reward possible. However, you can see that if we try to apply the non-watermarked policies to the watermark environment, we'll see very, very small values of score or total reward. So you can see that it's possible to determine whether a policy is authentic or not or whether a policy is an exact copy of another policy by just applying it to a second environment and see whether it performs optimally or not. Now there are many other things that I really wanted to touch upon in this talk, but unfortunately we are a little short in time. For practitioners, it may be of interest to have some way of benchmarking or evaluating the resilience and robustness of policies and compare different policies, different approaches with regards to their resilience and robustness. Some of my work already introduces or proposes an RL-based approach to perform this evaluation and benchmarking. I've also done some work on investigating the impact of hyperparameter choices on resilience and robustness of DQNs in particular but also other model-free and active critic approaches. This can be very helpful to those who want to engineer and design the new RL agents to be deployed in critical environments. Also, something that I wanted to mention but unfortunately I don't have time to do so is that adversarial training is not a silver bullet. It's not an answer to all of the problems in DQN agents or certain limitations for robustness and resilience obtained from adversarial training of DQN agents. Also, adversarial training is very costly in general, especially when it comes to real-world scenarios, real-world environments and actions. Some of my recent work is focused on improving the sample efficiency and computational cost of adversarial training via a new exploration mechanism called adversarial-eguided exploration or H. All of this work can be found in my PhD dissertation which bears the same name as this talk, false and apply star security of deep reinforcement learning. You can find it if you search my name on Google Scholar. If you're interested, of course, all of these are published in separate papers in slightly more details under the same titles. And finally, some of the open areas of research in this domain. With regards to training time resilience and robustness, not much has been done with regards to policy search and actor-critic methods as well as model-based and hybrid methods. Of course, when we talk about model-based, there are some approaches from optimal control theory and approximate dynamic programming that may be applied here, but very few have looked at this problem from a security point of view. So if you're interested, this is one of the areas that is in dire need of security-oriented investigation. As for mitigation of policy replication, one of the ideas that my research group is currently working on is constrain randomization of policy. So randomize the policy such that the replication through techniques like imitation learning becomes more costly, more samples, more observations will be required while preserving the performance of the policy. There is almost no work done in multi-agent settings. Of course, adversarial reinforcement learning has been investigated in settings where there are zero-sum agents, but not really where there is an external adversary or adversaries trying to exploit the inner workings of the agents, the RL components of the agent. One more thing that is of note is the importance of discounting. The addiction problem that I demonstrated earlier in the Snake agent is mostly due to the constraint discounting solution. For those of us who come from a reinforcement learning background and are familiar with the basics of reinforcement learning, you probably know that the discount factor is typically chosen to be 0.99 or something and the same ballpark has left the same. It's treated as a constant throughout the training process. But this is very far from how our brain works and very far from the optimal approach or accurate approach to discounting. Our research group has recently started looking into this problem and is working on developing adaptive discounting solutions to enhance the resilience and robustness of RL agents, particularly deep RL agents in complex environments for AI safety and security purposes. Now, of course, our naturally-inspired approaches that can be looked at, for example, approaches coming from, let's say, TD Lambda models or dopamine models of psychopathological problems or neurological problems and the solutions prescribed to those, as well as approaches in social sciences which may help with the security problems arising in multi-agent RL settings. Very well, thank you very much. And I believe that this time it should be available for your questions.