 Did you know that more neurons get activated in your brain when you walk than when you play a game of chess? You probably don't consciously think about it when you walk. But it turns out that the mechanism of walking, it's extremely complex. As children, we learn to automate this mechanism by coordinating legs and arms and hips together with sensing our environment. And despite it seems really easy for us, it's really not. So my question to you is, how would you build an AI system that could walk? The DARPA Robotics Challenge, these are images from back in 2015, was motivated by the Fukushima disaster. The goal was to develop a mobile robot that could move through and within disaster zones and perform useful tasks, like using power tools, opening doors, walking up the stairs, walking through the site. And like I said, as easy as it seems for us humans, it's really not. And for an AI system, walking is extremely complex. And of course, at times, things will go wrong. Now some of these really hard. But the point is that designing such a complex system and provided with AI is by no means an easy thing to do. So inspired by this, and also the movie How to Train Your Dragon, given that we're in a movie theater, this talk is going to be titled How to Train Your Robot, in this case to walk with deep reinforcement learning. I'm Lukas García, I'm an application engineer at MathWorks, and for the last decade or so I've been working with and obsessed with what math can do for AI. I want to start by thanking some of my colleagues at MathWorks who have helped me put together this material. And so let's start discussing about what the goal of control is. So broadly speaking, the goal of a control system is to determine the right actions into the system that generate the desired system behavior. With feedback control systems, the controller observes such behavior and uses those state observations to improve its performance and correct for random disturbances and errors. And engineers typically use these feedback along with a model of the system or the plant also known as environment, to design a controller that meets these system requirements. All right, so this is a simple concept to put into words, but it can be hard to achieve if the system is hard to model like a walking robot, is highly nonlinear like a walking robot, or has a large state or action spaces like a walking robot. So let's think of this in the context of our walking robot. All right, so we're going to start thinking about the complexity of building a walking robot from the traditional controls approach. So the way an engineer will traditionally do it. We'll start off by getting some data, some camera data, which so once we acquire those images, we can extract features and then together with data coming from other sensors, we can complete the state estimation that then we can use together with a model of the system to design the control system. And very likely, this control system will consist of multiple control loops that all interact with each other to generate this complex mechanism of walking and that there could be maybe low level controllers responsible for the actuators in the joints or higher level controllers managing leg and trunk trajectories or even higher level controllers managing the balance. And all of these has to work together in an uncertain environment with all of these loops interacting with each other. And it is not very clear how to structure these loops or how to break up the problem into parts. So imagine if we were able to squeeze everything down into a single black box controller that takes in observations as inputs and outputs the motor commands directly. So as an engineer, if you were to design such a controller, how would you do it? Well, the thing is we're not going to design it. This is where machine learning comes in and in particular reinforcement learning. So what is reinforcement learning? Well, first off, my apologies to all of those for whom this is a review, but I think it's important for context. So reinforcement learning is a subset in machine learning that learns from data coming from a dynamic environment. The goal is not to cluster or classify the data, but to find the best sequence of actions that generates the desired outcome. So of all the definitions out there, the one that I really think that best describes what it is, is the one given by Saturn and Bartow in their reinforcement learning book. They describe it as learning what to do, how to map situations to actions, so as to maximize an numerical reward signal. And continues with the learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. As for applications, reinforcement learning has been widely used and popularized with video games. I think you may have all heard of the company DeepMind who created the program AlphaGo. It was a computer program that could beat the world's best Go players. And then years after they created AlphaStar, a program or an AI that could beat the world's top professional players at Starcraft 2. But reinforcement learning has grown today beyond video games in areas such as autonomous vehicles. We're seeing an example of autonomous vehicles using reinforcement learning with AWS's deep racer competition. Or controls maybe to stabilize a drone in a highly dynamic and turbulent flow. Or even robotics to teach a robot to walk. All right, so before diving any deeper, I want to review some deep learning terminology. So despite we're not a piece of software, unless maybe someone proves that we live in a simulation, but that's a different movie, we learn in a similar way a software agent learns. So this is our agent. We're going to be thinking about our walking robot. So we can think of the software running in a robot as the agent that operates in a specific environment, moving the robot joints, for instance, in order to walk. So the agent is able to give in a set of observations, determine which actions to take. These are the outputs. And these functions that you see here that maps observations to actions, it's going to be called the policy. Then the agent is going to be rewarded by the environment, by taking actions that are good, like staying upright, continue walking, and at the same time it will get low or negative reward for taking actions that are bad, such as falling to the ground. So we're going to continue building our basic understanding of reinforcement learning with a reinforcement learning workflow. So we first need to choose an environment where our agent can learn. We need to define what should exist within these environments and also whether it is a simulation or a real physical setup. We also need to figure out what we want our agent to do and craft a reward function that will incentivize the agent to do just that. Then we're going to have to come up with a policy, so how to structure the logic and the parameters that make the decision-making part of the agent. We're going to be training our agent to figure out what the optimal policy is, and finally we want to deploy the policy into the field, into the real robot, and test it. All right, so let's start with the environment then. We're going to define the environment as everything that exists outside of the agent. It is where the agent sends actions to, and it is also what generates rewards and observations. For the context of our example, we're going to be using a simulation environment and not use the real robot, and we'll touch upon the reasons for that, but the agent is just going to be the piece of software that basically updates the policy and generates the actions, and it is going to be the brain of the robot, so to speak. All right, so we're saying that we're going to be using a simulation environment and not the real robot, and there are a few good reasons for that. First off, if we were using a real robot, it will very likely fall even before it starts to move its legs, and that will be very expensive, and it will also be very time-consuming to pick up the robot every time it falls, so it makes sense to use a simulation environment. This brings a couple of benefits. If we simulate the environment, we can run faster than real time, and this is going to be an interesting point, because we're going to have to run thousands and thousands of simulations. Also, we might have to run those simulations in parallel and speed up the training process. And finally, you would like to test for conditions that are hard to test in the real world, but easy to test on a simulated environment, like maybe walking over ice. All right, so if we think of what the observations are going to look like, we're going to measure using sensors, the translation and rotation of the robot body, the joint angles of each leg and their derivatives, and an indicator, F-R and F-L of the normal force with respect to the ground to indicate whether that foot is in contact with the ground or not. With those measurements, then the agent will produce an action, which in our case, we've chosen to be the torques applied at each of the joints of the robot's legs. At each time step, the environment will generate some kind of reward, and we'll get to choose that as a designer, so we'll look into that in just a minute. And now, the question is, how do we design such an environment? There are multiple options out there, and in our case, we decided to use Simulink. So what is Simulink? Simulink is a block diagram environment for multi-domain simulation and model-based design. It provides a graphical editor together with customizable block libraries and solvers that allow you to simulate and model dynamic systems. The idea behind Simulink is that you will drag and drop the blocks that do the math of your model as if it was a whiteboard. Here we're seeing the Simulink model. You see a block for the aerial agent and a block for a walking robot. This is our environment. We see that we have blocks representing the robot's legs. We have blocks representing the sensors, the world and the ground. If you double-click in each of these blocks or subsystems, you get a more complex logic of how we build this robot leg. Same thing with the sensors that will then be transformed into the observations that will be fed by the agent. And of course, the world and the ground where the robot is operating, this contains the physics underlying how a robot is moving through the environment. So that was a brief introduction to Simulink. I'd like now to discuss what the reward is. So the reward is a function that outputs a scalar number that represents the goodness or the benefit of an agent being in a particular state and taking a particular action. So creating a reward function is very, very easy. It's just a function of the state and the action. But creating a good reward function that is very, very hard to achieve because unfortunately there's no straight way to come up with an agent and a reward function and guarantee that the agent will converge to the solution you actually want. So let's see how we came up with the reward function for a walking robot and you'll get to see the walking robot for the first time now. All right, so what do we want to accomplish? We want the robot to walk in a straight line and walk straight, of course. We don't want the robot to fall. So we're going to craft this reward function by using multiple terms. First, instead of the instance, we can reward the robot for its forward velocity in the x-axis so that there's a desire in the robot to walk faster rather than walking slower. And with this reward function, after some training, this is what happens. So you see that the robot quickly fails to get that quick burst of speed but doesn't really know how to use the joints. All right, so what is it that we're getting here? What we're getting is a very common local minimum where the robot, very early in the process, learns that by diving, it can maximize its forward moving reward. So how we can fix this? We can work around that by adding a duration reward or a survival reward so that the robot lasts longer. TF is the final simulation time, TS is the sample time. So by doing this and training again, this is what happens. So takes a step and dives again. So we're not getting there yet. Probably had we trained this agent for longer, we could have gotten better results. But what it makes sense now is to add some sort of reward term that helps to keep the robot as close to a standing height as possible. So what we're going to be doing is introducing a penalty term that penalizes the robot whenever there is a C displacement from a reference in the height and that reference is C0. And now this here is looking a little bit better. But as you can see, this is not very natural looking. The robot kind of jitters its leg back and forth and it's kind of dragging the right leg as if it was injured or something. So yeah, we have to work around this. So how are we going to do it? Well, we're going to introduce another penalty term. And so we can reward the agent for a minimizing actuator effort. So by doing so, this gets a more, let's say, realistic result. The robot is equally using now both legs. And there's just one minor problem, it is that we want a robot to be designed to walk straight and as you can see, it's already diverting from the original path. So to fix this, we add the final reward term and that is to make sure that the robot walks straight by introducing this penalty whenever there's a drift or a stray from the y-axis. And this is going to be our final reward. So how do we integrate this with the environment? Well, we're going to have a block here in Simulink and I'm going to overlay now the equation that we came up with, the function. And you can see that there is a direct correspondence between these blocks and each of these terms. Now I didn't mention it earlier on, but we could have used other tools for modeling the environment. We could have used MATLAB, of course, and MATLAB provides an API, an ISAPI to build up your environments and also has predefined environments. So you could have also used third-party environments and interface those with MATLAB. Either way, in this particular case, because of the complex modeling underlying the walking robot, it made much more sense to do it with a multi-domain simulation environment. So now that we have discussed the role of the environments and the reward, we're going to talk about the policy. And so in order to do so, we must review in further detail what the agent really is. All right. So earlier on, the agent, which we described as the brain of the robot, is going to consist of two main parts, and that is the policy and the reinforcement learning algorithm. The policy is going to be a function that maps observations to actions, whereas the reinforcement learning algorithm is going to be the optimization method to use to find the optimal policy. So the learning algorithm is going to change the policy based upon the actions that were taken, the observations from the environment, and the amount of reward that is collected. At the same time, the goal of the agent is to use these reinforcement learning algorithms to modify its policy as it interacts with the environment so that the policy, given any state, it always chooses the best action, and the best action is the one that collects the most reward. All right. So of course, during this talk, we don't have time to go through every possible policy and learning algorithm, so I'll just cut the chase and discuss the type of policy functions that I'm interested in at this point, and those are neural networks. So recall that the goal of the policy is to map observations to actions. So we can come up with a neural network and use it as a universal function approximator that takes in state observations as inputs and outputs the actions. All right. This type of neural network is going to be called an actor, because it's going to be telling the agent which actions to take, given the current state. And here I want to make an important distinction between two concepts in reinforcement learning, and that is the difference between reward and value. So earlier on, we described reward as the instantaneous benefit of being in a particular state and taking a particular action. Now value is going to be the total reward that an agent expects to receive from a state and onwards into the future. And these two concepts are key, because if you're in a particular state, given any state, what it means is that the action that collects the most reward from the current state is not necessarily the action that collects the most total reward in the long run. So what this means is that somehow we have to check the value for every action from a given state in order to determine what the best action is. And so these network doesn't entirely solve the problem, and we're looking here for a second type of network. So this is going to take in state observations and actions as inputs, and the neural network will return the value of that state action pair. And the policy is to choose the action with the highest value. This network is going to be called a critic, because it's going to be criticizing the agent's choices by looking at the actions, at possible actions. However, this setup does not work well either, and the reason for that is that how could you possibly try every possible action out there and find the maximum value, or find the action with the maximum value. Even for large state spaces or large state actions, it's very computationally very expensive. So the way we're going to solve this problem is by merging together both networks in a class of algorithms called actor critic, and so this is how it works. So first the actor is a network trying to predict an action given the current state, and then the critic is a second network that will estimate the value of that state action pair, and that is the state and action, or the action that the actor took. Now this one works well for continuous action spaces, because the critic does not have to look into an infinite number of actions, but just one, the one that the actor took, and so it does not have to find the best action by evaluating all of them. So let's see how this works in practice, and so to do so, the critic is going to use the reward from the environment to determine the accuracy of its value prediction. Once we have that error, the error is going to be used to update the critic so that it has better estimate the next time it's in that state, and the actor is going to also update itself with the response from the critic so it adjusts probabilities of taking that action again in the future. And now the policy ascends the reward slope in the direction that the critic recommends rather than using the rewards directly. So this is how we solve the MATLAB. We first start by defining these networks, the critic network, by just concatenating layers, and these can be rail layers, fully connected layers, tannage layers. We do the same for the actor. And there's, of course, these API that allows you to put this together. It's very straightforward. And once we're done, we create also an RL representation, reinforcement learning representation, which means we're putting together the environment and the network. We run this, and now I want to show you how you could have done this interactively so that you have a feeling of what it's looked like to do this in the interactive environment. So this is a deep network designer, and here you can drag and drop layers into the canvas to build your network, but I'm just importing the critic network so that you have a feeling of what it looks like. And what you see here is that you can just click in each of these layers, and you have, on the right-hand side, all the properties for each layer. You see the two branches, the observation branch, and the action branch. And they get merged together to produce the final output of the critic network. Now, had you done this interactively instead of programmatically, you could have, you could explore the generated code and use that as part of your algorithm or your program. All right, so once we're done crafting or the reward, determining what the environment should consist of, what the policy should look like, then we can train the agent. Now, training is going to be a little tricky because it's going to involve running lots and lots of simulations, and by lots we mean in the order of thousands, of dozens of thousands. So it's going to be key being able to run those simulations in a parallel manner, whether that's a compute cluster or just a local multicore machine or the cloud. When training with parallel computing, also working out the agent and environment is a little tricky, so this is how it works. The client is going to send copies of the agent and the environment to the workers, to the parallel workers doing the job, and each worker is going to get its simulation data, it's going to train, and it's going to send the data back to the host. The agent is going to learn, or the client agent in this case is going to learn from the parameters sent by the workers, and is going to send the updated policy data back to the workers, and then learning continues. Also, if you're using deep neural network for your actor or critic representation, it's worth using GPUs as you probably know already. And so this is how it works in Mac and the Matlab environment. First, we have some training options that we can choose from for the agent and for hyperparameters for the training. Here we're choosing a maximum number of episodes of 5,000. We're also using a stopping criteria of reaching 120 reward or greater, so whatever comes first. Then here we're seeing how to configure the environment, configure the networks, or create the architectures, and then train it. As we train, we're also going to have some saving criteria, so any agent that is over 150 reward will be saved for inspection later. And here you see the reinforcement learning episode manager. You see the training as it progresses. We're training on two workers and one GPU, so it's a local machine. The curves represent the red is the moving average, the blue is the simulation reward, and then the green one, which is labeled by episode Q0, represents the initial value that the critic estimates for that simulation. As you'll see, the stopping criteria is going to be reaching either 120 reward or reaching 5,000 simulations or episodes, whatever comes first. The algorithm that we're using, the reinforcement learning algorithm, is called the DPG, or Deep Deterministic Policy Gradient. And this is known to be as a high variance algorithm, and what that means is that the reward function isn't guaranteed to keep increasing monotonically as the training continues, so we have to watch out for it. And here we are seeing the result of one of the agents that we saved, one of the agents that reached 150 reward or more, and we're seeing that the robot walks pretty stable and it moves its right and left leg quite efficiently. So once we're done, we want to deploy, and to do so, we need to think about what we just did. We were training in an offline manner by using a simulated environment, so now what we want to do is to deploy the policy off to the target hardware, and for doing so, we don't want to recode the policy in another language, because that will be very time-consuming and very error-prone, so we're going to be automatically generating C code or CUDA code from the policy that we have trained, or the agent that we have trained, and then run it on the embedded system. So in here, you see that we are generating a function that we can then deploy, and when we open up this function, this contains a reference to a file called agent-data.mat that has the network parameters of the actor network that is going to be deployed. We are going to open up a tool for a code generation. This is called GPUCoder, and we provide an entry point to the app, and here next, we're going to have to do something which is obvious, and it's define the input data types. C and CUDA are strongly typed, as opposed to MATLAB. So we need to now tell the tool how it must generate code for us. So in this case, it's going to be a double array of 31 by 1. Here we can check for runtime errors. We're going to skip this for now. And then we can choose what type of code we want to generate for, want to generate. Here we're just choosing the host computer, so for whatever CPU and GPU that computer has. I've also targeted NVIDIA embedded systems, like NVIDIA Jetson or NVIDIA Drive. And once we're done with the code generation part, you get a nice report that you can look at, and in here, one of the things that I really like is how you can trace code between the original MATLAB code and the automatically generated CUDA code so that you can determine that, for instance, here the predict function corresponds to deep learning network underscore predicts. And you can look for that in the source files that have been generated and see how memory copies are being managed between the host and the device, so between the CPU and the GPU. All right, so we covered a lot today, and I want to finish off with some key takeaways. At first, we saw how reinforcement learning can solve complicated problems, and we saw this in the context of controls and robotics for building a robot that could walk. Now, important thing is that reinforcement learning can be applied to many, many other fields. You just need to determine whether it's the right approach versus other alternatives. We saw that deep neural networks can handle large or continuous or high-dimensional states and action spaces with the actor and our credit network. And we also show a complete workflow for deep reinforcement learning with MATLAB and Simulink. Now, if you want to play with us, all the code for the environment, the agents, everything is available on our GitHub site. And you can also access a license, try a license for MATLAB so that you can use those files. And finally, if you can wait to play with it, because you're stuck here at the conference, feel free to come by the booth and play with it, play with the rewards, play with the environment. And before I finish, I want to end with one example from one of our customers. And so I would like to introduce you to Justin. So Justin is a human-made robot developed by the DLR. DLR is the German Aerospace Center. It has up to 53 degrees of freedom, which allowed to almost match human performance for many activities. Now, human-made robots like Justin need to process inputs coming from a wide variety of senses. Plant trajectories manage the coordinated movement of dozens of joints at the same time. The DLR use MATLAB and Simulink to design, prototype their algorithms and test them on the real hardware in Justin by generating optimized C code, in this case, that could run in Justin's real-time operating system. And what this allowed him to do is to reduce the amount of time it takes to bring the code to production from weeks that they would need for manual coding down to just a few hours. Now, nonlinear optimization algorithms are used for planning to maximize, for instance, the distance that Justin can control the ball, as we were seeing in the video, and perform coordinated motions from the cameras to the fingertips. And so, this is another area where math and optimization, like it is the case with the deeper enforcement learning, play a big role in enhancing AI systems. So, these are some of the areas I find more and more inspiring. So, what will your next AI look like? Thank you all for your attention. Thank you, Lukas. Who has questions? I thought you were going to make it sound easy to build a robot. I had no idea what I was looking at. But, any questions for Lukas about how to build a dragon? No, how to build a robot? I can't see any. Okay, well, I do have one question. When we think about the robots of the future, what are we going to see in 10, 20, 30 years? Oh, that's a tough one. Yeah, so we're already seeing a lot of amazing maneuvers by the... Many companies, Boston Dynamics is one of them. You may have probably seen some of the robots doing very weird parkour things. And so, what I think is that more and more we'll get used to having robots as part of the interactions that we have. So, maybe robots that will assist elderly in their houses. Or that is, I think, the area where probably we're going to get there first. Okay. And then, as technology evolves, and we are able to solve more complex problems, who knows? Okay, and a legacy question for you. What was the... I had the robot with the original 8-bit Nintendo. What was he called? Or she? Do you remember? I don't recall that. Does anyone know? NEO, maybe? NEO. I don't know. I don't know. That couldn't do much. It never worked. I always never had any fresh batteries or something. Great. Well, thank you, Lucas. Thank you very much.