 All right, now that we've seen some model-free reinforcement learning approaches, let's move to our next class of approaches that we'll see briefly called model-based reinforcement learning. To motivate model-based reinforcement learning, let's go back to why we want to do reinforcement learning. Because remember, we have classes of approaches based on dynamic programming, like the value iteration and the policy iteration. And from other branches of study like planning and trajectory optimization, we know what to do when the environment dynamics are known. When p of s t plus 1 comma r t given s t comma at are known, then you know exactly what to do to maximize your reward. Now, it's only in the setting where we said we don't know these dynamics when we don't know the the state transition dynamics and we don't know the reward function, that's when we do reinforcement learning. And in particular, in model-free reinforcement learning, you could learn policy mappings from state to optimal action. And of course, you can also do this slightly indirectly by learning a Q function and then using that to determine the policy. Now, model-based reinforcement learning says if it is the whole point of doing RL is that you don't know the state transitions and you don't know the rewards, then why not actually learn that directly? Why not learn the environment dynamics, meaning p of s t plus 1 comma r t given s t comma at or you could also think of this as p of s t plus 1 given s t comma at and r of s t at s t plus 1. Those are both equivalent ways of writing the same thing. Right? So if you learn the environment dynamics, then you have suddenly reduced your setting to the setting that you had at the beginning. Planning trajectory optimization dynamic programming can handle that setting afterwards. So let's look at that in a little bit more detail and in particular, think about exactly what the workflow will look like. Right? So imagine that you could execute some random actions, aggregate the s a s prime tuples that you observed from executing random actions into a data set d. Then you would train a model p of s prime given s comma a in a supervised learning manner because now you have the label s prime and the inputs s comma a for that model. So you could just do supervised supervised learning then to learn the model. And once you've learned it, you would use p in some way for task execution. Right? Now, one good thing about this is that there was nothing about the process of aggregating the s a s prime tuples and executing random actions that was specific to the task that we eventually want to execute. And so the model is very general. And so you could train that model and then use it for executing potentially any task afterwards. The flip side of that is that it's actually really hard, even though in theory, you could do something like this. It's really hard to collect good training data that could produce a model that would work well for any task. Now, this style of model based reinforcement learning is called one shot model learning where you only have one shot at collecting data and training a model. And once you've done that, you have to use it for some task afterwards. There's another category of approaches which is more commonly used called incremental model learning, which looks very similar, except that what you do is after using the model after using the learned model for task execution, you at the beginning, you might do a poor job by using the learned model for task execution because you haven't it's hard to collect training data, like we said. But then once you have tried to use that learned model for task execution, you start producing some data specific, some task specific data, you get some task specific SAS prime tuples, and you grow your data set in the direction that matters the most for your task. And then you can repeat the process all over again, and you can iterate several times. You can keep going back and forth between aggregating the data, training a model, and using that model to perform task execution and collect better data. And doing this is great because the target task is taken into account as you're collecting data. And so it's obviously going to work well on that target task much better than the one shot model learning approach. And the flip side of this is that this is going to if if model training is expensive, if for example, you're training a deep neural network, this will be cumbersome to do it takes a lot of computation, it might take time. And of course, by specializing the model for the task that you want to execute, you're automatically also giving up the ability for that model to work well on other tasks potentially. Okay, so we kind of glossed over on the previous slide what it means to use the dynamics model for a particular task. But of course, we mentioned this a couple of slides ago, where we said, once we've learned the dynamics model, once we've learned the state transition, then you can simply use dynamic programming style approaches, like policy iteration to solve the learned MDP. And another class of approaches is to treat the learned MDP as a simulator, and then run model free reinforcement learning inside that simulator. Because after all, once you've learned how the environment transitions and what the rewards are, you essentially have created a simulated world. And inside that simulated world, you can imagine experience, and you can train a reinforcement learning algorithm for as long as you would like within that simulator. Even though model free reinforcement learning like we discussed takes a lot of time to run, because you're doing it all inside a simulator now, you don't really have to worry about how much experience you need to collect. You can collect millions of episodes if you like. And that's all fine. So these are two main strategies for how to use the dynamics model after you've learned it. All right, let's see some examples of model based reinforcement learning successes. So here's an example of a race car that's driving with model based reinforcement learning, and it's doing a very good job of it. It runs at really high speed, I believe up to 40 miles per hour, this tiny race car on this mud race track, which is quite hard to race efficiently on. And you can see that it's pulling off some pretty impressive maneuvers. On the right, you see some examples of manipulation tasks being learned with model based reinforcement learning. In this case, a simulated task of writing with a pencil, and in this case, juggling these two balls in a robotic hand. Again, these are quite difficult tasks, manipulation with such contact rich events tends to be really difficult. And so this is quite impressive that it's actually able to learn it within two hours of real robot training. Here are some examples of learning from pixels. In this case, surely given the image representation of the scene, and learning the dynamics in the image space, because you have a deformable object that's not easy to represent in a lower dimensional representation. You are learning the task of folding the cloth. So you're trying to take the hand of the shirt and place it near this point. That's the task that's specified in the bottom right. And finally, here is a task of learning how to control these different simulated robots, including a humanoid and a half cheetah, and so on. And you can see that over the course of only about 2000 attempts, it's actually able to learn pretty good policies using model based reinforcement learning for controlling these. So to wrap up, let's compare model based reinforcement learning to model free reinforcement learning along a few axes. Model based reinforcement learning tends to be modular because you learn a model separately. And then you also apply some kind of dynamic programming or other approaches like we've discussed to it. And that lends itself to some easy debugability. Because you have these two phases, you can independently test them and see what's going on. And it's also easy to inject some approximate physics knowledge, especially when you're trying to learn a model in a physical system. It's easy to say, I kind of know that this physics system works by following Newton's laws. And so that allows you to also learn more efficiently because you're able to inject some domain knowledge. And one nice thing about model based reinforcement learning is also that you can piggyback on a really large literature of many decades on planning and trajectory optimization and dynamic programming. And model based reinforcement learning is sample efficient like we've seen. You can learn some robotics tasks within only a couple of hours that would be much more difficult with model free reinforcement learning, at least with the methods that we have today. And you have reusable dynamics models. Your dynamics models, especially in theory, even if you do incremental model based reinforcement learning, you're learning after all a dynamics model of the system. And so that is not entirely task specific. You should be able to reuse the dynamics to some extent for new tasks as well. Now, among the negatives of model based RL is that the models are trained kind of independently of the task. And so you don't directly get task gradients into the dynamics models. And so this can lead to biased models that limit performance. And also sometimes it's actually harder to learn the dynamics than to learn tasks. So for example, if you're trying to learn to swim in a river that is fast flowing, the dynamics of that river are probably harder than to learn that you should move your hands in a particular way. And so it might in some cases not be worthwhile to try to learn the dynamics exactly. And finally, there's also a problem with some kinds of model based RL approaches. The ones that use planning, purely planning for selecting actions, planning can often be quite slow or deliberative compared to direct policies like the ones that we were learning with Q-learning or even with direct policy search or even with Q-learning.