 All right, so before wrapping up this notebook, let's quickly recap what we've learned about reinforcement learning so far. So we set up the formalism of Markov decision processes, and we said Markov decision processes have S, A, P, and R, where S is the set of states, A is the set of actions, P is the transition probability, P of S prime given S, A, and R is the reward function that maps from S, A, S prime to some scalar reward, right? And we've learned techniques for dealing with these MDPs. So if we know the entire MDP, then we know how to compute the optimal value corresponding to the optimal policy, the optimal Q function again corresponding to the optimal policy, and given that we know how to learn the optimal Q function using the Q value iteration that we saw just a few slides ago, given that it's easy to then compute pi star because pi star is simply the r max of Q star, right? So we know exactly how to do all of this if we know the full MDP. Plus we know for a given fixed policy, we know how to evaluate it as well. We saw this as one of the sub-routines in policy iteration where you evaluated a policy and found it's corresponding V value function or Q value function. Now, if we don't know the MDP, we've just begun to deal with the setting and we said if we don't know the MDP in particular, we don't know P and R, the transition probability and the reward function, then we've seen this idea of temporal differencing where you can actually learn to estimate the V for a fixed policy pi that's a policy evaluation through temporal differencing. And then we saw our very first RL algorithm, which also uses the same temporal differencing idea, but this time to actually learn the optimal Q function, Q star of SA, corresponding to the optimal policy. So we've only really begun to scratch at the surface of reinforcement learning and so we've seen in particular Q learning as an example of what's called value function reinforcement learning, but we haven't yet really seen how things like neural networks that we've been dealing with in recent weeks actually help us with this type of value function reinforcement learning. And so we will see very soon deep Q learning where neural nets will help us in Q learning. There's another entire class of methods called policy search reinforcement learning where you don't really bother with learning this optimal Q star like we did in Q learning because after all at the end of having learned Q star really what you're interested in is computing pi star which is equal to r max over A of Q star. So can't we directly learn pi star? Why do we have to go through this whole process of learning this state action value function or the Q star? That is a class of methods called policy search RL. There is then a third class of methods that tries to kind of bridge the gap between value function reinforcement learning and policy search reinforcement learning. This is called actor critic RL. And the idea there and we'll see this in more detail later is to take policy search methods and use value functions to help them along, to help them learn faster. So these are actor critic reinforcement learning approaches where the actor is a policy and the critic will be a value function. And finally another fourth type of reinforcement learning which is very different from all of these approaches is to say hey we actually started all of this discussion by dealing with MDPs where all the quantities were known and we said we could do policy iteration, value iteration essentially dynamic programming techniques to solve those MDPs. And if the only difference in the reinforcement learning setting is that we don't know the transition probability p and the reward function r can't we just learn to approximate p and r from experience and then use the same techniques from before the same dynamic programming style techniques. So that's model based reinforcement learning. And aside from all of this we could also cheat and say let's not actually do reinforcement learning at all but instead train agents to perform sequential actions the same tasks that we want to solve like driving or cooking but train them by mimicking an expert like ourselves. So we would perform some demonstrations for the agent and then the agent would learn to mimic us. So that's a class of approaches called imitation it's not strictly a reinforcement learning approach although it can be blended together with reinforcement learning approaches.