 So, welcome back everyone. So, this here this equation that we wrote out was the Bellman equation for when you are considering the when you are considering history dependent randomized policies. Now, what is this? Let us look at this equation a little more closely. Optically this looks very similar to the equation that we had for Markov policies. Now, what is the notice that the notice that here once again the equation asks us to find these functions for every value of the argument. So, in other in particular you we want we need to compute this function j t of h t for every h every possible history h t by doing this optimization here on the right hand side. We again have the terminal condition where j n of h n is this defined to be R n of s n for all h n where R n was the terminal reward. So, this is this is essentially very similar to what we had for Markov policies. However, the complexity of solving something like this is far more simply because you have you now have to do this for every possible history you have to do this for all h t here which which then entails an extremely large number large number of optimization problems to be solved. So, you need to do this for each and every possible history that could be realized in the problem. So, what we will do now is actually use this particular equation and conclude from here that without loss of generality one can actually work with only Markov policies. And the way to do that would be what way we would do that is by showing actually that this this function here j t of h t is in fact not even a function of h t it is a function only of only of s t the only of the current state s t. So, in other words although this equation asks us to write this for all possible histories we will actually show that it one does not need to do this for every possible history it suffices to do it only for the current state and that that tells us everything that we need to know about the about the value function. And there as and from there we basically conclude that therefore that the optimal policy can also be chosen as just to be a function of just the current state one does not really need to know the entire history. So, this is the agenda for the for now for us. So, the are before I proceed I want let me let us note down a few let us make a few observations. You note down we have a few several j's now that we have written out. So, first was this quantity j star j star and we wrote it to be a function of s 0. So, this is the optimal policy or sorry the optimal value value function for the entire problem for the MDP as a whole. Then we had a j pi this is the value under a policy pi under a policy the value that you would get under a policy pi. Then we had also j pi t and we wrote this to be a function of h t since we were considering history dependent policies. So, this here is now the value of the truncated problem value or reward by value I just mean essentially expected reward remember is just means expected reward truncated problem. So, what is this truncated problem this is from time t onwards this is that is this particular quantity then there is j star t sorry this is from time t onwards under policy pi under policy pi and j star t is the optimal value or optimal reward optimal expected reward of the truncated problem and this was under history dependent randomized policies under policies from pi h r and now we just wrote out a third j which sorry a fifth j which is j t of h t this is the value function computed through the Bellman equation. So, this is what we get when we write out the Bellman equation. So, that is j t of h t. So, what we get from the Bellman equation is has no star and no pi on it j star t is the optimal reward that we would get of the truncated problem overall history dependent randomized policies j pi t is that reward that we would get when we have a policy pi starting and we are looking at the truncated problem and j pi without the index t is the expected reward under a policy pi but for the entire problem and j star is the expected reward for the optimal reward for the entire mdp. So, now our main result is going to be that there is actually there is not much difference between some of these quantities in particular what we will show is that j t of h t is actually nothing but j star t of h t in other words we will show that these two quantities here are equal. So, and the way to observe this is actually by noting that when we are by is again coming from the principle of optimality. So, the principle of optimality tells you that if you are optimal for the entire problem then you also are optimal for every sub problem that emerges then now I can apply this principle of optimality from any time onwards. So, if I am if I have a policy which is optimal from a certain time onwards and then look at say a time t onwards and then I look at the policy look at a problem from a time t dash where t dash is greater than t then that then I would be then that policy truncated from t dash onwards would also be optimal. In other words the what this would mean is these functions which are being computed recursively here are actually what they are computing is the optimal reward is they are in fact computing the optimal reward because after all at the terminal stage we have when we are at the terminal stage we have this is the this is the optimal reward and the optimal reward at the the penultimate stage would be given through this equation and from and then the one before that would be given by a similar equation and so on. So, consequently it is actually follows rather in a rather straightforward way that we have the following theorem. Suppose J t for t equals 0 to n solve the Bellman equation Bellman equations then we have that J star t of h t is equal to J star is equal to J t of h t for all histories h t in capital H t and for all times t. So, if you are able to get if you have a bunch of functions J t that collectively solve these Bellman equations that means they are obtained from these Bellman equations then you can then you can conclude that they actually give you that at any time t you have that and for every history h t you have that J t is equal to J star t. Now the this is actually very interesting and powerful because J star t remember was refined for just some one instant in time and J t in order to even compute J t we had to do you know you had to apply Bellman equation up until you know starting from n going backwards until time t. So, in the process of solving Bellman equation we actually end up finding all the J t's that we need to solve that we need. You cannot sort of isolate one of them and find only one of them at that time at any at a time because they are all interlinked and they are all the computation works backwards in time and you need to compute the the the later ones in order to compute the one the prior ones. Now, but once all of those computations are done what this theorem is basically assuring us is that you have actually found the optimal reward not just for a one particular time but for every time and for every history right. So, every history for every history and for every time J star t is equal to J t J star t of h t is equal to J t of h t. So, now let us let us take this one step further and let us see what this has to say about the dependence of J star J t or equivalently J star t on on s t right. So, here is our next theorem for each t from 0 to n J star t of h t depends on h t only through s t. Now, the we will we will actually prove this particular result but notice that this claim here is talking of J star t which is the optimal reward for a truncated problem starting with a history h t it is actually if one less simply looks at the problem in its the truncated problem by itself it is not evident that the history is irrelevant but once one notices that this is in fact equal this J star h J star t of h t is equal to J t of h t. In other words you see the once we have this above theorem it is possible for us to use Bellman equation because that is what defines these J t s and then conclude something about the nature of J star t of h t. So, that is what we will be doing now. So, we will prove this we will prove this by induction the proof will be by induction. So, to begin with let us notice this that we have J n of h n which is equal to J star n of h n and that is equal to r n of s n right. So, this is equal to that the terminal reward. So, in other words is J n is a function of only s n or equivalently J n and then that is and hence J n star as well but I will not keep writing this again and again. So, J n is a function of only s n. Now, what we will do is assume that the claim is valid t equal to say k plus 1 till n. So, form a time k onwards k plus 1 onwards the claim is we will assume that the claim is valid this is our remember this is our induction hypothesis. Now, based on that let us ask the next question let us now do a calculation. So, now let us look at J k star of h k that is equal to J k of h k and that is equal to the maximum over a in a sk. So, this is I am now using Bellman equation. So, r k of sk, a plus J in s probability of J given sk, a and J star k plus 1 k plus 1 now has to be a function of h is in this. So, in place we can write I can write it actually J itself instead of writing J star. So, let us write J and this is a function of h k plus 1. Now, by noticed that so firstly notice that this here. So, let us write notice that h k plus 1 is actually nothing but h k a and J that is what h k plus 1 is. So, remember that, but also remember by induction hypothesis by the induction hypothesis J k plus 1 is which is of h k a J is equal to J k plus 1 of J and why is that well that is because J here is the state at time k plus 1 the state at time k plus 1. So, therefore, this is actually J k plus 1 of J of small j. So, once I put so this can now be substituted. So, I get max of a in a sk of r k of sk, a plus the sum from in J in s p of J given s comma a J k plus 1 of J. Now, let us observe a few things about this. If you look at this, I have made a small mistake here this should be sk yeah. All right. So, now let us observe a few things about this. So, the if you look at these terms here, let us see what where what what we have here. See this here is depends on sk this here depends on sk this depends on sk all right. So, you have a dependence here on sk, but you do not have any dependence on any of the any part of the history prior to sk. So, in other words this here this entire expression here is actually dependent on only is a function only of sk. So, this here is a function only sk. So, as a result of this although they have an h k on the right hand side what he left hand side what you have on the right hand side is a function that depends only on sk consequently consequently we have that J k J k of h k is equal to is just a function of sk J k of sk and which means then by induction the result holds for all t. So, we have concluded that the value function at any time at any time t and hence also the optimal reward from for any truncated problem from t onwards all of these are a function of only the state at that time. Now as a result of this we can we we also conclude we can also conclude another important thing we see that when we are maximizing here when we are maximizing here now the optimal the right hand side here is a function of just the current the state at time at time k. And remember that from the Bellman equation the policy that the policy is simply obtained by maximizing this quantity maximizing this right hand side the maximizing action here defines for you the policy and because now the see when we first wrote out the Bellman equation here there was a dependence on the history there was a dependence here on the history there was the h k was present here in this problem. So, in general the of the policy would have been a function of h k but however we have now been able to reduce it to a function of just sk. So, from here therefore we also conclude the that the there exists that okay let me write it this way we can also conclude that the action the optimal action at can be taken to be a function of s t all right in other words. So, the reason that is because this quantity this particular expression now does not have any trace of h k. So, it only depends on sk therefore the maximizing action also depends only depends only on sk. So, the optimal action can be chosen to be a function of s t. So, as a result of this in fact now we we also observe another another thing here that there is no that this this entire the entire if you write out another observation from here is that there is nothing to be gained also from randomizing simply that is because the this particular expression that we get is the same as what we would have the Bellman equation that we are getting here is the same as what that we would have got if you are optimizing over deterministic Markov policies. So, as a consequence of this so we conclude two things no benefit in randomization and no benefit in the history in using the history in other words there is this is our theorem that there exists a deterministic Markov policy that is optimal. In other words if you if you look at this if you look at the max of this over all policies that are Markov and deterministic this is equal to the max of of the same quantity over policies that are history dependent and randomized. So, this is therefore our justification for why one can work with only Markov policies. So, the once the history is known to us the the state is also known to us but the cost depends only on the state and the and the probability the transition probability also depends only on the state and as a result knowing all this history is immaterial. So, what you only need to know is the current state. So, the optimal reward or the optimal cost that we incur is a function only of the state we are in. Now, this has become this has become possible primarily because of the structure of the Bellman equation that the Bellman equation allowed us from the Bellman equation we were able to reduce everything to this sort of a structure in which only the state of the system appeared. So, this is the key structure that is that is being exploited in order in all of Markov decision theory and therefore all of Markov decision theory basically works with primarily works with Markov deterministic Markov policies. So, what we will see in the next class is that this kind of an extension this kind of a result is not true once we are once we allow once we assume that we do not know the state remember so far knowing the history was also giving us knowledge of the state. But there can be problems in which knowing the history is not completely telling us what the state of the system is that gives us this gives rise to a different and interesting class of problems called problems with imperfect information. So, those will we will start discussing those problems from the next class.