 Let us now start with the description of the dynamic programming algorithm. So what one does in a dynamic programming algorithm is one does the following. So I will explain the dynamic programming algorithm in the form of a proposition and we will then do the proof of the proposition as well. So the proposition is as follows. For every initial state for every initial state x0 the optimal cost remember we had denoted this by j star of x0 is given by last step of the following algorithm. So the last step of the following algorithm what is the algorithm? This is actually the dynamic programming algorithm. What one does in this algorithm is the following. So one proceeds backwards in time. So you define jn of xn in the following way you define jn of xn as simply the terminal cost which is we define jn of xn as gn of xn and this is so for we write this for all xn. So we want to write this as a remember we need to write this as a function. So jn of xn is declared to be identically equal to gn of xn. So for all values of xn then for any then as I said we will proceed backwards in time. So this is at time n. Now what we will do is write this for times k less than n. So at any other time k before that we write jk of xk as we write this to be equal to the minimum over all uk where uk belongs to has to be in the actions that you can take at time k. So uk belongs to capital UK of xk and we take now the expectation. So the minimization is of this expression the expression that I am about to write now. This expression is the expectation of the stage wise cost at time k plus the this function jk plus 1 that you evaluated jk plus 1 evaluated at fk of xk uk wk. Now recall again that fk here this here was my dynamics recall that okay recall let me write this let me write this here recall that the dynamics was given in this form f of xk plus 1 is equal to fk of xk uk wk okay. So we want this. So here is the expression. So here is the algorithm at time n when we declare jn of xn to be equal to gn of xn for all xn. At all other times k we write jk of xk to be equal to this and this again is for all xk and this is also for all times. So all k from 0 to n minus 1 okay. So we so this therefore tell this algorithm basically as such to the following it says we you take the you write j you declare you find or define these functions you define these functions jn jn jn minus 1 jn minus 2 and so on for all for so in other words jk for all k from 0 to n jn is defined as equal to gn of xn that is what was given to us it is the terminal cos jk of xk for k less than n is defined recursively it is defined in terms of jk plus 1 that that we have. So the way this would work is that you have jn already written up through that is through the through the through the terminal cos then you write using that you find jn minus 1 using this because in that you would have jn here out here then you write jn minus 2 using the jn minus 1 then you find the jn minus 3 using the jn minus 2 and so on. This would eventually get us to j j 0 for every initial state. So that would this would eventually give us j 0 right j 0 of the at every initial state. So this would land up at land us up with j 0 of x 0 and what the proposition here basically tells us that is essentially saying that j star of x 0 is is nothing but j 0 of x 0. So it is the last step of this particular algorithm. So you do you apply this algorithm backwards in time starting from time n and at when at time at time 0 you evaluate this at x 0 and that tells us what the optimal cost is starting from time time 0. Now there is also a second part to the to the proposition which I will write that out here. The proposition also says that if u star k equals mu k star of x k minimizes the RHS above. So what is the above expression? It is this expression which minimizes this RHS for each x k and k then the policy pi star which is given by mu 0 star to mu n minus 1 star is optimal. So now let us reflect on this a bit. So for that let us let us look at this expression little bit closely. So I just said that j k is defined recursively in terms of j k plus 1 but how exactly is it defined? So it is defined by do you what you do is you do a minimization with respect to the action u k at time k. So when we are computing when we want to compute the left hand side here j k of x k or when we want to compute this right hand side here we are fixing an x k we are fixing x k to be any token state. We let x k be any any state at time k for any such state we compute this expectation and minimize it over all u k that are that can be chosen in that state x k. So you minimize this expression over all u k of x k where u k small u k is chosen over all capital u k of x k. This minimization will give us a u k star. This minimization gives us a u k star which is the optimal action but that u k star would be different for could be potentially different for different x k that u k star is going to be a function of x k because in this minimization problem that we have here x k is a parameter. So x k being a parameter u k will now the optimal u k or u k what is denoted here as u k star this u k star is a function of x k. Now it is and let us denote that function as mu k star. So this minimization defines for us a function that maps x k to an optimal action and what is such a function? Well that is that function is a decision rule it is a Markov decision rule it is mapping the state at that time to an action. So what this theorem is also telling us is that you put together all these Markov decision rules which is mu 0 star to mu n minus 1 star put that then gives us a Markov policy and this policy is actually an optimal policy for the for our problem. So here is therefore the summary for every initial state x 0 you if you want to find the optimal cost is what you do is you solve you do go through this recursive algorithm and at each step in the recursive algorithm you what you do is you minimize this particular expression this cost if you recall the so what is this cost this cost is the stage wise cost plus the cost to go the cost to go evaluated at the next the cost to go is evaluated at the next state the next state is written out as a using this function which is using the dynamics. So the next state is expressed using the dynamics as a function of the current state the action and the noise right and then so then one minimizes this so you take the expectation with respect to the noise. So this expectation here is with respect to the noise w k and one minimizes this expression over u k you minimize this over u k you get u k as a function of x k let us denote that function as mu k star and you and you do this over for every for every k going from k equal to 0 to k to k equal to n minus 1 till k equal to 0 and the claim is that well you are the optimal cost at x 0 is simply given by this function j 0 of x 0. So it is the it is given by the last step of this particular algorithm. Now let us so let us dwell on this dwell on the efficiency of this particular algorithm as to why this is actually helping us save some effort. If you see the initially when I when I mentioned that we you know if you look at the problem of choosing the optimal policy over the set of all Markov policies the set of Markov policies turned out to be a humongous number turned out to be a raise to b raise to b raise to n right where a was the number of actions b was the number of states and n was the number of precision epochs or the time horizon this was the number of policies. So if you if you had to search over all of these policies this is the number of choices that you would have to cycle through. Now let us see how many how many choices do we need to cycle through when we have when we are doing this when we have to apply the dynamic programming algorithm. So what one does here is suppose once again we have we have a actions and b states right. So this optimization is a choice over a actions that is the number of actions that we have. So there are a choices here. So one has to basically compare a numbers. So there are by that can be done by computing all all a of them and finding the least. So you so this so this for every every value of the state xk can requires a a computations. So if there are a actions this would require at most this can be accomplished by by doing a computations at most. Now and this has to be done but this has to be done for each state for each value of xk. Now xk itself can take b possible values since xk itself can take b possible values this this computation has to therefore be done for for b such values of xk. So you need to repeat a calculations for b values. So therefore the number of computations you need to do is at most b times a. So and this is the number of computations one needs to do for each stage. And now this has to be done for each k since remember this we have to do this for k equal to 0 to n minus 1 which means that we need to do b times a calculations at each stage. So that and there are n stages. So there will be the total number of computations that we need total number of computations becomes roughly a times b times n times some constants. I am ignoring all these all the trailing constants here. But this tells you what the total number of computations is roughly going to scale like. So it scales like and it scales as the number of actions times the number of states times the number of stages. If you compare this with the complexity of listing out all possible policies the problem of listing out all possible policies was basically involved as doing these many listing out these many possible choices a raise to b times n. Obviously this is a dramatic reduction no question about that it is very evident that we have truly simplified the problem by doing this. So how has this simplification come about? This simplification has come about by exploiting the additive structure of the cost function that the problem our problem definition involved a cost that was defined stage wise. This additive structure gave us a recursive way of computing the optimal policy which is what is given in terms of the dynamic programming algorithm that what this is also done effectively. Another thing that you need to note that has happened in this is if you recall that I mentioned at the start of the course that any stochastic decision problem although it involves finding optimal actions which is finding vectors the stochastic decision problem by its very nature forces us to think in terms of not actions but in terms of strategies or policies. And therefore the problem any non-trivial stochastic decision problem basically becomes a problem of finding the optimal function over a set of functions. Indeed our policy set like pi md, pi hd etc these are also sets of functions and when I when we do calculations such as a raise to bn this is also the set of the number of functions that we have from a set of from a set of size b to a set of size a. So it is really a so the problem that we had defined was really a problem had was essentially a problem of finding an optimal function over a set of functions. So that is what the stochastic problem basically reduced to. Now because this these things because the problem is that of finding an optimal function one could the first instinct is to ask how many set of functions and try to solve the problem in the space of functions itself. But then that is that what we realize here that that is not the best way of approaching this problem because there is more structure to this problem that we can exploit. And as a result of this structure we get the dynamic programming algorithm and if you see what is happening in each step of the dynamic programming algorithm is that we are not really finding functions at all. We are not finding many functions at all we are really find what we are not optimizing over the set of functions all optimizations that happen in the dynamic programming algorithm are vector optimizations we are finding optimal actions not optimal functions. Of course we are doing this for every x but that is still that implicitly defines for us a function that is that is true but that is still far less complexity than searching over the functions directly. So what this is effectively done is reduce the search of reduce what was what was essentially a search over functions to search over the values that the function can take which is which is the UK's essentially and through the values you end up defining the function. This is this is this is a this is an extremely important and dramatic reduction. As we will see later in this course that there are this sort of reduction is something that we would all like we would like to aim for in every possible problem however there are significant limitations that come up later when we look at the information structures and so on. But nonetheless this is a this is this is a victory to saver because we have we have been able to we have been able to bring down the complexity quite significantly. So let us quickly run through the proof of this of this particular proposition. So it is the proof will use basically the idea of induction. So I will write out the proof here. So let pi be a policy be a be a policy and let pi k pi superscript k denote the truncated version of this policy. So you look at the policy from mu k to mu n minus 1 this be the truncated one. Now let denote j star k of x k not j k remember j star k let this be the optimal cost of the n minus k stage problem. So the problem that starts from stage k and ends at time n. So j star k is basically the minimum over all policies all these truncated policies the expectation of the terminal cost plus plus these stage wise cost starting from k up until n minus 1. So this is j star k. Now what we will and and similarly let j star n also so we want this of course for all x k let j star n also be defined as g n of x n and this also is true for all x n. Now what we will do is we will argue we will show j star k of x k is actually equal to j k of x k where what is j k well j k is are the set of are the sequence of functions that are that are coming from our algorithm it is these functions j k are these functions. The functions that come from the algorithm are j k what we will show is that these functions are nothing but the optimal cost the optimal cost of the n minus 1 n minus k stage problem. And so we will argue this by induction we show this by induction to show this by induction. So suppose so we already know this for n it holds for n. So for k equal to n we already have that j star n of x n is equal to j n of x n and these are both equal to impact g n of x n. So for k equal to n this holds the induction hypothesis holds. Now suppose it holds for some k suppose for some k and for all x for all x k plus 1 we have that j k plus 1 of x k plus 1 is equal to in fact j star k plus 1 of x k plus 1. Now notice that we can write pi k the the truncated policy pi k as in this in the following way we can write it as mu k the decision rule at time k and the truncated policy from k plus 1 onwards the truncated policy pi k pi k pi k plus 1. So notice this. Now if we if we notice this then we can write j star of j star k of x k this can be written as minimum the minimization over mu k comma pi k plus 1 of r expression which was g n of x n plus summation i. So in fact I let me I can write this much more. So we can we can write j star of x k in in the following compact way we can write this as we pull out the stage wise cost here we have also the terminal cost g n of x n and we have then the remaining stuff the remaining stuff which is going from i equal to k plus 1 till n minus 1 g i of x i u i w i. So now we are minimizing this over mu k and pi k plus 1 but then notice that mu k is mu k actually comes up only here. So what are we doing here well your u k is in fact mu k of x k and u i is mu i of x i. So in other words your mu k out of these mu k and pi k plus 1 the mu k only appears here the the ones that appear here are all the mu i is from for i equal to k plus 1 till n minus 1. So they they are the all the later the later mu's are the ones that appear here. So as a result of this I can actually write this in the following way I can take the pi k plus 1 inside. So I write this as minimization over mu k I have the my expectation g g k of x k mu k of x k w k plus minimization of over pi k plus 1 of the expectation of g n of x n plus this summation which is which is which is i from k plus 1 till n minus 1 g i of x i mu i of x i w i. And well what is this particular expression well this expression is is this expression here is simply the optimal cost j star k plus 1 starting from the state starting from the state x k plus 1 that will result from the quality from from from the state x k and the state and the action mu k of x k. So in other words this here I can write as I have j star k plus 1 of f k of x k mu k of x k comma w k. But then by induction hypothesis we we had assumed that this j star is in fact equal to j k we had assumed that j star k is in fact equal j star k plus 1 is equal to j j k plus 1. So this j so this j star can be removed I can in fact I can erase the star here and the expression is still valid. So this is still equal to so this is therefore this equality still holds. But then now I I can after having removed the star what I am left with is this right hand side and this right hand side is nothing is nothing but by by the dynamic programming algorithm this right hand side is nothing this it is actually this quantity. So this is the same as the right hand side of this dynamic programming algorithm and that is nothing but j k itself right. So this quantity therefore is equal to j k of x k. So in other words by the induction hypothesis the induction hypothesis holds for k equal to n and then it holds we have assumed it holds for some k in between and then from there we conclude that it holds for for any for the next k as well and then and by hence it holds for all k for so in other words we have for all k j k of x k is identically equal to j star k of x k and this therefore completes the proof because now for I just apply this for k equal to 0 and x k equal to x 0 and then that tells me that the dynamic programming algorithm has actually produced for me the optimal the produced for me the correct cause. One can go through this argument further and also show that very easily that the policy that comes out of this that means the optimal mu that comes out of this is in fact also an optimal policy. This is a quick proof of of the dynamic programming algorithm. In the next class what we will do is we will apply the dynamic programming algorithm to an actual problem of inventory control.