 Welcome everyone. We were studying a model of stochastic control with imperfect state information and the example we were considering was that of a machine repair problem. The problem was that the machine could be in two possible states P and P bar which denoted a good state and a bad state. The control actions we had two possible control actions one was to continue the operation, the other was to stop and inspect the machine. At the time axis was like this there were three time periods. So, it was time period 0, the time period starting from 0 then starting from 1 and then starting at 2. Now, at the time period starting at time period starting at 0 and time period starting at 1 we were being given the result of an inspection that was performed on the machine and the inspection told us whether with some probability whether the machine was good or not. So, it was a sort of an imperfect inspection. We had the option then at every time to either continue the operation or to stop and inspect the machine to get the correct state and if it is broken then return the machine to the good state again. Now, these were the probabilities that we calculated for the transition. So, Xk plus 1 was the state at the next time step was we wrote that to be Wk that is which and Wk here is given in this particular form given by this particular probability distribution here. So, you can go back to the previous lecture to see how we calculated that. We also had observation equations, the observation equations were written as Zk equal to Vk and they told us the probability with which we got a good result out of the inspection or a bad result out of the inspection given the state of the machine. The total cost that we had was a sum of two stage wise cost. We called that we were also told that the terminal cost is 0 here. So, as a result we only have stage wise costs. So, the total cost is a sum of these two stage wise cost and the cost is a function of the state that you are in and the control action that you choose. If you choose to inspect the machine and stop and repair the machine you always incur a cost of 1 regardless of what the state of the machine is and if you continue the operation when the machine is in a good state you incur a cost of 0 whereas, if you continue the operation when the machine is in a bad state you have a cost of 2 units. The information that we had at each time was I0 was the information at the first time step and at that time we have only the result of the first inspection. So, whatever is the observation we get out of the first inspection is known to us at that time. Then at time 1 we have two inspections that we have information of the earlier inspection which is Z0 and the new inspection which is Z1 and we also have the information of the action we took at time 0. So, given this we had to choose policy such that policy mu 0 mu 1 where mu 0 is a function of I0, mu 1 is a function of I1 in order to minimize this total cost here. The expectation here is now taken over the initial state X0 and the various sources of noise W0, W1, V0, V1. This was our problem formulation. So, now what we will do is write out the dynamic programming equation for this particular problem. So, remember for the dynamic programming equation for a problem with imperfect state information we start from time step n minus 1 rather than from time step n. We do not have the full state information at time step n. So, we just absorb the terminal the value function or the cost to go at time step n into the dynamic programming equation at time step n minus 1. So, let me write this now for a generic time step k. So, at a time step k we have jk of ik given as so we have two possible actions here we have we have to minimize over the action set we have two possible actions the minimum the actions are to continue and stop. So, the let me write out the cost that results from each of these actions and jk of ik is going to be the minimum of those two costs. So, the minimum of these two terms. So, the first is the cost from continuing. So, the cost from continuing is is going to be remember we have to take the expected stage wise cost in the dynamic programming equation. If let us just recall the dynamic programming equations themselves here remember there is an expectation here taken over xn minus 1 and wn minus 1. So, in the usual in the DP equation with perfect state information the expectation is taken only over the noise in the system whereas here since we do not even know we only have some partial information which is ik about the system we have the state also is unknown to us. So, we need to take an expectation even over that state while conditioning on ik. So, here we are conditioning on the information that we have and taking an expectation over this term. So, to recall this was the expect therefore what we get is the expectation over gk expectation over xk wk and zk plus 1 of the stage wise cost plus the cost to go. So, that is what I will write out here in our problem as well. So, first let us write out the expected cost from continuing the expected cost from continuing is so the cost. So, it is going to be the cost from continuing when you are in state when the state is actually p that will be g of p comma c times the probability of times the probability of the state actually being p. So, the probability here is probability that xk is equal to p given the information ik and given that we are choosing an action c. So, let me write this out a little more neatly. So, probability xk is equal to p given the information ik and given that choosing an action c plus a similar term like this now the probability that xk is equal to p bar given the information and the control action and the cost that you incur when the state is in fact p bar. So, this here whatever I have written out here is the expected stage wise cost from taking an action c. So, in the dynamic programming equation you have the sum of the stage wise cost and the cost to go. So, that term here is now this expectation of jk plus 1 and remember jk plus 1 is a function of ik the action we take c at time k and the new observation zk plus 1. So, the expectation here is taken over since ik and c are fixed the expectation is taken only over zk plus 1. So, we are taking the so this rather I should write this here by conditioning this on ik and c. So, we are choosing a minimum over these two terms. So, this is the minimum over two terms this is the first term here. So, this is therefore let me write explicitly this is the expected stage wise cost from action c given information given the information ik and c even that way you have information you have information ik and that you are taking action c all right. So, and the second is the expected cost to go given ik and c. So, this is therefore the term that comes up in the Bellman equation for with action c then we have to take the minimum of these this term and another term which will come when we when we take action s. So, that is the action of stopping. So, that once again we would then have the probability that xk is p given the information ik and given that we are choosing an action s g of p, s plus probability that xk is p bar given ik, s and g of p bar, s plus we again have this expectation jk plus 1 of ik, s and zk plus 1 given the information and given that we are choosing an action s and again the expectation here is over zk plus 1 and now I can close my curly bracket. So, I have written out once again this is the this here is the expectation of the stage wise cost from s given ik, s and this here is the expected cost to go given ik, s. Now, let us apply this for various values of k. So, remember for k we already have that j2 of i2 is equal to 0 that has been told to us because the terminal cost is actually 0 or in other words we can just simply write we can begin from k equal to 1 itself. So, let me write this for k equal to 1. So, if I have to write this for k equal to 1 I have j1 of i1 that is to be written out using this using these terms and in that when I where I get in place of j2 what I will really have here is the terminal cost and terminal cost is actually equal to 0. So, let me write so we can take so j2 of i2 is equal to 0 since terminal cost is 0. So, now let us explicitly calculate these for various values of k to begin with let us start with k equal to 1. So, k equal to 1 the information i1 remember we need to evaluate this for every value of ik and for every value of k. So, this has to be written for k equal to 0 comma 1 and for all values of ik. So, for k equal to 1 now let us write i1 as the 3 components that we had which is the observation z0 of time 0 observation z1 of time 1 and the action u0 that we take at time 0. Now let us compute j1 of i1 for each possible combination of these values. Remember z0 and z1 are observations so they can take 2 possible values for 2 possible values that is these can these are both either good or bad and u0 itself can take also 2 possible values which is to continue or to stop and inspect. So, let us take write the dp equation out then for each of these possible values of i1. So, the first case here is i1 equal to ggs so when I write i1 equal to ggs this would mean that z0 is equal to g, z1 is equal to g and u0 is equal to u0 is equal to s. Now if you as we go into calculating this notice that the thing that we will need are these probabilities we would require the probabilities we would require we are we already know these functions we know g of p comma c g of p bar comma c and so on all of that is known to us through these expressions here it is known from here. So, what we need to compute now very carefully is this what is the probability of the state being xp given the information and so on. So, each of these probabilities need to be computed. So, we also can try to understand what are the cost we will incur in each of from these two actions of continuing and stopping because in this case we only have a stage wise cost. So, since we have since this term is 0 and we have only stage wise cost we can try to guess what these terms are going to add up to. So, let us do that first before we actually compute these probabilities let us try to do that first. So, what is the cost from action c the cost of taking an action c is remember when we if you are if you now the cost of taking the action c remember if the state was a good state if the state if the machine was in a good state then this the cost of continuing the machine in state in using the machine in state in the good state was 0. So, this here is 0 whereas this particular cost which is the cost of the machine of running the machine in a bad state this cost is 2. So, this first term therefore vanishes and all we are left with is actually just this term which is the probability that the machine is in a bad state given the given the information and the action. The other thing to note here is that the probability that the machine is in a bad state is does not actually depend on the action itself does not depend on the on the fact that you have in fact chosen chosen the action c. So, really only thing that matters here is so, the machine will is in a good state or bad state regardless of the action that you choose to do on top of you know in that particular state. So, this is so, this is independent of c. So, this probability therefore is just probability that x k is equal to p bar given i k. So, therefore, in summary the cost from c turns out to be just 2, 2 is this term here this is equal to 2 it is 2 times the probability that x k is equal to p bar given i 1. This is the cost from taking action c. Now, cost from taking action s let us evaluate that as well. Now, if you are stopping and repairing then regardless of what the true state of the machine is we always incur a cost of 1 that is what we had seen here. Remember regardless of what the actual state of the machine is g of the cost of taking an act of stopping and repairing is always 1. So, as a result of that this term here simplifies dramatically because this g and this g are both 1. So, all you are doing is just adding up these two probabilities and they add up to 1. So, the cost of taking an action s is always equal to is just simply 1. So, that is because the j 2 is equal to 0. So, this term is 0, this term is 0. So, so, therefore, what we are really comparing in the DP equation are these two cost you are comparing this with this. Now, in order to do this more to make this comparison we now need to as I said calculate these probabilities. So, we need to calculate the probability that x k is equal to p bar given the information and let us we will now do this for each value of i 1. So, in case 1 let us take i 1 equals equal to as I said good good and stop. So, good state in at time 1 at time 0 good state at time 1 and then action chosen at time 0 is stop is to stop and exam. So, this therefore the probability that we are looking for which is x k equal to p bar in actually this is not x k it is x 1 here this is x 1 equal to p bar given g g s this is we can write this by definition this is simply probability that x 1 is equal to p bar and we took action we took action we had observation g g and we have taken action s. Now, but we will be conditioning on s here and write this in this following way the probability of g g given s. So, this is this just is from the definition of a conditional probability. So, the probability on in the numerator is the probability that the state of the machine at time 1 is p bar and we have to both inspections the inspection at time 1 and at time 0 and time 1 have both resulted in a good rating and you have stopped and repaired the machine at time 0. So, let us work out these probabilities the denominator probability is the probability of getting the reading good and good given that you have stopped and repaired the machine at time 0. So, let us evaluate the these terms here. So, the denominator here probability of getting good good the observations good and good is it can be written in the following way. So, this probability itself so let us write this out as so this is the probability that z 0 is equal to good z 1 is equal to good and u 0 given u 0 is equal to s probability that z 0 is equal to good. So, we can we can write we can even decompose this out even further. So, we can write this as the probability. So, we can write do the following here. So, remember these the observation that we get in this case the whether we whether the machine whether we get an observation good or bad depends on the state of the machine does not depend on the action. So, what we can do is here in this case we can write out this this as a sum of 2 terms this and conditioned on x 0 equal and x 0 equal to p that means the state of the machine at time 0 was good was in the proper state plus a similar term with x 0 equal to p bar. Now, here this becomes fairly simple for us to work with. So, this tells us this is now the probability that we can we can write this out in the following way we have now when we work out these probabilities here. So, we are talking of get probability of getting an observation g when the state is when the state at time 0 when the state is itself p that become that we can see is actually independent of what is happening for downstream. So, conditioned on the state these become independent. So, this is therefore, the probability that z 0 is equal to g given x 0 equal to p times the probability that z 1 equals g given z 0 is equal to p s or x 0 is equal to p and u 0 is equal to s times probability of x 0 equals p plus a similar other term. Now, we can compute this we can calculate all of this eventually by plugging in each of the numbers that we have. So, we have for example, the probability that z of observing of getting a good observation when the machine is in a certain state that has given to us here that is that probability is so, probability of good given probability that z 0 is equal to good given the that the machine is in a proper state that is equal to 3 4. The probability that the machine itself starts with a in a proper state that probability is actually two thirds. So, we can compute all of these things here and conclude that this quantity is actually two thirds times three fourth. So, this is the probability that you of the machine start that the machine starts in a good state in a proper state and you get an observation g at time 1 plus the probability that the machine starts in a bad state and you get an observation good that is one third into one fourth. So, this gives us the probability of getting observation g at time 0. Now, given now given this given the so, this just this just computes the probability of getting observation g at time 0. Now, get the probability of getting observation g at time 1 can be computed that requires us to do some additional work in addition on top of this. So, there we need to also look at the probability that what the probability that the machine is in a certain state at time 1. So, for that we require we need to go through this these probabilities here the probabilities of that the machine transitions to a certain state when we are taking a certain action. So, since we are choosing an action you action s here since we are choosing an action s out here. So, in both of these terms we have conditioned on the current state at time 0 and the action that you have chosen at time 0. So, from conditioning on this we can talk of the probability that the state at time 1 is equal to either p or p bar and then from there we can talk of the probability of the observation being good or bad at time 1. So, in summary let me actually erase this particular. So, I will just write out one particular term here. So, let us look at for example, this probability the probability that z 1 is good given x 0 is p and u 0 is s this probability is probability that z 1 is good given x 1 is p and x 0 is equal to p u 0 is equal to s times the probability that x 1 is equal to p given x 0 is equal to p u 0 is equal to s plus the probability that z 1 is equal to g given x 1 is equal to p bar x0 is equal to p, u0 is equal to s times the probability that x1 is equal to p bar, x0 is equal to p and u0 is equal to s. Now here we just we need to make this observation now that each of these terms is in each of these terms the condition given the state of the machine at time 1 the past is immaterial. So these two terms actually do not matter and we can now substitute using the observation equations that we have using these using these equations we can substitute out here. So and then we have and these two terms that remain they can be computed from these two expressions here. So as a result of this I will just tell you what the final expression turns out to be we get p of g gs turns out to be actually just 2 by 3 times 3 by 4 plus 1 by 3 times 1 by 4 the whole square this becomes the denominator sorry p of g given s this becomes the denominator. We will now do the remaining calculations in a moment in the next lecture.