 Now what we will do is solve the secretary problem using Bellman's dynamic programming algorithm. So in order to do that we need to first write out what the total cost is or the total reward in this particular problem is. So remember what we are trying to do is maximize over all policies by the expectation of H of Sn plus t equals 1 to n minus 1 RT of st,at where st now remember is the state at time t and this is the state not of the original Markov chain but of the synthetic controlled Markov chain that we have created. So it is a st and Sn these belong to s which is s dash union delta. The actions here are the actions that are available at a particular state. So if you are is this is c or q if st is in s dash sorry and if it is q it is continue if st is equal to delta. So this is the now if you remember the cost that we the cost and the rewards that we incur in the secretary problem. So if we are made this assumption that if we continue at any stage there is no cost it means there is no cost associated with interviewing further. So if we continue the if at any whatever the state may be if the action we take is continue then we incur a cost of 0. If we quit then we get a reward based on the state that we quit in. So if we quit at a state in which the current candidate is not the best you have seen so far in that case the reward that we get is 0 whereas if we quit in a state where the current candidate is the best you amongst is the best you have seen so far then the reward you get is the probability that the current candidate is the best amongst all. So these are the rewards and the we have a terminal reward also which depends on whether the find the candidate at the last step is the best you have seen so far or not. If it is not then it is 0 if it is then in fact the reward is 1. So this is therefore the total the entire cost function or the reward function. So what we will now do is apply dynamic programming to this particular problem. So for that what we remember what we need to do is for define or the value function or the cost to go. So let us define Jn so Jn remember is defined as the terminal reward or the terminal cost. So Jn of 1 is equal to H of 1 Jn of 0 is equal to H of 0 and Jn of delta is equal to H of delta and that is that is equal to 0. So remember H so H of delta was 0 H of 0 was 0 so consequently and H of 1 was 1. So Jn of 1 therefore is 1 Jn of 0 is equal to Jn of delta is equal to 0. Now let us do this let us write out the dynamic programming equation for any time t let us less than n. Now Jt now suppose you are in a let us write these for two cases first we will we want to write Jt as a function of the state that we are in. So Jt of let us write this for one that means this is the at time t we are in state 1 that means at time t the candidate that we are presently seeing at that time is the best you have seen up until that time t. So we have in this case two choices we can either stop or we can continue. So the dynamic programming algorithm what it asks us to do is to maximize choose the maximum of two terms the first term is the reward that you would the expected reward from this time onwards that you would get if you continue the second is the expected reward that you would get from this time onwards if you if you quit. So that quantity in turn is equal to the each of these quantities. So let me write it like this the reward stage wise reward let us say plus the expected reward to go from continuing and the other term is the so this is the stage wise reward plus the expected reward to go from continuing. Now there is if you continue there is no cost and there is no reward so there is really no stage wise reward but I have written it here for completeness sake. On the other hand the other term the other of action that you have is to quit if you quit then what we get is the reward from quitting reward from stopping or quitting let us write it as reward from stopping. So if you stop the search at that time whatever reward you would get is that. So let us write out these two terms explicitly the if you if you so this term this is from action C this here is from action Q. So what are these two terms so let us write it like this stage wise reward plus the expected reward from continuing. So if you continue then you do not incur any cost so then but let us write this for completeness sake it is the cost you incur is minus Ct the cost you incur is minus Ct of 1 plus. Now if you are the expected reward to go from continuing depends on is calculated in the following way so you are presently in state 1 so suppose in the next time step you end up in state 1 then that the probability of that happening is probability of 1 given 1 times the reward to go of from being in state 1 at time t plus 1 so is Jt plus 1 of 1 plus probability of 0 given 1 Jt of plus 1 of 0. So this is the first term which is the expected reward that you would get if we took the action of continuing. The second term is the reward that you would get the expected reward that we get from stopping so if we stop then we transition to then we get a reward remember of RT of 1 and we transition with probability 1 to the stopped state delta so it is a reward from stopping plus let me write here also the expected you transition with probability 1 to stop state delta so in that case the reward to go is Jt plus 1 of delta this is the expression we have for Jt of 1 so it is the max of these two terms. So out of these remember that this term here this is 0 remember we will also soon see that this is also 0 so this term also is 0 I will just establish that in a moment. But before we get to that let us write out Jt of 0 so Jt of 0 now is again written out in a similar way it is the max of two terms the reward from continuing and reward from stopping so once again the reward from continuing is now minus Ct of 0 this is the cost of continuing plus now Pt of 1 given 0 times Jt plus 1 of 1 plus Pt of 0 given 0 Jt plus 1 of 0 and the reward that we have from quitting is we remember we get RT of 0 RT of 0 plus we since we after we quit we transition with probability 1 to the stop state delta so you have Jt plus 1 of delta so once again recall that since if we quit in state 0 we do not reward is 0 so this term is also equal to 0 and there is no cost of continuing once again so this term is also 0 and I will show you that as I said that this was this is also equal to 0. So in order to show that Jt of delta is equal to 0 all we need to observe is actually that we can write for instance Jn minus 1 of delta Jn minus 1 of delta is when we are in state delta we have only one action which is which is to continue right so therefore Jn minus 1 of delta and we incur no there is no stage wise reward or stage wise cost so Jn minus 1 of delta would be equal to Jn of delta and Jn of delta is equal to 0 and in fact we can do this recursively and establish this for every T so in fact we get that Jt of delta is equal to Jt plus 1 of delta and this is equal to 0 and this is true in fact for all times T so consequently what this is effectively saying is that no matter once you are in state delta there is no further reward or no further cost so all the reward or cost to go from that at that once you are in state delta are equal to 0. So as a result of this term here is 0 this is dropped and likewise this term is also 0. So now let us simplify these the terms that remain so and write these out explicitly so Jt of 1 and Jt of 0 Jt of 1 note can be written in as in the following way so actually first let us so let us write out Jt of 0 Jt of 0 can be written out as in the following way so it is right here so P remember Pt of 1 given 0 was simply 1 by so it is max of these two terms Pt of 1 given 0 is simply 1 by T plus 1 1 by T plus 1 times and then there is a Jt plus 1 of 1 plus Pt of 0 given 0 is T by T plus 1 Jt plus 1 of 0 now Rt of 0 and Jt plus 1 of delta are both 0 so max of this comma 0 is what is being is what is written here but remember J is okay so J if we if we write this out recursively J is always going to be greater than equal to 0 since that is simply because that we can check this by writing by comparing the terminal the terminal ones so Jn is Jn of 1 is 1 Jn of 1 is 1 Jn of 0 is 0 so therefore when I write the same for T equal to n minus 1 I would get that Jt of Jt of Jn minus 1 of 0 is greater than equal to 0 and using that I will also and similarly I will also get that Jt of Jn minus 1 of 1 is greater than equal to 0 in other words by looking at these two equations it is easy to see that actually J that Jt is is greater than equal to 0 for for all T and Jt of both 1 and 0 Jt of 1 comma Jt of 0 are greater than equal to 0 for all T so consequently this max of something comma 0 the 0 from there can be dropped and I can simply write Jt of 0 as 1 by T plus 1 Jt plus 1 of 1 plus T by T plus 1 Jt plus 1 of 0 alright now let us go back and look at this equation here which is the expression for Jt of 1 Jt of 1 will have 1 by T plus 1 from the this probability times Jt plus 1 of 1 plus 1 by T Jt plus 1 by T by 2 plus 1 Jt plus 1 of 0 and what we have from Rt plus Rt from the second term is Rt of 1 now if you recall what Rt of 1 was Rt of 1 was T by n right so so consequently we we get we get that Jt of 1 is the max of T by n comma this this other term now look carefully at this other term this is simply 1 by T plus 1 times Jt of 1 Jt plus 1 of 1 plus T by T plus 1 Jt plus 1 of 0 and that is nothing but that is nothing but this particular term itself right so consequently this actually can be written as the max of T by n comma Jt of 0 this is Jt of 1 okay so notice notice what the what the power what this is effectively telling us since we eliminated this we have we have now got these two recursions and these need to be now solved in order to find the the value function or equivalently the rewards to go okay now look at let us look at these look at what we have decided here we dropped this particular 0 here because in this state because we found that this term was always greater than or equal to 0 which means that when we are in state 0 when we are in state 0 the optimal thing to do is is to continue remember 0 was the reward from quitting okay this was the reward from quitting so 0 was the reward from quitting in that case you are better off continuing what this is telling us is that if the candidate you have seen so far is not a candidate you are seeing right now is not the best you have seen so far okay so if the present candidate is not the best you have seen so far then simply continue that means in state 0 you continue okay you continuing is optimal in state 0 okay so in state 0 you should continue so regardless of what the time is no matter what time it is you you continue when you are when the candidate you have seen so far candidate you are seeing right now is not the best you have seen so far okay on the other hand let us see what happens in state 1 now in state 1 there is a what is being compared we are is t by n and and Jt of 0 Jt of 0 is the reward to go if you continue at that time that is the that is what we have found here right if you continue the reward to go is here the reward to go the that we have in at time t that is being compared with t by n t by n is the reward that we would get from quitting okay so Jt of 1 compares these 2 terms so what this is when when t by n is greater than Jt of 0 in that case it is it is it is optimal to quit that means if t so if your time it has reached a stage where t by n is now greater than the reward that you would get from continuing the optimal thing is to quit okay so if if t by n is greater than Jt of 0 optimal action is to stop and if t by n is less than Jt of 0 we continue if t by n is equal to Jt of n Jt of 0 then either is optimal either action can be chosen so what this suggests is an optimal policy of the following form this suggests that what we should do is we should continue continue to click we should keep continuing up until you reach a time where you know you have probably take you are sort of running out of options where it seems like you have seen enough and now the probability that you will see any bet any candidate that is better than all of the than the one you are seeing so far is has dropped to below a certain threshold okay becomes unlikely that you will see some candidate that you that is better than what you have seen so far and in that case you just quit okay so it suggests a policy this suggests we write it like this this suggests an optimal policy of the form of the following form basically you click continue for up until the time tau okay so observe the first tau candidates the first let us say tau candidates okay so you keep clicking continue until you have seen enough candidate okay and then select the one who is better than all previous ones so which means in other words what this is saying is you are looking for a pi star which is equal to mu 1 star to mu n minus 1 star with the following feature so mu t star of 0 is continue if p is less than equal to tau mu t star of 1 is equal to so mu t star of 0 or mu t star of 1 are continue if mu t star of 0 and mu t star of 1 are both continue if t is less than equal to tau okay no matter so no matter what state you are in if t is less than equal to tau you just continue and once t becomes greater than tau once you have seen enough candidates you do the following you you write you say mu t star of 1 is quit that means you see the best the candidate you are presently seeing is the best you have seen so far then you quit mu t star of 0 is continue that means or if that means if you are now if you if the candidate you are seeing right now is not the best you have seen so far then you continue so which means that you you keep searching up until a certain time regardless of how good the candidate is you keep exploring and when once you have explored enough you you see look at look if the present candidate is the best you have seen so far and you simply make him an offer if you if the present candidate is not the best you have seen so far then of course you should continue further okay so this is this it suggests an optimal policy of this form it needs a proof to show that this is in fact the optimal policy what we will do is we will complete this proof the proof of the optimality of this sort of policy in the in the in the following lecture