 So, the application that we will see now of the optimal stopping problem is called the secretary problem. So, the secretary problem comprises of the following setting. So, we have a set of candidates, let us say we have n candidates and they appear to us in a random order, they line up for an interview and they are appearing to us in a random order. We know that there is a ordering between the candidates in the sense of there is a preference ordering for us between the candidates that means there is a good candidate and a bad candidate and there is a best candidate. What we want to do is we would like to get the best candidate by making an offer. Now, the dilemma is that we cannot interview all candidates and then decide whichever was the best. Any candidate that we do not make an offer to leaves the process and is lost to us. So, we have to make an offer then and there by looking at the candidate. So, this is basically the optimal stopping that comes up in a secretary problem. So, essentially candidates keep coming, we need to decide when is it that we have seen a candidate that is good enough or where we think we have the maximum chance that this is in fact the best candidate. So, because there is the option of hindsight or option of looking at all candidates is not available to us, there is an element of exploring enough and then taking a call and finally deciding that we have seen enough and then choosing the best candidate. So, this is something that you will see comes up as part of this problem. So, we have these n candidates, we have a way of comparing, we can compare, we can compare if candidate i is better than j. We can compare if candidate i is better than j, but we have no way of knowing if candidate i is in fact the best. But we cannot tell if i is the best amongst all candidates. So, this option is not available to us. So, what we want to do is maximize the probability of selecting the best candidate. The best, the best means the best in the entire law, the best candidate. Now, what the, and as I said candidates that are not offered, that are not made, that are not let us say made an offer, made an offer, leave the system and cannot be interviewed again or cannot be considered again and cannot be offered again, all right. So, let us see how we can model this problem as an optimal stopping problem and a Markov decision process. So, the optimal stopping nature of this problem is very clear. The set of candidates is actually are appearing to us in a random order, which means that if we just look at the what is the sequence of candidates, then that process is actually just a Markov chain. In fact, it is a sequence of independent random variables one can think of it that way. So, what we are getting is a Markov chain, which is evolving in the background. What we have to do is ask if we should intervene into this Markov chain by quitting at some stage. By quitting, we mean that what it means is that here we are making an offer to the candidate. We are saying, here we are offering him a job and then you do not interview anyone any further. It means the Markov chain then shifts to the stopped state delta. So, it or does not evolve any further. While we continue to, while we continue to let candidates go by and search further, we get, we may incur a cost, maybe there is a cost to interview. In the secretary problem, actually the setting is simple, there is no cost for further interviews. We assume that one can continue to interview with no cost associated with it. So, there is and the terminal reward that we were, that was associated with it really depends on whether we actually have found the best candidate or not. If we do not make an offer at all and the last candidate is not the best amongst all, then we get no reward because we wanted to find, our goal was to maximize the probability of getting the best candidate. And we have, we now know that that candidate is the last candidate, if he is not the best amongst all, then we know with probability 1 that we have not got the best candidate. So, in that case, in that case, we get zero reward. And on the other hand, if the last candidate turns out to be the best of all, that means it was well worth waiting for all these candidate, then in that case, we get a reward of unity. That means with probability 1, we have found the best candidate. So, there is no cost associated with interviewing further. The candidates will appear in a random order and if we quit at the right, if we quit at any time, the reward that we get is equal to the probability that the candidate that we have quit on is the best candidate. So, here are therefore, here is how we will write out the costs and the rewards. So, what we will do is, let us model this problem in the way we model the optimal stopping problem. Let us look at, let us write the state space S dash, which is the state space of the original, that is the state space of the background Markov chain. Now, this state space consists of just two states in this particular problem. So, S dash consists of two states, 0 and 1. Here 0, sorry, 1 here, 1 is the state that the current candidate is the best seen so far. So, 1 is the state that the current candidate is the best that you have seen so far. It could be, but it does not necessarily mean it is in fact, the best candidate amongst all. It is simply this, one just simply denotes the state that the current candidate that you are viewing is the best you have seen up until that time. And 0 is the state that a previous candidate was better. So, 0 is the state that a previous candidate was better, not necessarily the immediate previous, but some previous candidate was better. That means the present candidate is certainly not the best you have seen so far. That is what the state being in state 0 means. So, at any stage, you have the decision maker or the interviewer has two options. You either make an offer at that stage, in which case you would be quitting. So, the action at each stage is again the actions in each state are again. So, the actions in each state are once again to continue or to quit. By continuing, it means that you continue the search, you do not make an offer. By quit, it means that you stop the search and make an offer to the candidate that you are seeing that you are seeing here. So, S dash here is the original state space. It means in this case, the state, it is either 0 or 1 based on whether the current candidate is the best seen so far or not. And delta remember is the stopped state. It is the state in which the Markov chain enters. This is this fictitious additional state that we added is the state in which the Markov chain enters after we quit. So, we get rewards in this problem only when we quit and there are no costs associated with the problem. So, no cost. So, it means that Ct of S is equal to 0 for S in 0 comma 1 and for all times t. So, whatever be the time and whatever be the state, Ct of S is equal to 0. Now, if it so happens that we make an offer, if it so happens that you make an offer to a candidate that is not the best you have seen so far. That means you make an offer, that means you quit at a time when the state was 0. If you quit at a time when the state is 0, then in that case, you would have certainly made an offer to a candidate who is not the best. That means the offer the candidate that you have selected is certainly that means with probability 1 not the best candidate. So, there is no chance that he could have, because you know that there has been a better candidate before. So, there is no chance that he could have been the best. Therefore, the reward in that case is 0. So, RT of 0 is equal to 0. RT of 0 is equal to 0 for all times less than n. So, when, so if you make an offer to a candidate who is not the best seen so far, no, you do not get any reward at all. Now, if you on the other hand, if you make an offer to the best candidate, the best candidate you have seen so far, not the best amongst all. If you make an offer to a candidate who is the best you have seen so far, that means you are in state 1 in that case, this particular, this state 1 here. So, you make an offer in this state, that means you quit when the state is 1. In that case, in that case you get a reward which is equal to the probability that the current candidate is the best amongst all. So, the reward that we get is the probability that the current candidate is the best amongst all. So, now what is this reward equal to? Let us calculate this particular quantity. So, what is RT? RT of 1. So, the RT of 1 is simply the probability that the best candidate amongst all is actually that is your, is the teeth candidate, is the candidate that you are presently seeing. Now, what is this probability? Well, this probability is, you can calculate this in multiple different ways, but think of it, one potential way in which you can think about this is to ask how many ways can you potentially, how many ways could you have possibly seen, how many ways could you have possibly seen in a set of T candidates so far. So, that becomes the number of ways of selecting from a T, number of ways of selecting a subset of size T from a set size n that is the denominator and what is the numerator? The numerator is the number of ways of selecting the number of ways of selecting a subset that contains the best candidate, a number of ways of selecting a subset of size T that contains the best candidate. The number of ways of doing that is, well, what you do is you, so this is of selecting a subset of size containing the best candidate. How many ways can you select subset of size T containing the best candidate? Well, the way to do that is to first select the best candidate and then select the remaining T minus 1 candidates out of n minus 1. So, the numerator here is simply n minus 1 choose T minus 1 and the denominator is the number of ways of selecting a subset of size T from a set of size n. So, that here is, that here is n of n choose T and if you look at, if you evaluate this ratio, this ratio turns out to be just T by n. So, the number of ways in which you can get so the reward that we get if we stop at the best candidate that we have seen so far at a time T is equal to T by n. Now, let us look at, let us see what the, let us now see what the transition probabilities are. The transition probabilities that means that let us first see the transition probabilities of, let us first see the transition probabilities of the background Markov J. Now, let us recall the terminal reward. The terminal reward for us was HT, remember HT, the sorry H here. The terminal reward is if we stop at a state 0, then the terminal reward is 0. That means if you stop, if the final state is one where the best candidate you have seen so far is not the last one. So, the last one is, there was a better candidate prior to that means you have certainly missed the best candidate in which case the reward is 0. Whereas, if it so happens that the last candidate is the best you have seen so far, then in fact it means that the last candidate was indeed the best amongst all, which means in that case you get a reward of 1. So, that is, so with probability 1 you have found the best candidate. Now, what is the, so this completes the description of the rewards. Now, let us write out the transition probabilities. Let us write out the transition probabilities of the uncontrolled chain. So, the probability of being in state J when if at the next time step if you are currently in state S. Now, this probability here, so here J and S remember are in J, S belong to S dash that means they are in 0 and 1. So, that means what we are asking is well if the current candidate, so what we would be asking is well if S is either 0 or 1 that means the current candidate is the best you have seen so far or not the best you have seen so far. What is the probability that the next candidate is the best you have seen so far or not the best you have seen so far. So, let us calculate these 1 by 1 now. The first thing to observe is actually if you look at P of J given Pt of J given S. This particular probability is the probability of being in a state J if the current candidate that means probability of let us say of that the next candidate is the best you have seen so far. Given the current candidate is the best you have seen so far or not the best you have seen so far. Now, you notice that this probability actually is independent of S what whether the next candidate is the best amongst the full lot or not amongst not the best amongst the full lot is it does not depend on whether the whether our current candidate is the best amongst the lot or not because candidates are sorted in a random order and they are coming to us in a random order. So, as a result of this quantity here is simply Pt of J. So, this really depends only on the state J and does not depend on the state we are presently in and what is this Pt of J well here now this let us compute this for J equal to 0 and J equal to 1. So, what is Pt of 1? So, Pt of 1 is the probability that the present that the next candidate the candidate at time at time t plus 1 is the best amongst all candidates you have seen so far. So, what is the what is that probability well you that probability is equal to is equal to the number of is the following ratio it is equal to the number of ways number of ways of arranging t plus 1 candidate this is in the denominator and in the numerator the number of you the t plus 1 candidate is the best you have seen so far. So, which means that the remaining t candidates can be arranged in any random order. So, the number of ways of arranging t items. So, consequence this here is naturally t factorial divided by t plus 1 factorial and that we write this t better this is t factorial divided by t plus 1 factorial and that is 1 by t plus 1. So, this is the probability that the present candidate at time at the this is the probability that the time the candidate at time t plus 1 is the best you have the best you have seen so far. What is the probability of Pt of 0? This is equal to 1 minus this so this is equal to simply t by t plus 1. So, we can see we that in totality we can write it in this in the following way that Pt of 1 given s is equal to 1 by t plus 1 and Pt of 0 given s is equal to t by t plus 1 and this holds for all s in 0 and 1. So, with this now we have all the elements written out for the optimal stopping problem for the secretary problem modeled as an optimal stopping problem or an MDP. We have the cost function, we have the transition probabilities, we have the actions, the state spaces and so on. What we will do in the next part is solve this particular problem to the extent that we can. We will get an approximate solution for this particular problem and in that we will make full use of the method of solving MDPs which is use of the Bellman equation.