 Welcome everyone, so we will today look at a problem class called the optimal stopping problem. Now this problem can be studied in various ways but what we will do is we will model this problem as a Markov decision process. You will see that this problem actually has applications to a number of settings such as for example selecting best candidates out of a set of candidates selling an asset whose price keeps varying and so on, okay. So this problem is called the optimal stopping problem. So in optimal stopping problem there is a Markov chain that evolves in the background on a state space let us say let us call this state space S dash. Now this Markov chain could be stationary, could be non-stationary possibly, so let me write this here as possibly non-stationary and we will assume that S dash is finite here. Now this Markov chain has is autonomous in the sense that it is uncontrolled, okay. The Markov chain is uncontrolled. So that is why I said that it evolves in the background, is uncontrolled. So left to itself, left to itself it will evolve based on its own transition probability, okay, whatever the transition probability is. The problem for us is to decide when to intervene in this Markov chain, okay. When do we step in and stop its evolution. So what we have, we have essentially two options at every time step. We have the option to let it continue on its own, to let it continue to the next time step based on its own evolution or we have and or we have the option of saying quit and stopping the Markov chain, okay. So if, so we have two options, let us say actions is either continue. Now if we continue, okay, if we continue, so if we continue in state S at time t, we incur a cost. There is a cost to just to continue, okay. We incur a cost and let me denote this cost by C, Ct of S. This is the cost that we incur. If we continue the Markov chain when we are in state S at time t, okay, at time t here. Now if we quit, if we quit, then we can pocket a reward equal to RT of S. So if we quit in state S at time t, we get a reward RT of S, okay. Now this reward is incurred to us, is accrued to us immediately and after that the evolution of the chain stops. So there is no more further evolution and no further rewards or costs that we will incur, all right. So the question we have at every time is whether we should either continue or quit. If we continue, we incur a cost Ct of S, if we quit, we get a reward RT of S and we know what S is, okay. We are able to observe what S is, okay. What this, which means what the state of the Markov chain itself is, okay. If we let this, let this chain evolve towards to the end of the time horizon, that means if if we let the chain evolve at up until time n, then there is a terminal reward. If t is equal to n, which is the, that is the end of the time horizon, we get a terminal reward equal to H of S, where S is the state, state at time n. So if at the end of the time horizon, we end up in state S, we get a terminal reward, which a reward equal to H of S. Obviously we do not know whether the state at time S will be the one which is most favorable to us or not, because the chain evolves randomly. So it is quite possible that a reward at some previous time step was more attractive than a reward that we would get at the end of the time horizon. So the optimal stopping problem is to stop at the right time, okay, so that you get, so that you pocket the best reward for yourself. And the goal of the problem is to maximize the expected reward that you would get from this sort of a problem, okay. So the expected total reward that means the reward that you get minus any cost that you incur through the course of the problem. So it is very clear that this problem, this sort of a problem setting has an application to finance. In financial assets, it is very common to encounter a situation in which you have say bought an asset and its price keeps varying, maybe the price varies as a Markov chain. And at every stage you have to, you need to ask yourself, are you going to be, are you going to wait or that means are you going to continue to hold the asset or are you going to book profit and then take home the reward that you are getting. In assets such as for example options, there is a finite time horizon until by which you need to book out, otherwise the option becomes worthless. So then that model, that kind of a problem models a finite time horizon problem with a terminal reward or a terminal cost, okay. So this is essentially a matter of deciding how much to wait until you have seen enough to make a decision about booking, about booking profit, right. So this is basically the dilemma in an optimal stopping problem. We need to sort of trade off the amount of time we are spending in trying to get a better reward versus eventually deciding that no the best reward is behind us and we now need to make a decision about finally get, finally quitting, okay. So here is how we will model this problem as a Markov decision process. So optimal stopping, optimal stopping problem as a Markov decision process. So we what we will do is let us first begin with the time horizon. We are going to assume that there is a finite time horizon. So the decision epochs then for us the decision epochs are the times at which we will take the decisions. These decision epochs are going to be denoted by p let us say these are 0 to n minus 1, okay. There are those are sorry 0 to n. So n is the is where you 0 to n minus 1 is where you will be taking the decision, n is the time step at which your system at which your time horizon ends and that is when you would incur the terminal reward. Now what we will do is we will introduce a new state space. So if you for in order to model this problem as a Markov decision process. So because the original problem has a Markov chain that is uncontrolled but gets controlled only when we intervene we need to sort of come up with a new synthetic synthetic formulation in which this particular aspect is introduced. So what we will do is we will introduce a new state space s which is denoted by which is equal to s dash union a state delta. Now what is delta? Delta here delta denotes the stopped state. It just denotes a state in which the Markov chain has been stopped. You have quit and then the Markov chain has stopped. So this is what we call the stopped state. So the chain will enter this particular stopped state after you click quit at the previous state at the previous time step. If at time step t you click quit then at time step t plus 1 the state will be delta with probability 1. So you will surely so that is the stopped state that you enter in. Now the actions that we have now depend on the state that you are in. See earlier in as far as if you look at the Markov chain itself then while it is evolving autonomously while it is uncontrolled while it is so far not being stopped you can choose either to continue or to quit. But once you have quit once you have quit one has one has no choice but to one has no choice but to choose the action you know but to basically one basically has no action left because the Markov chain is already shifted to a quit state to a stopped state. So the actions so this was our state space the actions are as follows. So let us write the actions the set of actions at state in state s as follows you have c comma q. So c stands for continue and q stands for quit. So if your state is in the original in the amongst in the state if your state is one of the original states states of the Markov chain then you can either continue or quit that means while the Markov chain is still autonomously evolving you can either continue or quit. But once the state is becomes delta then you have really no choice you let us say we have and that no choice the trivial choice at that state let us call that continue. So you can only say continue when you are in delta it really does not matter what we call this particular action the state just does not evolve further than that. So this is what now the reward that we get remember let us now write out the reward remember what we were getting was that if we continue then we incur a cost ct of s in state s at time t and if we quit then we pocket the reward of rt of s. So the so the reward itself for the mdp which we used to write as rt of s comma a right. So this is the reward as in state s and when we take action a so here I forgot to write this so this c here stands for continue and q stands for okay all right. So now let us come back to the reward so let us take a few there are separate cases here. So suppose your state is one of the original states okay so say s is in s dash and you choose the action to continue in that case the problem tells us that the what we get what we incur is a cost we incur a cost of ct of s which means it is effectively a reward of minus ct of s so here it is minus ct of s all right. Now if the state is one of the original states and the action is to quit okay if we take an action of quit but we do this at a time before the end of the time horizon so if we quit any time before the end of before the time horizon that means for a t less than n if we do this for a t less than n then in that case we get a reward equal to rt of s. Now if your state is delta that means your chain is already stopped then there is you neither you incur neither reward nor a cost okay and then actually this also is for t less than n. Now and at the terminal at the terminal time instant then the reward that we get is h of s okay so rn of s comma a rn of s rather this is there is no action left here at the at the terminals at the terminal instant so rn of s is just h of s which is the reward that we get at the terminal time instant okay. So this is this is just the broad this is how we end up formulating a optimal stopping problem as a Markov decision process as we can see this this sort of a problem the final thing left in formulating this problem as a Markov decision process is to define the transition probabilities. So the transition probabilities the probability at time t of transitioning to state j when you start off from state s and take action a this let us write out this in in terms of several cases. So here this is so if your state s that means the state that you started off with is one of the original states that means the chain has not been stopped yet and the new state is also another one of the original states and the action that you took is continue then the state transition then it is like the chain has been left to itself it is evolving autonomously. So then its state transition is given by its native transition probability which is let us denote this by pj of pt of j given s okay. So this is the transition probability the transition probability so here this is the transition probability of of the uncontrolled Markov chain. Now if the previous state is one of the original states and the next state is delta okay that means you are and the action you took is quit right or if the current state and the next state are both delta and the action that you took is is continue for and the time is less than n in that case the transition this problem this transition is then talking of a transition from either from a native state of the original Markov chain to the stopped state delta under action under the action quit or a transition from the stopped state to the stopped state under the action continue okay in that case in either case these transitions are deterministic. So we define the stopped state as one where the state the in which the controlled Markov chain enters after we choose the action quit. So this occurs with probability one and whatever other possibilities are there they are all have zero probability okay. So this together defines for us a transition probability of the controlled Markov chain. So this transition probability of the MDP probability of the MDP or the controlled Markov chain. So as you can see if in this sort of problem what we will by modeling this this problem in this way what you will be doing is now in and solving this as an MDP will result in a policy that maximizes the total expected reward. So that would then translate to maximizing the total reward that you would pocket in the optimal stopping problem okay. So what we will see next is an example of a very common problem that occurs during interviews which is called which is called the secretary problem it is a problem of the problem is of making an optimal offer or making the offer at the right time to a candidate amongst a set of candidates when we do not actually know if there is a better candidate still waiting for us at the end of the end you know after this particular candidate that is that is being considered at this moment. So this problem is called the secretary problem we will see that how this problem can be modeled as an optimal stopping problem and therefore as an MDP and what we will then do is solve this problem at least to a good degree of approximation to get some insight into how these sort of problems into how one should be actually taking decisions about optimal stopping. The main purpose of doing this calculation is for you is to are two folds one is to show you the nature of structural solutions that one can get and the kind of insight that one can obtain into these sort of problems. The other is to also demonstrate for you how the Bellman equation can actually be used in order to efficiently solve a problem like this. So here is our the what is called the secretary problem. So in the secretary problem what we have are N objects or N candidates let us say let me write this as N candidates. Now these candidates appear for an interview in a random order we do not know the rank of any particular candidate we just know that they are ranked and there is a best candidate and the worst candidate and there is a relative ranking amongst the candidates. But when we see a particular candidate we have no way of knowing if there is a better candidate out there which we have not yet seen. So candidates keep appearing to us in the in the list of the in the interview process.