 So hopefully in the first part of the lecture, I will conclude the discussion about Markov decision processes, which is the case when you have perfect observations of the state of the system. So in this diagram, the environment sends a signal which is its own state. So there's no filtering, there's no error in measurement of what the environment of state is. Then this is sent to the agent, the agent through a policy decides what to do and then the goal is always to optimize the discounted cumulative sum of rewards in the future, the expected value, if they're wrong. So Markov decision processes are Markov chains. So let's, in general, we've been using this, this notation for the diagram. So every state is labeled with a circle and then from a state emanates some action and this action can send the system to this state or another state or which other state. And I mean maybe usually there are several actions that can be taken whose outcome can be again like this. So this is state S, this is action A, let's say this is state S prime and all these are other states and you can enlarge this diagram at will. You go from here to here according to a policy, pi, which is the probability distribution of picking action A when being in state S, this is the policy. And once the action has been picked, depending on state and action, you end up in another state as prime with the probability P as prime S A. This is the transition probability given the action and it's also called the model of the environment because it encodes the knowledge of what will happen to the environment upon a certain control exerted on it. It's just like Newton's law. The particle will go from position velocity x v to position velocity x prime v prime if I exert the control force f. It will be the action, the force will be the action and this will be the phase space variables. So this is the model of the environment and then we have as a function of the triplet of state action and state prime according to this flow from here to here to here, a reward is issued which might be random in general but in the context of MDP we just focus on its average value. And these things are known, they are known to the agent in the MDP framework. The policy is in the hand of the agent so it can modify this according to its will especially we will be interested in how this policy can be chosen so as to optimize the target, the goal of our decision process which is to maximize the value that is the return, the expected return, the sum of our returns discounted by gamma, etc. Okay. Another way of seeing this process is through a diagram which is often used so I will show it to you because it might be also instructed. It's a diagram in which time flows from the left to the right, okay, so this is time and you start from state st, okay, so that's the state that an agent is experiencing at time t. From that agent, from that state it picks through the policy an action at and then the combination of these two gives rise to a new state st plus 1 through the probability p. So I want to emphasize that this part of the process is in the hand of the agent, this part here is in the hand of the environment, okay. There's no way we cannot tweak the laws of nature. The only way we can act is shifting probabilities for actions, this is something we cannot act upon. The best that we can do is know it in advance, which is the case we're discussing here, but this is not part of the control. And then again the process repeats itself and so on and so forth, okay, so this is also what is called a causal Bayesian network, okay, causal, causation, causal Bayesian network, Bayes, okay, sure, without any action, what do you mean? You cannot, no, no, you, no, no, you go from here to here by combining the information of both s and a, that's what it says. Arrows means that you go there depending on what these two things were. That's a notation, sorry, sorry, I should have explained this better, okay. These are not the same arrows that are here, these are causality arrows, okay. So overall another way of thinking about these processes is that in the product space of states and actions, this is a Markov chain. In the sense that the probability that at certain time t plus 1, the system is in state s prime and fix action a prime is nothing but the sum over the previous states and actions of the probability of transitioning given the action and then the probability of picking a new action according to s prime and this is times the probability of being in state s and a. So that this thing here is just the transition probability from state and action s a to state prime, state s prime and action a prime, okay. The key point again is that you do not control the full transition matrix, in which case the system would be overly simple. You can only control a part of it, which is encoded in policy. So this was just to remind you the basic effects. Then what we are going to do now as quickly as possible without losing you all, nonetheless, is to derive a way to obtain the best policy for any Markov decision process, okay. So there is a clear theoretical path which will take us to one specific equation which is called the Bellman optimality equation and solving that equation we will be able to derive the optimal policy somehow and we will see how, okay. So that's the program. So we will now jump into the lecture at the point where we left. So I will just pick from the last two formula I wrote yesterday. So the last point where we stopped yesterday was that we want to optimize this quantity, the value of a policy that is the expected return according to that policy, which can be written, this in yesterday can be written as sum over states, action s prime, these are the triplets and then there is the reward that you get from that triplet and then the probability of being that triplet, actually the time spent, so this is going to be pi a s t s prime s a eta of s, okay. So again to understand this formula, what does that mean? It means that this, you will remember, this is the average time spent in a given state s while alive, okay. So it's the average number of times if you wish in which the agent visits that state before dying. Then if it's visiting state s you will pick an action a according to this policy, pi. And this combination will then lead to a new state s prime and then the triplet is formed and this is the average return reward for that triplet, okay. So that's the physical interpretation of this formula, which we derived yesterday from the Chapman photograph equation. Now the definition of eta s is the following one. Like I said just now in words, it's the discounted sum of the probabilities of being state s. So that's exactly what I said earlier, given a survival probability gamma that's the overall number of steps spent in that state s. It obeys a formula, a recursive formula which I wrote yesterday in the eta function. Is that what you're saying? The time horizon is inside the, it's present as a parameter in this eta, okay. Eta depends on gamma. I'm not writing this in order to avoid a cumbersome notation. But of course you can see from the question that different gammas give different eta. Of course if gamma is zero, the time that you will spend in state s is only the time that you spent at the initial condition, just run out. Otherwise it adds up all the other contribution. So very good students and derived this formula yesterday. Sure. Did you? Did anyone try? No? Okay, it's actually very simple. You just have to split this sum into the first term, which gives you this. And then all the other terms. And then scale back the sum by one, which will pop out a gamma, okay. And then this will be just the future step and then you will resound the eta using the Chapman-Komorov equation, okay. So that's nothing especially technical. You just have to keep in mind that the ingredient is the definition and the Chapman-Komorov equation. It should be rho zero, that's prime. Thank you very much. That's great. So like we said, one difficulty that arises in trying to optimize this function over the policy is the fact that this quantity eta implicitly depends on the policy, right? So we would like to extract this dependence. There are several ways one can do this, right? So one way is just sort of inverting this linear, this is a linear equation for eta. So you might want to write this thing eta as the inverse of a function we checked on this part. You'll recognize that this is just linear algebra, right? So you could invert this thing and then get eta explicitly as a function of pi using an inverse of a matrix that's possible, but it's pretty cumbersome because then you will make the derivatives of the function which is the inverse of the derivative, okay. You could go that way, but it doesn't really bring you very far, okay. So one way of doing this is to implement this condition as a constraint like we said yesterday. And if we use that, we can basically explode the dependency from eta and pi as two independent objects at the price of introducing a new quantity which would be the multiplier in our Lagrange multiplier description. So that's what we're going to do now. So we introduce a Lagrange function which I call f for no particular reason. This f is actually then the quantity that we want to optimize with pi. And then first we start by putting the constraint over the dynamics which for just for, in order to adhere with the notation I start with putting the error here minus, okay. That's no particular reason, but it makes simpler than the interpretations of both. So this will be our Lagrange multiplier, this vector phi, okay. It's a vector which lives in r to the number of states, dimensional space, real space. And this multiplies the dynamics, okay. So far so good. It's just the number of states of the system which is finite for us. You could go into the direction of making it continuous, even numberable than non-numberable whatever. But we don't need that. We just keep to the clearly simple situation where everything is finite so we don't have to deal with mathematical subtleties, okay, good. Is there anything else we should add? Rho zero, this is the initial condition for the distribution. This is something that you choose is where your system sits at the beginning. Of course, the outcome of your process will depend on where you start from, okay. So that's encoded in this Rho note, which is the probability distribution for the initial condition might be localized on a specific state or might be spread all over the place. That's in your hand, of course, this depends also implicitly on Rho note. I didn't put it here if you want to, we can add. It depends on the policy and on the initial condition for the state. Okay, so is there anything missing yet? So we want to optimize over phi. Exactly, you've been saying the same thing, normalization. So there's another constrain, which is a slack constrain, again over pi. Positive, okay. Phi should be positive and normalized to unity, okay. So it leaves in a space which is actually a simplex, okay. It's also called in Rd space. As we, one could insert both the constraints. Into this, we will actually need not the normalization one. In this case, this is just because calculations are made simpler, okay. So let's forget for a moment about the positivity constraint. It's something that in the end, we will check that our optimal policy will be positive, defined, will be correctly defined. But we have to care for normalization, okay. Otherwise, this thing doesn't make sense. So in order to care for that, we have to add another lagrange multiplier, which I call lambda, which ensures that for every state, the sum of all possible actions taken from that state equals one. Okay, so the sum acts only on this. This is the normalization, okay. So this is the, let's call the dynamics. And this is the normalization, okay. So this is a good point to stop by a second and think. So this f now depends on several things, on pi. Implicity on row node also, but you don't want to optimize on row node, not necessarily. It's not something that you can choose often where you are at the beginning of your problem in your environmental space. You don't want to depend on this. So there will be no optimization on this. But this thing now will depend on fry. You will have to optimize over all of them, which will, again, enforce the constraints and give you a constraint. So before getting into this, let's just have a look at this formula. So how does this whole thing depend on the policy? How does this depend on the policy now? Linear, okay, because we exploded the dependence on eta, which is now an independent variable. So this is linear. This is a linear constraint on the policy. This is a linear constraint on the policy. So what is the result of this optimization process? How can it be? What would be the result of optimizing a linear function in whatever compact space, the simplex? It will be on the boundary, except if it's flat. If it's flat, it means that everything is equally good, okay? So apart from that, this particular specific case, if there are some actions which are better than others, then, pardon, it should be the terministic, okay? That's first take home message. By looking at this, we see that the policy, the optimal policy is always the terministic, except when there are ties, okay? Two things are equivalent, then you can mix them up. But we can reverse it into a stronger statement and saying that there is always a deterministic policy that is a one-to-one association between state and action. That optimizes our value function, okay? And that's good. Now, this is good, but this is also a bit annoying. Why? Why? What would you like to do now? We would eventually find a way of computing this. So this is just a statement about there is an optimum and it's sitting on the boundaries. But we want to know where this is, what is the optimal value. We can derive, but then if it's on the boundary, what can you say? Can you set the derivatives to zero? You just, you will just have inequalities. So the derivative will be larger or equal than zero, okay? These are called as Karush-Kuhn-Tucker conditions. And it's annoying because if we take the derivatives and then we will end up with a set of inequalities, will we be able to write down the actual best policy? Yes, please interject. I will, probably you can do this. I've never tried, must admit. Perhaps it works. We would do something different. First of all, are we all on the same page on the fact that this is a bit annoying from the viewpoint of taking derivatives and setting everything to zero? Yes, okay. Because you have a function, okay? And typically say, okay, I want to find the maximum of this function. Now if this function is strictly convex or concave or whatever, I mean, like this, then you take the derivative stationary point and you will be assured that this is the maximum. Now the point is that this object in the pi space, so abstractly in a convex domain, is linear. So it's something like this. So the optimum sits on the boundary. So if you take the derivative, the derivative will be, you just take a look. You differentiate this and you will find some constant, something that doesn't depend on the policy, which means that it will send you against the boundary and this thing will not have to be set to zero, absolutely. If you set everything to zero, you will fall back into the situation where everything is flat. So you don't want to do that. So the trick, as your colleague suggested, is can we add something to make it a little bit curved? Can we regularize this problem and make it complex? Sorry, make it complex, in such a way that nonetheless, when we just make it this convex, it is smaller and smaller, then we recover the original result. We can add the parabola, yes, except that where is our object leaving? So let's fix for a second and think about, suppose that our system in a certain state, there are just three possible actions, okay? So this is the axis, action one, action two and action three. So the space where this probability leaves is this surface, it's sort of triangle and then becomes a tetrahedron and then a high dimensional object. The question is why parabola? Isn't there any other object which has this nice property like a parabola but then leaves on probability space? Fine, I didn't get anything, so I'll just, no? Logarithm, yeah, logarithm starts looking good. Entropy, ba-bam, okay? Entropy is a very good choice for regularizing this problem. So what is entropy? Is everybody familiar with the notion of Shannon entropy or probability distribution? The other way around. Who is feeling totally lost when I speak about entropy over probability distribution? I will, that's for sure. But at least intuitively, do you have a notion of what it is? It's a measure of how much a probability distribution is concentrated or delocalized, okay? It's actual definition, so let me add this first and then we discuss the definition and what that is. It plays the role of the parabola here, only it's nicer, okay? We add some epsilon, this will be our regularization parameter, positive one, and then we add the entropy of the distribution, so this has to be adjusted a little bit, so let me write it this way. It's gonna be the sum over all states and we're gonna wait by this function eta s and then we're gonna add the sum, the entropy of the distribution of the policy given a certain s. Okay, so let me first explain what this thing is, what property it has, I'll give you the definition and why it's sitting there. So first of all, the first property, no, let's put the other one, I'll give you the definition. So the entropy of a given policy of probability distribution is just minus sum over all possible values that are possible actions, say a of pi of a log of a. Here s stands just as a parameter, so for every state, there is a probability distribution of actions and there is an entropy for each state and we are summing all these entropies with their weight here, okay? And this is simply done because you wanna wait things that happen more frequently more than things that don't, don't happen, but it's not crucial, I mean, you can do it even without that, but it's just simpler. So this quantity has two important properties, so h is always positive. This you just know it just by looking at the function minus x times log x, because minus log x between zero and one is positive and x is positive. So it's a product of two positive functions, it's the sum of the product of positive function is positive. Second thing, it cannot exceed the logarithm of the number of actions. And this again, you can show easily by looking for the maximum of this function over the possible probability distributions. And this value is obtained when the probability is uniform over all actions. So when you act totally at random, you pick any action whatsoever with the same probability, then the probability of each action would be one over k, sorry, one over the number of actions, this is gonna be one over the number of actions, use sum, and this becomes logarithm of the number of actions. So this is a positive quantity, what does that mean? It means that we are adding a little bit of reward. When we want to optimize this, we wanna make this thing big now. And one way of making this thing big is of course making the part that is of interest to us big, but also it favors distribution which have a larger entropy. Okay, so we are pushing our solution away from the boundary because when the solution sticks to the boundary, it has a very low entropy because it's deterministic, it's concentrating over a single action. So this term, this regularization term pushes back your system within the simplex. Moreover, it's also very nice because h is convex, concave, okay? There's a function of pi, eeta. No, eta is just the same thing that you have here, it's just the time that you spend in a given state. So just a way of waiting different entropies. So if you stay in a state very small number of times that you can allow for a larger entropy and so on and so forth. But I mean, it's just one choice to make calculations simpler that that's why I'm putting that here. It's not strictly necessary. I mean, again, the intuition is as before. Suppose you want to optimize a linear function between zero and one and you have a linear function. Okay? So one thing is that you take the derivative and the derivative is gonna be positive, fine. And then you're gonna decide whether it's one way or the other. But this is cumbersome as long as you are in several dimensions because you have to look at this and the boundaries are complicated. So another way of doing it is, let's suppose I have this function and you add a little bit of, just like a parabola like this on top of this. With a small amplitude epsilon, okay? And then I do this and if I look, this case is particularly silly, but okay. In that case, you will add something which has a ship like this. So that the overall function is convex again. And then you can pick the derivative here, set it to zero, you will find your optimum. And then after that, you will send epsilon to zero, which will make your curves go closer and closer to this extreme. So you can take, this is a way of taking the derivative, setting them to zero, then taking the limit and you avoid all the, it's only annoying. It's not that it's wrong, okay? It's a shortcut, but an interesting one. Yeah, so this is just to mean that H is a function of a probability distribution, okay? Which depends on the state. And here it stands for the actions. So H is a function of a vector and this is a vector, just to denote that I cannot put here an A because there is a sum over A here. It's just out of notation, it's defined like this. So this is the probability distribution for actions which I'm not writing here and as you could write A. It's just in the literature, it's written like that. This one. It's bounded from above. So when you send epsilon to zero, you will recover the original function. Of course, if you have an infinite number of actions, then it's tricky, but we don't, okay? So another way of interpreting this is that you're a little bit promoting irrationality in a sense. You're saying that, okay, we want to optimize this, but let's allow some space for crazy choices, okay? Which we know will be sub-bottimal, but we just promote them a little bit by saying, okay, but if you are crazy enough, that's good, and then we're gonna send the craziness level to zero, okay? That's another interpretation of this term. That would be also another one in other terms, but this I keep it for the following, but some of you might also already have imagined the kind of things that this trade-offs between something that you want to optimize and then some disorder measure, and then there is some parameter that, okay, we'll get to that in a second, also, yeah. Okay, so a lot of chat, but we have to do the calculations. Otherwise, we wouldn't be going anywhere. So I have to organize the space on the blackboard, so I will erase the top part and use the bottom, and we're sort of basically cycling over the blackboard. Let me put V pi here, so we have it, half is gonna be sum over S, A, S prime, okay? It's a bit compressed, but we have been writing this many, many times, okay? So we have to take the variations with respect to the various terms. Of course, the variation with respect to phi just gives back our dynamics, our constraints, so that's fine. We don't have to do much in the chat to that, so I will not write it down. What I will, on the contrary, write down is the derivative of this with respect to, let's start with the, let's start with pi. So there are a lot of variables here, number of states, time, number of actions, real positive variables. The constraint is already encoded in the Lagrange multiplier. So let's take the derivatives. When I differentiate this sum with respect to this quantity, I get sum over S prime, so there is an eta S outside, and then I have sum over S prime of R, S, A, S prime. These are derivatives of a linear function, so it's pretty straightforward. Then when I go to this, okay, there's just one term which speaks pi, it's the last one, so I have a minus gamma, minus plus gamma, sum over S prime still, phi of S prime, B of S prime, S, A, and then we are here. Then I have minus lambda S, this is true for every A, and finally, I have to differentiate this quantity, which brings me minus, and then an epsilon, and then I have eta of S, and then logarithm of pi A S plus one. Plus one. That's the derivative of X log X, and like I said, I can now set this thing to zero because I convexified my problem. Now in a second we're gonna write down, no, let's write it down from immediately. So VF, the other thing we wanna derive on is eta, which has become an independent variable, right? So that's what we're gonna do now, and we get sum over S prime, and A of R S A, okay, so that was for the first term, differentiated with respect to eta S. Then we go to the second term. Now this first term is just a scalar product, we just have to change the names of the variables, changes prime to S in this sum, and what we get is minus phi S. This doesn't depend on eta, this depends again on eta S and gives me minus minus plus gamma sum over A of B S prime, that's also sum over S prime, B S prime S A, A S, double check, yes. And then finally the last term is just plus epsilon, the entropy of the distribution in A, which explicitly becomes minus epsilon sum over all actions of phi A S, and this again must be zero. So these are the stationary points, and now we wanna solve these two equations. Let's start from top. So this is good news first, because this equation allows us to express the optimal policy as a function of phi, and then we will plug this expression into the second one, and we will find a closed equation for phi. So that's the plan, and this closed equation will be, once we can take the limit of vanishing regularization epsilon going to zero, will be the Bellman equation, our optimality equation, and from that equation we will be able to recover the optimal policy. Is the plan clear? So let's go, but before doing that, there's one step that we can do before, and is the following one. So as we see what we're gonna do, this phi function will play an important role, because our optimal policy will be expressed in terms of this phi function, and this phi function will obey a closed equation. So what's the interpretation of this phi function? What does it mean in terms of the decision process? Is it just an abstract auxiliary function, or does it have a meaning? Well, it turns out that it has a very clear meaning, and we're gonna discover it now. In order to do that, we just perform a very simple algebraic manipulation on this equation. That is, we multiply this by eta s, and some overall states. I will try to squeeze this here, so it's more legible, but if I collapse in the afford, you will pardon me. So this is gonna be sum over s, I hope you all can read. So let's ask a question about this. Eventually, this has to be zero, right? In due time. The answer is yes, but one step at a time. So let's sum these things over eta and see what happens, right? So if we sum this over eta s, you will see that this first time gives me back this part. This is exactly equal to this. It's not surprising, I mean, eta was appearing linearly there, so I differentiate and then re-multiply it's an homogeneous function, so I expect this thing to pop back. Okay, and I'm just recovering this. So this times this gives me back this part. Then, at optimality, honestly, I don't give much about this because this will be killed by the constraint when I differentiate with respect to phi, so this doesn't appear, multiplying by eta or not eta. Then I have this term again is killed by constraints, and then I have eta, eta times, this I multiply by eta and sum over s and I get exactly this. So what I wanna end up with is that at optimality, this thing is gonna be v pi plus sum over s of eta s. So this is my regularizing term. These are, this part, and this part all together summed over. There's one thing left here to deal with, and it's when I multiply by eta, these two terms. Okay, which I'm doing here. So this is gonna be this part minus, and then I have sum over s eta s pi s plus gamma sum over s prime s a eta s p s a times s prime, and all this has to be set to zero. Okay, you see, I've multiplied this thing by eta s, summed and recognized all the terms that were present in my Lagrange function initially. Now the last step is to recognize that we have an equation, a recursive equation for this. If I write this as s prime, I recognize that here these two terms, eta p pi and eta are exactly with gamma are exactly the same one in the definition of eta. So that all this term actually gives me v pi v pi plus the regularization term epsilon sum over s eta s minus sum over s prime pi of s prime rho note s prime. Okay, can you see the step? You will be able to check this separately, but it's extremely straightforward. We've just been using the definition of eta in going from here to there. So why is this interesting? Because this tells us what's the interpretation of this multiplier. Suppose I start from one given state. I know what that is. In this case, this row function, this row node function will be just a chronicle delta function at one particular state. It will pick one state. Then this means that this phi s function is exactly the optimal value of my Lagrange function when I start from state s. So phi s will be the optimal value plus the regularize term, the optimal. Let me write it just like capital H for short. At row zero s equals, this prime equals delta s prime s. Is it clear? So this multiplier is actually the value of this problem. Is what you can optimally get starting from a given point. And then it's clear why eventually you will be able to express the optimal policy as a function of this V. Sorry, of this phi. Let's let this sink in because it's important. Apart from the calculation, which is trivial. The last argument, so the last argument starts from here, right? So this tells that this is the optimal value for your Lagrange function, okay? Now this thing is a linear combination of this phi function times the initial condition. If for initial condition you pick just one particular state, that will select that particular component of your phi field. And this phi of one particular state, initial state, then it's the optimum that you can get out of your process starting from that state. This quantity is so important that it's called the value function. This is the deformed version. We still will have in the end to regularize the problem by sending epsilon to zero. But still we have a conceptual understanding of what this function is and why it is there. It is independent of rho zero, yes. It's just choose different initial conditions. But the function is there. It's a vectorial function in the state space which tells you what's the best that you can get out from your problem, from whatever initial condition you have. V star is the optimal one, the one after we have been able to optimize. Still it depends on epsilon, et cetera. I'm not just burdening the notation with an addition of epsilon, okay? So if we agree, I will erase these two second lines and move on with the calculation because eventually we wanna get the policy and the optimal solution. May I? I will. Good. So I promised to solve the equation above, okay? So it's actually very simple. Pi appears here only through the logarithm. So we just have to move it on one side of the equation, take the exponential and find an expression, right? So pi will be the exponential of this sum over here, basically. The role that this Lagrange multiplier plays here is just it will enable us to normalize this probability. So I'm writing down the result of this very simple operation of extracting pi from that equation. And the result is that the optimal policy writes, okay? So this part above is the one that you just derived. Then there will be additional multiplication factors which are taken care of by the multiplier lambda s, which will normalize the thing. So in fact, this object you will see that it's obviously normalized when you sum over a. So if you sum over all possible actions, the numerator and the denominator will be the same. I just denoted by b. It is a different letter to denote the actions. All these actually, there's one thing that I was already forgetting, which is very important. There's an epsilon in front of the logarithm there. So here there has to be a one over epsilon. So this term under the sum here is very important, actually, in decision-making theory. So it earns a name of its own. It depends on s and a, and goes, it's usually labeled by a q s, a. This q stands for the quality. It's called the quality function. So that was the value function. Phi is the value function. This is called the quality function. So if we write this in terms of this quality function, this will be exponential of one over epsilon times the quality function for state s and action a divided by the sum over all actions b of the exponential of one over epsilon q s b. So for every state, the solution of this problem, the policy has a Boltzmann like distribution in which these qs, these qualities play the role of energies in this case. And then for each state, there is a particular set of energy values for each action. So that's the physical analogy when you do that. So this q function contains the same information as the phi function. We will see that there is a dictionary that goes from one direction to the other one. You can map one in the other. Okay, you can show, but I will not because it's just a little calculation, but I can explain you why it's so, but it's very simple. This quality function actually, in fact, as the value function was the best that you could get out of starting from state s, this quality function is the optimal gain when you start from action s and pick action a. So if you're sitting on a state s and you pick an action a, then this quality function here is the best you can do. You can see it from the definition. So this is the best you, suppose you are in s and you take action a, okay? Then you will be sent in state s prime and you will get this immediate average reward, this first part. Plus, you will be now in state s prime. There will be a discount gamma because time has elapsed, but from that, if you have it optimally again, you will get phi of s prime. So that's why this Q quantity is the optimal state action value is also called. It's the value of a state action pair. Good. Now, what is left? We have to, now we, like I said, there is a dictionary between phi and Q. So this is one way of going. This is the quality function and how it depends on phi. But you can reverse this. You can reverse this if you insert this definition that I gave you here into this equation. And what you find is that phi of s is equal to epsilon, the logarithm of the sum of all possible actions of the exponential of one over epsilon of Q. This is just, I'm not doing anything else than solving these things and plugging back in and using definitions, okay? There's nothing I'm adding to this. It's just simple calculations. So this is the definition which gives Q in terms of phi and this is the way by which you get phi out of Q, the other way around. So Q, what is Q? The free energy. No, sorry. Well, Q is the energy and phi is the free energy. You misled me. Okay? So these have natural interpretations for physicists that they are familiar, right? The big F, the big F is something that it's over the full process. So these are one state by state. Each state has its pair of energies and free energies and this is the full and composting thing which includes time and everything, okay? So now there are other things that we of course can do. That is, we can derive closed equations both for phi and for Q. Again, just taking these definitions for the policy and plugging them back into the equations in different ways. So I'm not gonna go through the calculations because they are very easy but a little bit cumbersome. But the closed equation for phi, if I write down here, can you see? Not really. So I will erase the upper two equations and you will believe me that what follows is just the consequence of what I wrote up there. So the closed equation for phi that comes from combining all these things is reads as follows. It's phi of a state S. So the value of a state S is gonna be epsilon, the logarithm of the sum collection of the exponential one over epsilon sum of S prime. Still keeping epsilon fixed. It will be sent to zero in a second. Then we have a similar equation for Q which reads as follows. Yes. It's more elegantly a prime. So these two are two ways of writing. Yes, please. Can you speak up? Sorry, what you're talking about? This thing can be, Q can be positive or negative or whatever. I mean, if you always get punished, then it's gonna be negative. There's no particular which one is one. So in all these things, I think at this stage we're ready to take the limit. Now, let's see what happens when we send epsilon to zero. So we remove our regularization. So the first thing that we have to realize is that we are gonna get a deterministic policy. Why is that? Because when epsilon tends to zero, all these exponentials, there will be a battle between all these exponentials and only the one with the largest exponent will survive. And of course, even in the sum below. So when epsilon tends to zero, every action will have a vanishing probability except the one with the largest Q. In fact, this is a general property. These functions here, this function where we take exponentials and then you normalize are called softmax functions. Softmax, soft maximum. This epsilon softens the maximum. In the sense that for any finite epsilon, it will not pick 100% of the time in the maximum among your possible values of Q. But it will do so with exponentially large probability as epsilon goes to zero. So that eventually when you send epsilon to zero, what will happen is that your policy, your optimal policy for the original problem with epsilon zero will be with probability one if A is equal to the argmax. So it's the entry which maximizes the quality function for every state. So it's gonna be an optimal choice A star of S which maximizes this function over the actions and zero otherwise. So the softmax becomes a full max and you pick up the maximum. And the policy will concentrate on the action which gives the largest Q. So if you know the Q function, the quality function at a given state, suppose you have a set of actions which are available to you and you know this Q function, you will just have to say among these actions which is the one which has the largest Q, then I give everything to that action. And so on and so forth. So given the Q, I know what the optimal policy is. And again, since the Q is equivalent to the phi, that same thing happens. Not necessarily at this stage, okay? So when epsilon tends to zero, what's the relationship between the quality function and the value function? Well, this is exactly the softmax. When epsilon tends to zero, the only terms which survives here in the sum is the one which picks the largest contribution from the actions. And then just epsilon, the logarithm of exponential of one over epsilon will give me the optimal Q. So when epsilon tends to zero, the optimal value function of a state is just the maximum overall possible action of the quality function which again states what I already said. The best thing that you can do from a given state is pick the action with the largest quality function and this will give you the value from that state from the positive side. Otherwise it wouldn't regularize, it would make the problem with the opposite curvature. So epsilon might be positive. So when we send it to zero, one over epsilon tends to infinity. So among all the exponentials you have is the one with the largest Q which dominates. Yeah. And so therefore this sum approximately when epsilon tends to zero, this becomes one over one over epsilon of the maximum overall actions of QAS. So do you agree that this is the dominant contribution? If there is a gap between the actions, it's the only one which will survive in the limit. And then these two equations in the limit also become simpler in that, again we recognize here the softmax operator and therefore the formula in the epsilon tending to zero limit becomes the maximum overall possible actions of the sum over S prime, P S prime. And this one becomes sum over S prime. These two equations are the Bellman optimality equations. Bellman from Richard Bellman. Bellman's optimality. So what's the idea now? We've been taking this long day tour with the regularization or to show things that can be handled in a physics friendly way rather than delving into all the mathematics which is of course a totally legitimate and actually the main path by which these things have been derived historically. But this is also I think quite an interesting derivation. So these equations now provide us with all the knowledge. If we are able to solve them, we will know what the quality function is, we will know what the value function is and based on that we can make the optimal decision making. Problem is solved. So all the decision making as promised turn into a problem of computation. How good we are at computing this value function or this quality function. Why is this difficult? There are two things that should make you think that this might turn out to be difficult. We have many, many states. So this thing, this is the first thing. So that there are computational issues but these computational issues are particularly and asked by the fact that this is a highly nonlinear problem. There's a very nasty nonlinearity here. So you have to deal with a nonlinear problem on a vast space of states of action that this is what Bellman recognized in 1957 and gave it a particularly sticky name. He said that's the curse of dimensionality. You have a large state space, a large set of actions and the nonlinear equation to deal with. It's gonna be painful. But still that's the golden road. So if you have a market decision process that's what you wanna do. Of course, except for very, very simple cases you cannot solve this equation by hand just because if you want to solve it analytically you would have to enumerate all possible options. So as an exercise, I'm not asking you to solve this but I'm asking you to think about it and to explore the complexity of the task. Think back again to the problem of the two bandits. We had two harm bandits, one on this side that the two coins here and two coins here and the cost in between. Then you formulate this as a Markov decision process and you ask yourself what's the value function for this problem for being in this state or in that state? How can I get the optimal policy out of it? So the value function will have just two components, the value for being here or the value for being here. So it's gonna be a vector in two dimensions, in two dimensional space, the value function. The quality function will be two states and in each state I have two actions. So it's gonna be four dimensional. Here you have a lower dimensionality vector but the maximum is outside the sum which makes it cumbersome because you have to take the maximum of a linear combination that's very inconvenient because typically this maximum will take different values depending on the value of the function itself. It's no linear. Here you have a larger space but you're better off because you just have to pick the maximum of the pure function itself. So it has a nicer form. So sometimes it's better in practice to use the quality function, sometimes it's better to use the value function. Turns out that in the current algorithms like AlphaGo, they are all based not on solving the Markov decision process because as you can imagine there are problems which we will address hopefully but it heavily relies on the notion of the quality function. So this concept of the quality function and the value of course which is attached to it will be a key point over which it's really the hinge over which this kind of algorithms work. It's not the only way but it's at the moment it's the most mostly used one and effectively be used. Okay, so like I said, think about this exercise, try to write down what the Bellman equations for phi and for q will be for this simple example. If it's too difficult, start with a single one arm bandit first. That's gonna be easy because we know the exact solution for the best policy from the direct solution. You remember yesterday, we computed what's the best action, what's the actual gain that you can get from that just from scratch. So the single problem is very simple. The value function is just one real value because there's just one state. The quality function is just two components because they are just two actions. So I invite you to write down this for that problem and to solve it, it's very easy. And then you step to the two state problem and then it's not very easy any longer as you will appreciate, I hope. So why am I telling this? Because as long as the number of states and actions increases, the problem becomes hard and difficult but it's solvable. It can be solved on a computer. It's just a matter of how algorithmically long it is, complex. It takes a lot of time perhaps but it's not one of these very hard problems that you can encounter in something. It's something that you can tackle. And then to conclude this lecture, I want to show you how to actually implement such a solution, the solution of the optimality equation, how it works, what are the ideas and it's actually a very, very simple technique, very effective. And whenever you have a Markov decision process that is you know the laws of nature and you have access to the states, that's what you should do. Okay. So in order to do this, this very last step, I will, sorry, I hope that's clear that once you have the quality function in hand, you have also the policy and that the value of the policy are related. So I will erase these two functions below. And just for the sake of simplicity, I will focus on the solution method for the value equation. Okay, so this is a closed equation. You see, there's only five there and it's nonlinear. So the question is, how do you solve a nonlinear, nasty nonlinear equation like this? Same techniques apply also for the equation for Q. Let's do a reality check. Are you completely lost? Partially lost. So for the final derivation, I will not consider the limit here. I will show that this equation here for any epsilon can be solved very effectively on the computer with a very simple algorithm and why it works. Okay. So then in the end, you can take the limit for epsilon going to zero at the end. So let's forget a little bit about this as well for a second. So any suggestion, suppose I gave you this equation with any background with recursive algorithm. Yeah. How would that work? Well, it's actually simple as that. Okay, let's define this thing. It's the right answer. So this is a vector, right? It's a vector in this R to the number of state dimensional space. Let's call it five. And this equation is some operator which acts on the same vector. Okay, this is an abstract way of writing this. And this is a nonlinear object, which is just the thing that is about that. So the idea of the algorithm is the following one. First, you start with a guess. You have to set your final. That's the initial value. You start with a wild guess of what the value function is. That might be typical. You know what the rewards are, right? So for instance, suppose that your rewards are all positive. And then you start with a very pessimistic attitude and say, my starting value function is zero. I will never get anything out of this problem. This is clearly false, but it's a starting point. Or you might be super optimistic and say, okay, out of this game, the game of the coins, I'm gonna get out $1 million. So that's gonna set my initial value, my expectation of what I will get out of the game to $1 million for every state. Whatever choice. Of course, smart choices will give you a better property for this algorithm to get close to the real solution. If you start from the correct solution, you're there in just one step. This algorithm will stop, when will it stop? When the difference between two successive iteration is small enough. That's a good termination rule. You've been unlucky. Yeah, sure can happen. I mean, you have to devise better termination rules in that case. Suppose you wanted to cumulate over the arrows over seven laps of time. And yeah. The fact is that we're gonna show that this is not gonna happen in general, if you allow enough time. So what are the conditions for this thing to work? This is sufficient, exactly. So if B is a, this is a map, right? It's a map from some space to another space. So for the two state, two arm bandit, the one that I was discussing earlier with the bridge and the cost, et cetera, this phi function lives in a two-dimensional space. So phi for state left and phi for state right, okay? So my phi is at any time is one point here. And this is a map, which after a step sends phi note into phi one, and then into phi two, and so on and so forth. And so we would like this sequence of points to converge to one point, which will be our optimal value, which is the solution of this thing when this thing is equal to the other. Now, like your colleague said, if this map, which maps every point in space into the space itself is contracting, so if two points which were at a certain distance after one iteration are closer, then you're guaranteed that this thing will converge. So how do we check this? What is smaller than one? Okay, you're in the right track, but let's know Jacobian. So someone said Jacobian. So first thing, this is a multi-dimensional space. So we have to compute the Jacobian, which is gonna be the derivative of every component of this B map with respect to every component of phi. So this is a matrix. And what's the property of this matrix that we have to ensure in order for this to be contracting? The real part of the eigenvalues? No, that's true in continuous time, which are, this is discrete. So this is a map. So the determinant of the Jacobian is telling you how the volume contracts, right? The largest eigenvalue of the Jacobian must be smaller than one in modulus, because this in general is not a symmetric matrix. The Jacobian is not the symmetric matrix here. So in general, it may have complex eigenvalues. It will. But all of them must stay within a bowl of radius smaller than one. If we are able to show this, we are home. So let's do it. We know what the form of the map is. It's just written up there. So we just have to compute the Jacobian of this transformation. So the Jacobian at a given phi, at a given point phi in our space of values with components s and z, let's say, is the derivative of this thing, b of phi in s with respect to phi of z. This b phi of s is the right-hand side of this equation. This object here is just this thing. And we won't differentiate with respect to the variable phi with index z. And z is one of the states. s is an index for the state. So this is a square matrix. And this is the Jacobian. Do you agree? I keep it, I take it as an agreement. Okay. So let's do it. So this is the derivative of a logarithm. So first there will be the argument of the logarithm going to the denominator. So I'm getting sum over a of the exponentials. Of what? Well, you will remember that this thing is just one over epsilon times the quality function. I'm reintroducing this here just for the sake of notation, but that will be helpful in a second, okay? So this is just another way of writing the argument of the exponential. And then we have to differentiate this inside with respect to phi z. So we have to pick s prime equals z in this. And this will give us. So there was an epsilon in front here. There is a one minus epsilon which comes from here. And then I have the probability of going in z given s and a. There is a sum, sorry. There is a sum over a because there are many of these terms contributing. And then there is the exponential. Oh, it's the same thing. One over epsilon q s a. Here epsilon goes away with epsilon. That's nice. And now the last step is that we have to recognize that this part here is just the optimal policy. It's the optimal policy when we have a certain epsilon. Yes, also gamma. Yes, thank you very much. Also gamma. So this thing overall becomes gamma times sum overall actions of p z s a times phi star a s. Now, what is this? What is this sum? This is the probability of going from s to z under the optimal policy. So it's a transition probability matrix. So it's gamma times the optimal transition probability matrix from which takes us from s to z. In particular, this thing is a stochastic matrix because if you sum over z, you will get one. So the Jacobian eventually is gamma which is a number smaller than one times a stochastic matrix which we know by Ferron Frobenius theory as a spectral radius of one. This is a generic property of Markov chains. So we are guaranteed that the Jacobian, the spectral radius that is the largest, the absolute value of the largest eigenvalue is smaller than gamma. Smaller equal than gamma. Then for any task which has gamma smaller than one and we've been discussing mostly all of them but mostly them here. Of course, as you know in Ferron Frobenius theorem, there might be tricky things happening. So for some transition matrices, some combination, you might end up having one or more points on the border and if gamma is equals one, this might lead into trouble, et cetera. So there are very special cases in which this algorithm might give you troubles. But for any gamma which is strictly smaller than one, it works like a charm. It's contracting, it will give you the unique solution to this problem. Spectral radius, it's a property of square matrices. They have a set of eigenvalues in general complex. Then you take the radius which encompasses all of them, the smallest one. So it's a disk which has a radius equal to the modulus of the largest eigenvalue and that's the spectral radius. Okay, so to summarize, all of this was when you know everything. You have perfect observations. You know what your transition matrices are. You know what the rewards are. And then in that case, all your decision-making process boils down to a computation, which you can do in an effective way. So problems, like I said, it can be very hard nonetheless to solve these problems if the space space is too large, but that's a problem for any decision-making process. Dimensionality is a curse not only of Markov decision process, it's a curse in general. Second problem is that this is totally unrealistic. Typically, we don't know the rules. Typically, we only get partial observations. So what can we do in these more realistic situations? The plan is to explore tomorrow, what happens when you have just partial observations which will take us in the lower right corner of our diagram and then to explore what you can do when you're in the upper right corner. You know that the states are there. You can observe them perfectly, but you don't have any clue about what the outcome of your actions and the future of your states will be. So before closing, a couple of announcements. So first, you should have received an email from the secretary. If not, you will receive that soon with a couple of references that you can consult about the foundations of what I've been describing. That's just a warning. So one is the very famous book by Saturn and Bartho, which is a reading which I suggest totally. You will find in that book both less and more than what you get in these lectures. So much more as of content, ideas, examples. Less in the sense that, for instance, this is not the kind of thing that you can find in that book. This is really the result of the digestion of all these concepts and ideas for a physics audience. Then there is another review which is about partially observable Markov decision process, which will be the subject of tomorrow's lecture. Second announcement, for next week lectures on Tuesday and Wednesday, which will be the practical ones in which you will implement deep Q learning for decision making. Today or tomorrow, we will send an email with instructions for downloading the Python packages and everything that will allow you to run this on your computer. It's not compulsory in the sense that all computers in the computer room will be equipped with this. But if you wanna download it because the computers in the room are slower or because you wanna have it on your laptop or whatever, you can do it just by following the links, et cetera. We do not provide technical assistance. Full stop, okay? Thank you very much. See you tomorrow.