 Okay. We'll come back. So as promised, today's lecture will be a tutorial. That is, we're going to have a couple of worked out exercises. Value iteration and so today's lecture will be held by Emanuele Panitzon. You can see him here. He's a postdoc in our group and recently has been working on reinforcement learning for Active Meta. And since he joined our group now he works on problems connected biological behavior and reinforcement learning. So I leave the floor to Emanuele. This lecture will be recorded as usual. And then I just created a channel or Slack where you can find the notebooks that we will be using during this tutorials. Okay. Okay. Thank you. So let me share the screen. So my name is Emanuele. I'm a postdoc here. I'm a postdoc. I will be giving this lecture more technical with exercises. This is my third time doing a fully virtual class. So please also give feedback. And this is my mail. So we can use the Slack channel. But also if you want to ask, you can write to me. Also comment on feedbacks and everything. So these classes I will present. We mostly present some code and the theory connected to it on my screen. So I will share just the already written code. And then the code I will make available for you afterwards. I didn't get up or in Slack or it will get to you. So today, as the client says, it would be dealing with solving MDP with dynamic programming. The first part of the election, we will deal with a traveling settlement problem. We will see how it can be rewritten as a decision problem. And how it can be solved in this framework. Then the second house, we will deal with another different environment in particular grid work. And again, we will see how to define the environment as a machine process. And then we will use this very general method of value iteration to solve it. So it will be very general method on a particular environment. And we will see several things that we have seen theoretical. So the final time or it's on the effect of deterministic around the moves, how it can change the solution of the same time. Okay. I only see like, okay, no, some faces. Okay. I'm really gone. If you have any question at any moment, please shout because I'm really not have the complete control of the razor hands or something just shout and I will know. Good. So, let's begin with a thriving settlement problem. So this has been a very much solid problem. This is quite old one. I was reading the first thesis of this was 19th century, then it was mathematically rewritten in half of the 20th century, but essentially this is like a benchmark problem. So this is a very hard problem which has been solved exactly. And this is why usually one of the first we deal with to show some things with a more general. What is the problem? The problem is, as you perhaps know, you have and cities which are all connected to each other. And you want to go through all of them, but only one. And also you want to go to all of them in the shortest path way possible. And since this poor salesman wants to go home at the end of the day, at the end he wants to be in the same city as before. So that you start from one, you go through all of the other cities, only one. And then you go back and you want to do it in the shortest time. What is let's say the complexity of the general idea. So if I am very, very lazy, I can try to do the group for solution, which is that, okay, I know that there are just designing ways I can, I can try. I will just write all of them and I will just calculate the path for all of them and I will choose them. Essentially, this is a great idea for a very small number of cities, but it gets incredibly complex. So this is a possible situation. I have four cities, so I start in the first one. At the end I want to go back to the first one. At the beginning I can choose three ways to go. Then I will have two ways to go because I will have two remaining cities, then only one. So that is the first question, which I can ask if it should be rather immediate to see that this allows also a very simple computation of how many states, how many ways, how many paths there are. So if you think about it, we started with N cities, but one is taken. The first one, then I can go in N minus one. Then I have N minus two possibilities for the second one when I have N minus three, etc. So it's just N minus one factorial, which as we know when N goes large becomes N to N approximately. So this is explode. This is a solution which you can be done, but it requires truly good force. And it's not that intelligent. So we want to put this test with the dynamical programming approach. So now we want to rewrite the problem from the simple problem we stated so far into a problem which can be read as a Markov decision process. Just a very, very, very basic recall of what it means. It's just that we need a set of states, a set of states in which we can do a set of actions. So for each state, I can do an action and I produce the resulting new state and reward. But the fact is that the states must be such that they need to, the pair state action must be, must have a Markovian property, which means it only need, you only need to know where, what is the state and what is the action to know everything what happens next. You must not need the history before the state. And this leads to something which is rather strange. So we have a problem which is just cities and you go through cities, but actually the state is not only the city you are currently in. You can think that like the most basic thing is like, okay, the state is where you are in the action is where you want to go, but this actually fades completely. And what is the minimal representation instead is that the state of the Markov decision process state is the state you are currently in, which is now the K, and the set of all city that needs to be visited afterwards. And this is the state. So the state needs to contain both information where you are and what you need to do afterwards. So a basic question can be why is a representation with the state only with the current city, not enough. You can think about it. I'm not sure if you already know the answer, but you can think about it in this sense. So to be enough, it means that if you know the state, you know everything else which you can do afterwards. But if you only know the current city and you don't know what cities you can go through, you cannot determine the true dynamics because if I am in the current city, but I've already visited another city, I cannot go there. And if I not visited, I can go there. So knowing only the current city is not sufficient for the dynamics would be my problem because then I would say, okay, I'm in the current city. I want to go to city three. And then I should know my history. And this is not my problem. So, but now you should state contains both the current city and all the cities you still need to visit this can be written as a Markovian dynamics. And indeed it's quite simple to understand that if you are in the current city with a lot of city to visit and you do an action which is basically the new city you're visiting the new state will be as the first element is the city you're visiting because it becomes the current visit the current city and all the city to visit yet are the same as before. Minus the one I visited. So, instead of having a kind of a enumeration of paths as before we have an enumeration of paths as shown here. So, I begin at a single state which is not only one as it was before but it's one and then the set to free for which means what I still need to visit. Then I can still do only three action I can go into I can go into, I can go in three going forward, but then if I am to I must keep in the state of information that I still need to go in 34 if I go in three I still need to four and etc. So now it's in this set of states which I need to move. Okay, and in this way the state is written is actually simple because not only I have the currency they mean, but I am also recording in the state the set of action I can take and you can see for example that for four you have I'm not sure if I'm if you see my pointer I also otherwise scream this is state four this is state four we state four and I could three two three completely different states because it's it's very important to know what what can I do afterwards okay so again how much memory this requires because if we found a more intelligent way which is actually worse it's not a more intelligent way this can be solved exactly and if you want you can do some combinatorial combinatorial math to try to get to how many states you have but there is a very simple argument which gives you more or less the number of states so as you see all the states here have a list of cities to visit so of course at the beginning you have n minus one states yet to visit and then all the states will have either a city which is visited or not visited so if you think of all the cities as bits then you can have two possibilities for each state so you have four for the different states to visit you have two one two to the n minus one so for each city either you have visited or not and these are for sure different states but then you can you can you can clearly see that for example I can still visit I can still have to visit four but I may have visited two or three so it's more where two to the n minus one but but then it's also simple to see that it should be less the n times that so it scales approximately two to the n so an exponential number of states but it's much better than n to the n so this is for sure okay so now we have defined the problem as the macro decision process which is the most tricky thing to do now we want to solve it okay by the way we have we have claimed that the state one is special because it's the starting one just to think about it is this truly special if you think about it for 10 seconds you can realize that no because it's a cycle so if you do a cycle and you start anywhere in the cycle it's exactly the same the poor travelers man wants to sleep somewhere by night so we have a special state okay then again we have changed the state to be the current city to be something which has more information than the current city so one should also say okay if I compute an optimal path in the states am I able to do a an optimal path in the cities fortunately yes because each state contains the current city and some more information so if I have an optimal path in the states I can just discard the remaining information and I have an optimal path in the current city which is the optimal path I'm looking for okay basically now this is just a translation of the problem now something enters inside which is the true reinforcement learning say framework which is which is that I assign a cost function to each state and this cost function which is a sort of evil value function is the accumulated cost from that state to the end of the state to the end of the task following some policy so I am in a state I will follow from now on a policy what is the cost which so in our case the distance which I will accumulate from that state until the end of the state clearly this is the typical and you can see that this is a standard reinforcement thing the value of the state is the value of all the future rewards started from the state following some policy now it's the cost it's the sum of all the distances accumulated at the up to the end following some policies it's a perfect equivalent which is this is the the recursive equation for the cost the cost of being in a state which you remember is current city actions it's the sum of all the actions so all the new cities you can go by the probability of taking them by the policy the distance which connects the two cities and then the cost from their own words and of course we have a bellman equation for optimal cost we said that the optimal cost to be in a state current city and that city to be just the distance between from it's the minimum of the sum that the distance for the next city and the optimal distance from that city to the end now of course if you have no city else to visit the optimal cost is the only cost you can get and it's the only thing you can do which is go back to your home so this is very important because I have a boundary condition which says okay in this set of states which are only like the you are in a city and you are done you have the optimal cost which is just the cost of going back home because the only choice is go back to the origin how to compute the cost now it's very helpful that you have a boundary condition because now you have the perfect cost when you have zero city to visit so let's take all the states which have only one city yet to visit then the optimal cost you can calculate from here is just the minimum of the cities you can go which is better than one the distance to go there and the distance from there going on but then you can go one step below iteratively and you have the all the states with two cities to visit and then you can you will have a minimum between two different cities of doing the first city and then you have the optimal cost there or the second city and the optimal cost there so since you at each step you have the optimal cost for all the states with zero cities yet to visit one city yet to visit two cities yet to visit you can propagate the the the exact value of the optimal cost step by step if you have any questions any moment is good to you okay we will see how to implement it of course in the code when we have solved the optimal cost for all these steps then it's also very easy to calculate the optimal path because essentially you already been calculated the optimal path because at each point you have a you have a minimum between different actions and that action there is the optimal process if you had only that subset of states and when you go back back back back to the origin that let's say path which is the path connecting all the optimal cost is from from zero is the optimal path so you can recreate it just by storing information you already computed in the way okay let's go to the numerical solution if you do not have question this is a bit technical but it's what we are here for essentially as it often is the case there are it's a good also it's a good practice there will be several components of the algorithm some which just compute the state just some which just computes the values and so it's it's a bit more than this code has been written partially also not by me but in general all the code you will see it's not written to be efficient it's written to be readable so most of the time it's not efficient and not readable so again yell whenever you want but so I will try to be as explicit as I can not using any any fancy thing of Python good first the thing we need to create all the states and you can you can go through this so this is the first function you will see that it's a bit convoluted but the idea is as I create the states going forward I will then go back to create the optimal code essentially I create the list of state which contains only the first state which is the one in which I am in in the city zero unfortunately now all the city of zero one two three and minus and contains a couple which is the city I mean and all the city idea to to visit which at the beginning it's all the cities for example if ncdc equal four this this variable now would be a list of of just this state here zero one two three then I want to populate all the states starting from how many cities have already visited so at the beginning I will say okay now let's populate all the states with one city visited so I will do this cycle here essentially I take states minus one is just the least of all the previous layer of state which contains you remember the current city and all the city has led to to visit I will take the list of the cities yet to visit and I will do a cycle over the cities which are contained in my in in the part with the state of the cities left to visit for example if I have four cities now I have the list of zero one two three I take state minus one which is just the only state I have with zero visited city which is zero one two three one two three are my cities which I want to visit so I will do a cycle over one two three now my new set of cities to visit is just the set which I had so the cities I had to visit minus the city I have just visited and the new state is essentially the city of I've just visited and all the cities I had to visit minus the city I've just visited okay so for example following this I had zero one two three zero one two three I had one two three to visit at the beginning for example I will visit one and I will create the state one two three and I will do it over and over over all the number of cities I visit so at the beginning I will visit one then I visit two and I visit three etc etc and this is just a check since there are multiple ways to go to visit some states you you want to save only the new state you created not all the copies so this is just to check so this is actually a bit boring but you will realize that most of the time the implementation of a person learning things with strange states is just a bit of technicalities there and then the part of algorithm will be generally speaking much faster now we want to define the distances and the distance is very simple we just create a random matrix with all with n cities and cities and all the entries will will be just a random distance between the two cities the diagonal is zero because if you are in a city and you want to stay in the city zero but actually you will never even use it because there is no action which keeps you there so this is it and as you can see this is what they call that for example if I have four cities I will have states this is a zero with a city then from this we get one with a city and etc etc and you'll see that with n it becomes large so it's exponentially exploding but it's still less than actually exploding okay so now you can see that essentially to solve the traveling salesman problem you need as few lines as to create the states and this is only because I try to put as many comments as possible what I want to give to the solver is just the essence of the problem so the the matrix distances and then I will let it solve so what it does is okay given a matrix this is the I know how many cities I have I will keep I will create dictionaries to keep all the information I want to create I will just create all the list of the states and then I am just remember the first thing which we knew is that we know the optimal cost for all the states with no city yet to visit because it's just the distance going home and this is exactly what we do know we do now for all the states which have no cities yet to visit the cost is directly the distance between that state and and the origin and they're all so the best path for the cities the best city to visit if you do not have the choice it it's like to go back okay as we we explained before what you have to do is just that you want to go back from the boundary so from the you have no cities yet to visit then you want to go from you have one city yet to visit when you solve that you want to go to all the states with two cities to visit and go on and this is what we do so we have we will cycle but in an inverse order so we start from we have one city to visit and then we go back for all the states which are all the states with a number of cities to visit the cost the cost now what we want to do we want to do if you recall the minimum between all the possible actions and this is exactly what we do okay we keep a list of what we keep a list of this the cost is we select a new state so okay I am in a state where I can go this is that I create all the new states and then I say okay the the the cost which I have to mean over is if I go to this next state it costs the distance to there plus the optimal cost which I've already calculated from there onwards so for all the cities I have yet to visit I will calculate the cost of going in that city plus the optimal cost from there onwards and then I will have in this list here I will have all the different possible choices all the cities I have to visit and in there I will have the cost to go there and the cost onwards the cost to go there the cost onwards and then I just take the minimum taking the minimum I have the optimal cost for that state also taking the minimum I can understand where is the minimum and this gives me which is the best action to do so which is that the cities to go to minimize okay then since we've done this for for all the steps starting from there you can see this as a zero step I do not have anything to visit I will go back then this will do it for the I have one city to visit and I do all that all my price I get all the minimum I to visit two cities and you can understand I will have kept in store all the best actions to do which are all the best cities to visit if I were in the state okay now I put myself in the start and now I can just add to my position what is the best action I knew I should I should do in the state and I do it until I go back to the state which is the origin since I've already stored the cost optimal cost for all cities of all states I can do this forward motion which I mean this state what is the best action this what is this best action this and etc etc basically this is everything you need to solve it so let's take you can see it this doesn't work because some reason I am a fool you didn't run the cell for importing them exactly which I this is like 95 percent of my my errors will be this okay so we we have this this matrix you can see which is we chose it to be symmetrical in the creation we have the diagonal with zero and then okay we can solve it and we see that for a couple of reasons this is extremely costly the two reasons are firstly of everything this is still an exponentially costly algorithm and the second reason is that this is like the worst way to use Python ever but and these two things together make so that this takes a certain number of seconds to finish it but in the end you have a best path so and you have also the measure of the shortest path so why is it why can we use this algorithm we can use this algorithm for two main reasons the first is that as in all kind of dynamic programming we know everything about the model we know we we we have perfect information we know what is the transition we know the reward we know everything and also in this particular case we can use this with with nice trick because we have the information about the boundary so we we know exactly the exact value at a certain point and we call we have the tool to go step by step and calculate we know that each step we will calculate the optimal cost so we have the optimal cost for for when you have visited all the cities and then we are sure that we can calculate the optimal cost for all the states with one city life to be and in this way it's rather good because you need only to sweep the the states once because it's you know that one one you have measured the optimal cost you do not need to to I mean it's it's not like you're doing multiple sweeps and every time you you just update some values but you can do it in just one quick because you have a perfect order of of of things to do you know we if you if you see in the in the algorithm every s every state will be considered only once and when it's considered the optimal cost it's it's perfectly calculated why am I'm saying this just because we will see in in the next in the next half of the lecture when we do value iteration this is not a more the case we will see that then we will have something perhaps even simpler as an algorithm more general but then we will have to do multiple sweeps of all the states to go and we cannot in principle if you cannot select just a bunch of states okay do it this then do it this then do it this if you have to do it all together okay so do you have a question I have a question yes please is it ever possible that if you choose in in the first steps an option that gives you a minimum cost then afterwards you could end up with a bigger cost rather than if you choose some bigger cost in the beginning and it could have give you a minimum cost after I don't know if I explained myself well I I think I understood in a sense you never choose an action which I have a maximum so okay let's let's let's solve it separately no and so if you are in a state and you calculate in from that same own words and you'll find which that your best action is is to do one choice instead of another when you are below that state and you arrive at that state it's it will never be possible that that choice there it's wrong okay so for example let's let's say I am um let's say I am in in this this point point here four to three okay when I arrive to four to three and I I see that the optimal cost is to go to two three and then three zero instead of going to three two to zero this this fact here will never become wrong as I go upwards okay because by definition the the cost of going to two three three zero and terminal state is perfectly computed and it's perfectly you can compare it perfectly with another number so if you at that point you say okay I will go to two three three zero terminal state then it will always be more convenient we're going to three two to zero terminal state but it could be that when you are at this level so you are a two three four three two four four two three it seems that right now the the four two three is the best way to go but actually the distance between one two three four and four two three it's much larger than the other two so even if from this layer onwards four to three as the best cost so the minimum cost if you go at a higher level the cost so the ultimate path goes through another state okay I hope somewhere there was an answer yes yes I understand thank you I sorry I think you can prove it by absurd like if you assume that the cost you choose the minimum but then the cost might be greater you you reach an absurd and you can prove it yeah I mean in a sense yeah you define the cost as the minimum and everything is you you perfectly know so if it ends up that it was not the minimum then it was not the minimum so yes but I mean what I wanted to why I was a bit shake at the start is that if at a layer you have a cost and you think that one to go through that seat it's best you are you have to wait for all the the the path to arrive to the terminal state before choosing this is what I wanted to say from one one state onwards definitely you you have chosen the minimum path it will always be the minimum path I meant with more states sorry I meant with more states because I was imagining maybe a map where you have several cities and they're all maybe distant with different distances and you choose to go from one city before to one city before because that's the closest maybe but then if you go on that wasn't the best choice that's what I meant well if if I understand well but this is this I mean this is not the case because a level at every level you compute all the distance of all the state of this level and so that's what I understood after he explained the yes may I just add a comment so just to make the connection with what we did during our lectures so I'll try to to share the screen myself I hope I can what kind of if you're seeing are you seeing my screen okay so during the lecture in actually when we derived the bellman's equation for the fixed time horizon okay you remember that we did exactly the same kind of argument so we started from the end okay and then we had to took an optimization over all the sequence of policies from a certain time on okay and and then we started from the final time in this procedure so you first focus on optimizing or you pick one time and then you optimize all the other ones and then you end up with this recursion optimality equation that is exactly the one that is using so here the way that you prove optimality the equation inherently tells you that that's the only way to turn with the optimal solution so if you start from the end you will never be wrong in getting backwards okay that's the just just so you realize that this abstract calculation is exactly what is happening here so I'm going to unshare my screen now and if you want to share it back yeah so when you look at this diagram of the states and transitions uh yeah so here you're reasoning in terms of terminal state but you can think in terms of times okay so this would be step time zero then you have time one and then your time two time three and then the time the final time is exactly the number of cities that you have to visit plus one if you wish okay so this way of going backwards you can think it in terms of going from the terminal states backwards to other states or going from the final time backwards okay you see the connection with what we have at the lecture thank you uh okay do I can yes uh I'm not sure if this is a good time to make a break or yeah I think so maybe maybe the only thing I want to add is just an historical remark so this algorithm here applied to the uh to traveling says my problem that goes under the name of I'm going to write it in the chat it's called as the bellman card held algorithm and so and as you as you just should realize it's just one specific example of dynamic programming with a finite horizon and with back backward induction but that's how historically it was known as uh it's uh it's good because it's exact but it's also exponentially costly so when the number of cities becomes very large it basically becomes unfeasible so in the literature on the turning assessment problem there's a lot of heuristic algorithms that are guaranteed to find solutions which are nearly optimal according to some rule but we don't discuss them here because they have nothing to do with reinforcement learning it's more of a computational issue in itself okay thank you Manuel so I think we can take a break now and maybe get back at 10 sharp for the second part okay I will pause the recording now so please remind us afterwards resume recording first okay good so now we go to the second part of this exercise lecture which is grid work okay so now we will deal with a different environment the grid world environment and we will use value iteration as a tool to solve so this is a very simple problem with very simple solution but I think that at least they are going very generally it's very remarkable I think it's simplicity and we will see funny colored things which is always good so the grid world environment can have many many aspects I chose just one which is very simple for sure it's a it's a grid so it's a square cell world we have we want to to find the optimal path towards we want to maximize the reward the reward will be some kind of navigation based the reward in the sense that whenever wherever we are we want to arrive to some gold states this gold states in this particular implementation of a grid world are just terminal states but whenever from a state you do do an action you arrive to these terminal states you get an reward and this reward it will be some value r which we will see positive negative whatever in this particular implementation the goals are terminal states so whenever you arrive to a goal then essentially it's an absorbing state with zero reward what can an agent do it's rather simple it just he can try to move up down left and right but there are two to exception to this rule the first one is that certain sites will be blocked so it will be just I will put randomly some blocks and you can if you try to move into the block you stay still and this there is an overall dimension of the world if you try to go outside the world you just stay still again okay so a typical world as we can create as again we will have different parts for movement definition of environment and an actual algorithm which solves the fact so the reward I will I will create assigning a couple of values so I will create assigning a dimension in x dimension in y a number of randomly chosen blocks a set of goals states and the reward these these states will give if you happen to arrive to that essentially it just creates a matrix so we will the states now are simpler it's just a matrix of positions we will have randomly chosen blocks at maximum they can be in blocks they could even repeat and we don't care and the word signals that this is a block just putting minus one in x construction essentially we have a matrix which is over everywhere at zero somewhere it's minus one and then there is a list of position of in which there is a reward which is neither zero neither minus one it it's the reward it's a goal terminal stating which if you happen to go there you'll get the reward you can see I create just one with I say okay I want 10 10 by 15 I want 10 blocks randomly put I want it in in the last position I want the reward one and this is what basically what it creates so it's this is a reward this is a goal one and this is an essentially whatever I need of a word so it's basically an array with some minus one and some one so what what it means to solve the grid word it means that I want to know for each position from each position what is the optimal path to get the maximum reward okay so at the end of the of this experience I want that every single square is mapped with the best action to do to to maximize my future reward and also I will do it with value iteration so in each square I will have printed the the number which is the value so they expected etc etc good good good good good and how to solve this with word is we will do it by value iteration value iteration again it's a part of as as was before of dynamic programming in the sense that we know the model we know we know everything there is to know the the problem is to solve it and not to to try to to grasp the information from what we do we know everything we just want to solve it in in this case it will be again an iterative process but it will be a bit different from what we had before as I pointed out before it was it could be seen as a as a as a temporal sequence and you you you went by by solving exactly all the steps from the last steps to the first step because the length was fixed and each times pertain to different completely different states so the states of time t were different from states that too so it was you could swipe from last time to the first time and it you could maximize you optimize this way now since you you can also in times you could like go back and go forth you essentially it will be a different process but it will be in the same sense and it will be as as as using the same equation as before essentially we will use the same bellman operator as before before we could use it knowing exactly what it what it was the cost at later times or later states and we could bring this exact notion iteratively backwards now we will use the bellman operator over all states and we will not have the fact that this every single iteration will bring to the optimal exact value but we will use that if you use the bellman operator in general over all the states what you have is that this this the value landscape of over all the states collapse to the correct one so essentially what I very poorly said is that we were going to use the property the bellman operator is contracting the algorithm as you've seen I will just reiterate a bit you we want to estimate the value for all the states and we do it in this way we first we are we attribute to each state a random value and then we so it's but of course the value for the terminal state is zero in the sense that this is an absorbing state with no reward so this is fixed zero and then we do as we sweep over all the states and we if the new value for the state will be the maximum possible the maximum possible number which comes from selecting the best action from there and you have the so you will select an action you will see here you will have the probability of being in different states being that state with that action and you have a reward plus the discount factor gamma the value from the new state onwards this is in general this is a probability function so let's say I have three actions I will check for each three action I will have a spreading of my positions and this will give some rewards plus gamma the value from that position onwards I will choose the best action from the from the list given by this this sum here and I will my new value will be the the maximum of this list here excuse me yes why that probability depends on the rewards because okay I understand it depends on the state the future state and the action but why okay okay um as you as you can see it does not it's not exactly that depends on the on the reward essentially let's let's say this okay you see that now in the summation I have an a r I have an a reward I have a proper reward right oh yes okay so let's say I am in a state s which is a position which is close to a goal state with reward reward one and let's do I do the action of going there okay so the probability you have to sum you have to sum two terms the probability of I was here I did this action I got reward one and I am in the terminal state right okay this probability is one because if you are there you move and you go to one you it's exactly what you do but then you have also another term which is the probability of being there moving to reward to the terminal state and not getting any reward and this reward is zero okay it's a it's a bit strange way to say that um you you sum over all the outcomes you sum over all the so you sum over all the new positions all the all the rewards you could have got normally speaking half of these are zero so the probability of doing an action which gives no reward and giving a reward is zero the probability of doing an action which gives one over reward and not receiving a reward is zero do you see what I mean so are you saying that I mean the the rewards is the probability of getting rewards is not is not always is not always one even if I go to a sale in which the reward is expected but no no exactly exactly then it's one um um may I may I just weigh in uh to make the connection with what we had at the lecture again so I will go again with screen sharing all right so uh you might remember that when we introduced the uh marker decision processes first uh we had introduced this transition probability that the from a given state as when you pick an action a can you see the pointer by the way yeah I can you jump into a new state as prime and then you collect some random reward are so in general when you do one one of these transitions you receive some random reward maybe stochastic okay uh then later on we realize that when we uh write down our objective function since we average out we can explicitly perform the average over the random rewards you see this is what is in this line here actually what matters here is just that you can marginalize your transitions over the distribution of rewards and you can focus only on the expected rewards given state action and new state okay so from that point onwards we already considered the average rewards given the tripod as a as prime so what what I'm what I'm already writing is this slide is just it starts from the more general setting which includes the possibility of having stochastic rewards even though they don't matter really here okay okay okay is that any clearer yeah yeah okay I see and so the expected value is that what I yeah in this case there's no ambiguity because you get a reward which is always one so there's no stochasticity and the average is the value of the reward okay okay okay and so you can you can write both the first or the second there is no difference okay exactly okay thank you sure okay so sorry for being slightly confusing okay essentially you see that this is just the fact that I am in a state as I evaluate all the action I can take this action will lead to different states with their rewards and and of course I will I will I will evaluate where I go which reward I get and and plus the the gamma the value of the new states and then this this I will have a list of of approximated evaluated values for my my different actions and I will take the best one of them and my new value from the state it will be it will be exactly that I will assign to that value that number there now the point is that this is a bit you can see that it's an approximate weight of working because on the on the right-hand side you have the values which were the last approximation of the value so when you do the maximum of a over the sum of with new states the probability of being in that state and getting the reward reward plus gamma the value of the new state this new state is the approximate value so whenever I do the new value and I assign to that new value I'm I'm actually losing I'm I'm actually changing also the right side so it's it's it doesn't it's not self-consistent and when it will reach the self-consistency then it's it's it will be the proper optimal value function okay to measure this self-consistency so how much does it change that I am applied the bellman operator so I'm doing the right-hand side of this this part and I'm changing the value is uh is my error error value which is the the distance between the difference of the previous value and the new value which I've calculated using the bellman operator over the old value and I since the bellman operator is contracting I will I know that this this distance will will will get smaller and smaller in time and essentially I I can say okay I I just need it to be as small as some small value I determine which is the some tolerance and when you are there when you have reached tolerance you assume this is the this this value I've got after many iteration is the is as close as I wanted to be to the optimal value then when you have the optimal value uh finding the policy is actually simple because again if you are in a state you just loop over all the the possible action you want you to do and you get the one which maximize uh the sum of the of the outcomes reward plus gamma the value in the new place okay so value iteration is essentially just having a random set of values for all the states before and applying the bellman operator so each state I I I see what are my the action I have available I see what is the the the expected future accumulated reward as in air plus the new value from there I do the action which maximize this this expected future accumulated reward and I assign the value to that one and this I do iteratively over and over okay good excuse me excuse me probably I'm one of the mhpc so probably I didn't reach the point I saw some some videos but probably not uh just something that maybe you you already said or has been already said in the theory lecture but the fact that this algorithm is converging is a property it's a property yes okay uh I don't know if Antonio wants to chip in but I think from now you will see in the lecture we use the property of the bellman operator which is that it's contracting so I can just point you to the right place in the in the lectures where I did that so you can okay because I started to see the drive and and also the the other the lecture but probably I didn't reach the point but yeah I will I will just I mean I think it's useful for for all students to to get to that point uh yeah so it's in lecture four and uh if you can unshare I yeah sorry okay so uh we discussed this uh uh how to derive the bellman's optimality equation which is here and it's just uh we start from uh the definition of the value function as a recursion and then take the maxima over the policy this is the first part and then the second part shows that the bellman operators is contracting which means that if we take any two vectors in the space of values and apply separately the bellman operator to each of them then the distance in a specific mathematical term it's the in the n infinity norm uh is smaller than the the distance of the original points which is okay yes so is a contraction okay it's a contraction yes okay thank you sorry I was just arrived to two like two three and so okay never mind I mean it's okay if you ask questions it's useful for us for everybody to revise on on the fly and see the connection with what we did uh during the lectures now no worries at all okay so okay so um operationally what we have it's it's a very simple thing in the sense that we just will have we can construct something which does a single instance of using the bellman operator and then we'll iteratively do it okay because the the the approximation of the values at the certain time is just you take the approximation you already got and you apply the bellman operator and apply the bellman operator it's just to to try all the actions and find the one which maximizes the expected outcome afterwards okay good so let's translate the grid word as a mark of the decision process it's actually even simpler than before because now the state is exactly what we would imagine so the state is just the position so it's in this case it's just two integers so in the index of the position the action state is the same as as uh is the same in all positions it's just the four moves you can try which is uh up down left right for now we do the transition is deterministic it means that the probability of ending up in a state s prime given if you are s and a is under one or zero if it's one it's it's one if s prime it's equal to to the position I was in and the move I did it's zero anywhere else in a sense okay maybe this is the reward you can put the reward here and see the maybe it's confusing so the reward function it's it's simple in a sense that if you reach a terminal state which is a goal the reward of doing the action ending up in the terminal states it's it's it's a value all the other action have zero as a reward okay this is the first vanilla grid word we we work okay so we want to implement this as a as a functions now there is a bit of caveat which to to be consistent with our understanding of up down uh the actions and the visual uh plots uh there is a bit of I mean uh one zero it's I mean it's okay up down believe it everything is consistent but sometimes up it if you read the vector which goes up it's not uh zero one and so x y are a bit shifted in in this one everything is consistent I checked uh so the actions it's up down right left and it will be just vectors one zero minus one zero zero one and zero minus one and x is not the first axis unfortunately the probability of transition is it's what I want now uh if you are in a state this is want to return a list a list of states you you're going to end up if you are in a state s and if you do an action given that the word is is is what it is and essentially just try to move in the deterministic sense it just try the new position and then it checks is this position outside of the world if it is outside of the world then you do not move or if it's new this new position it's a block so it it's worth value it's minus one you do not move and then it just returns to very simple lists one list says you are which new position you have reached and in this particular case it's only one uh it's either the position you reached or the same position it was you were before if the action was not allowed and the probability of doing it won't because everything is deterministic so far the reward is sorry yes is there any reason why to enforce the fact that we need to stay within the box we didn't put a halo of points with minus one no no no so all the all the efficiency is I mean this code is one of the many code you can create okay and I can assure you it's suboptimal in all possible ways okay so this is just one way I chose it to be okay the rewards again is it's zero everywhere but if you arrive to a terminal state a good you get the reward okay so this is essentially all the meat of the algorithm so this is the value iteration and it's actually rather rather simple as you will say so this part here is just to find the the the world the environment you will have you you you've been given a word you've been given a set of values and you will return the new iteration of values you will be you give the the gamma discount factor and this you can for now you can ignore in the sense that this is just if you want to transition to be run you you create initialize the new values the new policy you find where are the goal positions in the world but and then either the transition is deterministic or it's a it's random this is just what you do which is slightly different in other in the traveling assessment you cycle over all the states so you take you have the values you have the old values for all the states and you say okay let's sweep over all the states I cycle over all positions x and y my new state is defined as this you see y is x is y okay then if the position is a block I skip it because I'm not interested that it cannot be an agent in a block so it does not make any sense to have a value or or a policy but otherwise I'm trying all actions so a is one of all the list of actions I see where I end up remember that right now this is deterministic so it's just p is one so I have probably one of finishing in one state and then I want to record what is the the value of this action so for all the probability and states I am ending up I'm checking what is the reward to end up there and then I'm doing what is the right side of that equation before so I'm multiplying my probability by the reward plus the gamma of the values in the new position okay so in a sense since I'm doing this for loop I'm cycling over all possible new state over all the probability of new state so this is exactly what it's written here sum over and then I was I was doing this for all the actions and I'm doing this so if the new summation is it's better than the old one the old one I kept in memory then this is this is signals okay you are the best value you are the best action which is a bit different from what you could do in general because what happens now if I had two actions which were perfectly identical the answer is right now since this is a this is a strict major and strict greater than then only the first is it's it's selected so I am introducing a bias which doesn't matter too much but I'm turning the bias over the first action you are checking in in case of tie it's the one which breaks the time a better way to solve it and more correctly correct in general is to list all the actions which are will result in a tie and to store them and then eventually randomly choose one of the other I do it here for simplicity which I if there is a time we see there's a lot of ties everywhere I'm just taking the first action it right okay just to notice then I checked all the actions for all the action I check all the possible outcomes I think I saw which is the best my new value is that best action my new policy is the best action which led to that value okay then I my my I I take and I remind myself that the goal state is a terminal state with value zero and you have you there is no policy in that terminal state good okay any question about that okay so example I'm taking just one grid word with one goal good okay I'm just you see this is what what I mean this is a grid word I'm creating random relation blocks at the beginning the values are zero everywhere because I have defined that to be zero okay this is just I can do it what happens if I do a first update okay so I begin with setting the values to zero the word is exactly as as seen here and I want also to estimate change okay now you see also that I'm using a different l2 distance this is not l infinite in this particular case does not change the the theory where the contraction property is demonstrated with l infinity here I will use for laziness the l2 but it's actually if you do and p max and p abs of the two it's exactly the same okay okay so the one update is done here so I'm giving the word so description of the word I'm giving the values which are initialization to zero and giving the gamma and I am asking the new value okay so what happens this is already the answer I'm sorry one day I will learn but not today okay good so you see that the one update actually change only four values here and change the four values around the goal why this is actually very simple to see because if you are in any other point your four action would have led to with zero reward to a state with zero value so you had zero before then you can do whatever action you want but you you will end always with a zero reward plus zero value but if you are here in this four state which are now become purple I hope also your screen you have seen that between all the four action you actually have one action which goes into the goal and in the value is one for the reward plus gamma the value in the goal the value in the goal is zero so exactly one okay so it's one and you see and what I was saying before about breaking the ties you see that all other policy are looking down which does not mean anything it just means that this is the first action if they looked all the other are of course perfectly equivalent okay what happens yes but even though um even though all the other actions basically are random because the reward is the same regardless of where we go shouldn't we nevertheless respect the fact that we can't step into one of the black squares so like for instance there's a lot of tiles where there's a down arrow that leads right into a okay this is a definition of this is the definition of a mark of the process you are allowed to take that action that action leaves you on the spot which is different than you your your action set is smaller okay if you are here I hope you see my pointer yeah you do not have one action you have four action three of which are completely useless but they exist you can take them and they leave you with probability one to stay in the same place okay it seems like a stupid thing to say but it actually it's it's it's not at all the same so we will see afterwards that you have you can have negative ending up state okay let's say that here above it I have a terminal state with value of a reward of minus two thousand okay you don't want to go there you can avoid going there just keep keeping on going through this block you will stay there but it's the optimal policy you see what I mean yeah so um so basically this constraint that we cannot step into the black squares it's not it's not enforced when we calculate the set of the next actions that we can take but rather when we calculate the probability of the next stage right yes okay and I mean this is this is our convention we will leave it with it you can create whatever grid word with your rules this is what we have now sure okay okay thank you thank you for the question so now we see what happens if you we go above it so we again we we have values we initialize i2 zero we we set up a tolerance in this l2 norm and then we say okay let's try 300 iterations uh every time I I'm asking to to to apply once the bellman operator to the values given the word and I will get the new values and the policy and I'm using gamma equals 0.9095 this is just a side note for me you can see now that the distance now I'm going from 0 to 1 I have a large distance between the states and then the the distance between state k and k plus one is getting smaller and smaller at the end it also gets to zero we can discuss about that and we will see that now it's not as as before now we have selected the a policy which is not trivial and now we see that the value actually goes it's it's not zero everywhere and except for one place and we can discuss about that and and you see that there is a gradient of color and the policy is actually following that gradient towards the the best uh the the terminal state also it's you could also understand that what is that you can follow the value uh around the value is simply the value is in a state it's just by the action so a reward which is zero and then it's gamma the state in the next one but the the gamma value in the next state it's again zero the reward plus the gamma the value in the next state this ends up always except for the last step in which you you go so this is one because you end up in a terminal state but you have a reward one this is gamma multiplied by this so gamma one this is gonna be gamma squared because you have to do one gamma here one gamma there and one and and so on and so forth okay and um this we started from value of zero and it seems like that this is a creeping up of values which all grows up until convergence to the perfect value indeed this is perfectly zero this is in a sense the same reason as before in a sense that we could have fought this as a temporal sequence since you always can get a closer you could actually start from the last part and go up in in let's say in in one step one step I mean this is an exact process so for this particular deterministic transition actually starting with value zero you can get to the perfect exact value what I want to show because I think it's you you prove it but it's it's um it's still funny to see but what happens if I start with a value of 10 so now my starting estimate of the value is much larger than than what I have but still I end up in the same in the same situation why it's so because this operator it's contracting so if I am in a place which is far away than than what I uh I mean if I am in any place and I'm doing this this operation of trying the maximum eventually uh the if I am below it I will gain value because I am seeing that I actually have a better way to go there and and and do essentially gain a reward in the future but even if I start above I will collapse because uh eventually my value will all collapse to the right way it's it's an obvious thing from the theory but just a reminder because you can you can actually notice if you look carefully that for instance in the upper right corner all the values here are the same which is not true so the the actual v star is not like this and and as a result you see that the policy here is suboptimal because if you follow the bottom lines for instance you end up uh stuck in uh in uh in a dead end uh so this is happening because actually you you haven't converged yet so you have to run it from for way longer in order to break all these ties here in the value function and then approach the optimal solution of course you when you start with the value function which is way off the the optimal one it takes several more iterations in order to wipe out this initial bias that you put uh on the contrary if you have a good guess about the initial value distribution then this allows you to converge very quickly so if uh an exercise that you can try is that since you know that the target is in the lower right corner uh suppose you start with a within initialization which is basically uh gamma to the power of the distance from that corner ignoring the blocks okay the Manhattan distance over the read and you can see that if you choose that particular initialization for the value function you converge quite rapidly to the optimal solution okay so the choice of the initialization of course is where you can put all all sort of intuition or side information about the structure of your problem that was that was all i have to okay um so the next thing so from now on the basic algorithm is there we are just experimenting with a few things which you've seen from the theory part and i think it's rather fun to see visually what is the effect so first of all we will try multiple goals okay so now now we have one goal with value one what happens with multiple goals first thing a question for you is um what happens if you have a goal with a negative reward but the actions are deterministic if you sorry you mean action deterministic in the sense that this probability is always one but yes action deterministic in the sense that when you do an action there is only one one outcome so if you do an action you try to go there either you can or you can't but so in this case how do you how do you treat the the blocks and the border no no no i mean this is deterministic action so if you try to move either you move or if it's in a block you stay still and if you try to go out you stay still but this is deterministic so there is probability one of okay okay it's not it's not random that exactly okay okay so i i i would make a guess i'm not sure but i would say that uh greed the positions with the negative reward will work as a repulsor i instead of attractors because the negative reward will propagate say to the nearby good good good good but now there is a problem so in a sense yes but now we can ask what is the what is the size of this repulsor so if the action are deterministic actually the size of repulsor is just the single cell in the sense that if you if the action are are deterministic you can go as close as you want to to a negative reward and you will not end on it so essentially you will learn how to avoid it but it will act as a block not as a true repulsor because essentially it you can always move around a negative reward without going into it so it will just be a block so in a sense a goal with negative reward does not have much impact on if actually deterministic but we will see in other case yeah i just want one comment because that it's it's actually a little bit more subtle than that because suppose that you have a negative reward which is in the way of one corridor so you have to go through that corridor in order to reach the the goal and so you have two options either you step on the negative reward but since this leads you by the shortest path it was worth doing it because what you gain by avoiding the long detour to get because it costs you in terms of gamma to go along the long detour unless you have gamma equals one here no i have the terminal state so gamma is equal to one here i mean if you have gamma different from one this means that you have to reach the target as fast as possible so this means that you might want to step on the negative reward because it leads you to a shorter short the negative rewards are absorbing state i mean the goal okay so also negative rewards are absorbing okay in that case of course it doesn't it doesn't apply what i said what i just said so there are all sorts of different situations that may arise depending on the specifics of the setting that you choose in this case i think i asked you already this question you already answered me you know i forgot so in this case there's no there's no security yeah thank you because as we discussed to me for this is one one kind of grid war then of course you can you can if you have a proper system you can create your grid war so for example we were referring to a system in which you you you get a reward going in a goal but then you can move forward so if you are in a good goal you will just going back for outside the inside and if it's a negative reward you may choose to pass over it pay the penalty and go on okay but this is not the case now we have terminal state if you and just yeah just one thing about the the fact that you say so basically in this case we won't have that negative rewards will act as a repulsor's because there is no way that we can accidentally step on them exactly but but this is true is i mean this will act as a repulsor as soon as we have actions that are not deterministic but in the sense that we may want to go one way but end up going another way not like we want to go one way and the the probability is either we do the step or we stay where we are so when the probabilistic setting is you want to go one way but you end up another way and in the case you may happen to step on the negative reward and since it's terminal you avoid it like like that exactly and this is something which we will we will see perhaps or you can see it afterwards we will we will create exactly this kind of of transition which i have randomity in it randomness needed and then you can create exactly this kind of repulsor you're you're you're talking about so let's take now for now a simple one we have the multiple goals but both are positive we still are in in deterministic actions so either you do the move you want or you stay still and you can see now that you have a goal with one which is absorbing which if you end up you've done for the day or a goal which is 10 if you end up you're done for the day okay what happens now let's take gamma 0.9 for example okay let's take a larger gamma let's take a 95 you will see why okay and now you can see that even if i have two two goals okay the position which are stuck of value zero because it's you can whatever you do you will never get the reward so if they value zero and the policy is random which okay but now you can have you can see that all other position will actually avoid going to the reward the goal of one because it's absorbing so if you get go there you get one but this is everything you will get for the rest of your life and and we'll try to go to the goal with 10 okay now we can have a very simple question which is you can see now here this is value 3.24 and the policies go above go up and eventually it will will lead to the goal 10 the question is what is the gamma for which this policy will switch from going up into going right this is actually in this particular case which is very simple a calculation which we can do and the answer is just that we have to take the length of the path from here to the goal and and see that the gamma multiplied by the that power will must be gamma multiply gamma to the power multiplied by time must be less than one okay so we can check so the value here is 10 so we have one two three four five six of we can do it here this is 10 one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen so i forgot 16 i completely forgot i think it's like 22 but you will i i don't want uh okay so gamma to 22 we call 10 if somebody okay somebody does the proper math please and and you see what i mean if the if the gamma is such that if one is better than gamma to the 23 multiplied by 10 these are here will switch from up to right whenever somebody is ready and gives me this number we will we will check scroll down a little bit so we can see better the map what i think it's 23 so nobody wants to help me right because i i think it's 23 i think yeah uh okay so yeah but nobody gives including including the block that we want to to to go there to change so okay so it should be let's see if i am not okay so let's take this 0.946 let's take this okay no it was 23 22 i apologize but i had the number but then the word was okay so it's actually 22 steps and you see that if you change the gamma exactly to this value which is one tenth square 22 square you you can get exactly the gamma such that it does not uh it's not convenient to go up and go in 22 steps and end up in value 10 but it's it's it's more valuable to just step into value one okay it was a very silly exercise i just wanted to show uh so of course if you are much below what you get is that now the the absorbing thing of of the the small value is much large okay because you are effectively reducing the the the time horizon which they they are expecting to leave essentially they say okay now it's better to have an egg today than chicken tomorrow so it can be proven by numerical simulations okay let's add stochastic moves and then we are done it's very simple i'm using one of the possible way of creating stochastic move and it's it's something which you will see a lot of time so i give a probability and this probability gives that with that probability you're actually doing the action you want to do and with one minus that probability you're doing any any action okay uh this will have some some some equivalence which is something called the epsilon 3d we it's the standard things okay you either do with some probability of doing something or with one minus that probably you do anything at all okay uh now you you can see that now it does basically the same but it returns two lists and the list now i have a lot of states and a lot of probability of ending up in that state and you can see that actually this is now this this uh summation here uh so for for any action you can actually end up in many position different position having different rewards it even closer than this expression here so if you do an action you can have a list of possible outcome always processable probability and what you have to do is sum up of all the possible outcome all the possible probability of ending up in that outcome with that reward with the value of in the new position okay this is very very very simple stuff um let's just take it okay now does it change anything now we are going to do something which is okay we are doing a new kind of word which i have a weak attractor here plus one reward a strong attractor here plus 10 reward and a negative minus 10 terminal state here okay and then i can change the probability of randomness so this is the probability or i'm actually doing what i want to do so if this is one it's exactly an exactly as before so what it what it uh i'm sorry it you see it it just moves around as it was a block it doesn't doesn't care at all but it's here but then i change the probability to something which is uh it's rather significant so i have only 80 percent of probability of doing what what i'm supposed to do and you will see when this finish calculating okay that now for example if you are here it's better to move right instead of down because you you were you're actually going farther away from the repulsion goal and you can see that if you are here you will not take the shortest path down but you will prefer to go in up so it already changes a lot what you want to do and you see that the value so the color is it's it's it's actually lower close to the repulsion instead of farther away what it's even nicer if you if you want to do something which is even crazy it's basically a random walk with a small decision so the agent can decide but it's just a very small probability of doing what it has decided you see that most of the space actually the best optimal policy is to try to go to the to the weaker tractor with only one reward because at least it does not risk to to end up in the minus 10 in the world okay so in a sense we have seen visually that and now we have done that uh even if the basic algorithm is very simple then you can play around with and and the implementation of the details of the world can change a lot so gamma can can switch the best policy to avoid somewhere instead of something else and then if you have also randomicity you can have some policy which can get discarded because they were they are like they they couldn't end up in some danger sorry for taking so long i hope you don't have many questions but in the sense that then i wouldn't love to answer more so do you have a question sorry yes i have one yes i didn't understand well why if we put the discounted reward to the power of the length of the path to the bigger accumulator then wind up with a policy that is go to the weaker accumulator okay perfect uh thank you so um uh we we have the we have essentially let's go here okay so we have the the value it's a it's it's uh it's given by the reward plus gamma the value in the next state okay so we can trace out what is the value going away from one uh attractor the value going away from this attractor here for example now it's one because it has the reward with r plus gamma value the reward is one if it's end up down so the it's value it's actually one because it's one plus zero okay then if we go to the convergence the to be in this state here you have gamma which was 0.95 multiplied by one which was the reward okay so you have zero reward plus gamma the value here if you are one step away you have zero reward plus gamma the value but the value was zero reward plus gamma the value okay so whenever you step one further away you get that your value if you decide to go there it's gamma multiplied by gamma multiplied by gamma multiplied by gamma etc etc ending up in the value of the of the of the attractor okay of the goal state you see what I mean okay yes so basically we forgot that there is an accumulator because we are the length far away so the gamma is too small to remember that that steps far away there is a bigger reward oh I would not call it to forget the point is that when you're doing this action here you're doing the action which maximize this this term here in a sense which it's it's a bit uh I it's I hope it's it's clear it's not 100 accurate but in a sense you are evaluating two different paths okay I'm do I can do two action one action is I'm doing just one step and get the reward one and this term does not exist the other step the other action will get will go on and will get a reward it will get a this if you sum everything else it's like gamma plus gamma plus gamma plus gamma plus gamma okay so you have two actions you're taking the maximum but one action is always one the other action it depends on gamma but it's not like you're forgetting it's like you have to always to maximize these two terms one is one and one depends on gamma so at a certain point if the gamma is small enough this accumulation of gamma multiplied by gamma multiplied by gamma even if the reward the ending reward is 10 will get smaller than one and I chose this gamma which is one tenth of to the square of one over 22 which is exactly the value for which the two terms are equal so if you do action right then you have one if you do action up you have gamma gamma gamma gamma gamma gamma gamma 10 which is essentially the same number if this gamma is chosen as before if I take a smaller gamma then I have the maximum of two action but this action actually one is larger than all the rest for any gamma larger than this value this value here is it's the best one so it's not forgetting it's that you are calculating both terms exactly but at a certain point one switch is to being larger than the other okay I've got this thank you very much if anybody else has a question so if there are no further questions I just have one final remark so we have not been doing any algorithm in in policy space like policy search algorithms that I described yesterday like policy iteration or policy gradients but this could be a good suggestion for exercises and final projects to apply something from this the same environments that I've described here for grid world to apply policy iteration for instance that could be a possible exercise or whatever else comes to your mind okay so with that I think we're done for the week and next to Wednesday we'll start discussing problems with partial observability so function approximation and partial observable mark of decision processes and next Friday there will be again a tutorial on those new subjects okay thank you very much okay thank you goodbye thank you thank you thank you