 try to cover different aspects of decision-making under different conditions, so we started with the Markov decision process where all the decision-making process actually turns out to be a computational problem because the agent has knowledge of the model of the environment, an accurate model of environment, and then it's able to forecast or predict what will be the outcome of her or his decisions and therefore can compute the optimal policy. This is done through the help of a concept which is the value function for a given state of the environment or for a given pair of states and actions and the equations that govern this optimal value function are the Belman equations. And then we moved on to discuss the situation when the model is known but the environment is only partially observable in which case one has to introduce some notion of a memory over which the observations are collected and organized and when the memory is perfect there is this notion of belief which emerges which is connected to to Bayesian inference and then in that case as well provided that memory is perfect it's possible to write down a Belman equation in in in the belief space which is very helpful theoretically but the only partial is so from from the practical point of view. And then we moved to to the opposite situation in which there is a perfect observability of the environment states are given together with rewards as a feedback from the environment to the agent but the laws that govern the environment that is the model of the environment and how rewards are distributed this is unknown to the agent so the agent has here to rely fully on on on its previous experience in order to make good decision making. And for that particular case we saw that it's possible to exploit the knowledge that we developed for the Markov decision process into this particular case which leads us to stochastic approximation methods such as Q-learning which I gave you an example. There's supposedly in last two lectures you should have had a glimpse of how these algorithms work when they are coupled with a neural network which plays the role of a new universal function approximator for the for the Q function in that case and and this basically are the key ingredients so at this stage if you open up a research paper by deep mind you should be able to follow the line of thought okay you should have the conceptual tools to follow this of course might be some points which are pretty difficult and how are there lots of technicalities it's subject of current research but you should be basically in in a position of grasping that the the major the major things that that are happening there at the technical level. So for today for this this last lecture I would like to go back again onto a very simple problem which is the two-armed bandit problem actually an even more simplified version of that. In order to see in more detail what are the issues and the difficulties which are related to good decision making when the knowledge about the environment is is limited okay so that the workhorse for today will be the following decision process like in all bandit problems of the of this kind this is the stochastic bandit problems that is just one state of the environment and in this particular case we will be considering two possible actions an action zero and an action one the outcome of action zero is you get back to the state with probability one and reward zero and the outcome of action one is that you get back to the original state so with probability p you get some reward one and with probability one minus p you will get some cost that is a negative reward minus one now so this is basically another version of the coin tossing problem short of points today so there's a just one coin here sitting here and the decision is about tossing the coin okay and then if it's had you get one euro if it's tail you lose one euro but you have another option and this option is I don't want to toss the coin I will just pass and if you pass basically nothing happens and you know in advance you are know know that this is not going to cost you anything okay so you see it's a mixed situation the one that we have in mind here in which there is one action about which we know everything so we know this part of how the environment reacts to this particular choice of action but we don't know this okay so this is basically a problem with a single arm banded with the option of passing so first of all let's try and see why this problem can can be highly non trivial even though it's it's very simple so let's let's first have a look at it at it very very simply just out of out of insight so what is the optimal policy for this process assuming you knew p you knew how much this coin is biased what's the best policy if p is higher than one over two the optimal policy is to always pick action one otherwise the opposite okay okay at one half you can go either way whenever you will get the same very good so what is what is the value the optimal value for the unique state that is there you remember the definition of the optimal value is the maximum of the gain that you can get in the long term that is the accumulated rewards discounted averaged upon choice of the optimal policy so in this situation what is the value how much do you get in the long run sorry one at a time we're talking about yeah we're talking about this the question is we are in this case this is the optimal policy what's the average cumulative gain you remember the definition it's the sum of all rewards in the future discounted by the powers of gamma to the power t so how much is it one over one minus gamma that's that's one thing that it's there because that's the sum of all discounted factors and then two p minus one which is the average that you get every time very good otherwise if you played if you are here in this case what's the optimal value that you can get zero okay then we can also ask questions about another object that we introduce which is the quality function of an action okay so how much you can get in this case if you play action one so what does that mean it means that I'm playing action one at the beginning then how much is my first pick for the action how much do I get in the in the long run for this we're in this situation so the best best policy which is it's always to play action one right so you played action one for the first time because that's specified here that's the definition of a quality function so you get at the first round you get two p minus one and then you follow the optimal policy which is always to pull action to pull the arm on so take action one so it would be next time would be gamma times two p minus one and again and again and again so in this case as well this is exactly minus one that was also clear from the definition because the value is the maximum over the q function over all possible actions and what is the quality if I pick zero is first so the first thing I do here is I take action zero so what do I get in the first round zero and then I switch to the optimal policy okay what would I get in the second run I will get gamma times two p minus one divided by gamma and so on and so forth so there's just one step in which I take nothing so time elapses I have discount factor gamma and then everything goes as previously so in this case the optimal value for this decision is gamma times two p minus one this factor gamma is is the fact the price that I have to pay because I didn't start with the optimal action at the beginning then let's move to the other case so if I start picking action zero and p is lower than one half what will be the value function in that case it's it's how much you can get conditioned on the fact that your first action is zero how much do you get zero and in this case what do you get so you first play this and then you get two p minus one no that's the first time so there's no gamma it's the first time you do and then and then zero so this is two p minus one notice that this is smaller than q star zero and this is smaller than q star one and like I said before remember that v star max is always the maximum over all possible actions q one so in this case it's a maximum between q zero star and q zero which is verified all the things are just consistency checks that we're doing this thing right and then it's a very simple calculation okay so so in a second we will redo the calculation in the case where we allow for some regularization okay so we allow for for a policy which is non-deterministic by putting some additional reward with the weight epsilon you remember the regularizing term in the in the Lagrange functional so we will discuss this in a second in order to see how we actually recover these things in the limit of epsilon tending to zero this will be interesting for the following but for the time being I just want to focus on one particular situation and this is a situation where you don't know what p is right so that's the case where you you're model free partly in the sense that you know part of the model but you don't know another part of the model and then I'm let's let's try and think about possible ways of making decisions in this case so suppose that you start by deciding I want to flip the coin right because if I don't flip the coin and I'm not knowing perhaps it's always a win if I flip the coin so I have to try so I give it a try and then I get a minus one okay fair enough and then I take in some sequence and then I keep on flipping and I take one one suppose I'm here in this situation so on average there will be more ones than minus ones in the long run but now for some reason I'm a bit unlucky and I get a string in which there are more minus than plus this happens right especially when the sequence is short then what is the best estimate of p that I can take out of this sequence this is my sequence of rewards starting from one two so this is r1 r2 what's the most likely value for p given this sequence this is a Bernoulli process right what is what is the best estimate the number of times that you get plus one divided by the total number of trials right and then in this case it's going to be one two three four five six over one two three four five six seven eight nine ten so my best estimate of p at this time t is three over five if I didn't mess up with the numbers one two three four five six two seven it's seven over ten okay that's my best estimate with the uh sorry it's going to be three over ten the opposite okay because I have to come to plus and not the minus sorry everybody on the same page here okay so based on my current estimate I think that at this time that the coin is biased is unfavorable biased because it's below one half and according to this experience what is the best policy that I can take choose zero okay and if I choose zero I will get this sequence so I've been just pulling r1 for all this sequence and then I start I switch to zero and then I get zero zero zero is my estimate of this thing changing no because I'm not flipping the coin any longer so I stay frozen in this state of knowledge based on my previous experience and I cannot get escaped from that so eventually my behavior will be heavily suboptimal even if the true p here was I don't know anything larger than one half but this can happen for any finite sequence it can happen that your estimate is below one half even though your true average is above one half so what's the trouble here the trouble here is that we've been too greedy we've been trying to exploit the chunk of information that we obtained up to the time in order to decide something which will go on for a long time these strategies which are too greedy pose severe risks and this is a common lesson it's not something that is holding only for this particular system it holds in general and I will state some very general result about this in a second so what is the way out of this well sometimes allowing for suboptimal actions okay so even though at this stage my knowledge would say you should play action zero I allow some room for uncertainty and say okay perhaps I shouldn't be too confident about my results at that time I should allow for some room for exploration that is I'm taking deliberately suboptimal choices which apparently based on my current knowledge look suboptimal but might turn out to be optimal in the long run if I do that basically what am I doing is is I'm refraining from taking hard decision I'm allowing my policy to be random rather than deterministic at any fixed time of my process okay so this is really in a nutshell a very famous concept which is the exploitation exploration dilemma it's something that emerges basically in in all tasks in which there are some resources which have to be allocated and you do so sequentially so if you based on incomplete knowledge go on the fully exploitative side of the problem of course of course like I just showed you you will be suboptimal but on the other hand if you go and explore continuously so without ever quitting your exploration you will also be suboptimal because you know that the optimal decision doesn't explore finally so there must be a way of balancing these two tendencies in such a way that eventually you approach optimal behavior with the least possible loss because exploring has a cost in this case if I explore okay in this particular case I'm exploring the benefits of an action which might turn out to be in the short run seems negative okay so I have a loss I'm losing time which I could spend by using the optimal action while I'm not doing that so this is a this is a key concept and it's very important notice that this is a very peculiar case like I said it's very asymmetric because one action is perfectly under control of the agent and the other is not in fact if you ask the same question the other way around so assume now that your probability of the coin is less than one half and by chance your sequence is favoring you okay then what will you do is you will insist tossing the coin but at some point at some point if you keep on tossing your sample average will come below one half if the true average is below one half and at that point you will revert eventually to action zero which is indeed the best one okay this is to say that in this particular case the things don't work in the same way if you are above or below one half but this is a peculiarity of this particular example yeah sure so let's assume that p true p is less than one half and let's assume that your sequence here now so this is the case p larger than one half and now let's assume that you're in this situation and then you get a string which is basically say one one one minus one one okay so your estimated p would be nine over ten in this particular case because all of them are ones except one okay so you're under the impression that the best thing to do at this moment would be to pull the arm one to make decision one even though this is not the case but if you act greedily in this situation so you just say okay I will switch full throttle on using one in this particular case you keep on accumulating information so that eventually there will be some minus one popping out after sometimes this pt this this estimate p will again so as a function of time of trials this p initially is high so let's say that this is one half so initially at some time after the trials it's something like this goes high but if your true p is here eventually it will go below that threshold because it has to reach this one with probability one by the lower large numbers so with probability one it will cross below the line and when you cross it and now your p hat p has become smaller than one half then you switch again to this and you never leave this one but now there's no problem because this is in fact the optimal action so in this case there is no dilemma you can be greedy but in the other case you cannot but like I said this is a peculiarity of this situation because if you think about ordinary situation in which you might have here two different outcomes so this this also is random so sometimes if you don't pay anything sometimes you pay a price or you don't know what price you have to pay for waiting in that particular case when you don't know that again the situation is totally symmetric so this exploration exploitation exploration dilemma is present all over the face space of possible configurations okay so this was just a word of caution that this is a super simplified model which exposes some aspect but does not have the full concept called complexity of the decision-making process so here we would like to we would like to know to know more about this right and it seems this thing seems to yeah yeah sure sure so you can ask this question in different context so you can fix the time horizon and then you can ask questions about how much should I explore if I want to optimize over a certain time horizon so the same kind of issues emerge provided that time horizon is long enough okay so if you have a very short time horizon of course that that so think about it in terms of gamma setting gamma to zero if you just make one decision then okay it goes over and that's it but if you have sufficiently long time horizon the same kind of dilemmas emerge so in what was I saying yeah so the idea that emerges here is that the degree of exploration should go down to zero somehow right so at the beginning you want to explore a lot because you want to sample what are the outcomes of different opportunities but then as you gain information you would like to slow it down and how much can you slow it down can you slow it down arbitrarily no because if you stop suddenly you know that there will always be even with a small probability there will always be instances in which you will stick to suboptimal decisions and this is bad how how would that be then the personal distribution would be that would basically mean that you have some characteristic time which you allocate for learning and then you stop huh yeah it's even yeah it's getting low but it's one of the worst kind very good so this is something that in the Bayesian literature is called Thompson sampling so this is if I can translate what you're saying you maintain a belief about this parameter so you have a some sort of prior probability distribution for what this p is and as you get information you update it so it will shrink and we concentrate around the mean value and then you decide what to do according to this distribution so you pick one possible value and decide according to this this exact algorithm is called is called Thompson sampling yes so assume you're doing Bayesian inference so you're in the setting of uh partially observable Markov decision process because these are the parameters that you don't know so you want to infer them according to Bayesian inference you start with a distribution for this p value say I don't know anything it's flat for instance okay any value between zero and one has the same probability at the beginning or I start with one half because I think I have my opinion that this thing is concentrated around a fair value okay I cannot set probabilities to zero otherwise they will never surge in Bayesian inference but I can set something which is closely you choose you choose a prior for your parameter depending on what you think you know in advance about this uh the system about this coin and then you start flipping it and as you flip it your posterior distribution will be concentrating around the true value the real mean right so that after a certain number of trials you will have a posterior distribution not too many things of course so this is the posterior probability distribution for the parameter p which can go from zero to one okay and since this is a Bernoulli process if you start with a prior which is flat for instance so this is a conjugate prior with respect to the Bernoulli distribution so this will be beta distributions and doesn't matter they have a like shape like this and eventually after a long time this will peak and become narrower and narrower around this value right so this is what the outcome of and then the suggestion of this algorithm is that okay I have my one half here and then with this probability I pick the action that is optimal if p is less than one otherwise I pick this one you see at every time you have a probability distribution for your p and therefore you have a certain estimate of how likely it is that your system is really below the threshold of one half so according to this probability you pick your optimal decision your optimal decision so if here you have a probability I don't know 10 percent that your parameter is below one half then you will pull action zero with 10 percent probability and the other one with 90 percent and of course has this process goes on this distribution narrows and you will be less and less exploring but you always do okay because there's always a little take here this is one possibility so this is Bayesian this is essentially Bayesian in in spirit based on that probability you yeah at each step this won't change this won't change yeah you eventually are because if even when you are in zero there will be a probability that this is the good or bad action so you switch with a certain probability to the other action then you increase yeah so you never get stuck because there's always some exploration you go there because you go to zero because you still think that there is a small probability that actually the parameter is not this one but it's this one eventually well the point is exactly this how fast you approach this choice how many times you don't do this because if you know from the outset with infinite precision what the p would be you will be doing this string of actions which is always the same but you don't know it and the point is exactly like this as you accumulate information still there will be some time that you pick the wrong action and actually the point I want to make is that that's exactly the thing that you have to do you don't have ever to stop exploration you have to make it smaller and smaller and smaller and smaller but always be fine at any finite time so this is a good algorithm for bandits actually it's been proven a couple of years ago I think that this algorithm is asymptotically optimal in a in a sense that it's a bit technical but this is not the best thing you can do in absolute because for this kind of problems you can even write down the the optimal policy okay like we did you can do this for more complicated problems it's not the super optimal solution at all times but it's it's a very good solution so this is a good idea this one in this context okay one one limitation of this algorithm is that it's Bayesian okay so suppose you don't you don't want to use priors you don't want to use you you just want to rely on the string of observations then in that case there is a very similar algorithm which works as well and this is a class of algorithms which are called UCB from upper confidence bound there's a whole wealth of these algorithms with with various declinations could you put upper confidence bound so what's the idea again if you don't if you're not Bayesian you don't know this distribution the only thing you know is is this the current the value of this distribution as its own peak at every time which is exactly this sorry not to speak its average which is exactly this value in the long run okay so that's the maximum likelihood estimate of your parameter and then what you do is simply basically you you ask a statistical question and say how confident I am about this estimate I want to be confident to I don't know 95 percent and then you account for this confidence by allowing some time to say okay if I'm confident at 95 percent five percent of the time I will just pick the other action which is again another way of of allowing for exploration this is just to say that all algorithms which are provably good says that they perform well or optimally in certain certain limits have to account for this possibility that your current limited experience is actually not reflecting in full the real properties of the system so and then can we get from this very simple example an idea of how we should reduce exploration as we increase the number of trials yeah actually it's a very simple estimate that that one can do in this case so can I will report that here so let's let's ask the following question it seems to me to ask that that a key ingredient is to estimate what is the probability that my estimate of this unknown parameter after t trials is less than one half and we know that the actual p is larger than one half so what is the quantity which governs this probability large deviations okay is anybody familiar with large deviations or the other way around is anyone totally unfamiliar with large deviations you had a course on that some of you had a course you have no idea what I'm talking about okay so given that this is the true average and what is according to your opinion the probability that this event occurs after a very long time how does it how does it decrease in time the key result is that you can show that this probability goes down exponentially with some rate which in this case takes this form and this object is called the kubak liblord divergence for those of you who have done large deviations it should be familiar for those of you who haven't doesn't really matter the important thing you have to you have to notice that this is falling down exponentially with the number of trials so nt is actually the number of times action one has been we want to kill this probability in the long run yeah in this case this probability is the probability one half this is for the and this is a Bernoulli this is for Bernoulli distributions and this is Bernoulli with the probability yeah but this is the threshold this this applies to any quantity here which is below the true p you can always write down this expression with the property but doesn't matter I mean I'm not I don't want to spend too much time because it's just for the sake of explanation that this thing should be exponentially in the number of trials therefore if you want to kill this if you want to kill this the best thing you can do so actually the the least thing that you can do is is to have this thing grow logarithmically with the number the overall number of trials because this object is positive and then if nt goes like some some constant times logarithm of t with a positive constant here of course for large t much larger than one then this quantity will go down like a power low b is a number which depends on happy of course it's a function of a happy if you wish yeah of course this is a function of happy and this is a function of happy d d is a function of p yeah okay so um what I want to say here is is this is basically telling us there is a strong bound on the number of times that you have to visit suboptimal suboptimal actions in this case you cannot go faster than that because if you go faster than that down then you will not be able to contract sufficiently fast your probability okay so that's the basic reason reason for large deviations theory that you you have to keep on exploring sampling otherwise your distribution will not shrink around the actual value yeah is that yeah I'm assuming that that this is Bernoulli actually that's what I'm writing so it applies to all distributions we don't have fat days okay I'm just very conventional it's it's hard enough when you have coins so you don't need to yes I'm considering the longer you're happy of the explanation means that you will have to take the bad action an exponentially smaller number of times with respect to the other one but it doesn't mean that you have to stop that's exactly the point I'm making so in order to have this thing go to zero with t you cannot just cut it short if you're greedy this nt will stop at a certain time and will be finite and there will always be a finite probability that actually the system was in the other state in that case you have suboptimal so you want to allow for this thing to grow to grow as slow as possible and this as slow as possible is the logarithm it's the choice of the algorithm no I don't want to power low we will something which goes to zero and the slowest way to go to zero is a power low because we figure very fast this means that we are choosing the bad option many times yeah so the point is here is that the following one you want to compress this right and that you want that you would like to to compress it fast but you don't know in advance whether you're choosing the good action or the bad action so that's the compromise that you have to hit otherwise you will be on either one of these two sides n of t is the number of times that you're choosing the good action no sorry no this is in this case you would choose the bad action sorry action zero sorry sorry that was my mistake because in that case you would think that you are below one half so you would choose zero sorry this was the source of confusion so you want to limit this number as low as possible because these are the wrong choices that you make but you still want to have it fast enough to kill this term so that's a compromise you want to keep learning but you don't want to do too much because otherwise this will impact your performance there was a question then this becomes technical okay uh i can only state the result which requires some some confusion that we don't do actually and the result this was just to give you some some sort of intuition about the result but then it's it's now time to state the generic result which is valid for any consider now the situation where you have a any any number of actions so if you have one two three and all of them give some p1 one minus sorry p3 one minus p3 p2 so this is an example of a k arm Bernoulli and you can generalize this to any distribution of outcomes for your rewards so it's even more general than that then the basic results which is due to lie and robins remember the the ear at the moment the lie robins bound says that the number of times that you pick a suboptimal decision scales like log n and actually you can prove that the following closer result which is uh that the limit as t goes to infinity of suboptimal divided by log times the number of trials is bounded from above by some positive constant which i'm not writing here but it's related to the culbert libel divergences in the system so this result is stating exactly this thing that you suggested that log of log t but this would violate this bound log of log t is not exploring enough and this is essentially a consequence of large divisions theory so there are statistical limits to the power with which you can win at this gains yes uh yeah i see i see the point so let me just go over yeah so it's yeah i see what the point that you're making um okay i will have to need some time to figure out clearly what was raining so but we can discuss about this later uh yeah what is is any is any positive constant is it's a one constant which can be expressed in terms of the distributions of this rewards okay there's a bound here and you have to you cannot trespass this bound which is positive which means that the number of times that you make bad actions cannot go too slow they they hold for all distribution they just the this c that changes it's a distribution dependent this quantity is distribution dependent all right so um it was a long long detour for for just actually showing you that there are several interesting uh uh problems that have to be faced when when the information is incomplete uh in in fact i would like to show you uh this would be basically the last thing i'm i'm gonna tell you about the the course is uh actually a way of uh interpreting these results in terms of one specific algorithm which is q learning okay so let let's go back again to this situation and ask ourselves the following so this is what okay so let's let's start from the uh the bellman equation for this problem okay so in this particular case it's it's pretty simple because the bellman equation which governs the uh the quantity so there were two equations for for q one and q two these are two zero and q one these are dependent on epsilon okay so these are the uh quality function in the epsilon regularized problem and then in this case you will have to uh write down the equation as follows well yeah the equation for q reads um right here somewhere so the optimality equation for any action in this case for bandit problems is just the reward average reward that you get from that bandit plus gamma epsilon log of the sum of our actions of the exponential and the policy as you see this which is the choice of the action is so these are the optimality equations with regularization for the bandit problems and these are the averages that you get average rewards that you get when you play action a in general so and when then we show that we can turn these equations into a stochastic approximation algorithm and this stochastic approximation algorithm what it does is uh the following it's it replaces these values with estimates so this is a time t plus one our estimate of how much i can get out of that action is equal to the instantaneous return that i get from that action which i dub t plus one uh plus gamma the estimate at the previous time and then minus q so we interpret this as the as the error that the agent is making the idea is that this algorithm is correcting for over estimates and under estimates and then eventually given an appropriate scheduling of alpha and a certain way by which we decrease your our parameter epsilon which now you can clearly see as an interpretation in terms of exploration in that case we will converge to the optimal behavior in the long run so now we want to write exactly this equation for our particular specific case it's the policy time t our specific case of this decision-making problem okay so how do they look like so let's start with this arm here if i take action zero and suppose i start with an expectation which is i i'm not going to get anything if i wait i know this in advance so i can set q at all times for making decision zero equal to zero this i know it's my estimate and it's perfect because every time i observe that will not be changed by this so i know that picking up action zero will always give me zero as a value this is the part of the model that i know and then i'm just left with an equation for one quantity which i will call qt which is just qt of one this is the relevant thing it's my estimate of how much i will get if i play action one in this case and i'm just calling this with the with a simplified notation then what will be the equation for my only variable qt well i can i can write it down here so this thing of course takes place only when the action that i pick here is at so i can actually so because this part of the model we know we know that if we wait we are not going to get anything so q is but this is q hat is your guess q hat is just your guess and then you expect this upon iteration to get to the optimal right it's just the initial guess your initial you have to start somewhere and then choose it in this particular case you can start exactly at the correct point when you expect that nothing will be getting out of that again because p is positive etc is larger than one option so if we if you write down the equation now for q what will happen is that this small qt plus one the previous one plus the learning rate and then here we're going to get several options right so let's write it this way okay so if we pick action zero nothing will change because our estimate will not be changing so this will stay the same because this object is zero we are not picking the action one so this is zero we are not updating our knowledge of the estimate so this is going to be zero and how often does this happen well the probability with which this happens is the probability of action zero which under this definition we i have that pi t of action zero is just one over one plus exponential of qt over epsilon and obviously the probability at the same time of picking action one is so the probability with which i picking this action in the form i'm getting a zero is one over one plus exponential of qt if i pick the other action two things can happen either i get a plus one here in which case this will become one plus gamma epsilon logarithm and i have to solve the two terms so it's going to be one plus exponential of this and then i have to subtract minus qt and this happens with probability the probability of picking that other action which is e qt over epsilon one plus times p because that's the probability that i get a plus one here or i get a minus one same thing or otherwise means one minus p times this which is the other thing that is lacking in order to have probability one between all possible events so this is the algorithm at work with a given epsilon and the idea is that this algorithm should converge to the best value for this q which is up here when epsilon goes to zero this is again just the soft q learning algorithm for this specific case you can write it in full good and from now on that that line below is the one the only one that i would need this is relatively simple because it's just a jump process on the line instead of with sigma the sigma what do you mean the sigma the softmax yeah it's here sorry i'm lost with about the sigma because i don't see any sigma ah the sum here is just over two actions right there are two contributions one which has exponential of zero and the other one which has the exponential of q one okay that's i just exploded some okay so the question we're asking is is this algorithm going to reach the best and optimal value for q it's just it's just a jump process sometimes you stay in place sometimes you drift with a plus term which pushes you to the right and you want to increase your estimate of the value for the thing that you're doing but sometimes you get a minus and this will tell you okay this is not a good thing to do was not a good thing to do so i will just receive my estimate to load to more to smaller values of q in what sense it's a stochastic process right it's making jumps on this it's the basis of this of this stochastic approximation algorithm so one way of understanding qualitatively what is happening for this kind of algorithms is to take the situation one where this learning rate is small so if we take this thing small the process will proceed in in small steps and we can make a continuum approximation in time and we will turn this process into something which is drift process with noise okay you see it's just taking the continuous time limit if we going to take this process long enough in time and we are allowing for small steps alpha then we can study how this process evolves as a continuous deterministic process and this is was this is what people actually do in order to study the convergence properties of this kind of algorithms in the in the long-term limit so if i take this limit this this object here will become the time derivative so when alpha is small in this in a sense this will become tend to become this kind of deterministic behavior so what do i have i have to combine all these three events in order to get the mean drift for my process okay so with probability this it stays to zero so this doesn't contribute to the drift the q in q in the on the q line the process doesn't move if i pick this action it will move by this amount okay now in order to avoid complicated things let me just set for simplicity gamma equals zero here okay here i will set gamma equals zero you know from the calculation that it really doesn't matter right so it will change how much you get but it doesn't change the policy just because you're always getting back to the initial state and you repeat and you repeat and you repeat so this will just change the overall factor but it doesn't change the policy in this particular system when you just have one state gamma doesn't matter it will change the numerical values but it won't affect the substance of the policy so for simplicity we can set gamma equals zero and things will become very very simple if you have more than one state then gamma is important okay and it's important to take into account but for that particular case we can forget about this terms here because this is just the look ahead in the feature and then we get something which is very simple because on average this thing will behave like this times this quantity minus this so overall this becomes 2p minus 1 minus there's a q in both cases and then there is this factor here which I can rewrite as 1 plus exponential of minus q over epsilon plus noise okay which I'm not writing it's a little bit more complicated expression I'm not writing but you can derive it if you compute the the variance of the jumps you will end up with some expression for the noise and this will become a drift diffusion process in the limit with a relatively complicated noise so let's let's just forget about this and and focus on this object so where is this deterministic system going so this is a very simple one-dimensional dynamical system in the q space and we expect this object to behave like this when we schedule our learning rates appropriately so it's good to have a look at that so how does it behave well we can plot q dot as a function of q which is this function for different values of epsilon to see how it looks and then here I'm plotting just one simple case but we can figure out analytically very simple so if q is positive and larger than epsilon we can neglect this thing and here this term is linear and goes like this so this is a very nice region to work for our approximation algorithm okay because this is the function we want to find the zero about and it's monotonous it's decreasing so if I start with my guess in this region then it's fine because my algorithm corrects for this if you find okay it's positive then decrease q it's negative then increase q and then you go on with literal and literal steps and then you converge to this point which is actually exactly the best thing that you can do when gamma is zero okay so this part is fine but now let's have a look at what happens when we go to negative q so if our guess at the moment is on the wrong side because out of bad luck we had that sequence of negative events which are here in the noise so this noise brought us to have an estimate which is negative which is far from the real optimal value of the game then in this case what happens well this phase the same but now this pre-factor becomes large because q is negative so what is happening here is that you have a situation like this and here this is decreasing exponentially fast and then if you increase epsilon eventually when epsilon is very very small you get some curve like this so like this and then fall off very rapidly here so this is for epsilon depending to zero eventually it becomes a sawtooth profile like this then falls down and then it's zero exactly the meaning epsilon yeah that's what I'm gonna do in a second epsilon like I said if you look at the policy okay this is the probability of picking action zero epsilon is the rate of exploration if you send it to zero you will never do that for positive q and we you will always do that for negative q so if I draw the policy so the probability as a function of q I can draw the probability of picking action one and this is going to be like this and if I increase if I decrease epsilon then it's going to be like this this is one of both of them so epsilon controls the width of this transition it's pretty much like like the weight in the neuron right so when a sentence is zero the response is very sharp and you go either one policy or the other depending if you're negative or positive which is the the actual thing that optimizes your your process but if you have an epsilon large then you will explore it means that sometimes you will choose this even though you know that the optimal one is more close to this so epsilon is the parameter in which in this case controls the exploration which favors the entropy against the maximization of the of the cost of the gain sorry good so notice that this effect here the fact that this curve falls off here is is exactly due to the fact of the frequency with which you choose to explore this is exactly that probability the probability by which you pick action one so what's the problem with this well everything is is perfect and very fine if you happen to lie on the positive side of q because if you're here your q dot will be positive and will send you here here your q dot will be negative and you will send you back to center so this is a stable fixed point for the algorithm here it's a global fixed point for any positive epsilon because here still it's positive for any finite epsilon it's positive but it's exponentially small so eventually if you lie here it will take an extremely long time to go over it's something which looks like the system when you have a big nice potential well actually this drift it's pretty much like you had a parabolic well into which your particle will fall but then at some point this potential flattens so you have one side of your potential is something like if you think about it in terms of potential would be something like this so when you're here your your your particle will move very very slowly and if you set up the expiration to zero it will not move at all so there will be a truly set of meta stable states which don't go anywhere if you don't explore if you don't add some noise if you if you don't move your system so what's the best thing to do again in this case what you would like to do is to start off with a with an epsilon which is large which would make this bump larger and much much to the left so your potential would be something like more like initially something like this very smooth and then as you slowly let your particle fall then you can close it up basically when when you're sufficiently sure that your particle is within this range then you can really go down for exploration but still nonetheless sometimes they can climb up because of noise so there's a limit about in which you can pull it down and the last thing that I want to tell you is that of course like we said this is controlled by this parameter epsilon here so now let's forget for a second entirely the decision-making process and ask the question if this were a problem of finding a minimum and you had this problem of having a large set of metastable states how would you tackle it in a numerical simulation so one possible solution is to do what is called simulated annealing which can tackle even more complicated problems in which you really have other minima so the idea is that you keep your temperature high at the beginning and then you lower it down but epsilon is temperature here in all of our physical analogy so in this case what we're doing is exactly very much the kind of thing that you would do with simulated annealing in a different language with some notable distinctions but nonetheless if you study the problem of simulated annealing you will discover that in that case as well there's a rate at which you can go and lower your temperature that is a hard bound you cannot just switch off your temperature very rapidly because your system will be stuck somewhere and what's the low in time by which you lower your temperature in simulated annealing it's logarithmic in time so that's the same exactly the same reason for which we had this bound now interpreted in a language in a physical language okay all these connections are qualitatively clear not all of them have been studied in in full extent so there's there's still a lot to do and to study about these systems and there is an emerging belief that many techniques from statistical physics could be the relevant language in order to understand and describe all these things other than of course the usual and complementary languages of mathematical statistics and decision-making theory so there's a large probably a very large research avenue in that direction as well so this was basically the last thing I wanted to say that I don't have time actually to approach any other subject in the in these remaining minutes I would just like to to sum up by telling you about things that were not covered by by this course in in any part all all the discussion in the and the algorithms that I presented are deeply rooted in this notion of value of a state of value of a state action which is extremely fruitful language all these kind of algorithms are go under the name of learning with a critic in the sense that these value functions act as a critic with respect to what the agent decides to do so the agent has not a teacher which says this is the good action this is the bad action once you are wrong you say corrects and guides like in supervised learning in reinforcement learning the best that you can have is to have a critic to have a system which gives you provides information feedback in the form of a general encouragement or criticism depending on what you do and that's the basis of this algorithm but it's important to to know that these are not the unique ways to approach the decision making process in fact if you think about it from the beginning the goal is just to optimize over a policy so your value your gain is a function of the policy so can you just bypass all these notions of value functions and q functions in order to find effective algorithms yes you can you can devise algorithm which work directly in the space of policies and do a search in policy space so you just give up completely with the notion of a critic and this class of algorithms are called actor only algorithms because everything is in the mind of the actor which has a policy and tries to improve it and doesn't make doesn't make any recurs to the knowledge of a value function or over which it collects and stores information about how good the environment is in different space in different space these are very valuable because sometimes it's very difficult to construct value functions like we said like using neural networks to construct an approximation for the value function so you can dispense of this all together if if the number of actions that you can take is small then it might be more common even if the configuration space is large the environment is very high dimensional it might be a good idea to use this policy gradient techniques which search in directly in the space of strategies this also is a very well developed there's a very well developed mathematics behind them which we don't touch at all but you can find them discussed at length in the in the reinforcement learning book so it depends situation from situation right I cannot really say that there is a generic advantage in some situations policy gradient is better in some other value-based methods are better turns out that that many algorithms are value-based in the practical thing so like the example we're discussing with the alpha go is value-based but for instance many algorithms which run on robots are just using policy gradient techniques so I cannot make a firm statement saying this is this is better than the other there are benefits pros and cons depending on the structure of the actions that you have to take on how you can parameterize these these policies etc so there's not a unique answer to this to this problem and then there's a third big architecture of decision-making algorithms which combines the two basically and these are called actor-critic algorithms in which the agent does both things searches for policies and improves knowledge but using both of them not just focusing only on the value and then deriving the policy or only on the policy and these are also very interesting of course that and this is something which is very much debated than a subject of current study neuroscience there are evidences that some of these algorithms are actually implemented in our brain so that's clear experimental evidence that in our brain there is a reward system mediated by dopamine signals and there are neurons which actually compute temporal difference errors so there are neurons in our mind in our brain we know for sure in monkeys but we assume that there are there there was in our brain which compute objects which are algorithmically equivalent to the kind of things that we've been discussing for reinforcement learning and there are parts of our brain structures which seem to act like an actor-critic algorithm so there's a lot of work in neuroscience which tries to understand how our our brain works in some with some specific task and compare this directly with completely theoretical knowledge coming from from decision processes there are also many other interesting issues which i've not been able to touch upon and i just want to mention very briefly at least a couple of them which are again subject of current research and very interesting subjects one problem is the so-called inverse reinforcement learning problem this is a try this is a way of trying to answer this question suppose that i observe an agent who is doing a certain series of actions i see somebody acting somehow and the big question is can i reconstruct why he is acting like that what's the goal for the agent can i do this inverse inference process of from actions that use what was the goal what were the rewards that he was after is this possible so this is in general of course an ill-defined problem because there might be many different kinds of rewards which produce the same behavior but then being so ill-defined is also very interesting because you may find way of defining it in a better way and then try to do this process of inverse inference from behavior to motivations which is of course very interesting in itself a related problem is the problem that goes under the name of reward shaping so what's the idea of reward shaping so this originates in a in a behavioral experiment of Skinner who is one of the fathers of behavior of psychology Skinner was working with pigeons and he wanted to make this kind of experience in which okay if the pigeon puts his beak onto some particular button then it will get one grain as a reward but when when in practice when you put a pigeon in a cage okay unless the cage is really very very small it might take a very long time you remember the video from the mouse okay it was a really very much confinement environment and it took several tries before he discovered that so in order to overcome this slow learning of what was the behavior even before that this was rewarded anywhere okay because the the the the pigeon has to discover that it has to go to the button before getting any possible reward and in meanwhile it doesn't get any reward so it's a very very flat space of actions that he has to explore there's no particular reason for going there then it must be something done at random and if the arena is large enough if your search space is large enough it might take forever before you hit randomly on the right spot and then at that at that point you will be able to learn so there is a long transit before you start to learn and so what Skinner did of course in hindsight it's very it's very simple he put the grain somewhere close to the bottom available to the pigeon so the first thing that pigeon learned is to go closer to the button and then basically he made a little trail of grains and then the pigeon learned very quickly to go to the bottom where after that then the process initiated so this is a way of shaping rewards you want to obtain some some action but you can induce it by changing the rewards in mathematical terms this means if I have some behavior some optimal policy it can originate from very different arrangements of rewards so think about the following problem suppose that you have to locate the target like in grid world there is an arena there are obstacles which you can visit and boundaries you start from here and you want to locate the target here by finding out the shortest path okay so one possible way of giving rewards is I don't give any reward only when the agent gets to the target it gets its price this is the kind of thing that happens when you just have very little information and unless you stumble onto this you will not be able to discover that there is a target all together and then when you discovered it then this slowly propagates back and then you will find eventually this behavior but there's a very very narrow bottleneck in terms of entropy of the kind of trials that you have to do before getting there by chance but then remember your task was finding the shortest path now suppose now you give a different kind of rewards in another way for instance imagine that the target produces some kind of signal which diffuses and the agent is able to catch the signal then if the agent interprets the signal as a reward then it will climb up much faster because it will be able to infer from the signal the location of the reward the final result of this process is the same eventually the agent will go from s to t on the shortest path but the speed of learning in the second case is much different so if you care about one specific behavior there are ways of changing the rewards in order to obtain optimal results and in learning in learning time and this is very interesting and important in practice because you have to if you have to train a robot or we have to train an algorithm and you choose the structure of rewards which has a long transient where you're not really encouraging your agent in doing the good action then it will take a long time but otherwise if you shape accurately your rewards you will get exactly the same accurate behavior optimal in a much shorter time and of course i'm just describing this by words but there there are ways of making this concept quantitative and theorems about what is the class of shapings of rewards which are allowed that is the one that preserved the same optimal policy okay with this i think i'm done tomorrow there will be the exam and it's going to be a series of questions and small exercises with multiple choices so you should be able to take all that today at 2 30 we have this interactive section session on work low and ethics in the age of artificial intelligence enjoy the lunch oh yeah question