 All right, so quick recap about the yesterday lecture on partial observable market decision process. So we've seen that there is a way to merge planning with inference by introducing this notion of partially observable market decision process. And the key conceptual objects that we have been introducing is this notion of a belief that is a probability distribution of our states. So the agent assumes that it has no direct knowledge of the states, but it has only indirect information through observations. Why? For us. And the agent has this inference device, which is Bayesian inference, adapted to the situation where there is an underlying hidden market process and uses this combination of these two notions, a belief and an inference to construct a decision process that is picking actions based on beliefs. And I didn't show you all the math, but I sort of introduced the main idea for solving PMDPs is that to realize that a PMDP is nothing but a market decision process in the space of beliefs in which the transitions between different beliefs are given by Bayesian inference. So it's a rather abstract view, which we will try to make a concrete in one specific example today. It is nevertheless quite a powerful formal approach because it allows to write down a Bellman's equation for partial observable market decision processes, which has the same, roughly the same structure of the ordinary Markov decision process Bellman, as it should, because it has to reduce to it in the case where states are actually detectable, measurable. So in the specific case when the observation function here, the probability solution for observation, it passes on to delta functions for the states that is observations are states themselves. Then all this expression simplifies the beliefs go into the corners of the brief space that is you get perfect knowledge of where you are at every time. And then this exactly reduces formally to the customary Bellman equation for Markov decision processes. So we are lighted the big difficulty is that this equation now is a functional equation, no linear functional equation, because the value function is not a vector any longer, but it's truly a function of a continuous space. So like I told you yesterday, but it's good to say that again today. So how do you go about trying to solve directly this equation? So there are some techniques which are relatively recent, okay, that try to solve this equation. So to fix the ideas, let's consider for a second, one of our courses that is just the system with two states and two actions. Okay, so the most general one that would be probabilities, etc. So I'm not detailing all the transitions here, but you've seen this already several times. So what is important here is that your belief space, okay, since there are just the two states, your belief space is one dimensional, okay? And goes, it's just an interval from one to zero. This is your belief space. Why is that? Because this, you can think that this is the probability of being in state one. And of course the probability of being in state two is just one minus that. So there's just one single real variable comprised between zero and one that describes the belief space. So one general idea that I will just describe to you very qualitatively because it's a generic approach to try to solve the Bellman's equation for PMDPs is that you construct a finite set of points. Okay, so this approach is called a finite point or just point, let's say better point-based value iteration. So the idea is very simple. It's rather than working with a full segment, we choose a certain set of points here. Okay, and we try to solve our Bellman's equation basically only on this set of points. That's the underlying basic idea. So in order to have a manageable set of beliefs. Of course these beliefs that you choose here, beliefs must be reachable. Okay, this means that if you start from your initial belief, which might be for instance, one half, one half, if you have 50% prior on where you are in space, given the states and observations, you must reach one of those beliefs in a finite set. So one way to construct this for instance is to run forward our model and therefore pick some policy, for instance, at random. So you pick an action at random and you make an observation. You end up in a new state, you pick another action, another observation. And if you add all the new beliefs that you have obtained by iterating, clearly you get a lot of points. Because at every step, you can take two actions, you can make two observations. So from one belief, after the next step, there will be four beliefs. And after two steps, there will be 16. So after 10 iterations, you end up with a million beliefs. And these of course do not exhaust all the possible beliefs, but they provide you with a set of points here, which is sort of a starting point for trying to use value iteration on those points. And then you use other ideas and the next important idea is that if you try to plot the value function in this belief space, while it has a shape like this, more or less, okay? So you can actually prove that this value function, this is the belief and this is the value function, optimal value function for the belief, is always a convex function, like a droid, like a droid. And so one idea is to approximate your value function with a piecewise linear function, which is based on these points here. So this is another important idea, qualitatively, to replace a continuous function with its piecewise linear approximation. And then you start and you do one iteration of your value. And then after that iteration, you have to do other technical operations like pruning and enriching your space of beliefs. So some beliefs that are at which your value function is not suboptimal, you just remove them and you look for other beliefs where you want to improve your approximation, okay? So it's a very complex algorithm. The basic ingredients are these ones, work with a finite set of points and use piecewise linear approximations. And then there are lots of technical results. For those of you who are interested, I can point you to a reference, okay? So this is just to convey the message that there are algorithmic ways of approaching the solution of a OPMDP in rather large spaces also, okay? So it's important to know that, but we will not go into the details. What we will do today is, like I told you yesterday, there is a subset of problems which lend themselves to a much easier treatment. And these are the problems in which your transition probabilities and your observation models are in a sense simple. So they belong to a class of likelihoods and transitions which are described as simple probability distributions. And there exists the conjugate classes of distribution of priors by which the Bayesian updating leaves them in that class, okay? So this was a highly formal statement and involved to say just in words that if the beliefs are Gaussians, the transitions are Gaussians and the likelihoods are Gaussians, then the new beliefs will be Gaussians. And we can solve the Bellman's equation, not in the space of probabilities, but in the space of the parameters of the Gaussian, which is much smaller, of course. We will not do this today. What we will do this today is we will discuss the problem of Bernoulli bandits, that is coin flipping, deciding what coin to flip, because in that case, the transition probability is simple because it's just a unit, okay? It's the identity, states don't change. And the likelihood to observe an outcome is Bernoulli. And as we will see, if we use beta distributions for the beliefs, we can write down this Bellman's equation in much, much more simple form and solve it, okay? So that's the plan for today. If there are any questions on this sort of broad framework please do not hesitate to ask. Otherwise, I'll leave the floor to Emanuele and I have to make you host, which I will do immediately. Okay, so I will share my screen. You're the host now. Okay, perfect. So thanks for being here and for the introduction. So today as we just saw, we will deal with two arm bandits in the case of partially observable system. So now I've split the lecture in two, but the timing is completely random in the sense that we will have a break in the middle, but I'm not sure when this middle will be. Please ask all the questions. And if at certain point you are exhausted, then it's a good place to make a break. In the first part, we will see in this very specific case how this ideas of beliefs of Bayesian updates are in a concrete way. So first we will see, okay, we have this system of coin flipping, what we know and what we don't know. So we will lose our complete information of the state. So we will introduce the beliefs. We will see what actually are these beliefs and how to deal with them. And we will exploit this fact that Bernoulli and beta distribution are conjugate to have a complete picture of how to deal with beliefs. Then most probably we'll have a break. The second part is that we will use all this information we've gathered to transform this non-dealable system of a partially observable two-banded Bernoulli bandage into a rather simple mark of decision process in the belief states, space states. And we will use basically some technology we already have to solve it perfectly exactly, not exactly, but in a rather controlled way up to the end. So let's start with the description of the problem. So we have the two-banded Bernoulli bands. You will have seen it several times by now. It's standard. Let me increase a bit the size. We will have just this simple system. We have only one state. So we are dealt two coins. And one coin, which is conduction one, has the probability q1 to give me a success with a reward of one and a probability minus q1 to give a reward of zero. The second coin has a different probability, but it's the same. So both coins are Bernoulli, so zero one with some probability. And of course, my aim in life is to maximize the reward in a long term. Now, as you can see, since we have two coins with two different probability, and we assume from now on that the two coins are perfectly independent one over the other, there is a whole space of possible cases of possible states, which are the couples of the two probability. In particular, we have this square here, zero to one probability of one coin, zero to one probability of the other coin. Each point in this space corresponds to one precise realization of the two-banded. For example, this point here is the one in which I've dealt two coins, both very fake, they only give fails, whatever I do. So 100% probability of giving a fail for both of them. This here is one with q2 is equal to one. So q2, the second coin always give me a success, but the first coin is a probability of 0.5, which means that 50% times fail and 50% success. So whenever I am in a specific point, my options are very clear. I will just try to maximize and use only the coin with the best probability. Of course, we are now in the problem where we don't know everything. We don't know everything because we will see we don't know exactly where we are, but we know a lot of things. So this is a case in which I have a partial observable and we are model-based. So we perfectly know how the system works. So what is known? For example, the first thing which is known is that when I have fixed the q1, q2, I know how the reward function is dealt. So if I know the state, the reward function is perfectly known. And it's Bernoulli, as we said, essentially the probability of having a result. For example, if I have q1, q2, and I choose to act with the first coin, the first coin is this here. So A is just a normalization. Forget about it. It's important because we will see many times I have a q1 to the R. So if it's a success, qR2, 1, 1 minus q1 to the 1 minus R. So if it's success, this goes to 1. And if it's a fail, so if it's a success, it's q1. And if it's a fail, 1 minus q1, as we said. So we know it's the same exactly thing for the second coin. So this is known fact. So then another thing which is known, which also I'm talking about briefly before, is that there are no transitions. So whatever I do, whatever coin I decide to flip, the state will not change. So this turns out to be this condition here. So the probability of transition from one state to another state is zero if the two states are different. And you will see, you will, if you go back to all the theory described yesterday, whenever you have an equation in which you have a sum over S prime and you have probability of transition, that is simplified because that sum goes away. Only transition between S and S and only transition which remain in place are allowed. And this happened with probability one. So this is known and it's a very simplifying fact of life. Second thing, it's known what the outcomes of the actions are. So this is the case yesterday also there was missing the observable of the state or the percept. So we do not, we know the information about the rewards, essentially, which are the outcome of the actions. And we also know with model for the observation. So given, we know given a state, what is the model for the rewards, which is essentially the same as before. It's the Bernoulli distribution of the reward. So we know a lot of things. What we do not know is the state. So we are not dealt, we are not given the information of what this probability q1 and q2 are. That is absolutely out of our point. We don't have direct information about that. We need to extrapolate this information from other things. So the idea is, as we pointed out in this series of ideas is that we have to shift from the idea of the single state to the single to the idea of beliefs of the state. And a belief is just a probability distribution of the states, which means that I cannot be. I'm not given the information that I'm in a perfect state here in a point, but I'm given, I may have the idea that I have a distribution of them, not all of the states are equally probable. Basically, this means that instead of having a state, I have a function of the state. And this is a typical example of the belief. You can see a purple means I have my belief is that I'm not in a perfect state here and I have a maximum probability of being in a certain point. Okay, so instead of, this is a typical belief, instead of being in a state, I will deal with probability distribution of the state. Notice that this is something which you have done yesterday, but it's rather, I think, useful to now that you have to think that instead all which could be dealt with state, now I have to be integrated in the belief. So for example, if you have a probability of having a reward given a state, now you have a probability on a reward given a belief, which just means that you have to integrate or some over all the state, given the probability of being in the state, which is the belief. So it's a complete shift from being in a single point to having to evaluate or average out over all the probability of being in the all points. Okay, so, but once we are in the beliefs, this is the general thing that now that I have some gut feeling, which is not a gut feeling, it's mathematical results of what I see. I have a belief of where I am. Now, as always, I have to do two things. I have to, I can explore, which means I have some idea of where I am or what my belief is, but I want to make sure that my belief is more accurate to have a better understanding of where to go afterwards, or I can explore it. So my belief, my current belief, it says that I should do this action here, it's the best one. So this is the action I have to take. And as always, we will have to find a compromise between the two so that I explore as much as I can to make my belief stronger and also I have to exploit my beliefs so that I do what I currently think it's the best thing. Okay. May I add just a comment? So if you go back to your picture of the typical belief. Yeah. So for instance, there are a strategy that exploits your current information would be to say, okay, let me identify the maximum of this belief. So the point where the probability is maximal. And then I say, okay, I fully trust that my state is the one where the maximum is reached. And therefore I act accordingly. So in this case, the maximum is above the diagonal, which means that at that point, Q2 is larger than Q1. And then according to that, I will, I would play action too. Okay. Now, this is provably suboptimal. Okay. And the reason is that sequences of observations might have carried you to a point where it's just due to incomplete information about the system. Okay. Just because out of bad luck, perhaps the number of times that you've won on one arm is larger than what the mean, the actual mean. Okay. On the contrary, what is an exploratory strategy? For instance, an exploratory strategy looks at this figure and says, oh, it's very broad along the Q2 axis. Okay. So maybe I have to sample Q2 many times because I want to make it sharper. So I have more uncertainty on Q2 than I have on Q1. So a very exploratory strategy decides I have to play Q2 in order to reduce, to play Q2 in order to reduce my uncertainty on Q2. Okay. In this case, the two things coincide, but in general, it's not the case. It's also provable that this kind of purely exploratory strategies. So strategies that look only at information rather than looking at rewards are suboptimal as well. So what is optimal and what the model will compute is the sweet spot between exploration and exploitation. Thank you very much. Okay. Good. Okay. So, and now we will, we want to deal and we will arrive to this sweet spot. So the idea is that we need something more. So we have seen that we have, as Antonio said, we have our beliefs. We want to modify our beliefs. The point is that to modify our beliefs in a simple way, I mean, we want a proper way to modify our beliefs and this is given by bias update. So clearly, when I have new information, I should change my beliefs, but how exactly to do that is not extremely simple because it has to be done in a proper mathematical sense. So let's deal now with one case, since we said that two arms are completely independent, now we consider just one arm and we consider only the case of the Bernoulli distribution. So we know what is the probability of a success given a state which is called likelihood and which is the Bernoulli distribution. So again, we know that if we are in a state and we find out that we have a result, this is just this formula here which already said many times. Then let's say that I have a belief, so I have a prior, so I already have a distribution of probability over the state. The bias rule gives us a way to modify our belief. So the new belief, so the new distribution of the probability over the state, given that I had a result, it's done by multiplying the likelihood, so the probability of a sin that result given a state, multiplied by the prior, so probability of being in that state. Then there is a normalizing factor which we just make it the probability, so we don't care. So as you can see, the new belief is just the old belief multiplied by likelihood. Now, you can already see where this conjugate distribution entered. So the new belief is the old belief multiplied by something which is of the formula to the Q to R 1 minus Q 1 minus minus R. So you can see or you can imagine that, as pointed out in the previous slide, you can see that there is a difference between the two. And then also, if you have a belief in a form which is not transformed by multiplying by this likelihood, you can remain in the same functional form of belief even with some bias update. In particular case, the general space of belief can be integrated or integrated to one. So in general, it's a complete mess. But for a belief that have a special structure of being a better distribution, then we will see that things are much, much, much, much easier. So what is a better distribution? Better distribution is just one possible distribution which is given by two parameters. So for each one-dimensional degree of freedom, you have two parameters, alpha and beta. And the belief of the state parameterized by alpha and beta is just Q to alpha minus 1, minus minus Q beta minus 1, and then again normalizing factor. You can see that now this beta function is very similar functional form as on the likelihood. And in particular, what happens? Let's say I might believe for some reason, it's as a beta form with alpha and beta. And then I do one observation and I got a success. I could have got a success or I could have got a failure. I got a success. What is my update? My update is that I have Q, this is the first part is the likelihood. So Q, 1 minus Q to the 0. And then I have my beta and then I have normalizing factor. So this just means that my Q, which was to alpha minus 1, now it's alpha plus 1 minus 1. And the part with beta stays the same. So you can clearly see that this has first, the prior was the beta alpha beta. Sorry for that. And the posterior is beta prime. So my new belief updated due to the fact that I got a success is just the beta, but in a different point alpha plus 1 and beta. So if you start from a beta distribution, so if your belief is a beta distribution, then if you do a bias update, you end up in a beta distribution. We will see in a certain point that if you do not start from a beta distribution, then things get ugly again. Luckily for us, the state of complete ignorance is a beta distribution. So the flat probability one everywhere. So which says I have absolutely no prior bias of where my state is is a beta distribution. It's a beta distribution of alpha equal 1 and beta equal 1. So whenever from now on we assume, and this is important fact, we assume that the beginning of, so before having any outcome, I could be anywhere in the space of state. And this is mathematically focused in the fact that my zero belief, so the starting probability of being in the state is 1 everywhere. It's a flat probability and it's equal to the product because they are independent. Beta distribution for Q1, which is alpha 1 equal 1, beta 1 equal 1. Multiply my beta Q2, alpha 2 equals 1, beta 2 equal 1. Okay, so first part of code after half an hour because this is more like, it's not exactly a coding exercise. It's like a more complete example of something which is a theoretical more abstract. So what I do now, I just created this very simple thing here which plots a belief state. So this is not even coding. It's just that fortunately the beta distribution is useful enough that Python already has them inside. I have a probability density function now. Definitely not now. Okay, so the two beta, I just defined the being, I just defined the plot. And okay, so this is just something which allows me to show what is the belief given the two alpha, beta, alpha, beta. Okay, so let's start with what I said was the total ignorance. So the total ignorance we said that is when I have alpha 1 equal 1, beta 1 equal 1, alpha 2 equal 1, beta 1 equal 1. And indeed I have a perfectly flat distribution. Okay, so now we can play around a bit with this. So let's say I pull arm 1. Okay, and I get a success. So what is the update? If anyone wants to intervene, now it's a good time. So this is a real question. What is the update? What changes if I pull arm 1 and I get a success? Alpha 1 becomes 2. Alpha 1 becomes 2. Perfect. So you can see what happens. My belief has changed. Now you can see that my belief in Q2 is still homogeneous. I have no information on Q2. So every vertical line is perfectly homogeneous. So there is still no information on Q2. But of course now my belief state says, okay, you know what? I don't think you have a very, very, very low probability in the one arm. Okay, so the belief state has shifted. So what is the incredibly helpful thing of having a conjugate description of the, or like this, that here you have to plot a 2D function. Here you have four numbers. Okay. So given that we are in the special thing that Bernoulli and beta distribution are conjugated and our prior was complete flat ignorance. So we have a proper way to translate this belief space which were function over 2D to four numbers. Okay. This is what I say much better now. So the belief which contains our current knowledge is what we have to exploit to make the best policy I can. And we have a very simple tool which is the belief update. And exactly as before, now we have, before I just did one, now we have just the four rules. So if I start from alpha one, beta one, alpha two, beta two, and I do action one, then I can only change alpha one or beta one. And of course I change alpha one plus one if I have a success and alpha beta one into plus one if I have a fail. And the same as for the, alpha two, if I do action two, alpha two plus one, if I get a success, beta two plus one. Okay. This is just a way to move from a belief state to another belief state. Okay. So if I was in a belief state in which my information so far was that I got that number of success and that number of fails. A new, a single unit of new information would bring me either in one of this for the election. Okay. Okay, good. So what was the history which yesterday you said what is the history that determines that shows you what is the your current belief that sufficient statistics of your history which gives which are in Britain into the belief is just the number of wins and number of losses so far. Okay. Starting if you started for totally ignorant. So now is more thing which can be backfire is that now we will do this was in the belief state. So what does it mean from the point of view of having real machines single points real machine which can give me one and the other. So now we are dealing with belief. So we always integrate over all possible state inside my belief. Now I want to show you that it's exactly the same thing if you do a proper a proper trajectory with given states. Okay, so then if you have question please do I'm doing the other thing around. So I'm now I will do this. Okay, let's take the history with six steps. Okay, then let's take 200,000 randomly chosen states. So I for 200 times I take one particular point in my space state uniformly. So I have a q1 a q2 and then I extract for six times either a fail or a or a success. Okay, so instead of dealing with belief now I'm doing the frequency spark so I for 200 times I just take one single two arm banded and it's okay. Let's do it six times. Then I count how many times it had a success. How many times he had a fail. Okay, so these are all different machines which have different probability of doing there, but all of them together arrived to a single point which is described by what is described by alpha equal five beta equal three. Okay, since I started from a completely uniform description in the world, each of them is completely different but I chose only those which end up in this point alpha equal five beta equal three. Okay, and then I take only those which arrived there because other have clearly gone elsewhere and I do okay what is my distribution of if I arrived in that point there what is my distribution in the q space. Okay, I just take all the points which arrived there. I write myself what is the q and I do okay let's do an Instagram of this. Do you have any question about that? I assume no. Okay, this is still calculating. Okay, what I'm plotting here so above I have an Instagram of all the state I chose randomly and I follow them. Okay, so this is this are not believe this earlier machine I chose them randomly, but then I select the probability following exactly what they what they wanted to do with their very exact probability Bernoulli okay and now I chose only a point which was alpha five beta three and I saw where did they come from and this is the description and then I can do the opposite fight so what is the belief so the complete abstract belief which is encoded by alpha five and beta three starting from ignorance is this one okay so you can see that believe it's rather powerful thing it's an information it's a complete information of what you what where you could have been okay for all the state okay so to summarize we have we have this very simple problem here which we know everything except for the states we're going to the belief which are distribution over the states this distribution over the states can be used in a simple way because they updates following bias it's simple for tourism the first reason is that I put myself in a better distribution at the beginning and the second reason is that if you put yourself in a better distribution at the beginning and you do an update then you still end up in a better distribution for this very simple reason that better distribution formally is very close is structurally the same as the okay so this here and this here deal well together because multiplied they get again a better distribution what we mean and this allows us to have the transition so whenever I am in a beta and I know the next outcome I know where I'm moving to what this means and I wanted to show all the same thing so we said many times now that you should start from a beta function you end up in a beta function and this was here okay so I chose Q randomly from 0 to 1 let's say I don't want to do that let's say I want to take them okay not randomly I want to take them in a other way and for example I just instead of taking randomly I just do this okay instead of maybe from 0 to 1 I'm taking from 0 to 1.5 okay so it's the same as before I'm taking randomly and then collecting only those with an history of four success and two fails and my question is can I describe this with a better function with alpha equal 5 and beta equal 3 no okay because because alpha equal 5, beta equal 3 is a good description of my belief for success and two losses only if I starting from beta 1 and alpha 1 which was my total ignorance so from now on you can think of alpha and beta as a description of your history but you also have to be reminded that this works only if you started from a total ignorance of beta 1 and alpha 1 okay so just a small side remark in fact in the in the Bayesian literature these parameters M and N are called pseudo counts in order also to distinguish them from the fact that these are not really sample counts okay they are called pseudo counts because depending on where you start with the prior they can take different values okay so always keep in mind this distinction between what is Bayesian and what is frequent use the two things are consistent at the level of beliefs but you have to be careful okay exactly for this if you want to get a bit confused and we all want always to get a bit confused you can say okay I don't want to start from total ignorance ever I want to start from a system which is very skewed in one part I want to start from a beta which is 6 I want to start from this okay I don't know why and then I do the same thing now the pseudo counts M1 and 1 will lead so if you have a beta which is 2015 1512 clearly the number of success and fails need to be related to these numbers I think this was Antonio was okay so good now I think we will make a break if you have questions please ask and then we will exploit all these new ways to deal with beliefs and actually translate this problem into a mdp which we know how to solve so if you have questions otherwise I think we can make a break yeah maybe we can remain at 5 past 10 sounds like yes see you in 15 minutes sorry could you please pause the recording okay so good let's resume my resume recording let's share the screen again okay okay so we have arrived to a compact description of a belief so as we saw belief is distribution of the states now we have a compact way of writing what is the belief assuming a prior of total ignorance so now the space is defined only by these four pseudo-countings N1, N1, N2, N2 let us imagine that we have such a fails and successes and then my belief can be perfectly described by a beta function using these four parameters so we can use these four numbers here to rewrite everything as I mark the decision process so from now on what I will use the term belief or state in a hopefully not so confusing way I will perhaps write state because the years are the states of the new market decision process but I will write B it means that this new state actually believes we went from a POMDP into an MDP into this new space so the new space for this MDP are this belief and can be described by these four numbers here this means that we've created this mapping from what it was a complex function in a 2D space to a single point in this new four-dimensional spaces of belief every single point in this lattice correspond to a belief and you can see there is 1, 2, 3 and there is also a fourth axis going in the fourth dimension as clearly visible by the wriggling of grey arrow so if we want to construct an MDP we need the usual the usual properties on MDP so we need states we need the transitions in the state so a transition now it's exactly as I was saying Antonio so it's let us sorry we are looking at a PDF file no, sorry this because I wanted you to remind that I will put also this lecture was done previously by two colleagues of mine and I wanted also to point that I will put in the folder also this PDF here which is essentially the more technical parts of the class but this is not what I wanted to show you but this will be you can see now it's the same but now everything is defined properly defined from the mathematical point of view thanks for reminding me but now I have to change my sharing so I apologize and I will share this screen here okay, sorry okay what I was saying sorry about that sorry about that what I wanted you to show that we went from a function into D so what we have on the left now it's a single point in this lattice of four dimension now you can see what I was meaning by the fourth dimension axis wriggling in gray into the unknown okay so what was before a distribution function now it's just a single point and this also what was I meaning that now I will use B but I will call it state because the beliefs of the old QMDP now are the state of our new MDP we need everything to define MDP so we have defined the states now we have to define the transition so let us imagine that we are in a state of belief given by four numbers with pseudo counting N1, N1, N2, N2 then let us assume we do an action okay if I assume to do the action one I can have two outcome I can have outcome success outcome fail what is the probability of having success if I am in a belief state N1, N1, N2, N2 and I do action one well it's actually given by the average probability which is essentially it's given by this formula here which is N1 plus one which is the past pseudo counting the successes divided by N1 plus N1 plus 2 which is the old pseudo counting so you should and you could do it mathematically so if you integrate over all the states weighted by the belief in the state and you ask in each of these states what is the probability of taking action one and giving a success but in the end the result is very simple and it's just that it's this here in a sense this is we said that the mathematical result of being in a state of belief of these numbers and doing this action and having success is given by how many success you already have this number here is the result of the integrating of all possible states you can visit in that belief with that probability given by that belief okay but this just means that if I am in a state of belief and I do the action I can calculate what is the probability of my result and so I can calculate what is the probability of ending up in a different state because since a success from a state of belief N1, N1 N2, N2 brings me to N1 plus 1 N1, N2, N2 okay I can calculate that this is the transition probability of being in that belief state and doing that action and getting that result okay so I have only four possible outcomes if I take action 1 I can only change my belief of the probability of arm 1 so these two first lines if I take action 2 I can only change my belief on Q2 and the change is actually very simple so with the probability which from my previous belief which is so the probability is given by this part on the right I will end up either in a system which has N2 plus 1 or in a system with N2 plus 1 and 1 I will end up with the probability of having a success and the other one I will end up with the same probability of having an FA okay which means that I am actually always moving from each point I can only move in four different directions if I take action if I consider only the arm 1 I will only move in the plane of in this plane here where only one and in the other case if I take arm 2 I can only move to the two different points in belief connected by the two different axes this means also which is something nice that as you can see you always move from a state with a certain sum of pseudo counting to a state with a sum of pseudo counting plus 1 okay again this is nothing but my new transitions in my new my new MDP which is the MDP of belief do you have question about this okay as I was pointing out in the PDF you can have perhaps a more detailed and technical discussion of how this integration works but I hope that at least the general idea is not too much counterintuitive so the new rewards I have to calculate the the function rewards given a belief actually they are the same as the transition so if I am in a state of belief described by N1M1 and 2M2 I my probability of having a success is given by this number here my probability of having a fail is given by this number here so of course the reward function is connected by that in my sense this is my function it says that the reward it should go from a belief to a the next belief it's one only if I'm moving to the belief where I have had a success in N1 in the R1 or it's one if I am moving from a belief of N2 to N2 plus 1 which is where I which correspond to a success in N2 okay so we have defined everything we needed for a new MDP this new MDP states which have these four integers correspond to a perfect belief of our correspond perfectly to the belief I have of my states given the cell accounting I have the action I can take which I can take R1 or R2 I have a transition probabilities depending on the belief I have different transition of success of fail and this leads me as we seen from four integers to four integers which only one is changed and it's changed by one like a grid word but I can only move in a line I can only increase by one my total number of integers okay now we have defined a problem and if I gave you this problem here without saying where it comes from you could have already solved it because this is just a MDP and we want to solve it okay so what we want to do we want to optimize the value function of being a belief which is none other than taking the value if we take a policy so if we assign to each belief an action to take we start from the B0 which is my prior beta of total ignorance and we go on following the policy and we just accumulate we want to see the expected value of all the possible outcome discounted by gamma which is standard definition of the value the optimal the optimal equation now is that the optimal value is the one in which you for each belief you take the best action which reaction which maximize the the immediate reward you get plus gamma the value of where you end up why is this so similar to the old to the standard MDP this is a product of the fact that we chose this special Bernoulli problem with the Bernoulli and beta helped us with reducing the state space and also that with all transition probability being going from S to S prime was trivial in the sense that you don't move so essentially those property there allowed us to rewrite it in a something which is MDP with no problem at all so if I am in a state of belief I cannot take action one on action two this action with some probability will give success or fail so will give us some rewards and they will move us according to a belief which has either an N plus one if it was a success or or with some probability will move to a name plus one if it was a fail and then I just check okay I this was my I chose the arm one and I have this expected results I can choose arm two and I have this expected the reward which arm is better okay I will take that and this is just the way the usual way it's written the optimal again simpler what is the best policy is just that which maximize this thing so okay you do the value for being in a belief in doing an action you take the best one okay so in this way we are left just to computing this value function we encounter now a small problem which we have not encountered before so even if the system is exactly solvable as before we have this problem with now the state space is infinite what does it mean it means I wish do we solve it the beliefs were composed by four integer numbers unfortunately these four integer numbers can go to infinity okay so there is no way to make a proper tabular description of the problem because now I would need infinite for the matrix okay but so this in a something which it's this from now on it's not a problem of the fact that we are MP on DP it's a problem which can happen normally solving any MDP which is the problem okay what happens if I have an infinite set of states now enters the good thing which is we've seen before which is transitions as we saw here you can see transition from a time t to a time plus one t plus one always add one to some integers okay so you you start we start with a prior which is described by beta one one one one okay which is described by n one equals zero m one equals zero and two equals zero and two equals zero and then we go away from the origin okay so the total number increases always if we have a depletion of a discount factor gamma then if we are interested in what happens at the origin of course this sucks sucks following rewards are always discounted by this gamma so I can we are always going in the same direction so we can say okay let's create a boundary far enough from the center and I will be sure that I will have spent so much time around there that the rewards from there on will be so insignificant for my value which I can discuss okay in particular this is what we do so we say okay let's say okay the gamma the weight of being gamma from a time larger than that boundary time it's gamma to the t so because all the time you get the gamma value gamma value so you should do it t times this is gamma to the t let's say okay I don't care I don't care with the tolerance of epsilon then you say okay if I decide some epsilon which is my tolerance which I don't care anything smaller than that I can choose a boundary length t equal just you can you can do it epsilon equal gamma to the t t equal log epsilon divided by gamma which just tells me okay I only need to define myself to this state to this group of state subset of state here and then I know that discarding the actual result above this this t will lead me at most to an error of epsilon okay so we just define the subset of beliefs defined by the this integer such that the sum of integers arrive to t why is this again because if you remember at any you can change for all four of the integer numbers but only by one so every time you spend there is a new extraction the sum of the integers goes up by one so in a way this is similar to what we've done the first time which we went from the end and we moved back by one one one one okay so so we had all the collection of the state at each time in a sense if you remember the traveling salesman you had all the states with zero cities left all the state with one city left all the state with two now we have all the state with t extraction of t imaginary extraction all the state with one so these are slides of states all condensed together the only point is that with the traveling salesman the end point was just one and was perfectly defined so we knew the real value from the last city because it was just the city to go back now that point there we just say okay you know what I don't care that value there it will be taking with an approximation because even if I'm wrong there that error there will be smaller than x and I don't care about x do you have problems in this so this is a general way to deal with system which in which you can define clearly an arrow of time and you have an infinite an infinite dimension of spaces but you decide to put a cutoff okay so we use the bellman equation as we know so we just use we take the we just want to the user so the optimal value of the function in the belief but now we restrict ourselves to our subset we want you have to do one of two actions and this one of two actions since the number of the subset are perfectly stacked one of the other in time so if I want to calculate the optimal value for a belief which is a time t so in this subset in which the sum of all integers is t then I will only need to worry about states which are t plus one because my probability of sorry this is because of the I apologize my probability of moving is only getting a reward it's only not zero if I move to from the sum of t sum of integers equal t sum of integers equal t plus one okay which basically says that to compute the value of state b when it's to know only the values of state bt plus one okay as we said we took the boundary bt so the boundary in which the total sum of integer is such a value capital t which is what our tolerance is and then we can just do what we did last time so we go backwards we assign a value to b to the boundary values and we go backwards and backwards and backwards okay as the traveling statesman going backwards and back looking for the best action between one two will give us the best optimal way in the first update we have the best optimal option action only for the boundary then we have to the state of the boundary when two states below the boundary three states below the boundary and at the end we will have the optimal action up to the center so now something which we want to point out this is the optimal policy for each state of belief as is unique so we will have for each state of belief a unique action to do because also it's deterministic okay so we will have it's not like we will have solved one of the possible we have solved all possible cases of all possible beliefs okay we have solved all the issues which can come up if I have if I assume that I have a complete ignorance of the my first belief and I have some history of pseudo counting okay okay how to do it properly this is this is now the code first of all we decide as we said we have infinite states so we have to limit our self with a cut cutoff we decide that our tolerance is is some value we know that this count factors so the rewards in the future are discounted by this factor here every time I move it's 0.9 which allows us to have a t which is just this our cutoff boundary okay okay so essentially this gives us 87 steps so we will have to compute every step every state of belief in which the sum of the integers is up to 87 and then we'll say okay from now on we don't care anymore this is a lot because of course we have to create a way to enumerate all the states at a certain time so again if you remember since all the integers can only go up by one every time so we want to enumerate every time every state possible at a given sum and this is what it does so you see I want to enumerate all the states in which the sum is 5 and essentially this is what it does so it says okay let's keep a track of the states let's go from all the states up to the number there all the states except what I have because the sum has to be 5 so this is just a very stupid way to enumerate all the possible combination of integers which sum up to a certain number particularly if you do you can see that you want to do all the states which sum up to 5 and now you have all the states which sum up to 5 okay if you want to be sum up to 10 you have a good number so I will not show you 87 but you can imagine that this is quite quite not cheap in memory okay so what to do with the boundary so we said that the boundary we don't care about the boundary but actually we do something which we have some information of the boundary so the boundary is the state of belief instead of doing okay we do the optimal thing we do we do the suboptimal point so in the state of belief of the boundary which are those in which I have made T observations so T percepts in a sense I can do okay you know what I can act on the on the arm which is my belief it's better okay so since I have a belief on the two arm I can say okay I will exploit this information now I don't care about what is the variance of the belief it's just okay I have that belief which my belief current belief is that on average one arm is better than the other okay and this is done here so if I I enumerate all the states at the boundary so some of the integer is T and for all the states I just evaluate what is the probability of success given arm 1 the probability of success given arm 2 and I say you know what I define the value which is wrong but it's a good approximation the error will be diluted going back and the value is just the best of the two actions okay I will do I will exploit my information my current belief there and I don't care what happens afterwards this is my last state I am carrying and also the back section of course is going to be do that pull that arm there okay and then simply as in the traveling salesman you say okay I now it's not in the traveling salesman you really had the optimal value for the boundary now it's not an optimal value but it's an approximate optimal value but we don't care so the next step is to iteratively propagate the values from this fake optimal value at the boundary towards the states with capital T-1 and then again again okay so if you get if you get a state you cannot understand what is the value because you know now for the two independent arm you given a belief of the state you have probability of win given the belief of the state you have the probability to win with the other arm and then again it's the probability to win then I have the reward of winning plus gamma multiplied by the value since I was arm 1 and I won now what is the state I'm with new state of belief I'm going to it's state 0 it's n1 plus 1 n2 and 1 and 2 plus this what is this way is probability of failing I don't have 1 because I failed so I do not have the direct reward and I have gamma multiplied by the state the value of the state of belief which is the same as before so n1 n1 and 1 and 2 but now my pseudo counting of failing is plus 1 so if I imagine to be in a state of belief n1 and 2 and 1 and 2 the probability to win is this and where I end up in a belief if I imagine to be in this belief and I win is this so this is none other these two lines for arm 1 are not none other that if I took action 1 and I do some over the new state so I'm literally applying the bellman optimal equation because I am assuming and this time it's only approximate it's not 100% accurate that this value are already optimal because they lie at a higher level they lie all lie closer to the boundary I started to the boundary and I pretended those were optimal and I go back and I pretend those upwards in the flow are optimal so I have the quality which means what is the discounted value of doing arm 1 quality discounted arm 2 then of course I have 2 I can compare and I will have this information so I can do what is the best one between 1 and 2 and what is the arm I should pull it just returns what is the best value in that state in that belief state and what is the best action to do in that way so this does only for one belief state but now I have to do it all the belief states starting from the boundary downwards and this is what I do here I start in range t so I am excluding t and going up to the boundary and I am going backwards this is a python way to say I am going backwards so I get all the states at that point there and for all the state in this and for all the state in this subset I just say what is the best value what is the optimal value what is the best action and I store it and I have done everything everything and this is essentially solving my issue I have to do this I should always yes sorry I I forgot that I had to restart everything so okay good okay so once I did that the perfect solution and at least in the limit that I have I have six secretos do we have a question about this everybody is desperate okay so we can now check the numbers so good for example let's plot a solution okay this is just this is just what happens okay so I took this capital T as before this was already computing all the values and best action this is just a script to properly plotting it and now what I have is a solution for n2 equal 5 okay so since the hyperspace of belief is it's four numbers it's very very hard to make any reasonable plot in this space so instead we say okay we fix n2 equal 5 n2 equal 5 which means that my current belief is a belief which is more centered to a fireness because this is equivalent to a history of four success and four so it's a belief which is it's more or less centered in the center so I have not a strong belief but I have a belief of a fair coin and now I'm looking at all the possible other state of belief for the other arm you see what I'm doing so the value is what I expect my value to be in the future if I start in a belief which is characterized now by n1 m1 okay now and this is the best action today so let's look at the state of this plot and if you have specific question about this plot please ask this is just I've solved the system with a cutoff rather far away I've I've put myself in a situation in which I fixed my belief of the arm 2 which is a not strong belief of fairness of arm 2 and I'm checking in the space of beliefs for the arm 1 and I'm checking what is the value and what is my best action to do so the value already gives us some nice information you see my pointer right look here so since gamma equal to 0.1 okay 0.1 I expect I have a discounted value which is probably 9, 10 that range there here where I have but n1 were extremely superior to m1 I have a very strong value which means that I am very confident that this is a very biased arm I will choose arm 1 because I see that it's much better and the value which accumulates is very high then clearly if you go towards the middle point this red region here it's clear it's the best action to take arm 1 and I take it and indeed the value decreases going away from this axis this axis here is the axis in which I had it's almost clear to me that I had that I had a super unfair arm 1 which always gave me a success because I essentially it's the belief of having seen a lot of successes and very few it's equivalent to the belief of having seen a lot of success and very few losses and then when I arrive here everything is flat for a very simple reason because I do not expect to do I do not expect to do to take arm 1 but I expect to take arm 2 and arm 2 I do not have that information but I know that it should be fair so now here where is orange I expect to do arm 2 and I expect that since it's fair I will have less half of the value of the maximum value but this is a less important information but what I wanted you to focus on is this figure here which I find very interesting so this is the best action to take I plotted the green yellow line which is the diagonal line so every belief above the diagonal line says that my current belief is that arm 1 is better than arm 2 because my belief of arm 2 is that the probability the maximum probability in the belief of arm 2 is 0.5 since I have equal number n2 and m2 my maximum probability in the space of q2 is 0.5 and that probability there corresponds to the yellow line so if I am above it my current maximum probability the belief of arm 1 is actually better than the maximum probability of belief in arm 2 but you see that there are non vanishing small of region where actually the best action is arm 2 can you guess why? I do not hear anybody okay so this goes back goes back to what we discussed at the beginning so why is this not a trivial why reinforcement learning is never trivial because we have to carefully balance two drive one drive is exploitation which means my current belief is that this is better I should do that I should only do that never think about anything else in my life and exploration my current belief is that this is better but my current belief could be garbage so I should go around and you can see now sorry you can see now that here close to the yellow line my current belief in arm 2 is 0.5 my current belief in arm 1 is slightly above 0.5 but as you can see my belief in arm 1 actually comes from hyperparameters for example where I am at it's like 40-40 maybe it's 35-30 which gives me a very strong belief that my arm has a probability centered towards a value which is 0.5 but it's also very narrow centered but now I have a belief I have a belief in arm 2 which is actually centered in 0.5 but it's actually quite weak we can plot it now we take 35-35-5 let's do it ok so this is our current belief exactly the belief we are here somewhere ok close to that and you can see that the best here it says clearly the maximum probability is above 0.5 the other one is in 0.5 so the exploitation is better but you can see that actually there is a very considerable amount of probability of being in a system in which q2 is actually above the diagonal ok it means that for this specific region here the best possible action is actually to explore q2 because there is a very reasonable probability that it ends up in a state in which q2 is actually much better this form here of exploitation and exploration so this small tongue of blue there is why we need reinforcement learning if you did not have that your value would be smaller and this is essentially the point of reinforcement learning you find a compromise between exploration and exploitation which is not at all easy to find in other ways so now I wanted to let if I have a few minutes left I want to to show this so so far we worked with beliefs we have a proper system we have solved the problem with total ignorance as a prior so we start with a completely indiscriminate if we could have any state and we solved in a biased, in a biased way so at any point it's like what is my current belief I assume I imagine that I extract things following my belief I imagine that everything follows my belief and this is what happens next and we have a solution for this kind of thing so this is the optimal solution if my word is completely Bayesian so if I assume that at each time I will have randomly distributed following the belief but then we started from a very different point we started from this was a frequentist problem we have one, I've been given two coins so they are there and somebody wanted to know what is the best thing to do in this particular case the two the solution for Bayesian can be applied so we have the perfect optimized set of actions for each set of beliefs and actually applies even if I do what is generally not the same thing so I take one single case and I flip, I truly flip the coins and depending on the flipping of the coins I decide what to do next which are not the same thing the Bayesian part is I take beliefs I take all possible states which are compatible with that belief and I follow all of them it's a much larger one but in this particular case I can say does it work if I also have a simple question which I have two coins and I have no idea what to do so this is what I want to do now so I start with a total ignorance belief but I actually start with a very specific real state so I start with a state Q1 it was 0.45 and Q2 it was 0.55 and what I do is actually I take this two arm and the result is a failure win and I'm doing up to 87 steps and what I do is I check I record what is my history of this particular specific thing so this is not a belief anymore this is just one one history of one instance and I take the best action to do the best action that I want to do following the solution in a Bayesian sense and I just take the result of the random polynomial with the Q1 I know that it's there and I either take arm 1 or 2 I collect the history and I'm in a new belief system in belief word and I use that belief for the optimal solution for that belief and I apply it to my concrete example here and I see where it goes and where I go is okay I begin I pull the I was the ignorance I pull the first arm and the result was fail so I got shifted there then I pull the second arm and the result was a win so it got shifted in that corner there I pulled again the result was a fail so it got shifted there then it was a win then it was a win and then I go and at the end turns up in this it then turns up doing the optimal thing which is the second arm and you can see now it has also it's rather strange but it's true they also not extort much but it seems sufficient for him to be here okay have I while I do this do you have questions yes is there a way to make the model more explorative or more exploitative well for example the Antonio was proposing two models which are extremely exploitative or extremely exploitative a perfectly exploitative model would be to choose the arm which has the maximum probability of success sorry let's call them strategies strategy and the other strategy which is properly explorative is for example choose the arm which has the largest which has the largest a for example the least number of trials okay you are all times choose the arm in which you have the least number of trials because that means that your information or you can do other information but this is strategy but those are a worse strategy than this this is by definition the optimal strategy combining exploitative and explorative strategies was this the question because I am talking about strategy you were talking about models is this the question was with the question I was thinking if there was the possibility to set a certain kind of trash shoulder that for us for us choosing the strategies I'm not sure I understand the question but maybe I can try and clarify some points so you can construct very different kinds of heuristic strategies like the ones I said before which are definitely suboptimal so to exploitative or to explorative there are other strategies one of them very famous is called Thompson sampling what it does is just looks at the belief the current belief and then extracts one state at random according to the belief and picks the optimal action for that state this is a heuristic strategy and it's a good heuristic strategy but still it's suboptimal in the Bayesian sense that means if the states are picked according to my prior then the strategy that you find here by solving the Balmain's equation is probably the best one what is not obvious in general is that if you use a Bayesian strategy with a certain prior uniform for instance but your world is giving you coins according to another distribution then these two things need not necessarily have the same performance okay so it's important to mention between heuristic and optimal strategies okay heuristic strategies you can construct but they are lower in performance in the Bayesian sense that is given that prior they have a worse performance than the optimal Balmain first thing second thing is that any strategy you come up with the Balmain or the heuristic are more or less well suited to that prior and if you compare them with other priors with data coming from different initial distributions they might have different performances I hope this clarifies some of the things otherwise you can try and reformulate the question so in the end we choose the best we choose the best policies that combine the best the best exploitation and the best exploration we can't choose in any way how to enforce the policy to follow a much more exploitative model exactly for that given prior and for that given model that's the best thing to do okay and we can't force to move our policy toward our... if we do that we lose in performance if we force it to be more exploitative or more exploitative we will have a lesser expected gain or expectation on memory is always done over beliefs okay thanks sure I think we can wrap up because we're running too late unless there are burning questions so I think as usual you will find the notebook on as soon as Emmanuel has some time to upload it and otherwise we'll meet again on next Wednesday for the theory lesson you're muted still muted I think you can quick recording anyway sure