 Okay, let's let's continue. So we're very happy to welcome today Emily Kaufman. So she's a CNRS researcher based at University of Lille and in the crystal team She's known for amazing work on the multi-arm bandits, and that's what she's going to talk about today Thank you. Can everyone hear me in the back? Is the mic okay? Good. Thank you So I'm I'm happy to be back at RLSS which we organized in Lille Four years four years ago. It's nice to see a big amphitheater of people interested in RL so You know already experts after three days on the RL and deep reinforcement learning and today We're gonna take a step back to one state MDPs So given that the model is much simpler the expectation will not be the same because in the bandit world We will actually be able to prove things And we will try to provide non asymptotic guarantees for the algorithm that I will describe based on statistics So thanks to the previous lecture. You already know the main tools that will be needed Concentration inequalities and then let's discover together how to use them in adaptive sampling strategy So just a quick Recap I guess for all of them of why do we use these terminology bandits? It comes from the name an old name for a slot machine in a casino Which was a one-armed bandits and unlike this octopus we are actually not able to sample the arms of the bandits and collect the Corresponding amount of coins in the casino, but we instead have to wisely choose one arm after in the other Still with the idea to maximize a revenue Of course this kind of models were not at all developed to make Researcher rich by winning casinos because in this example even the best slot machine in hindsight as a negative expectation However, there are a lot of other examples in which bandit algorithm can actually be useful so maybe it is interesting to know that bandits were studied first in the statistical community from the 1950s and the archetypical example that the introduction of those paper would give was that of a sequential clinical trials in which the different arms are different treatments and sampling an arm mean giving the treatment to a patient and then Observing a random outcome on whether or not the treatment was successful on the patient And so in this context the idea is to wisely choose the next treatment to give to a patient Based on all the prior observation So sadly the application of bandits are more in the other part where it's much easier to have data and billions of user That's of online advertisement where this time the arms the option would be different Advertisement and sampling an arm means showing an ad to someone and observing whether or not it clicks because clicks are usually related to Revenues and so you should also choose wisely which add to display based on the past So this choosing widely is what we will try to To do and so we will model what is a sequential strategy first and so I will formally Introduced this multi arm bandit problem and after that I will present you with the More popular solution to it So we will start with simple Variants of what could be called a greedy strategy and then we will present two important families of algorithm Optimistic algorithm first and then randomized variants including some by Asian approaches So what is a bandit model? So think about these K slot machines is K arms is K actions in the RL terminology And so we assume that each of these action is associated to a reward streams so this reward stream is this X80 so X80 represents the reward if I get if at time T At you I choose to sample arm a and the cracks in the bandit model is that you cannot observe what you would get with Other arm the bandit information is precisely Choose only at each time set one of the arm and you you make this choice sequentially So you choose I will denote by 80 the arm chosen at time T and the receive reward is the corresponding Value of the reward stream and so the constraint of the strategy of the agent is that the arm chosen at time T plus One can only depend on the past chosen arms and the past observed reward And the goal is to maximize the sum of rewards the agent get So here I am not very explicit on what these reward streams are because there is also a very interesting literature That consider a totally adversarial setting where those rewards could just be chosen by nature But to fit with a reinforcement learning paradigm what we will consider is Stochastic bandit where these reward stream are generated under the simplest possible stochastic process, which is IID So in this case the K arm are Represented by K probability distribution Which are unknown of course to the agent so I denote by new a the distribution that generates the reward of arm a and it has some expectation Mu a and so in this context the reward RT observed upon choosing arm 80 Is an IID an independent draw from the distribution? associated to that arm and Because of the stochastic nature of the model it is it actually makes sense also So this sum of reward will be a random variable that has actually very complex tails and a common Criterion that is looked at in the literature is to try to maximize the expectation of the sum of reward So this is exactly Solving a Markov decision process with a single state because unlike in a rail where you have This navigation problem. You are always facing the same action in every round if you did not choose an action today You will choose it tomorrow. That's not a problem. So there is a single state unknown Mean rewards for all your actions and your criterion is expected sum of reward up to an horizon So it could be for example solving a finite horizon or episodic mdp so in in bandits usually the horizon is not Assumed to be known and the best algorithm do not need to know it to do them to make the most of it So just to go back to the previous example and also talk about the possible assumption that we make on the reward distribution so The example of clinical trial the simplest model for the efficacy of a treatment is just a binary Variable did the treatment cure the patient or not? so in this the corresponding bandits consists in considering a set of Bernoulli distribution and then the response to a treatment is one with the probability the unknown probabilities at the treatment is curing the Decempto and when we seek to maximize expected sum of rewards in this problem It means that we want to maximize the expected number of patients that we cured during the trial So actually in clinical trial sometimes they consider other performance metric such as finding rapidly the best arm because afterwards this arm will be given to a larger Population, but today we will stick to the already very interesting reinforcement learning problem associated to bandits which is that of maximizing the reward and in the other case I described so I talked about online advertisement, but Typically in a recommender systems Any recommendation problem could be modeled as a bandit. So for example, you want to present movies to two users and When you recommend a movie you ask for a rating that could be for example a number between 1 to 2 5 so in that case the distribution associated to an arm could be a distribution with a finite support but you could also sometimes show a net to a user and Monitor the amount of money he spends on your website in that case you have continuous a distribution model for example as Gaussian with unknown mean and possibly a no variant so there are plenty of Assumption you can make on the distribution Depending on your application. Those are just a few example So how do we define the performance measure? So of course there is a value we already talked about but actually in bandits we try to go beyond Convergence which is typically what we can establish sometimes for error. So we look at a second order Notion called regret So first let me introduce the relevant notation. So if you knew in advance the distribution and their means Obviously the best strategy to maximize revenue is to always play the best The action with the largest means so I denote this action by a star and I denote by mu star the largest expected mean in my bandits and so maximizing a reward is equivalent to minimizing a quantity a very important Quantity in all the online learning literature called regrets and the regret measures the difference between the sum of rewards That the oracle strategy play always the best arm in inside gets so this is t time you start Minus the expected sum of reward you get with your adaptive strategy, which is agnostic to the means And so what do we expect for the regret? We expect the regret to be sub linear because then it means that the average reward We collect with the strategy converges to mu star which is The optimal value but actually in the in bandits stochastic bandit We will be able to be more precise and what is the rate of growth of the regret at the function of t We will see example of algorithm with square root t regret and even with logarithmic regret So to control the regret of an algorithm, it is very important to relate this performance measure to a Quite intuitive quantity, which is how much time did the algorithm select each of the arm? So ideally a good algorithm choose only select arm a star But of course it's not possible because at the beginning we don't know so we have to explore a little bit and so the regret the composition writes as a sum over the arm of Delta a which is the sub optimality gap of arm a so which measure how How far is the mean of arm a to the best mean mu star? Multiply by the expected number of times that we selected some arm a up to time t So to prove this you can check it on the on the slides There is here one conditioning saying that the sum of the observation is as the same Expectation as the sum of the mean of the chosen distribution and then when it's written like that You can put here mu star everywhere and you get the sum of the gaps and then just by Reordering clustering these by identity of the arm here This sum can be rewritten as a sum of a arm of gap of arm a times number of time I selected So the details are not very important But just remember that to control regrets actually what we will prove is Bounds on how much time the algorithm is selected the suboptimal arm and Looking at this formula also give us an idea of the behavior of a good algorithm If you want a small regret you should not select too often arms for which Delta a is large But of course at the beginning you don't know the gaps so you have to estimate them So estimating them need extra exploration So you will need to achieve this well known exploration exploitation trade-off So on the first end of the spectrum. What is exploitation? So we are facing this lot machine. We know we want to maximize our revenue It's quite natural to just summarize the performance of the slot machine by What was the average revenue we got before so to estimate their quality with? The the empirical means so which is just the average of the reward We got in the past when sampling this this machine So of course this greedy strategy has to first select each of the arm once otherwise you cannot estimate anything But after doing that you pick the arm that has the largest empirical mean in every round. So this is really greedy But it's quite easy on a simple Bernoulli bandit model with two arm to understand why this strategy can fail Because just because of randomness assume that arm one is better than arm two The first time you play arm one it gives you a zero which happened with probability one minus mu one and You are also lucky for the suboptimal arm that gives you one which happened with probability mu two And then when this happened the empirical mean of arm one will be equal to zero And so it will always be strictly smaller than the empirical mean of arm two Meaning that the greedy strategy is always gonna choose the suboptimal arm So it's the number of selection of arm two is lower bandit by T minus one times the probability to be Unlucky on the good arm and lucky on the bad arm And so therefore the regret will also be linear and we will not even have convergence of the algorithm to a good average Rewind so this means that pure exploration is not enough and so the the interest of bandit algorithm is to add the right amount of Exploration so to do that we will start with simple strategies that just try to fix a little bit What is happening with greedy? Also before I start this I forgot to ask but feel free to interrupt me at any time if you Have a question or need a clarification So what was wrong with greedy what was wrong is that we just Initialized our estimates with just one sample and then we could have this lucky unlucky Phenomenon that explains the linear regret But this might not happen if we take a little bit more of exploration at the beginning So the idea of explore then commit or sometimes explore then exploit is to do a well chosen length of exploration So we you pick a parameter M and we draw each arm M times After which we compute the empirical best arm based on M sample per arm And then we then we need to exploit a bit we stick to playing this arm until the end of our budget So let's try to analyze this algorithm Again in a simple situation with two arm where mu1 is larger than mu2 So the regret of the algorithm According to the formula I just showed you is equal to the gap between the mean Multiply by the expected number of time we select the bad arm But under this strategy, it's quite explicit the number of time we selected arm to we selected M times During the exploration phase and then if these arms was elected to be the best Empirically then we play it an additional T minus 2 M times Meaning that then the regrets will be a pounded by upper banded by delta M plus delta T I'm ignoring this Multiply by the probability to pick the wrong arm But what is the probability to pick the wrong arm? It's precisely the probabilities at the empirical mean of arm to based on M samples was larger than the empirical mean of arm One based on M samples and this is actually unlikely to happen due to Concentration in equalities Indeed we want to do as I emphasize as the beginning non asymptotic analysis and actually we can bomb this and So when we want to use a concentration in equality as the talk of you've been highlighted Then we have to make some assumption on our distribution And for example if we are willing to assume that the distribution are bounded in zero one Then we can use of being inequality to control this so why Can we use of being here? We want to control the probabilities at an empirical mean is larger than another so you can rewrite everything Sorry, I wanted to write on my slides, but I found out it was not possible So you just put this guy here and this difference of empirical mean is also the empirical mean of the sum of the Differences and then you can apply a concentration in equality because the expectation here is supposed to be To be equal to minus delta and so it's not very likely that you get a positive value and Precisely you you end up by exceeding by the gap The means it was not very clear, but you can easily do the computation yourself on a piece of paper To understand that with hofding inequality here you will get an exponentially decaying probability of these two happens and scaling with Delta square the gap between the two means So now that we have bounded the regret we see here of course the dependence in the parameter M So the natural question is what is the best value of M? So again, you can do the math yourself or trust me If we wisely pick M to be roughly log of t delta square divided by delta square We end up with this form of regret. So we got really a sublinear regret, which is logarithmic scaling win one of a delta time log t so it makes sense because So when delta is small it means that somehow the two arms are harder to discriminate and so it makes sense to have a slightly Could make sense to have a slightly larger regret. So this is nice However, the big limitation here is that the choice of M needed to know further horizon Which is sometimes viewed as a problem sometimes not maybe in a clinical trial, you know how many patients you have in Contextual advertisement, you don't know how many people will visit your websites But the biggest problem in both cases you never know what is the difference the true difference or lower bond on it Between your arms. So this is this is a limitation That we can somehow overcome by changing a little bit the length of the exploration phase to also be adaptive So instead of choosing in advance how many pools I will dedicate to exploration. You actually stop Exploring so you sample the two arm in a round robin fashion until their difference in empirical mean is Significant where here significant is some comparison to a threshold that does not depend on Delta and when exploration ends then you also commit to the arm that was the largest Among the two so you hear it's an illustration of the process Of the difference of empirical mean and when it hits this boundary Then you stop and in that case the difference is positive. So we would pick commit to mu 1 So this strategy was Analyzed in the particular case of Gaussian bandits and was proved also to attain logarithmic regret without the knowledge of Delta But there is also an interesting result in those papers saying that even if the regret is logarithmic It's a regret that will be always worse than the best strategy that we can get with for example UCB as I will show you so we know for two arms and Gaussian that even this Smartly stopping the exploration leads to not optimal regret another possible fix of greedy Which is on the randomized size is something you already seen a lot in the class in the previous week and actually come from error It's this idea of doing just epsilon greedy So meaning at ron T with some probability epsilon you pick an arm uniformly at random to make sure to explore enough and otherwise Then you you are greedy because you know that you will always you will never be stuck because you have only one sample you Because of the exploration it will fix things Actually, it's easy to convince yourself that with a constant epsilon You have a constant probability to pick all the suboptimal arm and therefore you will also have a constant linear regret Where here D mean is the smallest gap in the in the bandit problem So if a possible fix is to consider a decaying is it better like this I'm sorry for the weird sound the mic is making so if you consider a decaying Epsilon a decay specifically in some parameter divided by T Then actually you can get logarithmic regrets for greedy However here note that you need to select the parameter of the decay with a prior knowledge Again on the minimal gap between the best arm and the second best arm and again This is not something you know in practice. So when people use Epsilon greedy in industries. I do some heavy hyper parameter tuning to find what is a good epsilon for My type of data and then they use it and it can work But it's not a satisfying solution that would adapt to any Yes Here I'm considering a fixed set of arm But yes, it could be some bandit setting where you allow some arm to enter or leave the system This has been studied in the literature, but so far K is assumed to be known But not this parameter D. Are there other questions? Before we move to Well, I guess if D is too large you can get linear regrets Maybe they proved it in the paper of our I'm not sure but yeah for sure It will not work on any problem Okay, so ready for some optimism So I really like this Optimistic principle in general but for statisticians. What is optimism? It's The main idea in bandit is not to rely only on the information given by point estimates by empirical means And a way to do that is to balance for the uncertainty Which is captured for example in a confidence interval that we can build on Each of the unknown means that are our parameter of interest So this confidence of interval interval sorry takes a form of a lower confidence bound and An upper confidence bound so here there are depicted on a forearm bandit problems where the true means are here the red diamonds and We see that some confidence intervals are larger than others So large confidence interval has a large uncertainty and is typically associated to arms that have been sampled not too much Whereas small confidence interval are for arm for which we we gathered more simple So somehow they also convey the information of how much certain we are about the the estimates And the idea of the optimistic principle is to say this confidence interval represents the possible bandit models Let's pretend we are in the best possible bandit's the best possible casino in which the mean payoff of it machine would be the largest possible mean payoff. So this is a Bandit model that you probably don't see in in In yellow here and if this were the truth then you would just select the arm with largest means So in that case with largest upper bound on the confidence interval So this defines actually a family of algorithm because as long as you can build this upper and lower confidence bond Well in that case you just need the upper confidence bond You can propose an algorithm and you've gonna showed you in this class many ways to have Concentration inequalities that will in turn give you Confidence interval so depending on what you know in your distribution You might want to use this sophisticated empirical Bernstein if you know that Your distribution have are bandit from above and you control variances But here I will present just the simplest possible UCB based on of ding inequality for bandit or sub Gaussian random variables Which popularized somehow the UCB algorithm So what do we need for confidence interval so for this you will have to trust me for the analysis to work And you will see shortly why we need a guarantee of the kind the Probability that the UCB is above the true means so that it is indeed an upper confidence bond needs to be of order One minus one over T and here you notice that when T go Increases we increase the coverage we demand for our confidence interval which will create also some exploration Mechanisms and so to design this confidence interval. Yes It's roughly basically you need the series of the one of a T to converge but to be smaller than log T So one of a T log T will do the job. Usually people do one of a T square, but in spirit. It's one of a T Okay, so let's use a standard assumption that I liked a lot is this sigma sub Gaussian assumption that you have gonna Talked about it's nice because it seems it's simultaneously covered to use cases Gaussian with Variant sigma which could be useful in practice for continuous rewards and also an assumption very common also in RL Is this reward bounded in zero one because? Bonded distribution are a particular example of sigma sub Gaussian distribution with sigma equal One-fourth. So this is a way to make UCB work for two different family of rewards So this is recap of having inequality in the sub Gaussian case. Yes Yeah, I wanted to know what is the motivation behind the using optimism like The confidence upper confidence is it's just because you can use concentration inequality to bound so you have Access to this value or is there another motivation or intuition behind? Why is it good to use optimism? Actually, I will show you a proof of a quick analysis of UCB basically are when Arms are inside the confidence interval you can easily relate the number of Selection to the size of the confidence interval and to the gaps and it's really works because of the Optimism as you will see in the proof But how people got the idea that being optimistic is good. I actually don't know But maybe you will develop your own Philosophy about that very soon. Thank you Okay, so we have a thing in equality as you've already told you The difficulty is that you cannot apply this concentration in equality when the number of observation is Random because it creates some weird dependency and then you have all these counter example He created so we have to be very careful in the bandit algorithm And actually I just reviewed a paper who was again applying everything in equality with a random number of Observation so don't do that if you ever want to do a paper on bandit or don't ask me to be a reviewer So let's just see in on this example again the the union bond So here what we want to prove is that this which will be our candidate UCB Is larger than mu hey this probability one minus I chose a simple one over t squared Here's a difficulty that N a t is random and what we know how to control is the empirical mean of A number a fixed number s of observation So we say that the empirical mean used at time t of the algorithm is the empirical mean based on the N a t first observation But N a t is random variable and how do we cope with it just by saying okay? To to prove this we need to upper-bond the probability of the complementary of this event and here to under the random N a we just says okay Well, we know that N a t cannot be larger than t So the fact that these holes imply that there exists a value of s such that we have the corresponding inequality for the empirical mean of ebb's observation and Then we upper-bond the probability of the union by the sum of probability This is a union bound and then we use a thing then it's fine and we get the one over t square. So just Think of union bound. There are more sophisticated stuff than union bound, but we will not enter in this nice martingale details for for today So as I showed you a possible ucb based on nothing take the following form empirical mean plus some something that we call sometimes the exploration bonus of the form square root of some parameter alpha log t divided by the number of draws and we see here that The log t here comes from the one over t square we wanted for the coverage probability and actually when you don't draw an arm for a while it's mu mu hat and it's n Remains frozen. However, it's the log t augments meaning that this arm will eventually be Be selected. So this form of ucb could be tried trace back to some old work but was really popularized with this ucb one algorithm of hour at all in 2002 considering bounded reward, so which is a special case with Sigma before equal to 2 1 4 and the analysis was later refined in subsequent work So before digging in the analysis and understanding on the theoretical level why ucb can work Let's see how it works in in practice. So here you have a little video on showing how the ucb's Evolve with time in a five arm bandit and on the x-axis you see how much time the arm were drawn So at the beginning you have very wide confidence intervals for every arms and then you see that one arm starts to be drawn more and start to have a very concentrated confidence interval Whereas the the other remain remain wider and so this algorithm tends actually to Align the upper confidence band on all arms meaning that at the end the good ones Will have very small confidence interval and the bad ones larger and here you really have the behavior expected The best arm is the third one here, and you see you really draw it an overwhelming fraction of time So indeed ucb will also have the same kind of logarithmic regrets that we We we were managed to prove for explored and commits when we choose the exploration length wisely and The consequence of a bond of the number of arms because of the formula of the regret is that the regret We also be logarithmic and here we have what we call a problem dependent quantity a sum of arms of One over the inverse gaps So again the more arms are closer to the best the larger will be the constant in front of the lock So this concern somehow reflects the difficulty of your Bandit problem. So, how do we analyze ucb? So we will leverage this upper and lower confidence bond in the proof So this proof is basically the proof of our simplify a little bit And so the good event what happened with high probability is that the best arm is not Underestimated so it is indeed below its ucb and the bad arm is not overestimated So it is indeed above its lower confidence And just with of being inequality we can upper bond the probability of the bad event of the Complementary of this by these two sums that we managed to upper bond on the previous slides by two over T square And so to understand why ucb works you have to understand this argument What happened on the good event on the good event? You know that mu 1 is smaller than its ucb in red mu a which is a suboptimal arm We want to control the number of pools in this proof is upper bounding by its lower confidence bond And you also know and you you are looking at what happens when the algorithms selects arm a So we know that mu a is is larger than lcb We also know that mu 1 is larger than mu a because one is my notation for the optimal arm Maybe it was not obvious. Sorry You know that mu 1 is Larger smaller than its ucb and because the algorithm chose arm a you know that the ucb of a is Larger than the ucb of mu a so all these together tells you that the confidence interval of arm A has to contain both mu 1 and mu a and in particular It means that the gap between the two has to be smaller than the width of the confidence interval Which is two times the square root of this log t over n and if you invert the inequality It tells you that the arm a cannot be drawn too much It tells you that n a t is indeed upper bounded by log t divided by gap squared And this is why there is a gap square in the in the proof So indeed putting things together if we want to control the expected number of selection of a we want to control the sum of The probabilities that we selected arm a and we intersect with the good event or its Complementary and here we get something that sums to a constant and here We get the sum of the probabilities that we selected arm a While the number total number of selection was smaller than this So it's not hard to convince you that in total this cannot be larger than this because of the constraint That we have here. So this gives us the nice Upper bound that I have I gave you before Which leads to what is called a problem dependence bound on the regret So limitation is that the constant in front of the log could really blow up as I said several times when the means get closer Whereas actually when you have for example to arm where the two are very good It should be fine to actually select arm two a lot And so this phenomenon is captured by what is called the worst case bounds and that you will see Tomorrow in the class in contextual bandits on RL. So a worst case bound is telling you that okay The regrets in any problem is never worse than something of the former square roots KT like T. So here we have a bond that does not depend on the specific means of the problem So that's why it's called sometimes problem-independent or worst case so I will just Skip the proof in the interest of time, but the idea is that each time you have a guarantee like that When you write the regret you just distinguish between arms that are very close to the optimal arm for which It's more interesting to bound their regret by Delta T and arms that have a larger gap for which It's more interesting to use the problem-dependent bounds and then by carefully choosing the threshold between Interesting and non-interesting arms and doing some mass you end up with this mini max To wrap up on bounds the one I showed you on the slides the proof I made is an easy proof But actually it has been refined a bit with better concentration inequality and some nice trick in the literature So just for completeness as I state the best Available of the shelf bounds It's interesting because actually it allows to use a tighter UCB Which has practical consequences when you apply this algorithm So if you have sub-gaussian rewards, you could select the UCB to be mu hat plus square root of 2 sigma log T divided by NAT and This is the best constant that you can get in front of the log with this algorithm So for the algorithm that I called UCB alpha for sigma sub-gaussian rewards What what it proved is that if we set the parameter alpha to be Two sigma square so alpha was governing the size of the confidence bonus We get a worst case bound of square roots KT log T and a problem-dependent bound Scaling with this quantity and now is the time to Answer the question whether those are good regret. Yes, sir It could be yes. I probably skipped it It still depends on the variance, but not the means so it depends on the distribution but much lighter You're right. Okay, so here is it under the square root, maybe Yes, sigma squared under the square root Okay, so are these bound good so let's first talk about the mini max bound actually there is a result in the literature saying that For every bandit algorithm you can find a crazily hard instance on which your regret has to be larger than square root KT Meaning that a square root KT log T bound is kind of good So there are better algorithms that even shave off the log T But they are a bit more complex to present and it is interesting to look at what is the hard bandit instance in this bound It's basically a Bernoulli bandit where all the means are one half Except one which is larger by a tiny margin of square root one of a T So those are not the typical bandit problem you encounter in practice This is why when the gaps is large you also want to know what is the best problem dependent regret That you can get and this is given also by a lower bound that exists in the statistics literature since the 1980s given by Lyon Robbins So the original bound apply to simple parametric models such that all the means are Bernoulli Or all the means are Gaussians that are the two things you can cover under the sub-gaussian framework if you're interested And actually the bond do not feature gaps, but it features a cul-back like blood divergence between the distributions So typically if we are looking at Gaussian bandits this scale divergence will be equal to the squared gap Between the means divided by two times the variance Whereas for Bernoulli bandit it will be this more complex binary KL divergence that you have gonna show you before And the results said that so it's an asymptotic lower bound saying that when T is large The expected number of time we need to select a suboptimal arm is log T divided by here the KL between the distribution with mean mu A and the distribution with mean mu star and So how are we doing with respect to the lower bound if we take Gaussian bandits the lower bound becomes Two sigma square divided by gap square log T and actually this is precisely the upper bond We got when applying UCB with the parameter alpha equal to sigma square So UCB is already asymptotically optimal so UCB based on of Dean is asymptotically optimal for Gaussian However for Bernoulli, there is a slight mismatch between the lower bound and the upper bound Given by the fact that here we got something with gaps Whereas here we get something with KL and to close the gap You actually have to use more sophisticated confidence interval based on this KL inequality That's a Yevgeny showed you so here it's an inequality that can be proved for any simple parametric distribution Like an exponential family for example where you have to take the KL to be the KL divergence in this family For example, if you apply it to Bernoulli, you use binary KL divergence if you apply it to Gaussian you use the Gaussian Divergence and so the the shape of the confidence interval is a bit less explicit, but Basically you can compute the KL function so here in blue you have KL mu hat Q as a function of Q and you are thresholding it up to some level locked C divided by NAT and then you get your UCB so this requires to solve a simple optimization problem It's not a big a big overhead But still it's less explicit than this mu hat plus square root of 1 over N And so what was proved is that When applying this algorithm called KL UCB so maximizing this more complex UCB We can actually exactly match the lion robin slower bond. So this was proved in 2013 by by Kp Kp et al. So this gives us an asymptotically optimal algorithm in a problem dependent sense and the last Algorithm close to this family. I wanted to mention before talking about a bit about this Randomized algorithm is a nice algorithm called IMEDD, which is actually slightly different than UCB and In particular it has a nice property Not to feature any confidence interval tuned with concentration inequalities Here in UCB or KL UCB There was always this log T that came from the fact that we needed to construct an upper confidence bond that is valid with High probability and actually those construction are based on mass and are often not very tight If you measure through empirical simulation, what is the actual coverage? You get something usually much bigger than what you ask for with concentration inequality So this is why it's nice to have Algorithms that do not depend on any tuning of confidence regions and so IMEDD has this property So it is a it was originally proposed for bounded distribution But it can be used under the same assumptions and KL UCB for example Bernoulli Gauchan or exponential family It also requires to know the KL function that appears in the lower bound for the corresponding Family and so what the algorithm does is that it computes out of all the arm the largest empirical mean That was found and then it should it selects the arm that minimizes the the distance between mu at a and this Proxy for mu star which we don't know multiply by the number of selection And to work we need this regularization this additional log NAT To make sure that all the arm will be eventually Eventually drawn sufficiently because if an arm is not drawn This guy does not change whether the other evolve and so this is a mechanism that make us select all the arm sufficiency many and So IMEDD is not strictly speaking what we called an index policy in bandits Because an index policy is something like UCB which compute for each arm Something that depends only on the reward you got from that arm But here the index if you will compute it for arm a also depends on Mu star hat Meaning it's also depends on what we observed on other arms So it's a slightly different family of algorithm and it's a slightly different kind of analysis which I will not enter into Into details, but I recently came across the fact that in practice as you will show in you see in the In a few slides. This is usually a bit better than the KL UCB counterpart So it was some advertisement for IMEDD, okay So this was still kind of using uncertainty contained in confidence interval, which is Something that in statistic is usually associated to the frequencies to a world that you've been it talked about and now I want to make a little Detour in the Bayesian words because it's kind of an interesting story in the bandit literature I told you that the bandit paper were in statistics in The 1950s but actually the very first bandit algorithm can be traced to a paper in 1933 by Thompson and The proposed algorithm suggested by this paper is no called the Thompson sampling and use quite a lot in practice and the way it does Exploration is quite different from UCB. That's why I put it in this category of Algorithm would do exploration with randomization So how does this Thompson sampling algorithm actually randomize? So it comes from a Bayesian paradigm meaning that the uncertainty is not summarized by this confidence interval but rather by distributions So in a Bayesian framework you assume some prior meaning that the unknown parameter here in red You say okay, each one is for example drawn from a uniform distribution of a zero run because I know I am facing Bernoulli's I don't know where my mean lies. I encode this by this is it is sample from some uniform distribution And then assuming that the muse are random variable what Bayesian compute is the posterior distribution the posterior distribution of arm a is the Conditional distribution of your random variable mu a Given the observation you've made and here in a particular example of Bernoulli bandits I presented the The standard easy to manipulate posteriors that are used which are beta Distributions we will come back to it later and so here you have to move a bit your head But here you have the density of the distribution of each arm And you have a bit the same phenomenon then with the drawings with confidence interval There are some arms that are very picky posterior distribution So for which you have sampled them a lot and so your posterior has become super concentrated Whereas there are some arm here. I got four sample from this guy. I still have a very wide posteriour So what is Thompson something doing with its posterior the way it was written in the 1933 paper is okay, let's randomize the arm and Compute the probability to select a given arm in this randomization To be the posterior probability that this arm is better meaning concretely you have this beta Distribution you want to compute the probability that this distribution gives you a sample larger than the sample of all other Distribution so even in our time. This is a hard to perform a computation and in 1933 It was even worse even for two for two arm and actually to realize this to draw a sample from this Randomization that Thompson suggested Actually, you can convince yourself that it is equivalent to actually just draw one sample from each of the posterior distribution and Select the arms that gave you the largest sample because the probability that you select to give an arm is Precisely the probability that theta a was larger than all the other So it's precisely the posterior probability that this arm was optimal and so Nowadays how Thompson sampling is presented is okay. Your posterior is encoding some also Like it could generate you possible worlds possible MDP is possible bandits You just sample a possible model in your problem a possible bandits and you pretend This is the truth and therefore you just choose the one that has the largest samples So here is a here is an illustration again for Bernoulli distribution for which I told you that a Natural prior is a uniform prior so that the posterior belong to this family of beta distribution which has a bit Twisted Gaussians. Well, sometimes they can really look a bit like exponentials, but in some Border cases, but a beta distribution is a well-known tabulated in all your Software distribution that has two parameters and in that case the first parameter is the SIT is the number of success the sum of reward we got from arm a and here NA-SA is the number of zero we have so when To maintain this posterior. It's really computationally efficient because we sample an arm We observe whether we got a zero or a one if we get a zero we update the second parameter If we got a one we update the first parameter. So it's something really easy to to store And so what Thompson something is doing is for each arm it draws a sample from the corresponding posterior distribution and then it selects the arms that gave the largest sample and So if you want to implement this with Gaussian distributions, so Bernoulli and Gaussian where my two examples then you change the process the prior you put and you can choose it in such a way so this is not actually a Distribution, but it's what is called an improper posterior in Bayesian statistics still you get your posterior to be Gaussian distribution with mean Mu at 80 so the empirical mean and with variance sigma square and 80 So this means that Thompson sampling could be also viewed as another fix of the greedy strategy That instead of using as an index for each arm of the empirical means It is adding a bit of noise to it. Sometimes it will lower the mean or Putting it higher. So it's another way to to randomize and so randomization is is what achieves exploration there because here in these two arm scenarios In that case, I drew a sample theta one for the arm in red a sample theta two for the arm in blue and in that case The fact that theta one is larger than theta two tells me to pick arm one But sometimes because of randomness, we will get a sample of blue in its right tail Whereas we will get a sample of a sample of red in its left tail And we will still keep exploring just the right amount of time the the other arm So on the theoretical level what's what is known for Thompson sampling despite the fact that the algorithm is Bayesian the guarantees that we have are frequentists So what do I mean by frequentist guarantees is that this is the same kind of guarantees and we are for UCB We use this prior and posterior in the algorithm But then we can prove that for any choice of means in a given parametric distribution For example Bernoulli's we can control the number of selection of all the sub optimal arm When this is the true the true bandits and we get exactly the bond with with KL so this was first proof for The Bernoulli bandit with uniform prior that I used in the introduction. It is also known for One-dimensional exponential families. So it means that with the corresponding KL function So we really matching this Lower bond we we proved for for Thompson sampling But beyond one parameter models. So Claire will talk tomorrow about Thompson sampling in this more complex contextual bandits. It's usually harder and you have to choose your prior Carefully and even so in this paper by on day and take a more as I they considered the case of Gaussian distribution, but where you also don't know the variance in that case You can also do terms like a put a prior and both the mean and the variance sample it But in that case it doesn't work for any choice of Of prior so this is an interesting phenomenon Okay, so I told you I would Show you a bit the practical performance because this has been a lot about theory so far So if we recap the different algorithm I showed you so we had this optimistic families. So here I'm performing an experiment on Bernoulli arms So here there are ten arms. I chose So this is a scenario that comes from the KL UCB paper I think that is supposed to represent what happens in online Advertisement where typically the click probability are quite small and so when you have ten arms with quite small means This is where you will get the largest difference between an algorithm that adapts to the geometry of Bernoulli random variables and an algorithm that just use a simple sub Gaussian assumption So UCB is a simple UCB with explicit form based on of ding UCB is this more complex confidence interval based on the KL inequality IMED is a hybrid algorithm that Uses the computation of the best arm in the bandit and Compute some divergence here. What is IMED SG is IMED algorithm where instead of the Bernoulli KL I use the sub Gaussian the sub Gaussian KL so IMED SG is supposed to be close to UCB and IMED is supposed to be close to KL UCB and here we get Thompson sampling and so what I I run this little experiment that you might run tomorrow in the Practical session or something very similar and so this is a regret as a function of time estimated over I don't remember how many at least a thousand of Of repetition of independent friends because the regret is an expectation so to estimate it you have to Use a Monte Carlo estimation and here we see that so smaller is better because it's it's regrets all the curves have kind of logarithmic profile of square root t. It's hard to to distinguish, but we see that here IMED and Thompson sampling are the best among all this algorithm and with particularly notes the gap between UCB and KL UCB and IMED Sorry and Yeah, that's all I wanted to say Because yeah, UCB is not supposed to be optimal for this particular distribution and this is just showing you that okay It's also it's also reflected in in practical Experiments and the two most performant algorithm I just want you to observe that those are the only one that are not depending on a tuning made with confidence interval this IMED and and Thompson sampling Thompson sampling it's also just sampling a posterior. It is not choosing some quantile of it or something like that so So yeah, that's what I wanted you to remember for the practical story So I don't know how much time I have left like 15 minutes maybe Okay, well, but then people will be tired So yeah, the last part of the talk is something a bit more recent that I worked on with some with some students is Trying to have something that could work for for example different parametric families because if you think about those algorithm I Noticed here the gap between UCB and KL UCB So KL UCB the algorithm will be optimal for Bernoulli But if you use it for Gaussian you first you will not be able to compute the KL This will be a first problem But I just mean that when you need some prior knowledge on the distribution you're facing to get such good performance In Bernoulli, this is usually considered. Okay, because you usually know when your data is gonna be binary But when you go to more complex assumption like assuming sub Gaussian with a sigma does it make sense when it's bounded? Do we really know the bounds? So somehow we would ideally we would want a Single algorithm that would be optimal on any class of rewards you instantiate it even without knowing it And so this reflection made me look a bit in the recent year into this Non-parametric approaches and I will try to present you a one or two algorithm. I liked from this from this family So as I said the Thompson sampling Realized usually on the parametric Assumption so meaning that if you use some something for Gaussian It's is not the same algorithm because it is going to draw sample from a Gaussian posterior then some something for Bernoulli that is going to draw samples from a beta Posterior or for other exponential family. You will also get different family of posterior and So the idea to extend some something to more complex Possibly non parametric families of distribution would be to replace this idea of sampling a posterior by finding another Randomization mechanisms that is just using the information that you have which is okay Those are the rewards I collected. Maybe they are sample from Gaussians. Maybe they are sample from Exponential, maybe they are sample from every tail, but I don't know. I just know that I had these rewards And so for this a natural idea is to use what is called in statistics the non parametric bootstrap So you have your history and you do not want to summarize it just with the empirical mean because by now you have understood that Just estimators are not reliable that we should Incorporate some exploration and so to do that the idea is to change a bit the sample We got by drawing at random with replacements and sample from this batch So so so n is nat the number of rewards you have observed for arm a So we you will pick a first one in the history replace it pick another one So some some rewards will appear twice some will never appears but this allowed to propose a noisy samples and so from this noisy samples this Modified history then you compute the empirical mean on this what is called this bootstrap sample And so this empirical mean of the bootstrap sample is what defines your index and then you just choose the arm with largest Index so this is the somehow for statistician the most natural Non parametric variant of of Thompson sampling which doesn't could be applied to any to any distribution but actually what was proved again in this nasty case of To Bernoulli arm is that if you do that you can get linear regret and the reason is again If for the best arm you are unlucky at the beginning you will be stuck for a really long time with a History of a single reward zero because you will need for the bootstrap sample of other arm to only contain zeros And I think in in this proof they prove that it has a linear regrets if you choose a Tie breaking rule which always break ties in favor of the best arm. I don't know if you add randomness, but I think you would still Take a very long time to recover from initial bad histories for the best So a possible fix of this is what is called a perturbed history Exploration is to introduce some extra exploration by adding fake rewards Inside the history that we that will skew a bit your history to to are being more uniform So if you know that your reward are in zero run you will actually if you have nat samples in your history you will add a times nat Fake rewards that are generated completely at random. And so then you do the bootstrapping based on this fake History and so yes Yeah, this is a very bad notation for myself. Let's call this a alpha So it's an hyper parameter of the of the algorithm. Thank you for the for the catch and so for alpha Sorry strictly larger than two they managed to prove that they got logarithmic regret Expressed with the gaps so undecent practical performance But again here. I have this little Twist to world always wondering if a method is optimal and indeed you can ask Okay, is this the best we can do if we know that the distribution are bounded So so far I did not talk about what is the good lower bound for complex non parametric class of bandit rewards, but You will see it's not in this slide too bad on the next one, but statistician have proposed also Optimal algorithm for bounded rewards. And so the this algorithm that I'm talking here about was called a non parametric Thompson sampling by It's inventors Ryu and Honda in 2020 So it's an algorithm aimed for a reward bandit in a known round 0b and the idea is instead of adding sample and doing re-sampling We will just re-rate all the sample completely at random But interestingly for this to work in order to avoid the problem of being stuck with an history That is bad at the beginning is very important to add a possibility That's the largest reward you could get is inside your history So you are going to re-rate at random what I call an augmented history So if you have a time t for I'm a This and a tier reward you collected you pretend that you also collected one more reward Which was equal to be and then you re-rate them at random Meaning concretely you you you draw a random probability distribution Supported on an 80 plus one points. So this is called a Dirichlet 1 1 1 1 With n plus 1 1s distributions and actually it gets a bit nasty to sample from when n gets large So this is a computational bottleneck of these methods and then you compute the Empirical means but not with weight one over n for each sample, but with weights Given by what was drawn in your probability vector So this is a pretty nice idea in my view and You can also give it even a more general interpretation of using Empirical distribution so an empirical means in this non-parametric setting you can view it As the expectation of the empirical distribution of reward you got When you do that you are actually producing a noisy variant of the empirical distribution That changing the usual one over n weights with these double use and then you compute the mean of this modified perturbed CDF somehow instead of perturbing parameter as we did when sampling from beta posterior Here we propose perturbation of the empirical CDF's and When doing that it allows you to generalize this idea to Something that has been looked at in Bandit and also error recently is to consider a performance measure Which is not the mean for example If you are interested in some risk averse criterion like you evaluate an arm by a trade-off between mean and variance or by Conditional value at risk if you do that you can actually just say okay, I will Perture my CDF and just compute whatever risk measure. I'm interested it for the perturbed CDF So this is a parenthesis of some work we did on Sivar Bandit With a student named Ruy-en-Baudry, which is not on the slides, but I felt like I had some time But okay, let's go back to a standard non parametric terms and sampling for minimizing some of expected rewards In a Bandit in which we know that the rewards are Bandit So what is proved by Onda and Ryu for the NPTS is that its regret is logarithmic Scaling again as a sum over the arm of Here some quantity and the quantity here Generalizes this notion of KL that we add in the simple parametric model so this K in things us is the smallest value of KL between a new and new prime Where new prime is a distribution with bandit support whose mean is larger than Then so we this for exponential family for Bernoulli for Goshen it exactly boils down to KL But for bandit distribution It's more tricky to compute and it can be proved that this is There is also a lower band showing that this is a minimal number of samples We have in a problem-dependent sense in this non parametric class of bandit bandits Okay, so those are all the idea of non parametric Algorithms that my student Dorian worked a bit on so A limitation of NPTS well, it's not a limitation It is made to address bandit distributions But then you can always think can I get faster have it for a Universal algorithm for a bandit distribution for any value of the bound So there are papers that say that this is not possible to get logarithmic regret for bandit distribution But with unknown range, but you could get maybe log square regret or something and so Dorian investigated this by changing a bit and PTSD by Replacing the unknown be by some other data dependent quantity So this is the idea of this Dirichlet sampling type of algorithm And then there is another family that was pioneered by an algorithm in 2014 named BESA in 2000 by Baranzi et al Where the idea is also to use some Sampling mechanism, but actually to do that when doing pairwise comparison of the means So it is that those are special algorithms that In each round they are not computing an index pair arm, but they are Comparing the arms that looks best so far called a leader to all the other arms and to do a comparison You could again think of the usual suspect just compute the empirical mean of the two and Decide whether the leader is better than the challenger But this is very unfair because the leader has typically much more sampler samples than the challenger so the idea is to make a fair comparison by Removing on purpose some sample in the history of the leader to just have the same number of samples and the challenger So there are many algorithms that use this kind of pairwise comparison with interesting Mechanism to sub-sample the history of the leader So those are just a few references if you are interested or curious about this line of work So enough for advertisement of recent work. Let's go back to the general conclusion for this reinforcement learning class So what I showed you as actually is actually several principles to balance Exploration and exploitation. I guess so far you had seen only one which is the epsilon greedy that everyone is applying in greedy and I convinced you on in one of the first slide that's okay for Bandits which are of course very simple in MDPs. This is not the most desirable algorithm So the most interesting ones are the one using this very general Optimistic principles whenever you are able to compute confidence regions on your model and not just point estimate You can try to use a similar philosophy Even if computing an optimistic MDP might be harder than just computing the best possible Bunted problem and then there are all these family of randomized algorithms that are Extending this very interesting posterior sampling ID again whenever you are You have access to a posterior distribution on your model or to something that generates plausible models You can apply the idea of just drawing a possible model at random and then acting optimally Solving your MDP with this parameter Solving your bandits with this parameter in the sample model and as you will see in the next classes of these two days Those two principle can be Extended to the more complex bandits that Claire will talk about tomorrow and also it can inspire some Exploration strategies that I hope Alessandro will also talk about tomorrow in his exploration in RL Class and also this afternoon you might recognize the UCB in these guys in our lecture because indeed bandit strategy have also Inspired this very powerful Monte Carlo three search methods that are Adaptively exploring a search tree in a large in a large space. So this is what you will Explore related to this class in the next in the next days. Thank you Yeah