 Thank you. So first I'd like to thank the organizer for inviting me to this great workshop. And this talk will be about a statistical model called the multi-arm bandit model in which an agent interacts with a set of probability distribution called ARM by sequentially sampling from them. This model is often used within a reinforcement learning framework in the sense as the sampled collected are viewed as rewards that the agents want to maximize or equivalently we say that you want to minimize a quantity called regrets. And while this regret minimization problem is a can be considered as solved because of some lower bound and algorithm matching these lower bound I will talk today about a much less understood problem that of best arm in notification in which the goal is to identify as quickly and accurately as possible the arm with highest mean within the model. And in this joint work with Aurélien Garivier will present a new lower bound for these problems together with an algorithm that asymptotically matches our lower bound and that is also very efficient in practice. So first I will start by reminding you or by explaining you what the multi-arm the bandit model is. So it is simply a collection of k probability distributions that we called ARM and then an agent sequentially interact with this ARM by choosing at time t an ARM 80 that he wants to draw and then you observe a sample from the associated probability distribution. So of course his sampling strategy is going to be sequential in the sense that the ARM chosen at time t plus one must depend only in some arbitrary way of the past chosen ARM a1 up to 80 and the past observed sample x1 up to xt. And the way sample the ARM will be directed toward a goal related to learning which arms are the best in the model and our criterion for best will be the ARM with highest mean especially we want to identify the ARM A star that maximizes the mean and we will denote by mu star the mean of this ARM. And this learning process can have several constraints and actually the first constraint that has been considered in the literature is the one called regret minimization in which the sample collected are viewed as some rewards and in which the goal is to adjust the sampling strategy so that one maximizes the expected sum of the rewards accumulated during the interaction. So this is equivalent to minimize the quantities that I define here as the regrets which is the expecting difference between so the accumulated rewards one could obtain if we played always the ARM mu star that we of course don't know during the the real situation minus the sum of accumulated reward obtained with our actual strategy. So minimizing the regret forces to realize a trade-off between exploring the environment trying a little bit all the ARM to get an estimate of the mean payoff and trying to play the ARM that have been best so far because we have this constraint of maximizing the rewards while we learn. So originally this model arose from a simple modeling of clinical trials so this dates back to the 1930s with the work of Thompson and for example we can have we have a bunch of medical treatments so each is associated with a Bernoulli random variable that models the variability of the treatment across patients and that gives one if the treatment was successful or zero if the patient dies and the goal of course would be in a medical trial to maximize the number of patients still alive at the end of of the trial but actually if you so this would be a rephrasing of maximizing the expected sum of reward but if you discuss with people running real clinical trial this day actually in the early stages they are not really concerned with curing patients while they are learned the efficiency of the treatment and they would be more interested by an alternative objective that is learn quickly which treatment is the best among the pool of candidate treatment because later in the next phases this treatment will be given to a much larger size of of patients so they would be more interested by this best-arm notification problem that I introduced here a little bit more mathematically so the goal is to identify quickly the arm A star so the arm with the highest mean but this time without the incentive to draw arms that have high mean so we are only looking for a strategy that optimally explores the environment so the strategy so the an important part in the strategy will still be the way that we choose the arms that we want to draw at the current stage based on the previous history but we will also need some stopping some random stopping time telling us when we can stop the experiment so when we are convinced that we can identify the best arm after which we make a guess so this is our recommendation rule so a guess A hat for this arm A star and so several goals have been considered in the literature so I give two here so either we can fix a budget so we know that we can only samples capital T times the arms and then we want to make a recommendation that is as accurate as possible so that minimizes the probabilities that we make a mistake or we are we fix some risk parameter data and we want to guarantee that the recommendation we make is wrong with probability at most data and to reach this recommendation we want we want to need as few as few sample from the arm as possible so minimize what we call here the sample complexity so this framework is can be a can model a clinical trial as I said but would be maybe more relevant for some market search application in which for example a company has to decide which product it want to commercialize on the ELOs so he wants to identify the best product with high probability and the company is okay to lose a little bit of money during the learning in order then to make much more much more money so in this talk we will focus on the fixed confidence setting and more precisely given a class of a bandit model so a set F as of possible a bandit model for example all the arm are some Bernoulli distributions we want to build strategies that are data back on this class that is we can guarantee that for any bandit model in this class it will output the correct the best arm with probability larger than one minus data and we among delta pack strategies we want to minimize the sample complexity and in this talk I want to tell you what is the minimal expected number of samples that we need to for a delta algorithm and the answer will consist in first lower bound on this sample complexity and then in exhibiting a delta pack strategy for which the expected sample complexity matches our lower bound so here it will be a distribution dependent lower bound so depending on the the class of function of bandit models that we consider and we solve this problem for a particular type of one parameter bandit model so that we call the exponential family bandit model no I don't think I want to install osxl captain okay so we study a class of bandit model in which the distribution of all arms belong to some sets of probability distribution that are parameterized by some real parameter theta called the natural parameter and whose density have the following form and so this class of distribution if we particularize the b function here we can recover a lot of well-known distribution like Bernoulli distribution Poisson distribution Gaussian with non-variant and so forth and a good feature of these class of one parameter models is that actually the distribution new theta can be also reparameterized by its mean because there is a one-to-one mapping between the natural parameter theta and the mean and as our parameter of interest in the bandit problem is the mean we will denote by a new sub mu the unique distribution in the class p that has a mean mu and I introduce here an important an important notion to characterize the complexity of my problem in terms of information theoretic quantity is here the cul-back-leibler divergence between so to parameterized by the mean of the distribution so we introduce here d of mu mu prime as the cul-back-leibler divergence between the unique distribution of mean mu and the unique distribution of mean of mean mu prime so we can give a close form for the particular example that I mentioned and here I give it for the Bernoulli case and so the class of bandit model that we consider will be this class f so to ease the notation as I'm going to consider a bandit problem of the following form where so each of the arms that depend on some parameter I will identify the so this set of distribution with the vector of their means and so I will consider all the bandit problem for which there exists one arm that is strictly larger whose mean is strictly larger than the other so the set of all bandit models that have a unique optimal arm okay so before presenting the lower bound on the matching algorithm I'm going to review quickly why I said in the introduction that the regret minimization objective is solved and we will find some inspiration about what we want to prove for the best time notification problem so regrets minimization can be well an important result that dates back to the 1980s is a lower bound that has been given by lion robins on the on the regrets and that follow from the this rewriting of the regret as a function of the number of time each arm has been drawn up to time t so more precisely the regret is the sum over the arm of so here we have the the suboptimality gap so the gap between the mean of the best arm and the mean of arm a multiplied by the expected number of times that arm a has been drawn up to up to time t and so to have an algorithm with low regrets we need an algorithm for which the expected number of draws of any suboptimal arm is small and so the lower bound of lion robins give us a limit on the number of times we we draw the suboptimal arm and tells us that we need to draw them infinitely many and more specifically that the expected number of draws of a suboptimal arm a up to time t is asymptotically lower bounded by log t divided by d of mu a mu star so here's the cul-back like blur divergence between the distribution of mean mu a and the distribution of mean mu star appears and of course we understand that the smaller these quantities the larger the the arm has to be drawn because we we have troubles to discriminate between this arm and the optimal arm and so this lower bounds termites to define a notion of asymptotic optimality as an algorithm that matches this lower bound so for which the expected number of draws of any suboptimal arm a is asymptotically upper bounded by log t divided by the right information theoretic quantity so quickly i show you that there exists such an asymptotically optimal algorithm that is based on the so-called ucb principle so it is a very simple algorithm that compute one index per arm and chooses the arms with highest index and the index will be some upper confidence bound on the mean the mean of this arm and actually in order to get so several ucb types of algorithm have been proposed but in order to get the asymptotic optimality property one has to be careful in the way we build the confidence interval and here you see we have a non explicit upper confidence bound that is computed using the function d itself so the feedback like blur divergence in the exponential families that we consider and so these types of confidence interval follow by applying a channel method and rely on some specific property of the exponential family but just practically to compute the index we just need to have so the function d of the empirical mean x as a function of x and then threshold at some level log t divided by the number of draws and this give us our so-called k l ucb index and so this k l ucb algorithm has been shown to be asymptotic optimal through a finite time analysis so we we upper bound so it was upper bounded in this paper the expected number of draw of the optimal arm a is upper bounded by log t divided by d mu a mu star plus some second order term that is smaller than than log t so this proves the following we showed that so the infimum over all consistent algorithm of the limits of the regret divided by log t is exactly equal to the sum over the arm of the gap between the mean of the best arm and the mean of arm a divided by the feedback like law divergence between mu a and mu star so this this results dates back to 1985 because additionally to the lower bound lion robins actually proposed an asymptotic algorithm asymptotically optimal algorithm however the algorithm was not really explicit or practical and so more more efficient algorithm were proposed afterwards until some algorithm like k l ucb that are simple to implement and also asymptotically optimal so for the best arm notification problem that we are looking at today i'm going to try to provide the first two steps and also we will see that the algorithm will still be efficient to implement in practice so i quickly remind you about the best time notification problem so whenever i will work with some bandit model model parameterized by mu i will assume for simplicity that the arm are ordered in a decreasing way and there is a gap between mu 1 and mu 2 and as i told you the algorithm is made of three things the sampling rule the stopping rule and the the recommendation rule and we have to guarantee that for any bandit model mu the probability that the recommendation rule outputs the optimal arm is larger than y minus delta and we want to have a small sample complexity so a small expected number of a draw of this arm so in the literature we can find a lot of delta pack algorithm for which bounds on the sample complexity are given so either in expectation or with high probability but the existing bounds scale like this so the order of the magnitude of the sample complexity is a log 1 over delta where delta is our risk parameter multiplied by a quantity that depends on the on the bandit model and that takes the form of a sum over the arm of so here we have the sum over all the suboptimal arm of the square distance between the best arm and this arm and here we have the distance between the optimal arm and the on the second best arm so this looks like some so the square gap between the means in the Bernoulli case at least look like a sub Gaussian approximation of this defunction of the the Kuhlbach library divergence so somehow with this quantity the information theoretic terms are not yet identified and plus here in the big O there are lots of sometimes even non explicit constants that are hiding and so my goal is really to have a lower bound and upper bound that match up to constant and involve some information theoretic terms so in this sense we can say that the optimal sample complexity is not yet identified so let's try to have some new lower bounds for for this problem so I will first introduce you some useful tools to derive lower bound and we will derive together a lower bound for this problem so lower bounds for either regret minimization of our best arm notification relies on a so-called change of measure arguments so the idea is that if we want to lower bound the number of samples needed under some specific bandit model mu we will need to find another bandit instance lambda under which the behavior of the algorithm must be quite different and under which the all the arm needs to be to be drawn a little bit and so change of distribution can be quite technical but here I propose a useful lemma that we derived with Olivier Cappé and Aurélien Garivier that relates in a quite explicit way the risk parameter delta to the expected number of draw of of the arms through the feedback library divergence of an arm A under a model mu or under a model lambda and this inequality holds for every bandit model lambda that has a different optimal arm compared to the to the original bandit model mu so the results tells us that the sum over the arm of this expected number of draw multiplied by the information term is lower bounded by so this quantity is a binary relative entropy between delta and one minus delta that is roughly of order log one over delta so this is a log one over delta that we saw in the state of the art so I'm going to explain you how to use this this result to derive lower bound so we will first try to uh so as we need a lower bound on the number of samples needed the first idea would be to separately lower bound the number of times each arm should be drawn and for example we fix here an arm A that belongs to 2k so suboptimal arm A and we want to lower bound the expected number of times this arm has been drawn so the idea here is that if we choose a bandit model lambda in which only few of these terms are non-zero we will directly have a lower bound on the expected number of draw of A and so what we choose here is a bounded model lambda in which so for all i difference of A lambda i is equal to mu i so the corresponding information term is zero and for arm A we move the mean mu A slightly above the mean of the optimal arm so we set it to mu 1 plus epsilon so under this bandit model lambda the optimal arm is no arm A whereas in the first the original bandit model the optimal arm was arm 1 so this condition is satisfied and then if we write this inequality for this particular choice of lambda we get that the expected number of draws of A multiplied by a d of mu A mu mu 1 plus epsilon is lower bounded by k L delta 1 minus delta which gives us the following lower bound on the expected number of draws of arm A that we when we take epsilon that goes to 0 so what we show with this argument is the following lower bounds so we have to repeat the argument for all suboptimal arm and then do also something for the for the optimal arm so we have that for any delta pack algorithm the sample complexity is lower bounded by so this is roughly log 1 over delta as seen by this inequality multiplied by these complexity terms and so here we have something like looks like an equivalent of the line robins lower bound because here we would have so the distance between mu A and the optimal arm and here the distance between the optimal arm and and arm 2 so for sometimes I believe that this was right lower bound and that we could find an algorithm matching it but it turned out that this lower bound is not tight enough and actually we can derive the optimal lower bound and actually it will be a three line proof so the idea is if we start with the very same lemma so the very same change of distribution actually in the previous proof which shows specific values of lambda and wrote the statement for this value but we can in principle as this all for every every bandit model that has a different optimal arm we can simply defend the set of mu as all the bandit model lambda that have a different optimal arm and then it holds that the infimum over lambda in this set of bandit model of so this sum is smaller than log 1 over delta say and so as we want to lower bound the sample complexity we just artificially introduced it so we multiply and divide by the expectation of of tau and so at this stage we are not completely happy because this quantity depends on the algorithm through the expected number of draws of the arm and so the idea is just to simply upper bound it by something that does not depend on the algorithm and if we note that this quantity sums to one so they form a probability vector we can upper bound this by the supremum over all w in the simplex of size k so the vector that sums to one of this quantity and so we have proof in three lines the very simple yet non explicit lower bound on the sample complexity on the sample complexity so telling that the sample complexity is lower bounded by t star of mu so some characteristic number of samples of the problem multiplied by log 1 over delta where so t star has this non explicit form and this actually reminds us from some non explicit lower bound that do exist in the bandit literature so the first one given by grass and lie in 1997 was a lower bound on the regret but for very complex model with possible correlation between arms so it was somehow natural to have something less explicit than the line robin's lower bound while the second result is a recent paper from last year that studies the best arm notification problem but for a different class of bandit model so whereas we studied all the bandit model with a single optimal arm they were concerned with bandit model in which only one arm is different from the other so you have one optimal arm on a set of arms that all have the same values and so for this specific class s that is different from ours they also derived some non explicit lower bound so in going back to our setting what is actually very interesting with this bound is that we understood from the proof that this w a were a substitute for the proportion of draw so there was a ratio of the expected number of draw of a divided by the expected total number of samples so they can be viewed as a proportion of draws of arm a under some optimal strategy so if we take a w star of mu that realizes the arg max here so we expect this to contain the optimal proportion of draws of the arm so at this stage you could ask me a question so can I really define this I mean I never told you that this arg max is unique and so what I need to justify this definition is to show you that this is a unique and so well defined and actually as a byproduct we will come up with some efficient algorithm to compute these these non explicit value so we can start by being a little bit less ambitious and just fixing some w star in this arg max and actually because we are working with this exponential family bandit model we can we can try to give a bit more explicit formulation of the second optimization problem here in in lambda and actually an explicit calculation yeah that this is equal to the minimum for all suboptimal arm a so a different of one of this quantity which is a weighted sum of killback labeler divergence and actually if you if we factor w1 in this expression we can see that this is now simply a function of the ratio between w a and w1 through a function g a that we defined here and a little bit of derivation show you that this function g a is a one to one mapping between r plus and some interval zero d mu one mu a that would be useful in the in the sequel and so to refrate this a little bit more more explicitly we are going to introduce the following notation so we will work with a x star so the ratio between w a star and w1 star and so with this notation using that the sum of w a equals one we have that w1 will be a one divided by the sum of the x a and so finally we are if we want to find the x star first we are looking at x2 and x k star within this this arg max and so the next step is to realize that as here we have a minimum over k minus one function it is easy to check that at the optimum all these k minus one functions have to take the the same values and so there exists some real value that I denote by y star for which x a of x at the optimum is equal to to this value and so finally the the optimization problems that we are solving can completely be reduced to a one-dimensional optimization problem so we just need so defining x x a as the inverse function of so the previously introduced function here we have to find a y star that maximizes this function over the interval zero d mu 1 mu 2 and actually it is possible to compute the derivative of the function and solving the equation the derivative equals zero we can prove that so we can prove that there is a unique point realizing the arg max which shows us that there is a unique optimal weights w star and so we more precisely we have the the following theorem in the paper so that characterize the value of w star of mu in terms of this this function x a that I introduced and so showing that the computation of y star reduced to solving a real equation so f f of y will be some increasing function so we can just solve this by I don't know dichotomic search and then the evaluation of the function will be also reduced to solving k minus one a smooth equation so in the end we have an efficient way to compute this important vectors w star of of mu and so the idea of the algorithm that we propose to attain the the lower bound will be to try to match this this proportion and indeed that is what our tracking sampling rule is is doing so the idea is that so at a given stage of our algorithm we have formed the vector mu hat of the empirical mean of all the arms so based on the draw the draws that we have so far and so the tracking sampling rule first check whether there is an arm that have been drawn less than square root of t times at time t if this is the case then we are going to draw such arm so this is called the first exploration phase and if all the arm have been drawn more than root t times we are going to choose the arms that maximizes t of w a star of our empirical of vector of empirical means minus nat and so this sampling rule is built in order that with probability one the fraction of draws of the arm a converges to the target optimal value w star a of mu and so we can see here that the algorithm requires to compute the vector w star for our current empirical mean so at each step of the algorithm we are going to solve the previously described optimization problem in order to compute the weights and so now we have to so a key feature actually of the best arm notification problem is the the stopping rule so we should stop as soon as possible in order to have a low sample complexity so the idea is so the stopping rules that we proposed can simply be motivated by some statistical testing problem so if we introduce here the abt as some log likelihood ratio some more precisely here we have the maximum likelihood under the constraint that the mean of arm a is larger than the mean of arm b and here we have the maximum likelihood under the opposite constraint and it is easy to understand that high value of these statistics tend to reject the hypothesis that mu a is smaller than mu b and so we will stop when there is one arm that can be shown to be significantly larger than all the other in the sense of these generalized likelihood ratio ratio tests so our stopping time t that i indexed by delta because of course the moment we stop depends on the risk parameter delta that we given can be rephrased in the following way so we stop when there exists a subject for all b for all other arms the statistic as the abt is larger than from threshold beta t delta and actually this stopping rule can be traced back to some old work by chernoff in 59 who was working on sequential adaptive hypothesis testing but he was concerned with a finite hypothesis whereas here our hypotheses are continuous because we we want to check whether mu a is larger than all the other arms but this paper gave us a lot of intuition on how to devise a good good strategies okay so of course again in under our exponential family assumption we can give an explicit formulation for this generalized likelihood statistic showing especially that if mu adds hey so the empirical mean of arm a is larger than the empirical mean of arm b we have the following expression that feature here so this quantity is important so it is the weighted sum of of empirical means so a bit like in the lower band we have here a weighted sum of of information theoretic terms and so i i defined i defined this stopping rule with this generalized likelihood ratio test ideas but actually there are several possible interpretation one of which is related to the lower bound that we give and so the stopping statistic can actually be shown to be equal to t multiplied by the infimum for lambda that belong to so alt of mu hat t so all the bandit model in which the optimal arm is different from our current guess of the sum of this quantity and so we understand that if we use a sampling strategy such that here these fraction of draws of arm a converge to a w star the solution of our optimization problem here we will more we will recover our complexity term t star of mu and so if we have a good sampling rule and we believe this is true then we should stop when this quantity exceed log 1 over delta because we would exactly recover the sample complexity of t star of mu multiplied by log 1 over delta but we will see that we shall need a slightly larger threshold for stopping and i will give you here some some pointers on how to choose the the sampling rule so before that that may be another interesting interpretation for the stopping rule is in term of information theory in terms of the minimum number of bits you need to code the sequence of zero and one produced by arm a and arm b so again our generalized likelihood statistic can be rewritten using the channel entropy of the the distribution in the following form where here actually we have the number of times we've drawn a and b multiplied by so the channel entropy of mu a b t so this represents the average number of bits we would need to code all the samples produced by both arms and here we have the number of bits needed if we separately code the samples obtained from a and the samples obtained from from b so another interpretation is that we stop when this quantity is significantly larger than this one meaning that is this really more closely to encode a and b together and so that a and b should really be be far apart so this is another interpretation and i think i won't have the time to go into a into details of the proof but this there will be some useful information theoretic arguments and so what we prove is that if we choose as i told you a threshold slightly larger than log 1 over delta so more precisely log of something multiplied by t divided by delta then we have a delta pack algorithm and this this choice comes from the the following result that we prove telling that if we have a domain of arm a that is smaller than the mean of arm b then for any sampling rule the probability that there exists t such that the ibt exceeds a log 2t divided by delta is smaller than than delta so i'm not sure i have i have time for this but i will just try to so the way we prove so usually to prove this kind of thing we use some concentration inequality and here i also use some change of measure idea and so to prove this if we introduce a t ab as the first time at which this this statistic exceeds the threshold log 2t divided by delta then all we have to prove is that the probability that t ab is finite is smaller than than delta and to do so we're just using the definition of the abt as a likelihood ratio statistic so the event t ab equal t means that this likelihood ratio exceeds 2t divided by delta and then the probability that t ab is finite will be the sum for all t's of the expectation of the indicator of that event and then the idea is just to upper bound one by this divided multiplied by delta divided by 2t so a bit like the trick you use to prove a markoff inequality and so you're left with with this quantity and then using that mu a is smaller than mu b here you can lower bound the denominator by the value of this for the particular instance mu a and mu b that we we consider so we have the following equality and then what is what will be useful is that when we express here the the integral so i do it here in the Bernoulli case so there will be a sum of all the possible observed observed reward up to up to time t this is going to cancel out with the the likelihood term and here we would be very happy if we add here a probability density which is not the case because of the maximum here and then so the information theoretic idea that we use in the proof is an existing uniform bound over the the likelihood of of a Bernoulli random variable in term of this quantity called kt of x so it consists it is a partially integrated likelihood here pu of x is the likelihood of the sequence of observation we have under a Bernoulli of mean u and then we integrate u under some some probability distribution here and so upper bounding this quantity here this p mu a of the sample gathered from arm a by by this quantity give us something that is a probability distribution over the sequence of 0 and 1 observed and then the the change of measure idea comes from the fact that here we can interpret this sum as an expectation under some alternative probability space of the indicator of these events and this allows us to give us an upper bound by something which is a probability and which therefore smaller than one so this is a a new idea to to derive this kind of proof compared to to use to a usual concentration in equality that are that are used in other in other works and that could also be used but to obtain less precise result okay so to summarize what what algorithm did we did we exhibit and what guarantees did we prove so the algorithm i call it the track and stop strategy so it uses the trapping the tracking sampling rule that i presented so the idea to match this vector w star of optimal optimal proportion of draws then we use the turnoff stopping rule so based on this generalized likelihood ratio statistics with this this threshold and the recommendation rule when we stop we are going to choose the arm that has the highest empirical mean and so we prove that this algorithm is delta pack and satisfy that the limit when delta goes to zero of the expected number of samples needed divided by log one of a delta is equal to t star of mu which was or the characteristic time that appeared in in our lower bond so it is pretty easy to prove this equality almost surely so the technical part is to handle the expectation and so i will i will skip the proof but it's it takes one slide so if one day you want to look in the paper it's also it's also very short so i want to spend a few minutes on the practical implication of this this work so how does our track and stop strategy compares to the state of the art algorithm for this problem so in order to do so i have to quickly introduced a few algorithms that can be used for these problems and usually there are two kinds either they are using a upper and lower confidence bound so there are counter parts of a ucb types of algorithm for the best arm notification problem or they are using some elimination principle so you start with all the arm and then you successfully eliminate the arm for which you're convinced that they are not the optimal arm so the first algorithm i present is of the first kind it's called a k l l ucb so it is a bit reminiscent of the k l ucb algorithm except that is going it is going to use an upper confidence bound so based on this defunction again and also a lower confidence bound and here whereas in a k l ucb we had a log t here we have something that depends on t and delta and which is equivalent of the the threshold we've seen before so the algorithm samples actually two arms at each round so it samples the empirical best arm the arm that maximizes mu hat 80 and it also samples the arm among the suboptimal arm that has the largest upper confidence bound so here the arm in in bold is is sampled and it stops whenever the lower bound of the optimal arm is larger than the upper bound of of all the other suboptimal arm so somehow when the confidence interval are separated and then of course we decide for or the the empirical best arm so the other algorithm that i call k l racing is of the second type so it maintains a set r of remaining arms so all the arm still in the race and so it proceeds in rounds at each round we first draw all the arms that are in the set of remaining arms we compute the empirical means so for each arm still in the race we have we have drawn each r time so the empirical be mean is based on r samples then we can compute the empirical best arms the one with highest empirical mean mu a r and also the empirical worst arm and if the best arm in significantly larger than the worst arm we are going to discard the the worst arm and our criterion to perform the elimination is a bit similar to the k l l u c b algorithm is based on confidence interval and more precisely we eliminate the worst arm if the lower confidence bound of the empirical best is larger than the upper confidence bound of the the empirical worst and then we we remove w r from the set r and so we stop where there is a single element in the set of remaining arms that we output at the our guess for the the the optimal arm a star okay so this is kind of a generic algorithm where you could replace the elimination step by some other criterion and so empirically we also try to improve this procedure by replacing the elimination step by so we stop if the likelihood statistic of the best empirical best versus the empirical worst exceed some thresholds and so this in a more explicit form this amounts to eliminate an arm when this this holds true for a equal the empirical best and b equal the the empirical worst so we call this the charnoff racing algorithm because it is a racing type of algorithm but that uses our charnoff stopping so i present here numerical results on two Bernoulli bandit models so one with four arms and the other with five arms so i i give here the the explicit value of the arms together with the optimal proportion of draws in each of these two models i also mentioned that in practice the threshold function or the exploration rate in the confidence interval are all set to a log of log t divided by delta whereas it was proved optimal for i think here two times t so they are a bit smaller than what is allowed by theory but they are still very conservative and so with this choice so we implemented the track and stop strategy the two state of the algorithm or also and also our improvements for a racing type of algorithm based on the charnoff stopping rule and what we see that compared to the state of the arts the sample complexity are divided by by two roughly so there is a huge practical improvement so this is run for a specific value delta equals zero one but of course the trends the same trends occur if we take smaller value of delta and an interesting phenomenon that we can highlight is that on the first bandit model charnoff racing perform kind of similarly than track and stop whereas on the second it performs worse than k l l u c b so it is more it is less robust to a very use a bandit problem and the reason for this in in the way we build the algorithm a racing type of algorithm is going to have two arms that have been drawn the same number of times because we stop when we stop there are still two arms that have been drawn the same number of times and somehow so if we looked at the proportion of draws we would here have two equal value in the empirical proportion whereas so in the problem two the proportion of draws on of one and two are more separated than in the first problem and then it's kind of normal that's a racing type of algorithm perform less good in this model so to conclude we we proved the following for best time notifications so we computed the value of the infimum over pack algorithm of the limit of the ratio of the sample complexity divided by log one of a delta so we propose a little bit more explicit formulation in the paper and more importantly an optimal sorry a characterization of the optimal proportion w stars that relies on max here and that permits us to derive an efficient strategy matching the the bound so there are there is plenty of future work because here the analysis we propose is really a synthetic so we would want a finite time analysis just like what exists for regret minimization and also we can imagine other ways to use this knowledge of the optimal weights and so we would want to combine it with other successful heuristic in the bandit literature like the use of ucb's upper and lower confidant dance bounds or the use of dance ensemble thank you so i'm wondering what you would lose if so if you don't want to to find the best arm to go back to your order problem which was to minimize your regret one possible way to purchase is to try and first find the best arm and then systematically play that best arm so how much would you lose if you were using that approach so it would be suboptimal i think so this this kind of strategy actually have been proposed it's called explorers and actually you propose to dissociate the exploration phase and the the exploitation phase and i think because the the two i mean if you were to use an optimal algorithm for best army notification in the first phase and then choose always the best empirical best until the end i think you would have some in your upper bound in the upper bound you could device for the algorithm you would have this t star of mu appear somewhere and i don't think that you you would end up with the complexity of the regret minimization problem which is much more simple and explicit the constant factor yeah i think it would be a constant factor in front of the lock but it could be at least twice this number and actually i tried empirically with not with the optimal algorithm but with good heuristic and for regret minimization it is still better to balance exploration and exploitation as we go and not to to dissociate them have you considered as your metric for success instead of the probability of picking the right arm the error and the mean yeah so this is called the the simple regrets so the optimization error somehow so your criterion is to minimize mu star minus mu of the arm you get so meaning that yeah so i have not so some work in the literature have considered this measure which is i agree a bit different from just the probability of error but for example i don't know lower bound for the simple regret and yeah it would be actually in an interesting future work as well i have a question related to the last one if you think of a clinical trial what if you just want to find the best treatment within within five percent compared to the compared to the best in particular you can include the case where you have several equally good optimal yeah so this is a natural way as well to relax the problem so you fix some epsilon so 0.05 says and your goal is to output an arm whose mean is larger than mu star minus epsilon so this is also a relaxation that have been considered in the literature and you can adapt the the algorithm to to handle that case but again for the lower bound i wouldn't be able to derive a lower bound that would feature that would incorporate that that epsilon for the moment but but yes there are algorithms with upper bounds that involve like the the maximum between epsilon and the the minimal gap between the arm or something like that so the idea would for example if you use a racing type of algorithm would simply be well if after one divided by epsilon square log one over delta you still not stop then you stop and output whatever arms this would work as a heuristic for this this problem