 and then some linear contextual adversarial bandits as well. So really all of this stuff is just an application of this one big theorem we proved before with different ways of estimating the thing that you don't know, and then not much new. But there are gonna be a few fun things along the way, so there's gonna be a little bit of stuff on experimental design from statistics, which is a useful tool here. Okay, so the first one I want to talk about is bandits with experts. So this is the problem you face, you have a bunch of experts, they're all recommending you which socks you should wear in the morning, it's a difficult decision, and so the way it works is every morning you have to make a decision, and you have a number of experts who are recommending what you should do, maybe a really large number of experts. And you want to use their advice as effectively as possible and suffer sort of a low regret, not with respect to the best action in hindsight, but with respect to the best expert in hindsight. So if there's some good expert, you will identify them quickly. Okay, so how do we do this setting? So it's exactly the same as a bandit problem, the simple, finite-armed bandit that we studied before. So you have these K actions that you can choose, and the adversary is choosing a sequence of losses about how good those actions are. But now we also have a bunch of experts, M experts. And M is usually gonna be much larger than K, like maybe exponentially larger than K somehow. And the experts, they're saying, I think you should play this action or that action in each round, right? So the way it works is at the beginning of the round, you ask all of the experts what they think you should do, and they tell you, they tell you, I think you should play this little AI of T, that's the thing that expert I says you should do in round T. And then you get to make a decision, you can choose your action from one of the K choices. And the regret is now measured with respect to the best expert in hindsight, right? So here we have the max over all the experts rather than the actions, and then here we have the sum of their loss, right? So this allows us to move away from this very rigid framework we have to compete with the best action in hindsight. Now we can compete with the best expert in hindsight. And we'll see we can use this to use bandits in cases where the data is seen on stationary or big combinatorial settings, and we can just use these results here. And the important thing here is what we expect our regret to look like. Well, we don't want it to be too big in terms of M. So there's gonna be a bunch of examples where M is sort of really, really huge. You could definitely not tolerate having your regret scaled with M, but maybe log M. And that's what we're gonna see. We're gonna see a log dependence on the size of the number of experts. Okay, but really the analysis is completely straightforward. We're just going to plug in the stuff that we already know, estimate our losses, and it all just works out. So is the setting clear? We don't need large M, but we're gonna see an example where M is really large. And this often happens when you have some sort of combinatorial structure in how the experts are predicting. So the example that we're gonna observe in a second is where the experts are recommending a fixed sequence of actions, but with changes. So we're gonna have the set of all experts, experts predicting a sequence that looks like this. So they're predicting sequences of actions that you should play or recommending, I guess. A, N. And then, so all sequences like that, but subject to the constraint that the number of changes is not too big. Okay, so AT not equal to AT plus one is smaller equal to some constant C. Okay, so this is the experts who are saying, I think you should play this action for some time, and then you should switch to that action and so on, and we have all such experts. And the set of all such experts is sort of, if C is reasonably large is gonna be really huge, but this is gonna allow us to compete with non-stationary. So that's the classical example, but there are others as well. Yeah, then it could be a which banded problem. So the usual banded problem you would recover if you had K experts, and each of them are recommending, I think you should always play some fixed action, and then you would recover the standard setting. If there's some perfect expert who's saying every time the loss is the smallest, then you would be recovering, essentially pushing the max inside the sum, which is you would have this regret bound. This weird regret is the sum T equals one to N, and now we have your loss, oops, this is your loss, LTAT minus the minimum LTI. Okay, so if you had a perfect expert, then you would get this. And we're gonna see, okay, you could easily get this by having all experts, right? You have the set of experts who predict every possible action sequence, and there are lots of them, and we're gonna see in the end that unfortunately our bound becomes vacuous in this case, which is what you would expect because this is just too hard. We're making no assumptions about that, it's just that the regret that we prove at the end of the day is with respect to the best expert in hindsight. So if there's one expert that's good in the first half and one in the second half, we don't track that. There are also ways to add tracking for things like this, but we're not gonna do that here. Do you have a way to somehow prefer an expert after to learn to prefer an expert like some mechanism? Yeah, so this algorithm is going to do that. The algorithm that I show is going to, at first it's gonna be kind of uncertain about which experts are good, and you'll see the form of the algorithm is gonna be such that if an expert is predicting well, then you're more likely to go with their prediction, and if they're predicting badly, then you'll sort of discard their thing. So this is what the algorithm will do, but here we're just saying what the objective is. Yeah, okay. All right, so we want to do an algorithm for this, and actually what we're gonna do is exactly what we've seen before. We're gonna do follow the regularized leader with a negative entropy potential, and here the space of things we're viewing is like matter actions as the experts. So we're gonna do this thing on the space of experts instead of the space of actions, and the resulting algorithm is called the x before. Very imaginative name. And what does it do? So this is coming back to your question. The algorithm is saying, well, I'm gonna find a distribution over the experts. So in the previous sessions, P was a distribution over actions, and here it's a distribution over the experts. And what is it? Well, it's this exponentially weighted distribution, and here inside we have a loss estimator for the losses that the expert recommended, for the actions that the expert recommended, right? So expert I recommended you play these AITs, that should be an S, that should that. I recommended you play this sequence of actions, and so you're gonna estimate the losses that you think they would have got. It's a bandit setting, so you don't necessarily get to see it, so we have to estimate them. But then nevertheless, if this expert I did badly, the sum of these losses is gonna be big, and you're less likely to follow the advice of that expert. Okay, so then what the algorithm does once you have this distribution is you just sample an expert from the distribution. Yeah, this ET, so you sample the expert, and then you just copy them. You play the thing that they recommend you do. Right, so each round you have a distribution over experts, you sample from it, and then you follow the advice of the expert you sampled from. Okay, so this is sort of the first part of what we need to do when we want to analyze these things. We need to choose on what space we're going to run our algorithm, and here we're running it on the space of experts. And then we need to come up with a way of estimating the losses, right? And here we take an action, and then we get an observation, and so we can estimate the losses in exactly the same way, the importance-weighted estimator, but instead of estimating the losses of an expert directly, right? What we could do, if we really just blindly followed this thing that we did for the normal bandits, we would have a different loss estimator. We would estimate the loss of an expert, right? So what that would be would be L hat T i for an expert. This is a loss of an expert that you're estimating. Here you might say could be the indicator of ET. That's the expert that you've sampled being i, and then L T i, you get to actually observe their loss when that happens, and then divided by this P T, oops, P T i. So this would be the normal importance-weighted estimator that's estimating the loss for an expert, but this is actually throwing away a lot of information because when you play some action, you get information about all the experts who recommended that action, not just the one who you happen to agree with when you play it. So if you use this estimator, the bound that you end up getting is actually just like what you would get if you did the normal analysis. You would get a bound that looks like square root N M, log M, right? Here we have M and we would like to have K and this is no good. And so the loss estimator that we should use should somehow make use of all the information we have, and that's gonna be this one, which is just saying, okay, I'm gonna estimate the loss of the actions. It's like, did I play the action then times the loss? Now divided by the probability that I play the action, which is summing now over all the experts who recommended you play that action, and then the probability that they play it. So this denominator here is the probability that you choose action A. And so this is an estimator that captures much better the information that you have about it. Okay, so this is the algorithm and actually the analysis is gonna be really straightforward. Any questions on the algorithm, okay? So the analysis essentially what we do is we look at the bound that we proved yesterday and we plug in our stuff, right? So here we're using the negative entropy potential. So we have the log M is the diameter of our set. That's what we've seen. We're playing on the space of experts which has size M, so that's a log M. And then we have to bound the variance of this. And again, here we're summing over all the experts. Now we find that because we've used this slightly fancier estimator, that bound ends up being of size K. Okay, and I'll show you how to do that in a second. But what this just means is that when we substitute these things in, we get a K here, we get an NK here. And now we chew in this atta and we get a bound that depends logarithmically on M and the K is just appearing inside the square root, right? So here what we have is we have a much more powerful class potentially of comparisons. We have this huge number potentially M comparisons and the price we pay is just this log M. So the only thing I have to do is check that this calculation is really true. Unfortunately, that's not too hard. So why is that true? This chalk is very squeaky. So what is the estimator that we have? So I'll just write that again. So we have L hat T of A. So this is our estimator of boss for an action. Is the indicator of did we play it? T equals A. And then the loss of that action divided by the probability that we played the action which is now this sum over all the experts. So this is the sum I equals one to M P T I and then the indicator that that expert actually recommended it, okay? Okay, so that's just the definition of the loss estimator that we're using. And now we want to plug this into this expectation here and work that out. So what happens? Well, we have the expectation T A. Oh, actually that's not quite right, right? We have to sum over the experts. Okay, so we're summing over all the experts. This one to M P T I. And now we have the loss estimate for the action that that expert is recommended. So this is L hat T A T I squared. I squared. Okay, but this thing here, this thing just depends on the action that expert I recommended, right? So in particular, if we have a bunch of experts recommending the same action, then that's gonna be the same thing. So we're gonna split this sum into what things did the experts recommend? So this is equal to the sum over A over the actions. And now the sum over the experts such that they predicted that action equals A. Now we have P T I and L hat T A T I squared. But this is just A, okay? And I've lost the expectation. Just put it there. Okay, so really we're just splitting this sum up. So we're considering the experts who are recommending each action separately. But okay, what is this thing? This is now a sum over P T I. This thing doesn't depend on I. And this here is the same thing as the sum over these probabilities. So we're gonna get some nice cancellation, but there's a squared. So what I'm gonna do is I'm gonna define Q T of A as being the probability that you play the action A. All right, so this is sum I A T I equals A T I. Okay, so that's just the probability that you choose action A in round T. And so this thing of course is equal to Q T A. And so now we have this is equal to the expectation. And we have the sum over the actions. And now we have here we're gonna get Q T A and then the L T is gonna give you a squared. So I think we get L T A indicator function. Equals A Q T A. Okay, so here we have the L squared. The indicator function squared is just the indicator function. So we don't need the square there. And here we've got a Q T A because well, we got a squared Q T A in the denominator here and we multiplied it by one. And now we're in sort of familiar ground. What is this indicator function? Well, it's gonna be one with probability Q T A. And so since we're taking an expectation here, this thing just becomes one. And now we get this is equal to the expectation of the sum of the squared losses. And this is smaller equals a K because we assume that our losses were bounded in the right one. Okay, so this is a sort of a routine calculation but it gives us the right K. Questions on the calculation. We come up in a second, I guess. This is that if all experts are equally likely and the loss incurred is distributed over all choices, then we get all this upper bound, right? You get this upper bound no matter what. What is the situation in which you get close to that? So it's gonna be the hard case, I guess, is essentially the hard case is what's it gonna be? When you have a lot of experts, it's hard to construct these hard cases, actually. We'll see one shortly. It's gonna be this non-stationary bandit case. Essentially, what you want to do is you want to have a sequence of bandits. And in each block, you're gonna play for a little block, basically one bandit. And in each block, you have a bunch of experts who recommend to do different stuff. And that's gonna give it to you. We'll see in a second that example, and that is gonna roughly match this bound here. And that's the hard case. But actually, this bound is not very tight in lots of cases, so you can prove, for example, if the experts agree a lot, then you get a smaller bound here. Because essentially, the capacity of the space of experts is smaller if actually very often they're just saying the same thing. So there are cases where you can get much better bounds than this, but this is sort of the worst case thing. Any other questions? And I like this because it's so easy. I mean, it's a really simple analysis once you've done the XP3 analysis, but it really buys you quite a lot. And then this example of the non-stationary bandit is gonna give an example. So here, we're back to the normal bandit setting, but we want to relax this assumption that the environment is essentially the same in every round. And so we have to be careful about this. Like in the adversarial bandit, one of the selling points is that we make no assumptions about the data. This is one of the really nice things about this adversarial online learning framework. You don't have to make statistical assumptions. And so you would think that adversarial bandits, like I presented yesterday, were already well suited for non-stationary environments. And that's not true. And the reason it's not true is because the notion of the regret doesn't measure the behavior that you would expect in to have in a non-stationary bandit. So let's give an example. So we have k equals two. So we have just two actions. And I'm gonna choose the sequences of losses, right? And I'm just gonna say that, so this is action one, this is action two, and then on the top, we'll have time going this way. And what we're gonna do is we're gonna say that action one is at first really a bad choice. Always lost one, right? So we have one, one, one, and the other action is always a good choice. And this goes on for some time. And then it switches, right? So suddenly there's a transition and now action one is good. It has lots of zero. So we have a bunch of zeros. And the other action has lots of losses. And so on. And we'll say that this split occurs at t equals n on two. So halfway through. All right, so here we have this really distinct break where one arm is good at first and then it switches to be in the other arm. But in the normal definition of the regret, what do we have? We have that the normal regret is just Rn. Maybe we have some expectation. And then we have t equals one to n, your losses, Et, minus the loss of some best experts. So here we have LTA for some fixed A and there should be a max front, max A. Okay, this is the definition of the regret. And so what happens here? Let's say I choose a policy that just chooses randomly, uniformly at random, one arm or the other. What is the regret gonna be for that policy? It's gonna be zero. Exactly, because in this case, both of the arms, the cumulative loss, the sum of both of the arms is n on two. It doesn't matter which arm you choose as n on two over the whole period. And if you just average uniformly 50-50 here, you're just gonna play, your expected reward is a half in every round. So you can get zero regret with a ridiculous policy. And that doesn't feel right. So actually this regret can be extremely negative. You imagine an algorithm that plays, okay, maybe at the beginning it explores both of them a little bit, but then it starts seeing that two is really good and it starts just playing two. And then it notices that two is not so good and it has an opportunity to switch back. And then you would get really negative regret and that's what we would like. We would like an algorithm to get negative regret here. But the normal thing to do rather than say, okay, I'm gonna try and prove a negative regret bound is to redefine the definition of regret. And essentially what's happening here is this definition of regret is not capturing the fact that you would like to be good against non-stationary environments, right? This is clearly a very non-stationary environment. And so competing with just the best single action in hindsight doesn't seem like the best thing to do, okay? And so we're not gonna do that. We're gonna redefine the regret to compare ourselves not to a single action, but a sequence of actions that has maybe less than C changes, okay? So here we're saying we're not comparing ourselves to one action we're doing in a max over a big sequence, but we're saying that the number of times it changes can be at most C, right? So in this example over here, we would have a more meaningful regret bound if we had C equals one, because there's one very clear change point here. You want to place one arm and then you want to switch to the other. Okay, so here if we try and prove a regret bound for this, that should be meaningful for problems of this nature. And that's what we're gonna do. So this is the regret. And what's the algorithm gonna do? Well, it's gonna be this thing that I suggest at the beginning. We're gonna do prediction with expert advice. And what are our experts? They're just the things in this max here, right? Here we have a huge number of sequences and we're just gonna say they're the experts. They're saying what you should do. So in particular, there's gonna be one expert who says that you should play arm two until halfway and then arm one. And that's gonna be the best expert in hindsight for this particular problem here. So what happens here? Well, we just plug in our bound that we proved for XP four and that was this square root, I guess, two K and log M. So the only thing we have to work out is how many experts do we have in this case? And okay, that's gonna be kind of a big set because here we have the number of big sequences, right? So how many sequences are there that have at most C changes? And well, we can be a little bit lazy about this. Okay, not sure why this seems to have died. We can be a little bit lazy about this and say that, well, M can't really be much bigger than N choose C times K to the power of C, right? So why is this? You have to choose the positions where the change points happen. That's basically N choose C. And then you have to choose which of the K possible actions does the expert recommend in each block? And that gives you the K to C. Okay, there's some laziness in this calculation because you can't choose the same action in two consecutive blocks. So it's an overestimate, but it's more or less accurate. And we can approximate this N choose C if we like in a very lazy way. Well, it's definitely smaller than N to the power of C. That will be good enough for us. It's not quite tight, but that's fine. And so if we plug this in, what happens in our regret bound? Well, we get square root C now appearing outside the log, right? Here we have NK to the C and the C comes down and you have square root C times N times K. So now we have a regret bound that works for these completely non-stationary bandits and the price that we pay is just the number of changes. And that actually makes quite a lot of sense. In a very easy case, we can easily get a lower bound, in fact, that's quite close to that. How are we gonna do that? So we know actually, I haven't proven it for you, but we know that there's a lower bound of normal bandits that looks like square root N times K. You can't do better than square root N times K. So what I'm gonna do is I'm gonna just split up my time again, been into C blocks. So this is block one, block two, and so on up to block C. And then what I'm gonna do is, well, within each of these blocks, there's a lower bound that says you suffer how long is each block? The block is N divided by C in length. And we know that in each block, there's a lower bound that says you can't do better than square root now N divided by C times K. So the regret in this block is square root NK divided by C. And there are C blocks. And so when we sum that up, you get C times this thing and the C comes inside and you get square root C NK. And this is essentially up to logarithmic factors, the thing that matches the upper bound that we proved before. It's just here we have now this log N. I think you can get a lower bound that includes the log as well with more work. Okay, any questions? So now we're gonna move on to something different. So this is the prediction with expedited advice and so on. And we're gonna look at some linear bandits. Okay, so this is more or less the same plan. We're gonna say have a new model and then we have to estimate something and we have to apply our algorithm, which is the normal thing. So here we're looking at the adversarial version of the linear banded problem that I introduced in the first lecture. And so here what happens is we have some action set that's a subset of Rd. Okay, this is your set of actions is now features. And the adversary chooses a sequence of loss vectors. So that's the normal thing. And we're just gonna have this normalizing assumption that the losses are not too big. So the reward or the loss that we're suffering is bounded in minus one, one. And again, in each round you get to choose an action but this time your loss is a linear function. It's an inner product, okay? And then you suffer the loss of the action you choose and the regret is exactly the same thing as normal. So now we just have to use this extra linear structure to have a new estimator. Otherwise everything is gonna be the same. Okay. And what is this new structure? Okay, so we have some examples. This is sort of a nice setting because it covers a few things that we've seen already. So for example, if we have A is just the standard basis vectors, this is actually equivalent to the finite banded. You get to choose a basis vector and then you just get to observe essentially the coordinate of the loss in the direction that you played, right? You get to observe this LT and EI for some I is equal to LTI. That's what you get. So if you have the standard basis vectors, then this is just a normal banded. Okay, so linear bandits are really a generalization of this. But you could also think of other action sets. So sort of there are fundamental ones mathematically like the LP balls, which are interesting to study. And then there are more practical ones where A is just like a big finite set. And that's the one that we're gonna talk about today. So A is usually like a set of feature vectors that are associated with users and the action or with say ads or movie recommendations or something like that. Okay, and you can deal with the changing action sets as well, which we also saw in the stochastic setting. Okay, so this is the setting and now we just have to look at what the algorithm does and see what happens. Okay, so the idea is we're going to play exponential weights with the negative entropy potential and we're gonna have the same bound that we've seen in every example before. But now we just need an estimator for this L hat. So any proposals for how we estimate the losses in this case? Yeah, so in the previous, in the first lecture we did least squares estimation for this linear banded business. But it's a little bit trickier here, right? Because what do you get to observe? Basically what you're gonna do is you're gonna choose some distribution. The algorithm is gonna choose a distribution over actions. So like the ALG is chooses some distribution, PT. And then it samples an action from that distribution. And then you get to observe the reward, which is the inner product between AT and LT. Now what we did in the first lecture is we played a series of actions, right? And then we got the data matrix and we used our estimate in that case was like this theta hat was the G inverse times S where this is the data matrix and this is like the sum of the covariates times the rewards. But here we only have one action, right? We have to have an estimator of this loss based on one sample only. Because in the adversarial setting the adversary can change the loss from every single round with no correlations at all. So here we need, it's a one shot thing. So this matrix would not even be invertible if we did it like this, which is a little bit annoying. But it's close, how can we fix it? So the task we have, we just have this data. We have a distribution, we have an action sample from it and then we have the loss. How do we estimate the whole LT vector in an unbiased way based on that? What's that? Um, can you be more concrete? Okay, it's actually really tricky. We want to somehow combine this importance weighted estimation idea that we had for the adversarial bandits with the least squares estimation. And the idea is we're sampling this AT from PT. We don't have a data matrix based on one thing but we can look at the expected data matrix so we can define, I think I call it QT. So I'll go with that as being the expected data matrix. So this is the sum over the actions that I could play. The probability that I play it and then AA transpose. This is the expected data matrix that you would get if you did this many times. And this thing is gonna be nice and invertible and its expectation is gonna be good. So what do we have? Now we can estimate our L hat by going just QT inverse and then the action that we played, AT and then the inner product, the reward that we got, AT. Okay, and if we do this, then things are gonna work out. Okay, so we have this QT is the data matrix or the average one and the L hat is now the least squares estimator. But what we've changed is we've replaced the actual data matrix with just the AA transpose with the expected one, which is this Q and everything else is the same. And now we can calculate the expectation of this estimator. Well, what is it? So here we have the loss and we're gonna condition on the choice of distribution so we're just looking at the time when we chose the distribution. And now we look at the expected value and we're gonna say, well, it's the sum over the actions we could have taken, the PT of A, the probability. And now when we substitute this in, well, we have the AA transpose here times the loss. And suddenly you see the sum of the first term here. Okay, this thing is all nice and linear so we can pull the QT outside and then this thing is QT. And then you have QT, QT inverse is the identity. Okay, so the expectation of this single shot thing is the loss. It's sort of incredible that based on just one sample, you can have a least squares estimator of this whole vector. Okay, of course, the variance is gonna be reasonably large. And that's a little bit annoying, but nevertheless we have this very simple way of estimating a loss in the least squares case. I think it's really neat. Well, if the action space is continuous, you can still do all of this. The QT is not gonna be a sum anymore. It's gonna be an integral. So you replace this with literally the expectation operator and everything is still fine. And actually we have algorithms that do this. But for this case, the A is finite. But yes, the continuous case you can do as well. Any other questions? It's clear? You look a bit perplexed. It's just beautiful, right? Yeah. Okay, so this is our estimator. And it's unbiased, but we still have to look at what the variance is like. This is the important thing. And so this is the term that we care about. This is the term that appears in the regret. And so we just have to work out what this is. And it's actually a little bit of annoying calculation, but fortunately it's pretty much all in equalities. So the first thing we do is we just substitute it in the definition, right? So nothing really special has happened here. This is the loss of our action is the inner product with that action in the estimator. And I've just substituted in the definition of the estimator. So that's fine. And now what we're saying, well we had this assumption, right, that the losses that we actually suffer are always bounded by one, right? So we assumed that the absolute value of AT and LT is smaller or equal to one. And so that's what we're gonna do here. We have see this, this AT LT squared term appears and we're just gonna pull that out. Okay, so that's also fine. That's the inequality. Oops. This isn't working. Okay, so now we use this trace rotation formula. So this is just the claim that which you can check yourself, but it's very useful. If A is a matrix, guess here we'll call it a D by D and X is a vector. Then X transpose AX is equal to the trace of AX, X transpose I guess is what I'm using. Yeah, okay, that's back, excellent. Okay, so now we have some cancellation. So the trace is a nice linear operator. Everything is gonna be good. So we can pull it outside the expectation, the sum here. And now we've got this sum of the PTs and it's gonna go over to the A's over there. That gives you a QT and you get QT inverse QT is the identity. So that's one cancellation. And now we just end up with this trace of this thing. Now we're gonna take an expectation. And when we do that, well, it's exactly the same thing. We just have this as the thing that's random here when we're doing this expectation. And so we're summing over the Pt of A. Suddenly we have AA transpose here. That gives you another QT and you have D. Okay, so this variance term is just D. Really, really simple again. And we're gonna see when you plug that into the bound that you get a really nice thing. So our bound that we have is Rn is smaller equals to log K divided by eta. And now here we just have this D times n and eta over two, I guess. So eta Dn divided by two. Okay, and the nice thing is that what we're really shooting for is a bound that doesn't depend really badly on K. And here, well, we have a log K, but K isn't appearing on this side like it normally does in bounded problems. And we've converted the K into a D. This is what having features saves you and this is equal to order square root Dn. Okay, so again, we've just used the normal machinery. We've just calculated an estimator and then we've done the variance calculation. We just plug it into the same bound that we've got for every other thing. And this gives us the right rate. And this is the bound that we want. It's really nice. And in fact, it's slightly better than the bound that I showed you for the stochastic case. For the stochastic case, I showed you a bound where the D was outside the square root. And I told you it was tight. For the stochastic case, I was claiming that the regret looked like this. D square root n, basically, and there's some logs. And here I've just proven a bound for the adversarial case, which is making more or less weaker assumptions. That's better, right? What's going on? How do we explain this apparent contradiction? So what doesn't appear in this bound that does appear in this bound? K, right? How big would K have to be for this bound to be about the same as this bound? It'd have to be like, yeah, like two to the D or something exponential in D, basically, right? So why do such bounds appear really naturally? Why should K be that large or whatever? And the answer is simple, actually. Imagine that the action set is like a sphere. I'm going to imagine now that we do have this continuous action set. So it's this set of the X's. Where the norm of the X is smaller or equal to one. So this is just a sphere, but imagine it's in really high dimensions. And now we want to run this algorithm, but we want to run it over a finite number of K, but we want to make sure that we choose enough action so that basically we have one that's approximately optimal. So we want to choose some finite, a prime to be finite in a subset of a, and a prime is gonna have size K. And what we want is we want there to be some action that's a good approximator of the optimal thing in this set, right? The optimal might lie in here, but we want an approximator that lies in here. Because remember in the stochastic banded analysis, we didn't make any assumptions on the size of the action set. And what happens, well, we want, basically, we want something that's good. It's close to optimal. And so we want there to exist some action, a prime such that this thing is small for any action a, right? The action a could be the optimal one. We want to show that there is some a prime in this set such that this is like, say, order squared and something small enough that we don't care too much about it relative to the regret. And if you do this calculation for how many, how many things do you need in this a prime? You basically need that a prime, the size of it should be approximately equal to, well, it's like there's the accuracy that you care about. And then to the dimension. And the accuracy is one over square root n. So it's like square root n to the d. But the important thing is you have something exponential in d here. And if you plug that into this bound, which is literally just disappeared, then you have a d. And so this explains the contradiction. If you have a really continuous action set and you want to approximate it with a finite one, you need exponentially many actions to do that. And this is exactly what's happening here. Okay, but nevertheless, we have this case where if k is relatively small, then you have a good bound. And actually in the stochastic setting, you can also prove such a bound if you have a small action set. It's just harder to do, much harder actually. So somehow this adversarial analysis seems to be almost easier than the stochastic one, even though we made fewer assumptions. Okay, there's one little problem here. And the main reason I'm highlighting this problem is because it gives me an opportunity to tell you about this optimal experimental design problem, which is really nice. But the problem is here, we've had this a little approximation thing here. And that approximation came from yesterday when we did the Taylor series approximation of the Bregman divergence for the relative entropy, right? This KL term we approximated by a Taylor series. And I just told you that that was okay. And it is okay, but it's only okay if these losses like are not really negative. If the losses get really negative, then things can go wrong. And in all of the examples that I gave you so far, the losses are just positive, right? The importance-weighted estimator is generally positive because you just have the loss divided by some probability and the loss is positive, so everything is positive. But here, when you do this least squares estimator, these L-hats, you have the inverted matrix, and suddenly there's no particular reason why the estimated L-hat should be positive. And in fact, it can be negative. And if this condition is not violated, then this approximation, if this condition is violated rather, then this approximation can be quite bad. So this isn't yet approved. We have to make a little modification. And the modification we have to make should guarantee that this holds. We want the loss not to be too big. So what's another reason for why this occurs? If you think about what the algorithm is doing, the algorithm is doing this exponential-weight stuff. And the exponential function looks kind of different depending on whether you feed positive things in or negative things, right? So the Pt of A is equal to the exponential minus the eta, S equals 1 to T minus 1, and then we have L-hat S of A, divided by the sum of the actions. Same thing, okay? So this is what the algorithm is doing. So it sort of seems to make sense if the loss is a big, then the algorithm plays with low probability and otherwise somewhat higher probability. But the exponential function, if you plot it, looks like this, right? Then it just gets big really, really quickly. This is like E. And so what happens if you're feeding in negative losses, then you're looking at this regime with the exponential function, where everything is relatively flat, the gradients are small, the things, everything is well-behaved. But if you start getting losses that are actually really quite negative, then this term here can get quite positive. And suddenly you're in a realm of the exponential function, where it's totally wild. The gradients are huge, the thing is really blowing up. And then this algorithm actually becomes unstable, okay? And the way to control that is to assure that you're like never playing too far away from this region here, and that's what this condition is saying you should do. Okay, you should always have your losses. They can be a little bit negative, but they can't be too negative. Otherwise, things get unstable. And so we need to modify the algorithm a little bit so that this doesn't happen, okay? And there's a nice little trick for this, which is actually a pretty common thing to add, and helps in other ways as well, because it encourages sort of a more stable algorithm. We had this question yesterday about what happens to the variance of these algorithms if they're a grad, can be very large. But if you add a little bit of exploration, then it's actually can be much smaller. That's what we're gonna do here. So FTRL recommends that you place some distribution PT, but we're not gonna do that. We're gonna mess with the algorithm a little bit. What we do is we play an average, a mixture of PT. So we're gonna play P tilde T, and we're gonna play mostly PT. This gamma is gonna be relatively small. So with essentially high probability, we're playing what the algorithm suggests. But then we add a little bit of exploration. Okay, so this pi is some distribution that you choose in advance. The distribution of the actions. You always play that with a little bit of probability. And the idea is this extra probability is going to give you enough exploration to guarantee that the losses are never really too big. It's not even that they're too positive or too negative, they're just gonna be relatively small. And so what we have to do is we have to choose this exploration distribution and then see what happens. Okay, so I'm gonna show you the calculation for what these things are. So the first thing that happens is the QT now. This is the average data matrix based on the distribution that we actually play, which is P tilde. Okay, so here we just have P tilde here. And this is gonna be bigger than just gamma Q pi here. So basically we're gonna drop a whole lot of terms. By greater than here, I just mean that this A, I'm not very used to writing the symbol, A is sort of bigger than B as matrices if A minus B is positive definite. And here what we're doing as well, the QT is really the sum of two positive definite matrices and we're just dropping one of them. And what we're left with is just the distribution from our exploration, which is this gamma times a new data matrix. And now we want to bound the magnitude of our loss. Okay, well we just start plugging things in. We just plug in the definition of the loss that we've estimated and see what happens. Okay, and the first step we're gonna do is I guess pull out this ATLT. We know that's less than one. We've assumed that's less than one, so we can pull that out. And then, okay, we split this matrices along the inner product. So we have an inner product. Basically what I have is I have A transpose QT inverse AT, I guess. Yeah, okay, but we can rewrite this thing as being the absolute value of A QT inverse AT. These matrices are positive definite. They're gonna be invertible. They have just all the nice structure that you need for the square root to exist. So this is equal to A QT half, QT minus one half AT. And now we can pull one of these over to the other side. These matrices are self adjoint that equal to their transpose, the symmetric. So then we have the thing that we want. That's one half, AT. And now we're gonna apply Cauchy-Schwarz. So we plug in our beloved Cauchy-Schwarz, and what we end up getting is just the norm, but they're weighted by these Q matrices for each. And now what I'm gonna claim is this super nice result, which is that there exists a distribution pi such that these things are both bounded by D. And this is like incredibly non-obvious, and it's a really nice theorem by Kiefer-Wulferwitz from the sixties, where they make some very nice connections between different optimization problems. So there are these two optimization problems. So G of pi here is the thing that we sort of want to optimize. We want to find a distribution pi where the weighted norm is small for all actions. So we want to minimize, we want to find a distribution G pi that minimizes the right-hand side there, which looks like a really hopeless optimization problem. How do you do this? The action set might be quite big, and you have this weird max, and you want to minimize it. It's really nasty. It's not a good problem at all, apparently, but it turns out that it's actually equivalent to maximizing this log-determined problem. And that's like a volume thing, actually, right? If you have a matrix here, a positive definite matrix here, q pi, the determinant is the volume of, it's the volume form of that, like if you have an ellipse, essentially, of determined by q pi, then this is a volume thing. And Kiefer-Wulferwitz proved that these things are equivalent. And furthermore, that if A actually spans the Rd, if you have enough actions to span Rd, then the optimizer is actually exactly equal to D. Like it's exact. We know what the optimal value of this optimization problem is. It's exactly D. It's really incredible. And the proof is not hard. I wouldn't show it to you, but it's sort of you just do the stuff you normally do when you prove things a minimum, like you take differentials and then you just set them equal to zero, and then it just all falls out. So it's incredibly nice and also pretty surprising. And the last point is that the support of this pi is not too big. So it's not very useful for us here, but it's sort of interesting. And it has a very nice geometric interpretation, which is related to this problem of finding a minimum volume ellipsoid. And this is the problem. So here I've plotted the set of points. These are your action sets. And the data matrix that you get when you choose a distribution pi, that determines an ellipsoid, which I'll write. You have the ellipsoid, is the set of X in ID, such that this norm X squared with respect to Q pi inverse is smaller or equal to D. So this is the equation for an ellipsoid if you have some positive definite matrix here. And this ellipsoid here, I've plotted in this particular case for the optimum. And what it is is it's the smallest ellipsoid that contains your points. So this is, and it has to be a central ellipsoid. So it's centered at zero and it contains these points. And that's what this pi is. It's the distribution over the action such that the resulting data matrix is this thing. And the distribution itself is supported on just the points that are on the boundary of the ellipsoid here. So in this case, the pi would be zero for all but those three points. And on those three points, it's gonna be something such that this is true. And this is, I think this is a very nice result. It's also related if you know some convex analysis. It's basically another version of John's theorem, if you like. So this is Kiefer-Wulfervitz. And we're going to, well, we just use the fact that this distribution exists to add the exploration to our algorithm. And essentially what you should think of this algorithm is doing is it's choosing the points that you should play to minimize the variance if you did a least squares estimation, right? So if you're normally just doing a least squares estimation problem, this is why it's called experimental design. Let's say you get to choose a bunch of covariates. So you have some action set, A, which is a finite subset of Rd. And you get to choose a bunch of covariates. And then you get to, so you choose the A1 up to some An. There's no real reward here. You have a different objective, but you get to choose a sequence of covariates. And then you get to see a sequence of signals. So we have X1 up to Xn and Xt is, well, it's going to be equal to the inner product of AT and some parameter theta that you don't know, plus eta t noise. Okay, so this is like the linear banded setting that we saw on the first day, but we don't really care about the regret. What we care about here is the estimator, the variance of the estimator. So this is the normal case. So we have our G is the data matrix, AT, AT transpose. And we have our least squares estimator is theta hat. Is this G inverse into N, AT Xt, I guess, Xt. So that's your least squares estimator. And now you can say, what is the variance of this guy? Okay, and the variance is going to be the variance in a particular direction. Okay, so the variance of A theta hat, if you work out what this is, well, it's just equal to the norm of A weighted by the inverse data matrix. This is basically the calculation that we did in the first lecture. And so now what the experimental design problem is saying is, well, roughly speaking, how do I choose these As to make this as small as possible? Right, now that in itself is quite a hard intuitive programming problem because you have to choose exact numbers. But what we can just say is instead, we're gonna find the optimal design. So this is the distribution, this is pi, which is a function from your action set to zero one. And it's a probability distribution. And now we're gonna play actions. So we're gonna play A in A, exactly. Okay, now we have a little rounding problem, so I'll just take a ceiling, pi of A N times, right, so we're saying essentially, I wanna play according to this distribution, pi, but I have to do a little bit of rounding to make things match up. And what happens if I do this? Well, really, I just have now the pi of A times N in this G here. So G basically is approximately equal to, if we ignore the rounding, sum over the actions, pi of A, and then we have N at the front, A A transpose. So this is N Q pi. This distribution here, so this is the G. What is G inverse now? G inverse is gonna be one over N Q pi inverse. And now when we substitute that into this bound, we get that this is basically equal to, except for this rounding business, to the norm of A Q pi inverse squared, divided by N. And this is smaller equal to D divided by N by the key for both of its theorem. So this theorem is telling you how you should choose your covariates. If you have a choice and you want to learn a least squares, this theorem tells you how to do that in essentially an optimal way. This is a thing from experimental design. It's like where you want to design your experiment to maximize the amount of information that you get. Right, and so it's not very surprising that here when we want to add this exploration to our algorithm, we want to add exploration that explores in kind of an optimal way. And that's exactly what this geoptimal thing is doing. I think it's really nice. And even better is you can actually relatively efficiently compute this thing. So you can compute an approximate geoptimal design or geoptimal design using Frank Wolf. And it works up to hundreds of dimensions and I think millions of points nowadays. Really nice. Any questions on that? Yeah. So computing P, given that you want to minimize the variance, you already have a way to compute P in order to minimize the variance. Yeah. You go to call exploration. Yes. Okay, and because I just somehow used more exploration to be some randomization element or just, I don't know, like I just don't see how to. So. Why are you still calling it exploration if you're just computing something that you already wanted to minimize that was important? So I guess I'm calling it exploration. So first of all, this pie is a distribution. It's a distribution over the actions, such that if you sample the actions from that distribution and then you do least squares estimation, you minimize the variance of the thing. So in that sense, it's like, what is exploration? You want to understand what the real thing is. And we have an estimator in mind, this theta hat. And if we commit to using this estimator, its quality is roughly measured by the variance, right? If the variance is really low, then we have a good idea of what's going on. And if it's high, we don't. So how should we choose to explore when we choose these actions? Well, we should choose them to make the variance relatively small. And this thing is doing exactly that. It's just terminology. Yeah. Okay. Yeah? Yeah. Is this from the ones that have been proposed in places of domestic arguments that has removed the baseline or? Yeah. Is there some connection there? I'm not sure it's exactly the same. I mean, here we are doing a variance reduction. So in that sense, it's the same thing. We are sacrificing, not exactly bias in this case, but we're not doing what the algorithm tells us to do because the thing that it tells us to do would result in a high variance estimate. Well, yeah, I'm not sure if it's easy to make a parallel exactly, but it's maybe something to think about. And so if we just plug this into our bandit, then once you fix all these constants, now you face this difficulty, right? We're not playing anymore what the algorithm says we should. And for that, because you're doing this random exploration, like all of the time with some probability, you pay a price for that, right? Gamma is like the probability that you explore. And when you explore, your regret could be big because you're just choosing one of these actions that looked kind of like it was good for exploring, but not necessarily good for regret. And so you pay this n gamma price here. But fortunately, we're allowed to choose gamma relatively small. So the calculation that we did to check that the losses weren't getting too negative required this condition here. Okay, so we're just gonna choose gamma to be exactly equal to eta times d. And then we get two of them here, two eta and d here. And if you fix all of the constants, you do all the approximations properly, then what you end up with is these constants here. But the main point is we have basically the band that we wanted, but actually it's a real bound now. So this added exploration is really crucial for this purpose. And at the same time, it makes the algorithm more robust. You can really see the improvement. Okay. And that's linear bandits. So again, it's the same thing. It's just the same old algorithm that we've gotten used to, new estimator, slightly new analysis. And then here we have to add a little bit of forced exploration, but otherwise it's all the same stuff. Any questions on this one? So the problems I wanna show you next is a really nice one. It's this optimal path routing problems. This is an online problem. We have a map. And okay, this is the picture from the book. I'm from Australia. My co-author is from Hungary. So like you have to get from Budapest to Sydney. That's your objective. It's a really long way. I can tell you like 12 hours, now the eight hours, it's really terrible. But this is the goal. And the game that we're gonna play is you have to do this a lot. I go back to Australia. Well, not as often as I should, but anyways, and never from Budapest as maybe. So the problem is I have to choose a path. So I buy my plane ticket and then I take my flight and I get to see how long it took. And the model we're gonna assume is when I take a flight, I get to learn how long that flight was. And I don't get to observe how long other flights were. So if I decide to go Budapest, Frankfurt, Singapore, Sydney, then I observe that this was one hour. This was 12 hours and this is eight hours. But I don't get to observe that Beijing, Abu Dhabi would have been seven hours. Okay, so this is a feedback that we don't actually call banded feedback. We call semi-banded feedback. And the reason for that is because you get a little bit more information, right? Here we don't just get to observe the sum, the total time that it took. We get to view the time it took along each path in the graph. Okay, so again, this is just a sequential game. Each round I choose which path I want to take. And the adversary gets to choose the lengths of all of these things. Okay, and I want to do as well as the best single path in hindsight. Okay, so this is the semi-banded problem. And it has a really, really nice and sort of straightforward solution. And here's the formalization. Okay, so we're gonna say we have D edges in the graph. And a path is just like some set of those edges. Not just any set, but like a set that actually forms a path through this graph. And our action space is the set of paths. And we're gonna represent these paths as one or zero vectors, right? So the actions is some big subset of zero, one to the D. And if the coordinate is equal to one, that just means we moved along that path. So if you have a really simple graph, this is your start goal, which we have. And then there are maybe two nodes, I don't know. Things are directed. So this is our start and this is our goal. So here there are just two paths and D is equal to four. So D is equal to four and A is, there are just two paths. And so we're gonna label our paths. This is gonna be one, two, three, four. And A is gonna be one, which is R1, R1, and two zeros. And the other is gonna be zero, zero, and two ones. Right, so these are just like M-hot vectors where the ones are indicating which paths you're taking. So this is the set of actions that we're assuming, which obviously if a big graph could be a really big set of actions. Here we have a sort of unusual case where the action set is actually smaller than D. But if you imagine you have a much bigger graph, the action set is usually much larger than D. The set of paths is usually much bigger than the set of edges. And we're gonna assume that the loss is the length of the whole path, right? I suffer in a plane for 24 hours, that's my loss is 24. And that's this thing here. We can just write it as a new product. This is the power of this linear structure. It actually appears a lot. And here because the actions of these one-hot vectors, this just selects out the paths that you go, and that's the loss that you suffer. So it's a linear loss, as we're used to. And just to normalize things nicely, we're gonna assume the losses are between zero and one, but of course you can scale that up to zero and 15 if you're doing these silly flights. Okay, so this is just the formalization of this problem. And the set A is given to you. All right, so there are sort of two cases that people studied, as I mentioned. There's the banded feedback, and there's the semi-banded feedbacks. And the banded feedback, all you get to observe is the length of the whole path. But in the semi-banded feedback that we're gonna be talking about today, you actually get to observe just the length of each path that you actually took, right? So if you didn't take it, then the ATI is equal to zero, and then you just get zero. It doesn't really provide you any information. But if you did take it, then you actually get to observe the loss. Okay, so this is the problem. So I'm gonna show you another application of this framework to show that actually this semi-banded thing is kind of a flexible and nice framework. And this is just a simple ranking problem. There are more complicated ranking problems as well. So in ranking, the problem that you have is you have a board that you're showing to some users, so like maybe you're Netflix. And in Netflix, I don't know what it looks like. It's like they have some window and then they show you a bunch of movies, basically. Right, you get to choose which movies the user gets to click on. Ranking is a really complicated problem, actually. And this is a very simplified version of that. And ranking is made complicated by lots of things. So ranking is hard for one reason, because if you show users things at the top, they're more likely to look at those than things later on. And usually you have an enormous number of things and it's very hard to understand what makes a user click on one thing rather than another. But here we're just gonna go with a simplified model. We're gonna say that we have M positions that we can show. So in this case, M is nine. We can show M movies to our user and there are deep products to recommend. And the loss that we suffer is like, okay, of the things that we showed, what did the user interact with? We want to maximize these interactions and so we're gonna say the loss is zero if they did click on it or do whatever they're supposed to do with these things, buy it or whatever, and one otherwise. And now the set of actions is like, well, we can show M things and the actions are just like these one hot vectors again or the M hot vectors saying one is in a coordinate if you actually showed that thing to the user. So our action is the set of things where the L one norm is equal to M, right? The zero one vectors where you have M things turned on. And this is like a little model for a ranking problem and here what do you get to observe? Well, you get to observe the clicks. Did they interact with the things that you showed? But it really is like a bandit. You have to do exploration because you don't get to observe if they would have clicked on something that you didn't show. So you have to do a little bit of exploration. So both of these problems can be formalized in this online linear semi-banded setting. And so this is the general problem. So the general problem is you're given some set of actions which we assume as zero one vectors and the L one norm is bounded by M. So they're at most M hot vectors. But you're given some set. So the A could be sets of paths, it could be rankings, it could be all kinds of things or you can have rankings, but you can add constraints so that you can't show two movies together or something like that. And then the loss that you suffer is just the inner product like we've seen. And we particularly care about this bandit feedback case. And the regret is exactly the same thing as we've looked at in every other case. It's just the difference between the loss that we suffer and the loss that you would suffer if you did the best thing in hindsight. Okay, and this is the combinatorial semi-banded and we're gonna look at one algorithm for this problem. Any questions on the setup? Exactly, the length of the longest path. Exactly, that's right. And we'll see that M is gonna appear in the bound. Let me prove, yeah. Any other comments? All right. So it's gonna come as probably no surprise that we're just gonna do more or less the same thing that we've done before. But there's a little bit of a trick here. And the trick is that we're not gonna play or follow the regularized leader on the set of actions. We're actually gonna play it on the convex hull of the action set. And that's gonna end up leading to a better bound and we'll sort of see why shortly. So what is the algorithm doing? I wrote it down, I wrote down the loss estimates. So yeah, so we're gonna play FTRL with neg entropy but on the convex hull of A. So what is the convex hull? It's just like the smallest set that makes something convex. So if you have some set of points just in R2, this is not a convex set, right? Because if I take an average of some points it's not convex, the convex hull is this thing. So that's the convex hull. And our algorithm we're gonna apply, follow the regularized leader on the convex hull of the action set. So the action set here is this A which is this sort of like high dimensional set of points. We're gonna take the convex hull which gives us a thing called a polytope. And that's where the FTRL is gonna play. And then what we're gonna do, well you can't actually play a point inside a convex hull, right? You have to make a recommendation to your user. And so what we're gonna do is the learning algorithm is saying, well I want you to play this point here. That's the recommendation. And then you have to find a distribution over the corners that averages to this point here. And the algorithm is gonna play that distribution, right? So that's this PT. So the algorithm is recommending that you play XT. It doesn't understand that you're living in this set where you have actions. It thinks you're playing in the convex hull, but you can't. So this is this standard convexification idea. We're gonna play probability distributions with an average equal to the thing that follow the regularized leader recommends. Okay, so this is the idea. And then we need a loss estimator. And here we have this extra information, right? We have the information of what is the loss in each coordinate. And so we're gonna estimate our loss by the same importance weighted thing. So this ATI is basically like the indicator function of did we take path I, right? That's just the definition of the setup. So ATI equals one, if and only if you played path I, you like flu on I, right? Okay, and otherwise it's equal to zero. And what we're saying is, well the loss is just gonna be the L hat I is gonna be the loss along that path if we actually did get to observe it. And then did we actually get to observe it? And then divided by XTI, which is the probability that you get to observe it, right? So XTI, this thing here you can think of as the probability that you take. So this really is just our importance weighted estimator again but now because we have the extra information we get to observe the loss in every coordinate. We don't have to do anymore this least squares estimator which we can just estimate it in the coordinates independently, this is gonna buy us something, okay? So that's loss estimator. I should just write down what the algorithm is doing, this XTI because we for some reason didn't like that. So I said that the algorithm is playing follow the regularized leader with negative entropy on the convex hull of A. So I'm just gonna let K be equal to the convex hull of A. And what the algorithm is doing, the algorithm is choosing XT plus one to be the arg min X in this set K. And now we're doing follow the regularized leader so we have a learning rate and then the sum of our losses or the estimated losses. So here we have X and L hat S. And then we have a regularizer, okay? So before I was able to write this down as an exponential weights distribution. And the only reason I was able to do that was because we were playing on the simplex. And now we're not playing on the simplex. And when you're not playing on the simplex this optimization problem you can't just solve in closed form to give you the exponential weights thing. So we just have to leave it in this form for now but that's what the algorithm is doing. You could, you could play on the simplex and what happens to then what you're doing is you're playing on the set of actions. You lift to this big set of actions and you play exponential weights on that. It turns out that that gives you a slightly worse bound. And to give you some little bit of intuition about that, what does the regret bound look like when we do this negative entropy stuff? What you end up getting is you get log of the dimension of the thing that you're playing on divided by the learning rate. And then we have the plus the sum of these these dual norm things squared in expectation. So this is what we have as our bound. Now if this thing is basically the same no matter how you play. If you lift to the space of actions this is going to stay the same. But this term here, this D is replaced not by a D which is the dimension of the convex hull of A. That dimension is D. But if you look at the space of all actions you would have not D but you would have log of the size of A here instead. And A is usually really, really, really big in these games. And so this is why this causes the problem. It's a really good question. Okay, so we're not gonna do that. Although it would make the analysis truly easy. But now nevertheless all we have to do is just like plug this into our bound. All right, so this is now the original bound that I had. We can't have the log K particularly because now we're not actually playing on the simplex anymore exactly. And still we have the difference in the potentials and then the sum of the dual norm stuff appearing there. And we have to bound each of these two terms. So let's, okay, I probably shouldn't have done that. Let's just write out what that thing is again. So we have L hat ti is equal to A ti L ti divided by x ti. Okay, so let's do the second term first. Well, it's actually exactly the same calculation that we're really used to. We care about the expectation. Sum i equals one to D x ti and then the L hat squared. This is the thing that we're trying to bound. And well, we do our usual business. We substitute in this definition. This is the expectation sum i equals one to D. Okay, so A ti is either zero or one. So when we square it, it's just equal to A ti. It's an indicator function. And so we have A ti L ti squared divided by x ti. But here the expectation of A ti is equal to x ti. So this is equal to the expectation. Sum i equals one to D L ti squared. And this here is just smaller equal to D. Because we have assumed that our losses are bounded in zero and one. So this is again the familiar calculation that we've done a lot. And so that gives us the second term. And now we just need to do the first term. And when we were working on the simplex, the first term just gave you this log k really easily and now it doesn't. Now we have to do a little bit more. So what we care about is bounding f of, what am I calling it? f of x for A x one. Okay, but what is this f? The f I'll just write as the negative entropy potential. So f of A is the sum A i log A i minus A i. And here these x's and A's each coordinate of the x and A's is something between zero and one. And so this thing is actually going to be smaller equal to zero no matter what the A's are. You just do a little calculation and this is smaller equal to zero. So the only term we really have to care about is this minus f of x one. Okay, so now we can just write this out. So this is equal to sum i equals one to D x i minus sum i, let's make it a plus in fact and have x i one divided by, okay? So this is just writing out the definition of the negative entropy. Now this term here, this is the sum of the action that we played and by our assumption on the action set this thing is smaller equal to m. So this is smaller equal to m plus the sum i equals one to D x i log one on x i. Okay, so this thing looks very much like an entropy. The only thing that's going wrong is the sum of the x i's is equal to m, not actually one. So we're going to bring an m out the front divided by m here and then we're going to do the same thing inside the log m divided by m, okay? And this term here, this sum, what do we have? We have the sum i equals one to D x i divided by m log one on x i divided by m. And the other m I'm going to pull out. So this is plus one to m log one on m. Okay, so this thing is now a probability distribution and we're calculating the entropy of a probability distribution with D things. And so if you remember your information theory this term is smaller equal to log D. And this term here, well now we just have a constant here and we're summing over the x i's which equals or the x i over m which is equal to one. And so this is just plus the log one over m and then we bring it inside here and we get log D divided by m. Okay, so this is a bound on our potential that we end up getting is just m from this term and then one plus m here and then the log D over there. So, and that's the second part. And then we optimize the eta. So this gives us our expression, I hope, good. So this is now just substituting in both of these calculations that we do. And if we optimize the eta this is what you end up getting. And we actually know that this bound is optimal or more or less optimal. Whereas if we did the suggestion of playing outside then I think what happens is the m creeps out here if you play on the big set. So it's just a little bit less. Right, so we have just five minutes left. So the only problem with this algorithm is that it has some computational challenges, right? So to implement this algorithm what is the optimization problem you have to solve? Well, there are two. You have to solve the follow the regularized leader calculation. Okay, and that's a convex problem. So that's a priori hopeful. And then you also have to find the PT such that this holds is the average, that's a linear problem. So that's also a priori hopeful. But the problem here is both of these problems are over a really nasty set. This set A is huge. It's like this convex hull is this giant polytope. And in general this convex optimization stuff is not gonna work well if the size of the input is potentially even exponential, right? Even to describe what this polytope looks like I might need exponentially many parameters to tell you that. And if that's the situation that you're in then even your fancy complex optimization problem algorithms are not gonna help you out. And there's sort of a reason for this. This combinatorial optimization is just hard, right? Like we could encode shortest path problems like traveling salesman problems and things like this. And you just can't solve those problems efficiently. So we should not expect to have an efficient algorithm for this problem. But there is a case when we can hope and that it comes from looking at the regret, right? So let's imagine that we have an algorithm that has small regret. That's what we're trying to derive anyway. Okay, so then the regret is looking like this. And so if we have an algorithm with small regret what that essentially means is I could, and it's efficient. If we have an efficient algorithm with small regret I can just run that algorithm and look at the average action that it produced and the average action that it produced should be close to optimal. And what that means is that I can solve by running my algorithm approximately equations of this form, right? I should at least be able to solve the linear optimization problem on your action set A. Because if I have an efficient algorithm for this problem then it itself is gonna approximately solve this problem. And so it seems reasonable in this search for an efficient algorithm we can restrict ourselves to cases where this holds. So we're gonna assume that we can just solve this problem and then ask what to do algorithmically. And there's a very nice idea which says don't do follow the regularized leader. Do a thing called follow the perturbed leader. And follow the perturbed leader replaces the regularization with randomization. And what it does is it says, okay, we're gonna play follow the leader, right? Follow the leader is just doing the thing that's best in hindsight. But then we add a little bit of noise to the losses. We choose a good choice of the ZT randomized distribution and we perturb our losses by that amount. And then we do follow the leader on the perturbed thing. This is called follow the perturbed leader and the perturbations give you, it turns out, enough exploration that this is a good thing, okay? And this algorithm is not quite as good as as the follow the regularized leader for this particular problem. Or at least we don't know how to make it that good. We don't know how to choose the ZT so that the regret is quite as good. But I think we can do, yeah. So we can prove a bound where the M just comes outside the square root. But we've replaced this potentially really hard convex optimization problem with a linear optimization problem. And this is something that we can hope. And the nice thing is, okay, I mean we can't use follow the regularized leader any more, but the analysis of this algorithm is it's too technical to show here. And I don't have anywhere near enough time. But the proof is actually very nice. And the main idea is to say, well, what is this algorithm? It's actually follow the regularized leader in expectation with some potential. And then you just do all the normal stuff. All you have to do is work out what potential is this equivalent to an expectation and then you do the normal follow the regularized leader bound. And that gives you the answer. So this is a really nice framework. And there are lots of algorithms that fit into this. So you sort of looked at Thompson's sampling and things like this in some experiments. And these are all essentially versions of this follow the perturb leader idea. So it's pretty neat. The one that you have in the UCB career in some exploration policy and different policy. Do the things connect you? So there's a few differences. I would say it's more similar to Thompson and something. So this addition here, this ZT is a random thing. So it's sometimes positive, sometimes it's negative. So it's not always saying you should be optimistic about stuff. Sometimes it actually looks bad. Whereas Thompson sampling is sampling from a posterior and so it's sort of more connected I think to Thompson sampling than anything else. But it is generally this idea that if you just do follow the leader you don't explore enough and by adding this randomization you encourage exploration. And that's the thing that works out. Yeah, any other questions? Or I just recap. All right, so I just wanna say like I mean we wrote this whole book and it turns out that in six hours you can't present a 600 page book. So there's some more stuff for you to be excited about. So we talked a little bit actually about handling these non-station environments today. But we didn't talk at all about delays, right? Like in practical problems you show your user something and then they go away and then a week later they come back and they briar by the thing or not. And that's a really hard problem to deal with because you run into all kinds of trouble that appears in reinforcement learning like credit assignment, right? If the user comes back a week later how do you decide which intervention caused them to buy the product? And so delays is a big issue and there's a bunch of stuff on that. And then there's lots of other structure we didn't talk about. So you can talk about, we did a lot of stuff on linear bandits where the losses are linear but what if they're convex? That turns out to be a really rich, interesting problem that's still not fully understood. You can talk about infinite action spaces. You can talk about bandits on whole graphs. You can kernelize everything. And then there's lots of other settings. So the only setting that I've talked about really is this regret minimization setting where you care about the cumulative reward that you get over a bunch of rounds. But there's also this pure exploration setting where what you care about is just identifying the best action. You don't care about the price that you pay. Which is relevant for example if you're doing clinical trials and you're doing it on mice. So you don't care about the mice but you care about the thing that you recommend at the end. And there you have the pure exploration problem. Then there are other settings altogether like partial monitoring where you don't even observe your loss. You just observe some surrogate information about the loss. And okay I didn't talk at all about Bayesian methods which is also pretty exciting. So I think that's the end. I'm like really happy that so many people came. I hope this has been fun. It's been really fun for me. And I hope to see you at conferences I guess. Or come to DeepMind. That would be great. Thanks.