 All right. So I guess it's time for the second exciting bandit session. As I said yesterday, it's going to start out a little bit away from bandits. We're going to talk mostly today, I think, about online convex optimization and even more specifically online linear optimization. So I think you've seen quite a lot of optimization techniques in this workshop. And this is maybe a little bit different in the sense that it's not focusing on any kind of statistical assumptions. We're trying to move away from that. So we're moving towards a fully adversarial setting. So we don't want to make assumptions on how the data is generated. And then I hope at the end of today I should be able to show you how we can use these things to solve bandit problems. But if we don't quite get there, then definitely tomorrow we'll just be like a rapid fire series of applications of these tools of online linear optimization to bandit problems. Today is like the hard work. And then tomorrow you get to see the exciting results. So it's worth remembering that this is what a convex function looks like. They're good because they have minimums that we can easily find. And we can do gradient descent to find them easily. So they have two really nice properties for us. First of all, the sum of two convex functions is, again, a convex function, which is an ice exercise to prove. And the second nice thing is if it's strictly convex, so if it always has some kind of curvature, or it has a strictly positive second derivative or strictly positive definite Hessian, then you have a unique minimizer as well. So this is why we like convex functions. And it's just worth remembering that basically convex functions look like this. So online convex optimization is not a bandit framework, but it is a game. And it's a game in the same way that you're choosing a sequence of actions, and then you get losses and you want to make your regrets small. And the game works as follows. So you have n rounds as before, and probably I'm going to tell you n in advance, but maybe not. And now we have some convex set, which we're going to assume is some subset of Rd, or maybe the whole of Rd. The only thing that's important here is that it's convex. And in each round, what happens is you get to choose some point in that convex set. That's just some point in Rd. You get to make any choice that you like. And then the adversary, like some really mean person potentially, as mean as you can imagine, they choose a function, a whole function that has to be convex, but otherwise it's just some convex function from k to the real values. And they can do that even after looking at the point that you chose. And then the loss you suffer is the value of that function at the point that you chose. So you choose a point, the adversary chooses a function, and you suffer the loss at the point that you selected. So this feels like a really hard game. Imagine we're just playing on 0, 1, the interval 0, 1. So you can choose any point between 0 and 1. So maybe you choose this point as your first point. And then the adversary gets to choose some convex function and you suffer the loss at the point you chose. Maybe the adversary is going to choose the function. I don't know, let's just do a linear function as convex. They're going to choose that function. That's the function they choose. And the loss that you suffer is this value here. That's your loss. And now you get to choose another point. Maybe you think, oh, this wasn't such a good choice. It would have been better if I'd been over here. And you choose over here as your second choice. And now the adversary looks at that and they're like, aha. I'm going to choose this function. And again, you suffer a big loss. And this doesn't seem so great. And again, you suffer some loss over here. And that's bad. So in every round, you can suffer a big loss in this game. The adversary can force that to happen to you. But we can be saved a little bit because the thing that we care about is the regret. And we care about the regret relative to a single fixed choice in hindsight at the end of the day. So the thing that we care about, usually there's a max in there as well. So usually the thing that we care about is i n is the max over all of the x's in k. This is called the competitor. And then that i n of x. This is the regret relative to the point x. And we would like to make that thing small maybe for all x. But in particular, the thing we don't care about, the thing that's just too hard, you could imagine having our tilde like bar or something as being the sum t equals 1 to n and then our loss x t. And then minus not the best loss against the best competitor in hindsight, but against just anything. Here I can have a minimum x f t of x. So basically I'm moving the min inside the sum. And it would seem even nicer if you could make this small. That would mean that in every round you're playing some point that's kind of close to being the thing that you would really like to play. And that's hopeless. We can't do that. And so this is not the objective that we go for. We go for this objective instead. And then this game becomes possible. And it becomes possible to even prove actually that this regret is very often sublinear in the horizon, which is the same thing that we saw in bandits. And actually for today we're going to talk about almost entirely this very special case where the adversary has to choose not just any convex function, but a linear function. So this works in each round. You choose your point x in a convex set, as usual. And then the adversary gets to choose some weight vector. This loss that determines the linear function and that's the loss that you suffer. So this is the game that we're trying to play. Any questions on this setup? It has to be convex because if you're trying to minimize a convex function, it doesn't necessarily have to be, but the problem is going to get really hard if it's not. So imagine you're trying to minimize a convex function. That's easy to do. And it's nice because it has this nice minimum. But this is only working out because the domain of the function is convex. Here we have the domain of this function is maybe this thing here. But now imagine the domain is like this. We don't have this part anymore. Now try minimizing this function. Well, what do you do? You have to look at all these little pieces, make this really high-dimensional, and you can easily get exponentially many pieces. So if the domain is convex, then gradient descent is going to be nice. But if it's not convex, maybe not. Good question. Any other questions? All right, so the purpose of this is I'm going to present you an algorithm that solves problems of this kind. And one version of this algorithm is actually just the thing that we know and love is gradient descent. So first I should tell you a little bit about why do we make this restriction to linear case? Why is that a sensible thing to do? And there's actually a reason for that. So in the setup that we have before, the adversary is choosing a whole convex function. That's a really, really complicated object to describe to you a convex function. I have to say what value it is at every point. And that's difficult. But I can give you just the gradient of the function at the point that you play. That's a simple thing. It's easy to represent. And actually it turns out that that's, in a certain sense, enough. So if we have FT as some convex function, almost the definition of convexity is that if we draw a tangent at some point, this red line, then that tangent has to lie below the convex function. That's practically the definition of convexity. And so what this means is we have this simple equation, the f of x at some point x here has to be bigger than the tangent that we have at xt. So this is coming from convexity. And if we rearrange this formula, then what we can say is this FT of xt, that's your loss, minus the loss that you suffer at x, has to be smaller than this inner product here. And this is a linear equation. The gradient of f at xt is just something. And so here what we can do, we can plug this into our regret and say, well, the regret relative to x, we can bound it in terms of the gradients that we chose here. So provided that the gradients are sort of reasonably well behaved, we can actually just take the function that we have, throw away all the details except for the gradient at the point we chose. And if we can show that this thing is small, then we have a bound on the regret. So these are called first-order methods where you just use the gradient. And you are losing information here. So there are cases where it's better not to do this reduction. But in the situations that we care about for band, it's particularly that turns out not to be the case. And it's a really nice simplification that we can just study this simple, clean case where we have linear losses and then do a reduction in the general case which is going to work out well if the gradients are not too big, basically. So this is our motivation for studying linear losses. And here I'm going to use this notation L of t. But if you want to recover it into some notation that you're more familiar with, doing gradient descent or something like that, you should think of this L of t as being the gradient of f. So this is called linearization. And it's a really useful trick that makes this analysis a little bit easier. Well, this holds no matter what. But the gradients, of course, might get very big. And we're going to see at the end of the day the bounds that we prove are going to depend on how big these gradients can get. And so that's one case where things go wrong. If the gradients get really big, this is maybe not a good idea. And the other case is, OK, I mean, this is a pretty weak. Essentially, we're saying that this red line is smaller than this blue line. There's a big gap between here. It's like a lot smaller, right? And so when we do this inequality here, that is potentially quite loose. And this contributes, for example, if you want to do online learning where the functions are, say, strongly convex, you have some really powerful structure, then this gap is indeed really big. And if you do this linearization, you get a worse rate. But OK, for our purposes, it's not going to be a huge problem. OK, so for the rest, it's just going to be this online linear optimization. But we still have this sort of situation. The adversary can just choose stuff to be really mean. It seems like quite a hard problem. But as it happens, we can solve this. OK, so what are the techniques that people know for optimization and that we've seen already quite a lot? So of course, we've seen gradient descent, right? And this LT here, you should really just think that is the gradient of the loss function. So this is literally just gradient descent. Maybe if you're on a constrained set, then there'll be a projection. But if it's unconstrained, just gradient descent. Or maybe if you're coming from a bit of a game theory background, you might be thinking about fictitious play. So what is fictitious play? Fictitious play is just where you look at the sequence of loss functions that happen so far. And the action that you take is the one that would be best if that happened. So you minimize just the sum of the losses so far. That's called fictitious play. Or maybe you know about exponential weights. So this is sort of specific to if the action set, if the convex set k you're playing on is a simplex, then you can play this strange looking exponential weight strategy. And the purpose of this talk, I guess, or most of it is to show these things are actually all instantiations of one algorithm just with different parameters. And that algorithm is called follow the regularized leader. And it has a very, very clean analysis. It has a very clean interpretation of the result that we're going to get eventually. It covers all of these cases. All right, so what is that mystical algorithm? It comes from the idea that if we do this, this actually doesn't work. This is an algorithm that works in certain cases, but it doesn't work in full generality in this online learning setting. And I'll show you why in a moment. But the algorithm that we have is a regularized version of that. So we've seen lots of regularization so far, and this algorithm is doing exactly that. So what it's doing is it's saying, I'm going to play my next point, xt plus 1, is going to be the thing that minimizes the sum of the losses so far. This is the sum of the losses so far, and then there's some learning rate to balance things out. So it's minimizing that. But then we add a regularization. And the regularization is saying, OK, minimize that thing, but don't go too crazy. And this F is called a regularizer, sometimes a potential. And then we have the ather is a learning rate. So this is follow the regularized leader. And all you have to do now is you have to choose your learning rate. It's easier said than done. And you have to choose your potential, which is a good time to become Harry Potter and put on a wizard's hat. That's really hard. But the point is that different potentials lead to different algorithms, and including all of the things that we've seen so far. OK, so for the simple case is just what happens if we don't do any regularization? We have F of x is equal to 0. Then we have exactly fictitious play. And I told you that this doesn't work. And now I'll show it to you why. And essentially, the problem is this algorithm is really unstable. And here's the example. So we have the set that we're playing on k is going to be minus 1 to 1, so just the interval. OK, so that means each time you have to choose some number between minus 1 and 1. And well, I don't really care what the algorithm is going to do in the first round. Let's just say it has to break ties somehow. So we'll just say that x1 is equal to 0. It doesn't really matter. The counter example works anyways. So the algorithm chooses that. And the adversary now, I get to choose my loss. And the loss I'm going to choose is l1. And I'm going to choose it to be equal to, let's say, 1 half. So what does this function look like? So we can draw our space. So here we have minus 1. Here we have 1. And we have, I guess we can have some things here, minus 1. And the function looks like this. So this is f1 of x. And the algorithm played somewhere. It doesn't actually matter where the algorithm played. The point is, where will the algorithm play next? So now we have one function observed. And we want to choose the point that minimizes the loss so far. So what point is the algorithm going to play? What's that? It's going to, yeah, so for x2 now we have to choose. It's going to be what? Minus 1. Right. These are losses. We want to minimize losses. So the algorithm is going to play minus 1. Exactly. Equals minus 1. Great. And now what the adversary is going to do, I'm going to say, OK, what am I going to do? I'm going to choose l2 equals minus 1. So that's a line that looks like this. And so the loss that you suffer, your loss is going to be, well, you chose minus 1, so you suffer a loss of 1. So your loss is 1. So f2 x2 equals 1. Now the algorithm looks again. And now we have to take the, oh, I've messed up this slope. The slope is OK. So what's the algorithm going to play next? It has to minimize the loss of the sum of these two things. Now, what's that? 1. Good. So now it's going to choose 1. It switches completely over to the other side. x1 equals 1. And now you can probably guess what I'm going to do. The adversary I'm going to choose. Now, this slope, a really big one. And then we just alternate from side to side. And your algorithm is totally miscoordinated with what the adversary is doing. Every round, the adversary switches to the other side. You suffer a loss of 1. And then you jump. You try to follow it. But it's already switched back. And this is no good. So again, you just suffer l3 equals 1. You suffer a loss of 1 equals 1. But note that if you had just played at the center in every single round, you would suffer a much smaller loss. So our regret is actually really, really large in this game. And that's bad. And that's why this fictitious play doesn't work. So the regularization adds some stability to your algorithm. That means that this isn't going to end up happening. And this is what we would like to see. So fictitious play is no good. So the second example is gradient descent. And it turns out that gradient descent is nothing more than follow the regularized leader but with a quadratic regularizer. So here we're minimizing the sum of the losses weighted by the learning rate. And then we have some l2 penalty. So we don't want to go too far away. And why does this work out? I mean, this is a convex function that we're minimizing. And we're actually minimizing on the whole of rd. So we can solve this just by differentiating, finding a point where the gradient is 0. And if we do that, what we get is that the sum of the losses times eta plus x, the derivative of the norm of x squared is just x divided by half. And now we can say xt plus 1 as well. I mean, we just move it over to the other side. It's the minus the sum of the losses. And that's exactly equal to gradient descent. So this is nothing more than gradient descent. Or you can say gradient descent is just follow the regularized leader with a quadratic potential. And this is an algorithm that we use a lot. And you can see what it's going to do in this problem. It's not going to jump right out here immediately. It's going to stay kind of close because you're going to achieve the learning rate to be reasonably small. And so the algorithm ends up just kind of oscillating around 0. And that's actually the right thing to do in this problem. So the fact that you've slowed the algorithm down a lot is what buys you a good regret bound. And this is gradient descent. And I think it may be another example. Oh, no, I don't have another example. So we'll see a few more examples later. This is just a teaser. But to actually analyze these things, we need a bunch of tools from convex analysis. And that's the plan to go through for the next little So we need to know, who knows about Bregman divergences? No one. I'm pretty sure that's a lie. Who really knows about Bregman divergences? Still no one. OK, so we'll learn about Bregman divergences, even if you know. It's your punishment. So the Bregman divergence, we saw this inequality, which measured the difference between the tangent and the convex function. The Bregman divergence is actually just measuring how big is the loss when you do that inequality. So the Bregman divergence is a function of two points. And it depends on a choice of regularizer. So we have this df is the Bregman divergence between x and y. And it's defined in this sort of hard-to-interpret looking way, if you don't think about it. But what's really happening here is you're looking at the point y, and you create the tangent vector at y. And then you see, how far is that tangent vector from the function at the point x? So you look at the point x, and you measure the difference between the tangent and the actual function there. And that is the Bregman divergence. So it's going to be positive because the convex function should always lie above the tangent by the definition of convexity. And well, how positive it is is somehow depending on how curved your function is. If the function is really curved, then the Bregman divergence will be big. Whereas if you imagined your function was actually linear, the Bregman divergence would be zero. So it's somehow measuring the curvature. And if x were equal to y, well, then you would just be comparing the difference at this point here, and it would be zero. So you have the simple property, which I should have put down there, which is just that df of x. Well, we have the positivity. It's always greater or equal to zero. And then we have that it's equal to zero if they're the same point. Is equal to zero. So these are two important properties of the Bregman divergence. And you can think of this, if you like, it's not actually a distance. So it's not symmetric, for one thing. And it doesn't satisfy the triangle inequality. So it's not a metric in a real sense. But it is measuring somehow a distance. We call it a divergence, in the sense that it's zero if they're the same. And if they're not the same, then it's probably going to be positive. If you have strict convexity, then it's strictly positive. So it's somehow a measure of distance, if you like. And the final thing that's maybe useful is, OK, to emphasize the distance nature of this thing, we can do a second-order Taylor approximation of the Bregman divergence. This is where you fix y and you treat x as the variable and you do it about the point y. And if you do this, then you end up that the Bregman divergence is basically this weighted norm. And you have the Hessian of the potential function appearing. So f is convex. And the definition of that is essentially that if it exists, I mean, this second-order may not exist, but we'll assume it does. If it does, then it should be positive definite by the definition of convexity. And so this really is a norm, this thing, if you fixed y. So locally, it looks like this quadratic norm. And if you go far away, then, OK, maybe that's not true anymore. OK, so this is the Bregman divergence. And it turns out to have some really nice properties that we're going to use. Any questions about it? So there'll be a few examples in a second to get you familiar with these things. Right, what do we have? OK, so we have the quadratic regularizer. So this is the one that gave us gradient descent when we used it as part of our thing. So if we have f of x is 1 half norm of x squared. Do I go with the half? Yeah, OK. So what is the Bregman divergence for this thing? Well, we just write out the definition. F, x, and y. So we have 1 half norm of x squared. Guess this is two norms. Minus 1 half times the norm of y squared squared. And then we have minus the gradient. We have the gradient of f. Well, the gradient of f is just x, actually, at y there. So y, x, minus y, OK. So this looks like kind of something we can handle. And indeed, we can. This is just equal to 1 half, I guess. And now we should have x minus y squared. So if you have a quadratic potential, then the Bregman divergence that you get is just the squared Euclidean distance. So that's nice. And this is one of the fundamental potentials. But the other potential that we see coming up all the time in is actually particularly useful in bandits for reasons that will become apparent later. It's called the neg entropy potential, which has this strange form. And so let's see what happens if we do the same calculation for the neg entropy. I'm being a little bit sloppy about the domains of these functions. You have to interpret this in the right way. So this log is obviously not defined if x is smaller than 0. And so we're only going to define the Bregman divergence where the potential is defined, which in this case is called the positive orthont. It's where all the entities are positive. OK, so what happens here? So here we have df of x and y. So now we just, again, we just write down the definition of what these things is. So here we have the sum i equals 1 to d xi log xi minus xi. OK, then we subtract the one for y. This is minus, again, sum i equals 1 to d yi i. And now we have to subtract the gradient term. So this is minus. OK, so what is the gradient of this thing here? Well, it's a sum. I'll just write it out somewhere. The gradient of f y is what we care about in the coordinate i. It's just a bit more convenient to write out this sum form. So what do we have to do? We just have to differentiate this with respect to one of its coordinates. So the first one we get, I guess, just log x minus 1 and then plus 1, and then there's another minus 1. And so we just have log x. Here is actually just log yi i. OK, so we can substitute that in here. Writing out the definition of the inner product. i equals 1 to d log yi. Now we have xi minus yi. OK, so this is all just writing down the definition, doing the derivative. Nothing fancy is happening. And now we can do, I hope, a bunch of simplification. So here we have some xi log xi minus some yi log yi. And here we have plus some yi log yi. So those all cancel. And we're just going to get the sum of xi log xi divided by yi. Some i equals 1 to d xi divided by yi. And then there's still these other terms. So we have plus some i equals 1 to d yi minus xi. So I did the calculation correctly. OK, so this should sort of look a little bit familiar. And in particular, if we imagine that these x's and y's are living in the simplex, so they're like probability distributions, the sum, the coordinate sum to 1, if that's the case, then this term is equal to 0. All right, we have sum of 1, sum of 1 is 0. And now we have exactly the KL divergence or the relative entropy between these things. So this is the significance of this potential is that if you're using it on the simplex, which essentially we always will, then the Braggman divergence that's induced by the neg entropy is the KL divergence. So that's something that we're already familiar with. OK, the last one, I won't do the derivation, but you can maybe do it yourself for fun. The Braggman divergence that you get as a consequence of that, I think it's called the Taylor's entropy. But OK, this is also useful for bandits, so maybe we'll come back later. OK, so this is the Braggman divergence and some examples, any questions? So just the second little piece of convex optimization stuff that we need is called the first-order optimality conditions. And this is also not super mysterious. It's a condition for whenever you found the optimal point of a function when you're doing optimization. You want to find them in? Yeah. In this case, say the neg entropy case. The last case, no. I think it's not quite the hell in a distance, but it's, I think not. Yeah, it's something else. The question was, did I just say that the previous line? It seems to be twice differentiable. We differentiate, I mean, again, on the positive or thought. We differentiate, so it's going to be, the Hessian is going to be diagonal, or it's going to be like 1 on x to the 3, 2, or something like that. Yeah, so when you get? At zero, yes. At zero, it's not well behaved, but sort of at any point on the interior. That I've heard, well, yeah, exactly. So as the gradients blow up, things get bad, yeah. So in general, these two functions have a special name, they're called Lejeune functions. And the meaning is, essentially, if you just think about what they look like in one dimension, and you look at the boundary of the domain. So the domain in both of the second two is zero to infinity. And if you look at what happens on the boundary of, say, the neg entropy potential, it looks like this. This is like growing at a linear, just like slightly super linear rate. And at the boundary here, the gradient blows up to infinity. And when the gradient blows up to infinity, the KL divergence is going to have a huge separation. So indeed, it's not exactly defined at this point here. Great question. Good. OK, but locally, anywhere in a local region of the interior, it looks quadratic. OK. All right. So back to the optimality conditions, this is saying if we have k as being a convex set, and f is convex and differentiable, and we have a minimizer of that function. If we have a minimizer of that function, then that's equivalent to saying that the inner product of the gradient in any direction inside the convex set is going to be positive. And what does this mean? It means they're a pretty intuitive thing. So if you have some convex set k, this is your k. And we have some function f on that. So you should think about some dish that sits on top of the convex set. And you have some minimizer, let's say this point here. That's your minimizer of the function. And what the claim is saying is that if I go in any direction towards some x star, x other some other x, if I head in that direction, then I'm going to increase. That seems pretty natural. We found the minimizer. That means that if we go in any little direction, it has to increase. The only slight subtlety here, if this is in the interior and it's convex function, the gradient should actually be 0. So these conditions are useful when you're doing the constrained case, because maybe the x star is appearing on the boundary. Maybe we have x star is on the boundary. And then the function could decrease as you go outside the convex set. There's no reason the function should be increasing in that direction. But it's going to be increasing no matter which way you go inside the convex set. So that's all this theorem is saying. It's just the conditions that you have for finding an optimizer of a convex function. And this is another really nice thing about convex functions is like locally you can tell when you're at an optimizer. You don't need any global information like you would for a non-convex function. So now with this, these are the tools that we need to analyze this FTIR algorithm that follow the regularized leader. And there are kind of two steps. The first one is really simple. It's just a rewriting of the expression. And we'll see why this is useful later. So really, it's just very simple. We just subtract off this xt plus 1 and add the xt plus 1. So that's a very straightforward step. So the next thing that we're going to do is actually spend a lot of time analyzing this term. This is the troublesome term. This one you can kind of believe is going to be small, xt and xt plus 1, if we have a relatively small learning. That's not changing too much. So that term is going to be relatively small, but actually it gets even smaller when we cancel something else later. OK, but the second term is really we need to do quite a lot of work. All right, so let's do this. So I'm going to define this phi t as just being the thing that we're minimizing. This is the thing that the algorithm minimizes when it's choosing xt plus 1. It looks at that thing and it minimizes it. And now this is the term that I care about bounding. That's the second term on the previous slide. And the first step I'm going to do is to use the linearity of the inner product to pull the x term outside. So we get two sums. And then we just plug in the definition of this phi n at x. And one term we have to cancel. The phi n adds an extra f, and so we have to add it back in here. So that's just a simpler quality. And then we write out this sum. Basically, we want some kind of telescoping to happen. And so what I'm doing here is I'm saying, OK, what is this phi? What is the difference between phi t and phi t minus 1? Well, if I have a fixed x, then lots of terms cancel. And you're only left with the last term in this sum. So that's saying that just any individual term in here is just equal to the difference of these potentials. That's also a simple step. And now I'm just going to claim that the following is true, which is completely non-obvious. So I'll show you why. This is the tricky part. So let's do this. So what we have is we have the thing that I'm actually caring about is just the sum term of this part. We'll leave the last two terms off for now. And we'll put them back together in a moment. So what we care about is bounding this sum term. The sum from t equals 1 to n. And now we have minus phi of should be t minus 1 x t plus 1 plus phi of t x t plus 1. This is the thing we want to bound. And what we're going to do is just write out these sums so that instead of comparing phi's which have different t's, we want to compare phi's which have the same t's. So we're just going to write out the sum and see what happens. So this is equal to, OK, I said I'm going to kill the sum. So what we have is we have minus phi 0 of x 2 phi 1 of x 2. And then we have minus phi 1 of x 3, plus I guess, what? Phi 2 of x 3, and minus dot dot dot dot dot. And then at the end, we have a minus phi n minus 1 x n plus 1 plus phi n x n plus 1. So I really just expanded out that sum. Again, nothing too fancy has happened. But now what we want to do is we want to pair these terms so that we have the same phi here in different axes. That's what we're going to do. So this is equal to now I'm going to claim minus phi 0 of x 1 plus the sum t equals 0 to n minus 1. And I think now we should get here phi t of x t plus 1 minus phi t x t plus 2. And then we still have at the very end this term here. So again, I'm really just reordering these terms. And now there's like a couple of little tricks that we need. So the first one is what is the algorithm doing? The algorithm is choosing the x t plus 1 to minimize phi t. So we have this phi t function. And the algorithm is choosing the x t plus 1 that minimizes that. So that's what the algorithm is doing. And so in particular, this phi n of x n plus 1, that has to be smaller than phi n of x for any x. And in particular, the x that we compare it to. So we have that phi n of x n plus 1 is smaller than phi n of x. So that's going to be handy, because you're going to see that it's going to cancel with this phi n of x, which is gone conveniently. All right, so now we have to deal with this term here. And the nice thing that's going to appear is we're going to make it into a Bergman divergence. And this is going to come out from a nice property, which is that if I take the Bergman divergence of sum with respect to sum convex function f, if I add a linear component to that function, the Bergman divergence doesn't change. So if we have f is convex, and we're going to let g be equal to f, maybe I'll write the x, g of x is equal to f of x plus some linear functions. So linear functions we can always write as a linear product. So it's going to be the inner product between x and sum w. So if we compute the Bergman divergence with respect to g, it's going to turn out to be the same as with respect to f, which is sort of not hard to check. So here we have f of x. I'll just do things nicely, minus f of y. And then, well, the gradient of this thing, gradient is linear. So in the gradient, it splits into two components. So here we just get the minus delta f at y, x minus y. And then we have the linear part. So this is going to be x of w, minus y, w. And now we have the gradient of this linear function, which is just w, w, and then x minus y. And this second part is just 0. That's the Bergman divergence of a linear function is 0. So this thing is just equal to df. Now, what is this phi t? Well, this phi t is an f plus a linear function. So that means that the Bergman divergence with respect to phi is going to be the same as the Bergman divergence with respect to this f divided by gatter, which is actually it's all nicely linear. So now we're going to use that to turn this thing into a Bergman divergence after I kill this. So what's going on? So we have this phi t x t plus 1, minus phi t x t plus 2. OK, so the first step is going to be really magical. It's equal to the minus phi t x t plus 2, phi t x t plus 1. I literally just negated it. And so now we're going to add the terms in that we need to define the Bergman divergence. So this is equal to minus, again, phi t x t plus 2, phi t x 1. And then now we just add the gradient term that we want. So delta, now, phi t x t plus 1. And then we need x t plus 2, minus x t plus 1. And then we need to add it back. We just subtracted this thing. So now we have another minus of that thing exactly out, but outside this bracket. So you get the slides again. OK, great. OK, you can absorb that for a second. So the next step now is, well, this thing here is the Bergman divergence. That's the definition of the Bergman divergence with respect to phi between x t plus 2 and x t plus 1 with respect to phi t. And as I pointed out in the previous calculation, that's actually just equals the Bergman divergence with respect to f divided by eta between those two things. OK, so we're going to substitute that in. So this is just equal to minus the Bergman divergence with respect to f, x t plus 2, x t plus 1, divided by the learning rate. And you can take it on faith, or you can check the obvious calculation. If I multiply the potential by something, then the Bergman divergence multiplies by that something. And then we have minus this term here. OK, so what's the next step? Yeah, exactly. So now we want to use the optimality conditions and the definition of the algorithm. So the algorithm, again, is choosing x t plus 1 to minimize this phi t. And so now this is exactly the optimality conditions. This thing here should be positive. And so we're going to just drop it. So we're just going to say this is now smaller or equal to the negative of the Bergman divergence x t plus 2, x t plus 1, divided by eta. Great. And that's actually essentially the whole calculation. Now we're just going to substitute this thing into this thing. And then we use this to cancel off the phi n with this phi n. And did I miss something? No. So we still have this phi 0, right? We have the phi 0 of x 1, the negative phi 0 of x 1. But what is phi 0? Well, it's just the potential function divided by the learning rate. So that gives you this part here. So what we've finally managed to do is bound this second term in the regret by this thing, and then minus the sum of the Bergman divergence. So this is already a little bit hard to interpret. And I'll talk about what it means in a second, but maybe there's some questions about the calculation. What's that? The sign in front there. Yeah. The regret is actually not positive. This is something that's easier to believe, but it's actually not true at all. Because the regret is just compared to this point x. And what can happen is your algorithm is moving around. And the adversary, the thing that you're comparing to, is kept fixed. So the example that shows that that's not true. And this is actually sometimes important that the regret can be negative. So in any case, negative regret would be great. I want negative regret. But the regret, remember, with respect to x is equal to the sum t equals 1 to n, the inner product between x t minus x and l t. So this thing can definitely be negative for some x, because I could choose an x where the loss is really big and the algorithm is good and so that's bad. But we can have just the normal regret as being the max over x rn of x. And this thing can also be negative. And the reason it can be negative is imagine we have this sequence of functions, like I showed you before, actually. So the adversary is alternating between these two functions. Now, the average of these two functions is just 0. And so the optimal point is, well, it actually doesn't matter where you play, but just anywhere is OK. Yeah. But if you're really clever as an algorithm, like you work out this alternating pattern, then you can play here when the algorithm is playing here and here when it's not, and you always get loss minus 1. So in this adversarial setting, the regret is not always positive. Sometimes useful to know. But whatever the case, this fragment divergence is positive. So having a negative sign here is definitely good for us. That makes the regrets small. And we're happy about that. OK, so if we just substitute this into the first step that we had when we did the analysis, what we end up getting is two terms, one of which is sort of easy to interpret. This f here, this distance here, is kind of like the distance between x and x1. So the way these algorithms work, right, at the beginning, it's just minimizing the, it chooses the point that minimizes the potential. So let's say you have a quadratic potential. The first point that the algorithm chooses is going to be the point here that minimizes the potential. And then it has to go to some x star, maybe x here, the optimal point is x. And the price that it pays is this distance here divided by the learning rate. So if you have a really small learning rate, and the algorithm, like, it moves along really, really, really slowly, and you suffer a big regret because it takes a long time to get there, right? So that makes sense that this term gets really big as the learning rate gets small. It makes the distance longer because you're traveling more slowly. The second term probably makes no sense. And so now we're going to spend a little bit of time massaging that term into a form that does make sense. And that requires a few new tools, or maybe a reminder about the dual norm. So this is just the simple dual norm in Euclidean space. So here, if we have some norm, so this is maybe not the L2 norm. It could be a different norm, an L1 norm, an LP norm, whatever you like. Then the dual norm is this thing also on RK, or if you really want to think about it as the dual space you can. But it's also on RK, essentially. And the dual norm is just this supremum here. It's the thing that makes this biggest over x's. If you like, you can just think about x's that have unit length here. We can bring this norm of x inside the inner product. So we're just considering of all the x's that have unit length what makes this linear function of the biggest. This is the dual norm. And you can do some little calculations to show that the dual norm of the L2 is the L2. Or the dual of the L1 is the L infinity. Or if the LP is the Q, whether 1 over P plus 1 over Q is equal to 1 or whatever. Or the thing that will be useful for us a little bit is if you have a norm that's weighted by this matrix defined like this, then the dual norm is the inverse of here. So that's just a little calculation that you can do. But so these dual norms are going to be useful for us. So one nice thing actually about writing down the definition of the dual norm is you get Cauchy-Schwarz for free. So Cauchy-Schwarz inequality is normally this thing that bounds the inner product using the regular two norms. But here you can actually get it with any norm and its dual. And it falls out just from the definition of the dual norm. So here we have that the supremum of this thing is equal to the dual norm. So in particular, for any particular x, this L star is definitely bigger than this thing. And then you just bring the x over to the other side. So once you've worked out what the dual norm is, Cauchy-Schwarz inequality is for free. So that's kind of nice. So you get holders inequality or what have you. OK, so these are dual norms. Any questions on the dual norms? All right. So we're going to see how they appear with a really nice little trick. So we're going to suppose that our bregman divergence really is behaving like a quadratic, essentially. We're going to say, let's suppose that there is a norm, that there is some norm. So this t here is indicating it's not like just any old norm or like a t norm. It's some norm that you can choose based on the situation that you're in at this time. And so it just exists some norm. And we're going to suppose that the bregman divergence is bigger than this thing. That's an assumption. We have to prove that it's true when we want to use this thing. It's an assumption. But we know that this is true. For example, if f is the quadratic, then it's literally equal to that thing with the two norm. So that's the good sign. So we're going to suppose that this is true. And now we just do this cute little trick to bound these two things. And the idea is really just to minimize some quadratic. So the first step we say is, well, we just plug in our assumption. We assumed that this thing is bigger than this. So we just plug it in there, no problem. And now we can apply Cauchy-Schwarz to the inner product. And that gives us two terms. We're going to use the right norm. So here we have the norm with respect to t and then the corresponding dual. So this is really just plugging in Cauchy-Schwarz. And now you notice this is like a quadratic where this xt plus 1 minus xt can be viewed as a variable. And we can say, well, how big can this possibly be? We have a linear term and we subtract away a quadratic term. If we make the x is really big, then it's going to go negative. So actually, this thing is pretty well behaved. And we can just say, well, what is the worst possible xt minus xt plus 1? We just minimize this thing. And if you solve this quadratic, you just get this a squared divided by b. So this is kind of a useful little trick for this problem. And what you end up with is just this eta lt of the dual norm. So this is measuring somehow how big are the losses, how big are the gradients. And when you plug this into the main theorem, you get the final form, which has this really nice structure, which says, OK, the regret is bounded by this thing that measures the distance. It's like how long it takes you to get to the optimal point. And then plus how big were the losses? So if the losses are really big, you kind of expect maybe you're going to suffer a bit more regret. And this is measuring that. And then this eta is you have to balance it somehow. So if you make eta too small, then your algorithm moves really slowly and you suffer a high regret. If you make eta too big, then your algorithm is unstable and you get really high regret. And you have to balance these two terms. And for that, usually in the simplest case, you should choose eta in advance. So you should have some idea on how big these terms can get to optimize this thing. So we're going to go through a few examples. But maybe there's questions on the derivation first. And by the way, we've done the scary stuff now. So this derivation, this is the bound that a lot of people use. They're going to write a paper about bandits. And they construct a new potential and have some estimators and sue some fancy stuff. And then they just use this result. So we're essentially just going to use this result. And we're going to derive how things should be tuned. So I want to give you a few examples so for gradient descent and exponential weights and these things so you can get a feel for how these norms behave. So I think the first one is online gradient descent. Right, our good friend. So what do we have here? We have a convex set, which is, oh, what did I say? Yeah, OK. So the set I'm assuming here is the ball, the L2 ball. So our actions are constrained in the L2 ball. Just an assumption. That's what you have to do. And we're going to assume that the adversary is also playing in the L2 ball. So this is coming back to your comment earlier. We want to control somehow the size of the gradients. It just has to be an assumption, essentially. Here we're assuming the magnitude of the gradients is bounded by 1. OK, so now we're going to use the L2 potential. So we have the Euclidean distances being our breakment of virgins. OK, so what do we need to do? We need to find some norm where that is always bigger than that, but it's already a norm. So this is our norm here, or rather the bit without the half, if you like. And the dual of the L2 norm is just the L2 norm, as we've seen already. And so now we can just substitute these things into the bound that we've proven. And here we get the norm of x star, which is the optimum, which probably should be written as well. And then plus the sum of the dual norms. So here what we're assuming is that x star is the optimal point. It's the minimizer of the x's in k to n Lt. So that's the best thing in hindsight. And remember in the bound, what we have is we have that the regret of relative to some point is smaller or equal to f of x minus the starting point x1 divided by eta plus a half of eta divided by 2 sum n. And then we have the dual norms here. OK, so this is the thing that we're doing. So now we just plug all these things in. So this f of x1, well the algorithm when it starts, just chooses the minimizer of the quadratic potential, which is the 0 point. So that's 0. The f of x is going to be this x star, which is just the half of the norm squared. So that's where this term comes from. And then you just substitute in the dual norm here. And now we use the fact that we assumed that the gradients were not too big to bound this thing here by just n times 1. And then you optimize eta at the end, and you get a bound that says there are greater or smaller than square root n. And you're done. So here we've made no assumptions whatsoever on the data. The adversary could play stochastically. They could be choosing these loss vectors stochastically. But they could be doing it any way they like. And nevertheless, we can get a sublinear of a grad. And this can even happen if the adversary is choosing the losses after we choose our point. So this is the power that you buy with convexity. It's really nice. And here we just have the square root n. And I think maybe even this is asymptotically the right constant in the worst case. OK. So the next example is going to be really practical for bandits. In bandits, what it turns out we're going to be doing is playing distributions essentially. And distributions are things that lie in the simplex. And so we want to do online learning in the simplex. And for this, we're going to use the negentropy regularizer. OK. How am I going for time? A little bit to go. OK, so let's say we do the negentropy regularizer, which actually turns out to be the exponential weights algorithm. I'm reading it here, but I'll show you in a second. And so here we're saying k is the simplex. It's a set of probability distributions over d things. And x again is the argument. And OK, so what is the optimal action on the simplex? If you're optimizing over the simplex, it's this polytope. If you're optimizing a linear function on a polytope, you only have to look at the corners. And the simplex has d corners, so we're just going to look at each of those d corners and see which one maximizes it. And that's what this x is going to be. So it's going to be one of the corners. OK, so now we have this bound here. I'm not sure really why I've got this approximate here. We've got this bound here. Oh, I skipped a slide. I have. OK. I'm not sure why I skipped this. OK, we really need this. We really need to understand what the Bregman divergence looks like, because what we need to do is we need to calculate the thing that's going to be the dual norm essentially. So this is our potential. And we did the calculation before that says that the Bregman divergence with respect to this neg entry potential is just equal to the relative entropy of KL divergence. So that's the thing we looked at before. And if you do a Taylor series expansion about this thing, this is what you get. So you have an L2 norm, but now it's weighted by the y's. And the y's are appearing inside the denominator. So this thing blows up as the y's go to the boundary. And we can just write this as another way. It's the weighted two norm by the diagonal matrix that has the entries be equal to the one on y. So this is just your diagonal matrix that has a bunch of zeros. And then we have like one on y1, one on yd. OK, so that's what that thing is. And if you do the calculation, we showed you what the dual norm was when you have a matrix weighted norm. You just get the inverse of that matrix, which here is just this thing here. OK, so the warning here, of course, is that this is a Taylor series expansion. It's just an approximation. It's not actually exact. So here we have this approximation. But we're going to go ahead and just pretend that this thing is bigger than this thing. That's actually not quite true, but it's close enough to being true that we're going to be good. So we're going to make that little bit of slope there. OK, so this tells us what our dual norm is like. It's like this thing here. And now, again, we just have to substitute that into our bound. Everything just goes into the same bound that we've proven before. And what we end up getting, well, we have the bound. And now we substitute in these two things. So for the first term, this one we can just bound by log d. So that's because what is x1? x1, the algorithm is going to start at the thing that minimizes f, which turns out to be just a uniform distribution. Probably shouldn't have rubbed out the bound, but anyways. OK, so the algorithm starts with x1 being equal to the uniform distribution, 1 on d. That's the first choice of the algorithm. And then we have this f is the negative entropy. f of x is the sum i x i x i. So that's the negative entropy. And now we care about bounding this thing here. And we know that x is some basis factor. It's the e1 through ed, whichever one happens to be optimal. And so we have f of x minus f of x1. OK, so both of these things are in the simple x. So these terms cancel. The sum of the x i's is equal to 1. And they just drop off. They have knowledge that they're out. For this guy, the x is some basis vector. And so this term here is 0. So it should tell you what happens when x i is equal to 0. We just define this thing to be equal to 0 when x i is 0. Because you take limits, it sort of makes sense. So this term here is just going to be 0, essentially. And then we have the sum of x1, which is the uniform thing. So we have just 1 over d. And we should minus it. And then log 1 over d, which is equal to log d. So the diameter of our simplex is just log d for this. So that provides us with this term. And here we just substitute in our approximation of sort of the right dual norm. So that's really the approximation should really be here. OK. And well, what are these things? Well, these things live in the simplex. The xt lives in the simplex. So they're all less than 1. And we've assumed that the lt's are bounded in 0, 1 as well. And so in particular, this thing here must be smaller equal to 1. And so then we just get n times eta divided by 2. And now we just, again, are in the same situation where we have to optimize the eta. And if you do this, then you get the square root 2n log d. So here we see in the previous example, there was no dependence right on d. We just had square root n. And now we have a different action set. We have a different algorithm. We have a different dependence on the dimension. So these differences kind of depend on the geometry of your set and how the potential interacts with that geometry. And we can see an example of where things go wrong. Let's say we tried to do online gradient descent on the simplex. So here we're not on the L1 ball. We're in the same setup that we just looked at, but we're trying to do the L2 regularization instead of this KL regularization. What goes wrong? Well, it actually goes pretty badly wrong, right? So we just substitute in exactly the same things that we have before. Here everything is straightforward. The dual norm is just the L2 norm, and everything is fine. But now this dual norm of LT, well, before we were assuming the losses were bounded in a circle, in a ball, right? And so it's bounded by 1. And now that's not true anymore. This thing when L is in the simplex could be as big as d, because that's how it works. I can have it be all in the middle. And then I get a square bound that depends on d. So if we just try to do gradient descent on the simplex, it's not as good as doing this KL regularization. So these things are subtle, but it really matters what regularization you choose. And what it depends on is how your losses behave on that set and how they behave with your potential. OK. So this is, I think, the end of the online optimization stuff. Any questions on this? This loss here is absolute direction sequence. The other case with the set? On the ball, it's always just going to be somewhere on the boundary. Isn't that quite true? No, I think the worst case on this one here, actually, is to be close to the center-ish. It's a little bit annoying. So what you have, I think, so we have our ball. And what do we want? You want it, essentially, to be hard to work out which direction you should be playing in. And to do that, so the algorithm should always play on the, well, should play somewhere. And the adversary, what they're going to do is they're going to choose a vector that's, I think, quite short here to be the best one. But they're not going to show you that vector every round. They're going to add noise. And the noise they're going to add is going to be random with this mean. Random along the boundary, but with this mean. And then it's really hard for you to identify that, really, this is the point. And the length of this vector is, like, square root 1 divided by n or something like that. And so, basically, it's a problem where you can't statistically identify which direction you should be playing. And if you just play in some random direction, then you get n times square root 1 divided by n is square root n. So that's where that's coming from. And in the other case, it's kind of the same. It's, as you say, the adversary chooses to play. I'm going to play just about the uniform, but not quite. So the worst case there is they're going to be playing, like, something that's one on D, one on D. And then they're going to hide somewhere some one on D plus epsilon, or minus epsilon, rather, we're doing losses. It's not quite in the simplex, but you just make it so. And basically, you're trying to find this coordinate here. And then you choose epsilon to be small, square root log D divided by n. And that gives you the bound. But OK, like, why is gradient descent really getting the square root D, not the log D? I'm not sure I can answer that question really well. That would be something worth thinking about. But there's definitely something going wrong with this potential. Yes, we're going to choose eta. So eta is our learning rate, right? So at the end of the day, we optimize it. And the choice that optimizes this, I guess, is just one on square root n. I choose eta to be one on square root n. And then I get the bound that I would like. There's none, actually. I can choose any positive eta that I would like. But the problem is I have these competing terms. Yeah, so if I make it too small, one is too big, and the other is, yeah. And this is always how it ends up being. It's like you have to balance stability against the speed at which your algorithm moves. The hard thing, by the way, is here, the eta that I have to choose depends on a bunch of stuff. Like, in particular, it depends on n. And so this algorithm, if you don't know n in advance, what do you do? So it turns out that you can just choose this eta. You choose it to depend on t, and you make it square root 1 on t, and everything works out. And the proof is just a little bit harder, but not too much harder. But that's the idea. And things get even more tricky. So you can imagine right here, here we have this sum of the losses with the dual norms appearing, right? And if I did the optimization of eta now, I would get a different thing. If I did the optimization of eta there, then what I would get would be a square root of the diameter. If we just call this thing in the numerator here, I'll just call that the diameter. So then the bound that you would get, rn of x, you would get would be some constant. But then you would get square root, the diameter, and then the times the sum of these dual norms, right? This is the sort of thing that you would get if you did this optimization now. But the problem is, I mean, you really can't do that, right? You can't tune your eta in advance to depend on the dual norms of the losses before you've even observed them. And so it's sort of a big question in online learning generally, is how do we design algorithms that achieve this optimal bound without knowing the eta in advance? And there are lots of sort of tuning strategies that you can use to adapt, but things get complicated quickly. OK, so how long do I have? Half an hour, right? OK, so this is online linear optimization. We have good bounds for various different types of sets. And now we want to just apply this machine to adversarial bandits, right? So finally, we can get back to the world of bandits. And what's going on here? So the adversarial bandits is kind of similar to stochastic bandits in the sense that you choose actions. And then you get to observe the, in this case, loss that you suffer when you choose that action, OK? Some reason these adversarial people are really pessimistic. They are like losses rather than rewards, but OK. So we make in particular no statistical assumptions at all, right? So before we had to carefully say, oh, our data is maybe Gaussian or Bernoulli or something like that, here there's none of that. No statistics. It's all gone, OK? And the way it works is at the start of the game, the adversary secretly chooses the set of loss factors, OK? It is possible to allow them to do it dynamically, but for simplicity, we won't. OK, so the algorithm is choosing these. The adversary is choosing them in advance. In each round, you get to choose an action, one of K choices. This is exactly like yesterday. And then you get to see the loss only if that could direction, right? Only in the coordinate that you actually played. OK, so this is different from the linear optimization that we saw in the linear optimization. You take an action, essentially, and you get to observe the whole loss factor, OK? And here that doesn't happen. You observe one coordinate with the loss factor. This is just the only difference between bandits and online linear optimization, OK? And the regret, well, I guess we'll look at the expected regret. And this is just the difference between what you expect to suffer, your expected loss, and then the loss of the best expert in hindsight, OK? So again, this can actually be negative for the same reason as before, but we want to just make this small. And the surprising result, or maybe it's not so surprising now that you've seen some bounds, is that this thing you can say is less than square root 2nk log k, which is actually practically the same bound that we had for stochastic bandits, right? For stochastic bandits, we had these square root 2nk, or whatever, or some constant. And here we get exactly the same thing. And we've made no statistical assumptions whatsoever. The only thing that we have assumed is that the losses are bounded in 0.1. That is important. Yeah? The adversary doesn't choose the losses depending on the history of actions, or? Yeah. It looks like more worst case analysis among all possible environments rather than adversarial. That's a very fair assessment. It can be adversarial in the sense just that these losses, the way it usually works, is like you'll show me your code for your algorithm, and then I can choose the worst set of losses for it. So in that sense, I could be adversarily playing you. Now actually, this bound and all of the analysis that I show you will hold even if the adversary is allowed to choose the losses at the same time as you in each round. It's just if you do that, if you allow that, then this definition of the regret becomes completely meaningless. And the reason for that is because, OK, it looks great. You're comparing to the best thing in hindsight. But you don't know that if in the first round you choose action one, that sets you up for a really great life. Like all the losses after that would be zero. If you did that one thing and if you do something else, then the bandit is going to be really mean. And this measure of regret, this hindsight measure of regret doesn't capture that notion at all. It's just like after what happened, how did I do relative to the optimal thing? So it's not a reinforcement learning thing. The regret is not capturing the history dependence. And so I think it would be very misleading to say that these losses are not chosen in advance because then the regret becomes less meaningful. But all of the analysis still holds. And very occasionally, it's actually useful when you use these as meta algorithms to actually use that. OK. All right. So did I start at 2.30? Did I start at 2.30? Yeah, I'm up. Yeah. And so I have how long? 20 minutes. 20 minutes? Oh, good. OK. I'm getting so tired that I can't do this basic clock calculation anymore. All right. So we want to prove this bound. And the idea is we're just going to use the tools that we've developed from the linear optimization. We have almost the same setup. It's just the only problem is you get less information. And so somehow, if you overcome that. And the first observation is that you must randomize in an adversarial bandit. So why must you randomize? Let's say you didn't. Let's say you present me with a deterministic algorithm. Then as an adversary, what I'm going to do is I'm going to look at your code. And I'm going to say, OK, what do you play in the first round? And whatever you play in the first round, I'm going to give you loss 1. And every other action is going to have losses 0. And so that means that at the end of the day, you suffer a loss of n. Every round, you suffer a loss of 1. But because the average loss, the total number of losses, the sum from t equals 1 to n. And then the sum i equals 1 to k of l ti. Well, with this particular bandit, this is just equal to n. And so in particular, there exists some action i. This implies that there exists an i with a small loss. Equals 1 to n is smaller or equal to, I guess, just n divided by k. And so your loss is going to be n. But there exists some action that has lost much less than that. And so you suffer linear regret. So that's bad. So if you do a deterministic bandit, then you suffer linear regret. So note this is different from the online optimization stuff we saw. All of those algorithms were completely deterministic. It's just gradient descent. There's no randomness coming from your algorithm there. For bandits, you really need randomization. And the reason you really need randomization is because in this setup, I'm forcing you to choose an action. I'm forcing you to choose an action. And what this is essentially saying is that you're choosing something that's on the boundaries of the simplex. We have some simplex. I guess a really bad one looks like this. And the algorithm is choosing really just boundaries on the simplex. And what the randomization buys us is it allows us to convexify the space. So instead, I'm going to be able to play on the interior if the simplex, maybe the online learning algorithm, would say you should play here. And then I'm going to choose the distribution over the simplex so that the average is this point here. So you can view randomization as this kind of convexification, if you like. And that's why we need it. And that's also how we get saved. OK, so we have to randomize. And we're just going to do it like that. And the core idea is really now somewhat similar to what we did for the previous thing. There are these losses. And we have to estimate them. We don't get to observe them for free anymore. We have to estimate the losses. And there's a really clever way of doing it. And the surprising thing here is in the stochastic banded case, how do you estimate the reward of an arm? You just play that arm. And because we assumed that everything was IID, we assumed that the distributions was the same in every round, no matter how you played the arm. When you play, you just get a good estimate. So you play an arm a bunch of times, you get some estimates, and that's your estimate. So that's good. But in the adversarial model, this does not work at all. Because the losses are changing potentially in every single round. I can't just say I'm going to play arm 1 for a little bit and then arm 2. That may not give you a representative estimate of the loss of arm 1, because then maybe it's going to change later. So here, we really have to tackle this. And what we're going to actually do is estimate the whole loss vector, this whole LT, individually in every single round. And there's this very neat trick for doing this, which is to use importance-weighted estimators. So each round, we choose some action, and we have to estimate the whole loss factor. And the way to do this is, well, we're randomizing over the actions according to some distribution PT. So in round T, we're going to choose in some way this distribution PT to randomize over the actions. And then we sample our AT is sampled from this PT. And then what we're going to do is define the importance-weighted estimator of the loss of action A as to be equal to 0 if we didn't play it. This is the indicator function, so it's 0 if AT is not equal to A. So if we didn't play the action, our loss estimate is 0. But if we did play the action, then the loss estimate is equal to the real loss divided by the probability that we would play it. So this is called an importance-weighted estimator. And when we take the expectation of that thing, what happens, well, you're going to sum over the probability that you play the action multiplied by that quantity, and you get exactly the loss. So this is an unbiased estimator of the entire loss vector with only one observation. The only problem is, it has a relatively high variance. So if you do the calculation for the predictable variation, it's called the conditional variance, you get something that depends on the inverse of the probability that you play it. So if you play an action with low probability, then you get a really high variance. But that aside, that's sort of the price that you're going to pay for living in the bandit world. It's like you don't get to observe these things, but you can estimate them, and the variance of your estimators is pretty large. But what this means is now what we have is we have a way of estimating the losses in an unbiased way. So all we're going to do is do online linear optimization with estimated losses, and just plug in the bound. And everything is just going to work. All right, so any questions on these importance weighted estimators? Yeah. Yeah, it's more or less the same thing. In this case, you only have a history of length one, but otherwise you're doing exactly the same thing. And people do this in off policy reinforcement learning as well. You're following some policy. You want to estimate the loss on some other policy. You re-rate it using importance weighting. And the estimator gets bad when your policy is really different from what you want to estimate. Here, what we want to estimate is the policy that just plays action A. And OK, if the difference is really large in the sense that this probability is small, then we pay a high price. So it's essentially exactly the same thing. OK. So now, really, we're just going to plug this into our bounds. So the first thing to do is to, OK, well, we have to choose a potential. All right, I didn't actually mention exponential weights, which I should have done. So the point here is, if you do exponential weights on the simplex with the negative entropy potential. So I showed you the calculations for what their regret should be. I didn't actually show you the algorithm, which I probably should have done. So if you do that, if you do the algorithm, and so instead of having the actual losses here, we're going to use the estimated losses. But the algorithm that you get from running the negative entropy following the regularized leader on the simplex is this algorithm called exponential weights. And what does it do? Well, it plays p, t. It plays action A with probability proportional to, like, this exponential of its minus the loss. All right, so if the losses are really big, it's less likely to play the action, because this term here will be big, and then it will be small. Probability will be small. And this is exponential weights. So this is what the algorithm is going to do. But really, we're just going to run through the theorem that we have before. But now we have the estimated losses instead of real losses. OK, so what happens? Let's just see. So this is our regret. This is a little bit of abusive notation. This A, T here is not really your action. It's like the coordinate of your action, right? So if A, T, A, T is really like some E, I, where I is the direction that you're playing in. So actions before, we're just numbers between 1 and k. But here we're looking at them as basis factors. So this is the regret. But now we can say, well, what is A, T being sampled from? It's being sampled from our distribution P, T. And the loss vectors are unbiased. These L, T's are unbiased estimates of the real losses. So now we can just sort of shoot these things in. Here instead we have P, T. This is now the probability distribution over the actions that we're playing. And instead of having the actual loss, we have the estimated losses. And we still have A star. So now we're really in a form where we can just plug in the bound that followed the regularized leader gave us. Here we have inside a linear optimization objective. P, T is being chosen exactly by following the regularized leader. And so we can just substitute in our bound, the bound that we proved before. Do I have it here? No, so I'll write it down. We're just going to substitute in our bound, which was n with respect to x, with smaller equal to f of x by f of x1. Now we have f divided by 2. And then the sum of these dual norms. And remember that here we are in the negative entropy world. And now we just substitute this in. So in exactly the same way as before, the f term is this log k divided by f. So that's exactly the same calculation that we did for the full information case. We observe everything. And here again, we're just writing down this is the dual norm term. So again, it's exactly the same as what we had before. And now we have some magic. This here we've seen before. This is the predictable variance of that loss. And so here we can just substitute in our bound on that, which was like the loss squared divided by P, T. And now we're actually done. Now we have a sum of losses squared. We've assumed that these things are bounded by 1. And so this term here is just bounded by k and then nk. And now we have that this is smaller than something that looks really familiar. The only thing that's happened is an extra k has appeared. We have now an nk here, whereas before we had k. And that comes because in each round, before in the early game that we looked at where you get everything, you get k observations per round. In every round you get to see what the loss vector was in every coordinate. In the banded game, we only get to observe one of them. And this is the price that we pay for that. You have to get k times as much information. And it's just going to appear there. OK, and now we can optimize the eta. So I guess the eta that optimizes this is going to be square root log k into divided by. nk, I guess. And if we substitute this in, then we get the bound that I promised you at the beginning. Square root 2 nk log k. And so this analysis is actually, in a sense, it's very simple. It's once you've done the online linear optimization, once you understand that really well, all you have to do is say, how do I estimate the losses? And then there's the somewhat more delicate question of how do I choose the potential to make things work out here? And what we saw is that the potential that we chose here had these PTAs coming in, which very nicely canceled with the loss estimators. And that's sort of the delicate thing. All right, so any questions on this calculation? The algorithm, generally. Minimizing the restricted eta. But yeah, exactly. We get to choose eta at the end, and we want to make this as small as possible. I think that's exactly what we should get. Let's do this calculation. So I'm going to choose eta at the end of the day to make that small. So what do we have? We have log k divided by eta plus eta nk divided by 2. So this I'm going to call f of eta. And then I'm going to take the derivative of f with respect to eta. So f prime of eta is equal to, I guess, minus log k on eta squared plus eta, whoops, just nk divided by 2. And I'm going to set that to be equal to 0. And now I solve this equation for eta. So this implies that eta is equal to, indeed, square root nk. Yeah? Is it a better bound for a UCB? Yeah. For the signals, I don't remember exactly. So just to locate the concern. No, well, sort of. So the one for the UCB that I showed you had a log n here. This is actually worse. Even though you make a stochastic assumption, you get a worse bound. And actually, that log n is real for that algorithm. It's real, yes. But there is a different algorithm which would get you just square root nk times some constant. And we can get that, too, for the adversarial setting by choosing a different potential function. So it turns out that if you use this potential function, the minus square root of the sum of the minus square root of the coordinates, then that gives you actually you can get rid of the log k here. And the calculation is a little bit more involved, but not much more involved. Any other questions? Ah, yeah. So it's this one here. And this is the thing that you get out. It's, again, not exactly trivial to show, but how is this derived? What is the algorithm doing? It's choosing pt to equal the arg min of p in the simplex. k here is the simplex, eta sum s equals 1 to t minus 1 p l hat s plus the potential f of p. And if you do this optimization problem, you end up with this thing. The fp is the negative entropy, exactly. So f of p is sum i pi log pi minus pi, exactly. And if you want to get rid of the log in the terms, then you should use f of p is minus 2 sum i square root pi. Any other questions? So more or less out of time. So I think we'll more or less just stop here, because next would be a big thing. But the summary here is, I mean, this online optimization is really ubiquitous already, just as a field by itself. So people are studying not just the linear case, but the more complicated case. And then they want to design algorithms that adapt well to the data, right? So the bounds that I've shown you here are very worst case-flavor bounds. They really don't depend on very much. But if you want it to adapt, so maybe you are playing, actually, a stochastic adversary, you should get a better rate. And there are people working on how to do this. As well as understanding, if you have losses that are really curved, or they're curved in one way, but not other ways, how can you exploit that? So it's really a big field, and it's a pretty hot field at the moment, this online learning. And then, of course, we saw the importance-weighted estimators in the applications to finite unbounded. So my plan for tomorrow is to say, OK, we have these finite unbounded. How can we scale this up in the adversarial setting? So we're going to look at the linear version of the finite unbounded problem, and then a bunch of sort of more practical settings, like how do you use this stuff for ranking, or path routing, and things like that. And really, this is just using exactly the same tools we're going to use, basically, every time importance-weighted estimators follow the rank-wise leader, plus some potential, and everything is going to be nice and easy. There's going to be some fun stats problems, as well, actually. Yeah, it's going to be fun. So thanks very much, and well, I can answer questions later. So thanks.