 How's that? Good? All right. So it's going to be bandits. I really like bandits. As we just heard, we have this new book with Jabba, who's doing some climbing, and also drew these beautiful pictures. All the artwork is thanks to him. The book is free online, and there's also a blog there which has some slightly shorter version. The book is now way too long, but it's free there, and it's going to be free to forever. So if you look at it, we'd be very pleased about that. We'd also be very pleased if you have any feedback. So with the advertising out of the way, I'm going to talk about bandits, and I have three lectures to do this, and there's a lot of stuff. So the plan is just to, in the first one, talk a little bit about what are bandits, why should we care. And then I'm going to focus on the stochastic models. So there are sort of two big models of bandits. There's a stochastic one where you think the world is like usually IID, and you just get sequence of stuff sampled from some distributions. That's the first view, stochastic view. And that's the plan for today. And then the second view is this completely opposite end of the spectrum. It's a totally adversarial view, where you make no statistical assumptions about the world at all, which is maybe a little bit less common in machine learning, particularly in sort of introduction machine learning. And so I think the next two sessions are going to be on this adversarial model, and as a little bit of an aside there, I'm going to go quite a lot into this online convex optimization, which is not strictly needed for bandits, but it makes the presentations much clearer. Like, once you understand the online convex optimization view, it becomes really intuitive about why the bandit algorithms work the way they do. Whereas otherwise it's just like, oh, this algorithm works, but we don't really understand why. So that's the plan for the next times. So bandit problems. It's like a baby reinforcement learning problem, right? We've heard a lot about reinforcement learning already, where you have lots of problems, right? In reinforcement learning, you don't know what the world looks like, and you have to plan, right? You might have some really nasty robot environment where you need to get your arm from here over to here, and you don't know exactly what the world is like, and you have to take this sequence of actions to get there, and that's really hard. And in bandits, we do away with the planning aspect of the problem. So in bandits, all you're doing is taking actions, and you're getting observations. But the important thing is that if I take an action, nothing really changes in the world. Like, I could get a bad reward, but I don't have to plan to get somewhere else, right? So you just get to do the same thing in every round more or less. And so removing this planning aspect really simplifies the problem, OK? So that's one thing. The other thing is we do still have the uncertainty, so you don't know what the world is like. So in that sense, it's still a reinforcement learning problem. So here is the setup, at least the very basic setup. So basically, what you have is you have a game which you play over n rounds, and is what we call the horizon. And in each round, you just get to choose an action, one of k choices. So we have this set A is 1 through k, and those are the actions you can do. You just get to choose some number between 1 and k. And then after you've chosen your action, you get to see a reward that's sampled from some distribution that corresponds to the action that you took. So there are k distributions, which you don't know in advance. And each time you take an action, you get a reward sampled from the distribution that corresponds to that arm. And then your goal is just to get as much reward as possible over n rounds. So here, there is no discounting. It's just a finite horizon, and you want to maximize the sum of your rewards. And the difficulty is just that you don't know those distributions. You don't know the p's. And this is the only thing that makes bandit problems hard. And the picture here is the picture you always see in bandit talks. It's like this octopus that can pull arms. That's the setup. Like you walk into a casino. There are a bunch of slot machines. You don't know which one has a good return, which one has a bad return, and you should do some exploration to work out which machine is good. So just two note-keeping things, by the way. Like of course ask questions when if you like. And I'm planning to use the board a little bit, so if you can't see it, you could move somewhere where you can't see it. That will help. OK, so this is the setup. So it's a really simple problem. And I want to say a little bit more about why I care about bandit problems. And there are sort of three reasons. So first of all, I really care about reinforcement learning. I did my PhD on reinforcement learning. That's sort of the end game. And bandits are a simplified view of this that lets you just look at this little part, this exploration, exploitation problems. How do you solve this little part of reinforcement learning? And you hope that if you can do that well in bandits, then you can generalize the ideas to reinforcement learning. So that's one view. The second is just there are quite a lot of applications of bandits straight away. And I'll have a look at some of the potential applications in the next. And the final one is I won't lie, like I like math. And bandits gives you a fun opportunity to do some math, which is harder to do in bigger problems. You want to, like, it's harder to prove something really concrete and a big problem that's meaningful. Whereas in bandits, you can really dig deep into the problem and prove some nice things. OK, so I mentioned applications. So the first one is clinical trials, which is the first bandit papers were written, I guess, in about 1930. And they were saying, we're going to use this for clinical trials. It was William Thompson. And he had this great idea that clinical trials were not fair because if you fix your pool of people at the start of the trial in advance and the drug ends up being really terrible, that's obviously not a good idea. You should have some sort of early stopping. And he wanted to have this adaptive idea for doing clinical trials. And he came up with a bandit algorithm to do that. And the idea is that you just have patients coming to the doctor sequentially. And the doctor can say, I'm going to give this patient drug A or drug B, or maybe drug A or nothing. And the bandit algorithm should decide. Then you get to see the outcome. That's your reward. And then you make the next decision. So that was Thompson's idea. And I think so far, bandits have never been used in clinical trials. So we're always hopeful. But it's just an example that you have a simple model. And then you try and apply it to some practical problem. And there is a whole bunch of reasons why it's potentially hard. But the second one, where you really have seen some applications is AB testing. And now you have these big companies. And they have to make decisions about how they sell you stuff very often. So Amazon has to decide, what is the web page going to look like? Is the buy it now button like here? Is there one on every item you can look at or something like this? And they don't know which of these decisions is a good idea. And the normal thing to do is you just test that. You say, I'm going to take 10,000 with my users randomly. And then I'm going to show half of them one option, half of them the other option. And I'm going to do some statistics at the end of the day and make a decision based on that. So this is called AB testing. But they've also realized, I mean, this comes at a price. If you show 5,000 users an option that is really terrible, that's probably not a good idea. You would like to notice quickly that that was a bad idea and stop the trial, essentially. And what you can actually do in AB testing is never stop, but just run a bounded algorithm. So from the start of the thing, you just put your bounded algorithm in place. And it decides whether they get to see version A or version B of the website. And over time, it's eventually going to show them always version B, if that's better, you hope. So again, there's lots of trouble. You know, the world is not stationary. The assumptions are clearly violated. But sort of people have decided that there are ways they can use these algorithms where the inaccuracy of the model is still worth it and something good is happening. OK, added placement is another example. And then recommender systems is obviously a big thing which has a banded-like flavor. If I'm Netflix, I get to show you a bunch of movies. And the browse page, and then you get to site if you watch them or not, if you provide feedback or not. And so it has this structure where every time Netflix can more or less show you the same movies. So they have the same actions each round. And then they get some feedback. And of course, if they don't show you a movie, you probably don't watch it and they don't get feedback. And then a little bit more abstractly, say network routing or if you want to do routing of traffic through a city, you can see that as potentially a banded problem. You drive along some paths, you get to observe how long that took, but you don't necessarily get to observe the traffic in other paths. And finally, the game tree search. So if you look at sort of the AlphaGo idea part of this MCTS tree search that was part of that, they have a little tiny piece of that as a banded algorithm that's choosing which direction to expand next. Have to be really careful that my computer doesn't go to sleep. And so that's another application. So there are sort of lots more. Probably there are 10 papers on bandits that Nips this year and all of them have a different application in mind. It is sort of hard to get the applications working in practice. Like, you really have to work hard to do that. But I think there's clearly a lot of potential and some stuff is already being done. OK. So back to the banded settings. Just to remind you, we have these K actions. There's a distribution associated to each that you don't know and you want to maximize the sum of rewards. So normally what we do is we don't actually maximize the sum of the rewards. We just add a normalization essentially. And we aim to minimize the regret instead. You could say we're pessimistic with this view. And so just the regret is the difference between what we would have got if we did the optimal thing in every round. We don't actually know that because we don't know which distribution has the largest mean. So here what we're going to say is this mu A is the mean of the eighth arm. It's what you would expect to get if you played arm A. And mu star is the maximum of the means. So that's if you could find the arm which had the largest mean and you played that, you would expect to get mu star. And if you did that every single round, all N rounds, that would be the best thing to do. And that would give you N mu star as your expected reward. So that's the expected reward of the optimal policy. And then we subtract away just your expected reward. So this is some positive quantity. It's just a number that depends on the algorithm and the distributions. And you want to make that small. So the normal goal, well, the very conservative goal is to try and prove that for your algorithm, this regret grows sublinearly. If it grows sublinearly in N, then as time gets on, you start playing less and less frequently the suboptimal arms. And that seems like a good thing. But of course, sublinearly, I mean, you want it to be literally as small as possible. And ideally, you'll prove sort of upper and lower bounds and they should nearly match. We should be really ambitious in such a simple model. So that's our goal. We just want to make the regrets small. And we're going to design some algorithms to do that. This is basically the setup that I'm going to talk about today. So maybe it's a good time for questions if anyone is confused about the setup. I'm right also that there was a hands-on experiment where maybe you even tried some banded algorithms. Is this true? Yeah. Awesome. Good. But I'm guessing you didn't prove they're a great bound. Also, good. So that's what we're going to do. So I'm glad you have a little bit of intuition about what this problem is. Before we move on to sort of proving anything, I have to emphasize, we don't know the distributions, but you have to know something about the distributions. It's a thing that happens all the time in machine learning. You have to make a decision about what you want to assume. And if you assume a lot, then you can design an algorithm that really exploits those assumptions very hard. And if you assume a little, your algorithm is maybe going to be less robust, more robust, but maybe not as good as if those assumptions are actually held. And so here you can make a bunch of choices about what you're going to assume in banded problems. And then the algorithm you derive based on the assumptions will be different. So this very classical UCB algorithm that you had to play with is essentially derived from assuming it's Gaussian or maybe Bernoulli, but like some very specific assumption. And then you get that form of that algorithm. If you make different assumptions, you'll end up with different algorithms. So in this talk, I'm going to be focusing on essentially the Gaussian case where you know the variance, but all of these situations in this list have been done. And if you assume something very weak, like you assume, I don't know, just that the kurtosis is less than kappa, that's a relatively weak assumption. You know a bound on the kurtosis, which is like the fourth moment measure of the distribution. If you assume you know that, then it's have a really robust algorithm, but it's going to be not as good as if you make a stronger assumption. OK, so the assumptions really matter, but here we're just going to be talking about the Gaussian case essentially. And maybe if there's time, we'll talk about what needs to change. So the algorithm idea is actually really simple. It's just you have to work out what is the best arm. You want to find the arm which has the largest distribution, the largest mean. And so what are you going to do? You're going to play each of the arms a few times, estimate the mean, and then you have to use some statistics to say, OK, what do I really know? Having got an estimate of the mean, you have to work out how sure are you that that estimate is accurate. And then you're just going to stop playing the arms where you kind of know they're bad. That's the very simple idea. So what we have to do, we have to understand what does this statistically plausible mean. It's a little bit vague. And well, we need our assumptions. We're going to assume that we have Gaussian with unit variance. And sort of the first step when you're analyzing any of these banded problems is almost always to do what we call a concentration analysis, which we'll get to in a minute. And the principle that you use is often called optimism. So we're going to combine this optimism principle with the concentration analysis. And what is the optimism principle? Well, it's like you should be optimistic. It's a great principle in life, right? It always works. People actually are naturally optimistic. There's this fun book by Tali Sharot where she talks about how humans exhibit optimistic behavior. Something like 80% of humans are sort of excessively optimistic about what their future will be like. You can ask grad students, they did some study. They asked a bunch of grad students in Israel what they thought their life would be like in the next month. Like, were they going to have some fun dates? Were they going to have a fight with their partner? And most of them said, yeah, we're going to have really good dates, and they're going to happen early in the month. And probably we won't fight with our partner. But if we do, it's going to happen at the end of the month. And then they asked them, at the end of the month, how things went. And they did not go as well as people thought. So people are naturally optimistic. And you can imagine lots of psychological benefits of that. But there's also another really concrete benefit if you're a learning algorithm, which is that it encourages exploration. Because if you're optimistic about something, you think the things that you don't understand are going to be actually better than maybe they are. And then you're going to try them. And that's what we're going to use it for. Of course, there are downsides too. So if you're optimistic about jumping off a cliff, you only get one try. So optimism is dangerous if you're in a situation where you can't recover from. And it's pretty notable that bandits are not like that. Almost the definition of the bandit problem is that it doesn't matter what action you do today, you can recover tomorrow. And this is not true of reinforcement learning, where you can take an action that leads you into a really bad state. So this is why optimism works really well for bandits and why it actually doesn't necessarily work well for reinforcement learning. You have to be a little bit careful or make extra assumptions to make things work for reinforcement learning. But OK, so for us, it's going to be great. And OK, here's just the example. Like if you go to Canada, you can have this option to try this poutine, which is chips and cheese, an interesting combination. If you're pessimistic about how good that is, you won't try it. And then you don't learn anything. But if you're optimistic, you will try it because you're going to think it's better than some known quantity like McDonald's. And then one of two things happen. Either you're right and you love this dish poutine, and then you're actually optimal. That's good. Or you're wrong and you don't like it. But because you've tried it, you've actually learned something. So this optimism principle essentially guarantees progress in one way or another. Either you're moving forward and you're doing well, or you're learning something. And this is the idea. So all we need to do, really, is now just rewrite these vague words in math. And we will have derived the famous algorithm that's so well known. But before this, we need to do this thing called the concentration analysis. This is really what it all boils down to. It's like you get some data about how good some arm is. And that gives you an estimate. Maybe you just take the mean. So you can think of this x1 through xt as being the rewards you got while choosing some particular arm. You just look at the subsequences of times you chose arm one. You look at the rewards that you got. And then you compute the average, this mu hat. This is like your estimate of how good you think that arm is. But the real question is not like how good do you think it is exactly. It's how good could it be? You want to understand what is the variance of that estimate. But it's a little bit more refined than that. We want to have things called tail bounds. And this is a tail bound. So this is saying that the probability that your estimate mu hat is bigger than mu by this square root quantity, this square root 2 log log 1 over delta divided by t is smaller than delta. So this says that the probability that your estimate is really large is not big. And then you have the other side as well. The probability that it's much smaller than the real thing is also not big. Yeah? What is the specific? Yes, this is specific to a Gaussian. Exactly. And so if you made a different assumption, this is essentially where things get different when you make different assumptions. If you made an assumption about, say you didn't know the variance, right? This is specific to a Gaussian with variance 1. But if you didn't know the variance, really there should be like a sigma squared here. You don't know sigma squared. So maybe it's like sigma squared hat. But maybe you know an estimated variance. But then maybe you have to blow that up a little bit and you get a different quantity here. And then that would give you a different algorithm and the analysis would be a little bit different. And yeah, so this is specific to a Gaussian. Who's seen a proof of this? Lucky you guys. So the proof is actually pretty simple. And it's very elegant. And well, I like detail. So I like to explain things like this. But also this proof technique is really general. And so if you want to do your concentration analysis with a different assumption, let's say you wanted to assume Bernoulli data, then you could do this proof and it would all work. OK, so what do we want to do? We want to prove, let's just say, the top one of these statements. And if you can't read my writing or just yell out, I'll try and write big. So we want to prove that the probability that mu hat is greater than mu. OK, instead of this big square root thing, I'm just going to put some epsilon to begin with. We want to prove that this is not too big. And so what's the process for this? We're going to use a method called the Kramer-Chernoff method, which is really a combination of two things. It's a combination of using the moment generating function. And Markov's inequality. So Markov's inequality, in case you forgot, is that if you have a positive random variable, then the probability that x is greater than c is smaller or equal to its expectation divided by c. This is Markov's inequality. And that's one fact that we're going to use. This may be the second most important inequality in probability theory. And the second we're going to use is that we know a bound on the moment generating function of the Gaussian distribution. So if x is sampled from some Gaussian with 0, me and variance 1, then the expectation of the exponential of lambda x, this thing is called the moment generating function, mx of lambda. This thing is actually equal to x of lambda squared divided by 2. So you can prove that yourself. It's just an integral. All you do is you write down the definition of the integral. You have the Gaussian density. You complete the square and you're done. So it's a really straightforward calculation. And we're going to use these two facts. And then just roll through the definitions and it's going to fall out pretty simply. So the first thing is what is this mu hat? Well, this mu hat is the average of the data that you got. So this thing is just equal to the probability that 1 over t is t x t. So this is mu hat greater or equal to mu plus epsilon. So all I've done there is substitute the definition of mu hat, pretty simple. This x t, remember, we've assumed that we have this sequence of Gaussian random variables. We're assuming that x t really is a Gaussian. So we have that x t is sampled from m 0, 1. And we've also assumed that they're independent. These things are all independent. So we're going to use those two assumptions. And also that they have actually x t is sampled with mu. x t is a Gaussian with mean mu. But then we're just going to subtract that off, actually. We're going to bring this mu over to this side and put it inside the sum. And so this is equal to the probability that 1 over t x t minus mu greater or equal to epsilon. This x t minus mu, this is actually a Gaussian random variable with 0 and me. We've literally just subtracted off the mean. OK, so it's still a little bit mysterious. Somehow we need to introduce this moment-generating function. This is essentially the next step. But first I'm just going to put this t over the other side just to make things a little bit simpler. Yeah, epsilon is just something that we've introduced. At the very end of the day, we're going to set it to being that square root thing. But for now, it's just some positive number. Any other questions? All right, so we have this expression now. And now we're going to exponentiate both sides. You can just take the exponential of both sides. It's a monotone increasing function. And so we're going to do that. And we're going to sneak in a little lambda as well. For some positive lambda, well, for any positive lambda actually, but this is just equal to the probability that the exponential lambda equals 1 to t x t minus mu. This is greater or equal to now we have just lambda epsilon t. I've literally just exponentiated both sides of the equation. Nothing more is done here. And now we're ready to apply Markov's. So this thing looks kind of like a moment-generating function. And this is just a constant. So we're going to immediately apply Markov's inequality. And this is now smaller or equal to. Well, what do we have? We have the expectation of this thing. Is it positive? It is positive. The exponential is positive. And then divided by this. So if I do the division first, then I get x minus lambda. So this lambda is just any positive thing. Lambda, and we'll tune it later, epsilon times t. And then we have the expectation of this exponential here. So lambda into t x t minus mu. OK, who wants to guess what the next step is? Exactly. This is exactly the moment-generating function. OK, there's one little annoying thing. It's the moment-generating function of a sum. So if you do the calculation, then you're going to see, ah, the moment-generating of the sum is going to be the product of the moment-generating functions. Why is that? Well, here we have x, but the sum is like the product of the x's. And because they're independent, and only because they're independent, we can say the expectation of a product is equal to the product of the expectations. So this argument would not work if they weren't independent, and the result would not even be true. But that's OK. So what we get here is actually we just get a straight-up equality of x minus epsilon lambda t. And then we have the product of the moment-generating functions, which is like a product, 1 to t. And here we have x just lambda squared divided by 2. And OK, this term doesn't even depend on t, right, this product. And so we can bring the t inside the exponential and say this is equal to, I guess, we have x minus epsilon t lambda, and then we have plus lambda squared t divided by 2. But this lambda thing, we just introduced it. It can be any positive thing that we like. And in particular, it can be the thing that makes this term as small as possible. We're going to choose lambda to minimize this. And this thing inside here is just a quadratic, so we can quite easily solve that optimization problem. And I'm just going to choose lambda is going to be equal to, I guess, epsilon. If I choose lambda equal to epsilon, what happens? This is just equal to x minus epsilon squared t divided by 2. OK, we're actually essentially done now. So really, all we've done is actually equality, equality, one inequality, which was Markov's, and then everything else is just an exact equality with this proof. And the nice thing is that essentially it works as long as you have good understanding of what this moment-generating function is like. So for a Gaussian, it has this really nice form. But it also has a nice form for lots of other class of distribution. So it's going to work well for exponential families, for example. It's going to work less well for some distributions where this thing does not exist. OK, we have to live with that. But nevertheless, it's a pretty general technique. So what have we established? We've established that it's smaller than this x minus epsilon squared t divided by 2. And now I'm just going to let this equal to delta. Remember, at the end of the day, we wanted to prove that the probability that it exceeded something was smaller than delta. And so here, what we have to see is, what is epsilon going to be if I do this? And I'm just going to solve that inequality. Just this exchange here, make the epsilon by itself. And what do you end up getting? You end up getting that this epsilon is equal to, I hope, log 1 over delta. And then this is smaller than delta. And you're done. It's cool, right? And the other case is just the symmetric argument. It's exactly the same. So if we want to prove the second inequality, which will appear in a second, you just like repeat this argument. And it's all going to work. OK, so this, by the way, you're probably familiar with some bound like this. If you did a physics lab in school or something, and you had to write error bars on your graphs, like this annoying business, then you would have had things a little bit like this. And this is where they come from, essentially. This is why. And this moment generating stuff, it's also how you prove the central limit theorem. It's really all connected, or at least you can. OK, so now what we understand really well is if we have an average of some Gaussian random variables, how close are they to the mean? This is what this theorem tells us. And now we just have to go back to our optimism and say, well, what do we need to do when we plug this into a banded algorithm? And so what we want to do is we want to act as if we're in the nicest world that's plausible. So in the banded, what does it mean to be nice? It means all the means are really big. If all the means are big, then it's going to be good. You're going to get lots of reward. Life is nice, so that's the optimistic world. But it has to be plausible. So the means should be as big as they plausibly can be. And so we're just going to use our concentration bound and plug that in and just get an algorithm. And if you've done this a little bit before 2002, you'd have a paper with 4,000 citations, which would be nice. So what the algorithm is doing is it's saying, OK, what is the mean of each arm? Where the estimated mean is the mu hat. But statistically, it could be as big as mu hat plus this square root term. We have to choose this delta. We're going to do that at the end. So the delta is some small thing that we'll tune later. That's like how confident do we feel we need to be. And then the algorithm just says, OK, the nicest world is where all the means are a little bit bigger than we think they are. And then we should act as if we're in that world, which just means choosing the arm, which has the biggest one of these, which we call the upper confidence bound. So the UCB algorithm is just choosing the action that has the largest UCB, the largest upper confidence bound. And that's what it does. And here, this mu hat of AT is just like, how good do you think action A is at the time? And then this is how many times you've played it. And this sort of makes sense. Another way of just looking at this very heuristically is if you have played arm A a lot, if you've played it a lot already, then this term is going to be pretty small because you really kind of understand what's going on. Whereas if you haven't played it that much, it's going to be a big term, and it's going to encourage you to explore a little bit more. So this algorithm is exploring more the actions that it hasn't played as much. OK. So now we're going to see how do we understand how this algorithm behaves? How are we going to analyze its regret? And the regret is following essentially from three steps, all of which are pretty straightforward. The first is going to be to decompose the regret. So it turns out it's really convenient if we can just talk about the regret for each of the arms separately. And then the second step is to say, OK, let's assume that the probabilities don't behave really weirdly. Let's assume that things work out in a kind of normal way. Then what happens? Show that their regret is small. And then at the end, we're going to show that there's good event because with high probability. So this is a very common path when you're analyzing algorithms that depend on random data. If you can, you want to separate the randomness from the algorithm. It's hard to analyze the randomness and the algorithm together. And so very often what you can do is you can say, OK, with very high probability, the data that's fed into the algorithm is regular in some way. And if that happens, then you analyze the algorithm. And then you show that the probability that that happens is low. And then you don't really care about the low probability failure event. And this decoupling of the algorithm from the probability usually makes your life a lot easier. So we're going to do that. So what we really have to do is go through each of these steps. And then we're done. So the regret decomposition is, as I was saying, we want to write out the regret in terms of the arms. So here we have n mu star. It's like the total reward you would expect. And then this is the expect that you expect to get. And the first thing we're going to do is just bring this n mu star inside the sum. Here we have a sum over n terms. Here we have n terms as well. And so we can just bring this mu star inside here. Expectation is linear, so this is no problem. And now we can say, well, what is the expectation of this RT? And in particular, what is the expectation of the RT given that you've just played some action? It's the mean of the action that you actually played. And so here, what we can say is we expect the conditional expectation of this RT is just the mean of the arm that you play. And so that appears here. We have the mean of the arm you play. Then you have mu star minus the mean. And that gives you this delta. This delta is how much worse is arm A than the optimal arm? So this is good. So now we have an expectation over the sum of these deltas. In the deltas, we call the suboptimality gap. So how suboptimal is arm A? OK. And so now, well, we have this AT, which is a little bit annoying to deal with. So now we're going to introduce the actions again just by having another sum. So here we sum over all of the actions. But we take an indicator function to say which one did we actually play. So this one function, it returns one if the thing inside is true, and zero if it's not true. So this thing here is going to be one only for the action that you actually played, and zero for all the rest. So that's why this sum is exactly equal to this thing here. And now what we have to do is use the linearity of expectation again. We can exchange these sums and pull one of the sum out the front. This delta is just a constant as well. And we get exactly the thing that we would like. So here what we're doing is we exchange this. We just pull this sum out here. And then we're summing over T, the expected number of times expected, did you play that arm? And that's the expected number of times that you've played. This decomposition is actually really intuitive, actually. The proof is much longer than it feels like it should be. Essentially all we're saying here is the regret is equal to, well, each time we play action A, you expect to pay delta A. The price that you pay is delta A. That's how much you expect to lose relative to playing the optimal. And so you just sum over the arms and there's delta A times, how many times do you expect to play it? That's it. So this is pretty simple. But what it allows us to do is, well now all we have to do is say this thing should be small if A is a sub-optimal action. That's what we're going to try and prove. We're going to try and prove that the expected value of this quantity is not too big. Any questions on this step? Okay, so the next step is going to be to say on the good event, we want to say the things work out nicely. So what are we going to assume? What we want to do here is we want to set up the set of things that's likely to happen. And then we want to prove that if they do happen, things are going to be good. And so what we're going to assume is essentially that the empirical estimate of some sub-optimal arm. So we're fixing some sub-optimal arm A here. And we have A star as being the optimal action or A optimal action, just any fixed optimal action actually. And we're going to assume that the empirical mean is not too large, right, of the bad action. This is when we expect things to go well. Things could go badly if the algorithm way overestimates how good the bad arm is. And we're going to assume for now that that doesn't happen. So we're going to assume that the empirical estimate of the bad arm is not too big. It's smaller than its actual mean plus some upper confidence bound term. And it's just an assumption. We'll prove it happens later, but for now it's just an assumption. And the second assumption is that the empirical estimate of the good arm is not too small, right? That's the second part. The algorithm could do badly if it really thinks the good arm is terrible, right? And so the assumption we're going to make for now, that's correct, exactly. It's the estimate after T minus one. So you've played your choosing what to do in round T and this mu hat A of T minus one is your estimate at that time, exactly, right? So now we're going to say if this thing happens, then the algorithm should behave well, right? And then we're going to prove that it happens with hyper-probability. Okay, so let's just say like, let's suppose that we do play action A in round T, right? What does this mean happen? Well, we're just going to start writing out our inequalities essentially. So here we have the actual mean of arm A plus two times this confidence interval, right? And now we're going to use the red inequality here and just say, okay, well, this has to be bigger than the empirical mean, plus one confidence interval, okay? But what does this term look like? Should be familiar. We're using ECB, yeah, and this is the thing that we're maximizing. So what is the next step likely to be? Any thoughts? What's that? It's actually not the blue inequality. The blue inequality would be tricky to use, right? Because it has A star here and here we have A, this one here. It is going to be greater than this thing. That is true too. And it's absolutely true, but it would be hard to use. Actually, we haven't assumed that either. It would be true with hyper-probability, but we haven't assumed it. So we need to do something else. We would actually really, we want to use this eventually, but we can't use it yet. And first we need something else, something about the algorithm. Oh dear. You guys, this is too slow. It's funny, they fix our computers to have this 10-minute security reasons. And I went before the lecture, I changed it to an hour, but it seems it doesn't work. Yeah. Okay, so here we somehow have to use the definition of the algorithm, right? So the algorithm is maximizing the upper confidence bound. So what is the step? Take the max, the max of which. I didn't see who was speaking, so now I kind of like stare at them until I answer. What do we want, the max of? Yeah. Yeah. The UCB algorithm itself is choosing its action to maximize the upper confidence bound, right? So given that it chose AT equals A in this round, that mean it chose the arm that had the largest upper confidence bound. So in particular, this must be bigger than the upper confidence bound for all of the other arms, including the optimal arm. And so this is bigger than this thing. This is the upper confidence bound for the optimal arm now, right? So now that we're saying, okay, we played action A, that was only possible because its upper confidence bound was bigger than all of the other upper confidence bounds, including the optimal action, which is this thing here. And now, okay, the game is given away. Now we can use this blue inequality and say, well, this is the upper confidence bound for the optimal arm. It has to be bigger than mu of A star, okay? So what we're saying here is that we're only gonna play this suboptimal action A. If it's real mean plus two times the confidence bound is bigger than mu A star. And just by definition, this is mu A plus delta, okay? So this is looking pretty good because now on this equation, well, we have mu A here and we have mu A here. And so we cancel these things off and essentially all we're saying is that this thing here should be bigger than delta, be bigger than the suboptimality gap. But that can't happen too often, right? Each time we play arm A, this T of A increases. And so if that T of A is too big, it simply isn't gonna be true, right? So we have that two square root two log one over delta divided by T A T minus one. This thing has to be bigger or equal to delta of A, right? And in particular, that just means that this thing can't be too big, right? So this implies that T A T minus one should be smaller or equal to, I guess it's gonna be eight log one divided by delta divided by delta A squared. Okay, so we're only gonna play this action if we haven't played it too many times already, right? Otherwise we're just not gonna play it. And so what does this imply? Well, it implies that, oh, look, the eight was right. It implies that the total number of times we play action A is gonna be less than one plus that thing, right? Maybe it's equal to this quantity, and we do play it. And so then it's equal to this quantity plus one. And then from there after, until the end of the game, we don't play it again, right? So this gives us a bound on the expected number of times we're gonna play. Well, it's not the expected number, it's just a bound on how many times we're gonna play this arm, assuming that the high probability concentration things work out well. All right, so this is what we have. So we're actually very close to being done, and I'll just write out the things that we know. We know that the regret, this is just the regret decomposition. The regret is equal to the sum over the arms, the delta A, T A of N, so this is one of the things that we first proved, and now we've shown that T A of N is smaller or equal to one plus this eight log one divided by delta, big delta squared, if the good event happens, if we're good. Okay, that's these two assumptions that I used at the beginning of the proof. Okay, so we just have to show that these assumptions hold. Any questions on this step? So this is the final step, and this one is actually pretty straightforward. We've done our concentration analysis, and we've set up the algorithm to really use that fact. All we have to show is that with high probability these things happen that we assumed would happen would. Okay, well, what do we have? We have to be a little bit careful, actually. We have to a little bit careful about when is something independent, and when is it not, and the view that helps to think about this is don't think about a bandit as you choose an arm and then a reward is sampled from a distribution. That is one way of thinking about it, but when you're actually doing the analysis, it often helps to think of it in a slightly different way, which allows you to de-randomize some stuff, to make stuff independent. That's otherwise a little bit complicated, and the way to think about it is you have a bunch of arms, and for each arm we're gonna have a stack of rewards that's sampled independently at the very beginning of the game. You just don't get to see them. So we have like arm one, let's just say there's just two arms, that'll do, arm two, and we have a big stack of rewards, enough. We have n rewards for each of these arms, and you don't get to see them, but they're there, and each time you choose an action, you just take the next reward off the pile. So if you choose action one, then you get this reward. Let's say it gives you a reward two, then you choose action two, and you get this reward, maybe it's three, and so on. But you have this view that the rewards are all sampled from the beginning of the game. They're independent, nothing is complicated or weird. You just don't get to observe them. And then what you do when you're doing your concentration analysis is you just look at these stacks, and you analyze these stacks. And that's what we're gonna do here. So what we're gonna say is this mu hat AS is the empirical mean of arm A after you've played it S times. So this is not at time T, it's after you've played it S times. So what that means is I'm looking at the first S things in this stack, and I'm taking the empirical average of that. And this gets away from any complaints you might have, oh, like what if I never play arm AS times, and how is this thing actually defined? It's actually just defined like this, right? And this really is just S IID samples of a Gaussian in this case, because we assumed it really is Gaussian. And that's what this mu hat AS is here. And now we can really use our theorem that we've proved already to say that the probability that mu hat AS is too large is not big. And this is exactly the thing that we wanted to assume when we did our analysis. Okay, the only other little trick is in the assumption we assumed it happened in all rounds. And so we're gonna have to prove that it happens with high probability in all rounds. And for that we can just use a union bound. It's a lazy way of doing it, actually. You can really refine this if you like. But here we're saying what is the probability that it's ever bigger by some big margin, right? And here we have exists. There's a union of a bunch of things and we can apply the union bound to control that term, okay? And that's it. That's the whole thing. So now we just put that in. And we have an amazing regret bound. Almost completely state of the art, right? So all we've done here is we've said, okay, this is the first thing that we did in our regret decomposition. And then if things worked out well for some armed and the number of times we played it is just this one plus the eight. But there's like our two n delta probability that things don't work out. There's two because we have to do the concentration analysis for both sides, upper and lower. And so if that happens, well, you could end up playing the suboptimal arm n times. So we have two n delta times n is two delta n squared, okay? And well, then we get to choose our delta. And we should just choose delta to make the bound not bad, right? And we're lucky we can do that because here, here we have a delta in the log. And so we can just make delta be pretty small and not pay a really big price for that. And it turns out the delta is sort of one on n squared is about the right thing to do, okay? And if you do this, then you get a bound that looks like that. Great job. That's the proof for UCB. And this is also the proof for like almost all bounded algorithms. Like you look at these long papers, essentially they're just doing this argument. And sometimes in a little bit more complicated, subtle way to improve the constants, but this is more or less the idea. So any questions I guess first on this part of the proof? Why? So what's on something that actually, when you put anything in it? Yeah, that's right. And I think in reinforcement learning, that's sort of life. I mean, the regret is in some sense it's just a normalization, right? So what we're really saying, if you like, is also something that we don't know, but we're saying that the sum of rewards that we get, the expected sum of rewards. So this is another way of writing our bound, RT. So this is the expected rewards that we get. What we're saying is that this is greater or equal to, I'm gonna say used to n, greater or equal to n mu star minus this sum of the one over deltas and stuff like that. That's what we're saying. So what we're saying is that we're doing nearly as well as the optimal thing, right? If we played optimally, we would get exactly this and if we lose this much. So what the regret is just measuring is how much do we, how badly we do we do relative to the optimal. That seems like a thing worth knowing even if you can't actually observe it. But it's true that it's hard to estimate your own regret for sure. What's that? So delta A, yeah, delta A, so here we're only summing over the ones where it is positive. If it's not here, we just drop the terms. So delta A is always gonna be greater or equal to zero plus by definition because it's delta A is the difference between the best and whatever. So delta A is equal to the max B mu B minus mu A. So this thing is definitely greater or equal to zero. The problem I guess that could be happening actually is that the delta A could be positive but really, really, really small, right? And then this term will be really big and that seems like a bad thing. And we'll talk about that in just a second. Did you have a question as well? Yeah. You wouldn't expect to be much of a difference if the gap is small, right? Yeah, so the effect that is being observed here is if the gap is really, really small, what does that mean? That means it's hard to tell the difference between those arms, right? Because they have almost the same mean. So you pull the arms and statistically it's really hard to tell the difference. And so if the gap is small, you can spend a really long time trying to work out which is the optimal. I mean, you have to play it a lot. And it turns out that the price you pay for the gap being small is actually really big. It's a little bit surprising that if the horizon is very big, then small gaps is actually harder. But if the horizon is not very big, then that's not quite true. And this is the next point. So actually you can improve the situation quite a lot if what you care about is a small horizon regime, right? And we're gonna do that. So there's a very simple argument to sort of fix this problem that if Delta is really small, what happens? And here we're gonna say, well, there's kind of two kinds of arms. There's arms where the Delta is so small that you can never really tell the difference between that arm and the optimal. And then there's arms where you can tell. And we're gonna do our analysis separately for each of those. So basically what we do is we split the sum of the regret into two categories, one where the gaps are really, really little and one where the gaps are not really, really little. And then we use our bounds separately for each. So this is where they're small, this is where they're big. And this term here, well, the number of times we play any arm can't be bigger than n, right? We just have n rounds to play. We choose n actions. The total number of pulls is definitely less than n. And so this term here, this first time is definitely less than n times Delta. All the arms are better than Delta suboptimal and we can play them at most n times. So we have n times Delta here. And then in the second term, we just do exactly what we did before, the same bound. And here we have a bound that is not behaving so badly when some of the gaps are really small, right? And in particular, now we get to optimize this thing. And if you choose that thing to be like square root k divided by n, I guess. So if you choose this Delta, I think it's square root k log n divided by n actually. Then you get a bound that says the regret is never much bigger than square root n k log n. And so this we call a problem-free bound. It resolves this issue of, oh, if the deltas are like really, really, really small and your regret is, can be arbitrarily big, no, that's not true. The largest possible value your regret can take is gonna grow like square root n k log n, okay? And this sort of confirms the thing that we were hoping for at the very beginning that our regret would be sub-linear. How sub-linear? Well, it turns out that we can get a square root n k log n here, okay? So we have a square root n rate essentially. If you are very careful, and you can also get rid of this log n, that actually you have to change the algorithm to do that. In a very little way. Okay, when am I due to finish? I have half an hour, right? Okay, so there's lots of refinements. I think what I wanna talk about for the rest of this is like, how do you scale up these things? This is an algorithm where the regret that we end up getting really depends on k. Depends on the number of actions you have. And if you think about some of the applications that I listed at the start, right? If you're placing ads in your big company, you probably have under 30,000 ads. You could show your user, right? You don't want this k to be 30,000. That's really bad, okay? Or if you're doing network routing, then the set of actions is like paths through some network, which is combinatorially large. It's even worse. So it's gonna be really, really quickly billions. So you don't want that. So somehow we have to scale up these simple ideas to something a little bit more complicated. And the way to do that, of course, is just to assume a little bit more. You have to assume some kind of interesting structure. And this is what we do, and we do it in lots of different ways. But what I wanna show you is one of the A the most useful. And B, it's a good example of how all the same ideas just work. Okay, so we need to introduce some kind of structure. And the simplest structure you can imagine and the one that we've seen a lot already in this workshop is a linear structure. So we're not going to assume that we have a really large set of arms and they're totally unstructured, right? The problem that we face in this setting is the only way to learn about some arm is to play that arm. And we don't want that. We want to introduce some structure so that we can learn about one arm by playing other arms. And we're gonna do that using the linear contextual banded. So this is more or less the same problem but there are a few little tweaks. So the first tweak is that the action set is potentially changing and it's not just an unstructured thing, it's actually a subset of RD. So in each round what happens now is you get given a subset of RD. Let's just say finite. So it has size K but you get given this set and maybe it's changing in each round even. And maybe it's really big, you could think of K as being 100,000 or a million or something. So you get given that in some form and then you get to choose an action. So this is like choosing a vector in ID but not just any vector, it's some vector in this set. And then the reward that you get is the inner product between that vector and some parameter that you don't know. So we're gonna say this theta is something that you don't know. And then we add some noise. This atta is like your Gaussian if you like. So for now you can think of the atta as just being a sequence of Gaussians that makes your observations a little bit noisy. And okay, then you get to observe your reward and you would like to make the regret small again. Yeah, usually it's finite. You should, so I mean if it's finite it's definitely compact. So it's good. Yeah, typically it's going to be a polytone. But okay, it could be a sphere or some compact maybe convex set in the lectures tomorrow and the next day we'll look at a different setting where this will be convex but for now it's finite. That's the easy thing to think about. Okay, so I mentioned that this was sort of practical and people actually cared about this. Why do they care? So one important thing here is this action set is changing and why is that? So very often what you have is you have a situation where you want to recommend some thing to somebody. So they come and you have a lot of information about your user, right? You know what websites they've looked at in the past. So you have a bunch of features about your user and so what you're going to do is you're going to look at the user that comes and look at the features that are associated with them that you've stored based on their previous things and then you look at the features of each item and maybe you just concatenate them into one big vector and that's what defines your changing action set. So the changing action set is really important when you want to do generalization across users is how it's used. And okay, well the linear model, it's the simplest thing you can imagine. Very often people add some sort of link function here. They make it a generalized linear model but this is just the simple case. So what you want to do here is exactly the same thing as we did for UCV. We want to be optimistic and then just see what happens and it all just works. That's the nice thing about this setting. It's like we don't really have to do anything new except update our concentration analysis and then just roll through the normal techniques. So it's exactly the same idea. We're going to estimate the theta. That's the thing that we don't know. Then we're going to build a confidence set around it. That's like the square root blah, blah, blah, blah, that we calculated before and then we're going to play the action that maximizes that confidence bound. We're going to be optimistic. So it's exactly the same idea. We just have to repeat a few of the steps. So I'm very glad that we had some really nice talks by Lorenzo on regularization and least squares and so on and that's what we're going to do here. So basically what we want to do is we want to estimate this theta with a theta hat and we're just going to do it with regularized least squares. The regularization here is actually not so, so important except that it guarantees us a unique solution and it makes them a little bit simpler. But we're usually going to choose lambda to be relatively small, right? So basically what we do is we get the sequence of data that's happened so far, the beginning that's nothing. And then we estimate our theta hat to be the thing that minimizes the error. And if you do the calculation, then what do you get? You get the theta hat t is equal to this data matrix inverse times the s, okay? Which I think we've already seen essentially. Okay, so this is your empirical estimate and all we have to do now is analyze the concentration properties of this. We have to say how good is this theta hat, right? Is it really close to theta and how close? And in what norm? And this is a little bit more subtle in this setup. It's a little bit harder to do the concentration analysis. Any questions on what's going on in this setup and the approach? Okay. So really we just wanna know how good is this theta because we need to know how to define our confidence intervals. Okay, and what do we have? Well, we're gonna be a little bit, oh yeah. How is it different? Yes, it's changing and we're not going to actually specify how. Our algorithm is gonna work no matter how it changes. Usually it changes as an applications because different users are coming to you and they have different features associated with them. But we don't particularly care for the analysis just each round somehow or other it's given to you. I should say by the way that the fact that we have a changing action set means that the definition of regret should change. And it seems I forgot to write down what the regret should be, right? Because before we were comparing ourselves to the best action. But if you have a changing action set then there's sort of no such thing as a best action. The best action is changing in each round. And so the regret that we care about is gonna be the expectation, probably, of the sum, one to n. And now we have the reward you expect to get if you play the best thing given the action set so far. So I'm gonna call that 80 star theta minus your reward. The reward you expect. Okay, and what is 80 star? 80 star is the thing that maximizes the reward in that round. So here we have a in the set of actions and then a theta. Right, so you have a changing sequence of actions and in each round you really want to play as well as the best action in that round. So that's our objective. Okay, and now if you do these calculations about this least squares estimator, well if we get that the lambda is equal to zero and you're a little bit in cautious, then basically it's gonna be unbiased. It's approximately unbiased. This is really irritating, I have to fix this. Okay, and the second part is you get a variance calculation and you can actually do this, it's a good exercise and maybe we will do it at the end if there's time, but the variance is saying, well, it depends, right? What we really care about is how good is your estimator at estimating the inner product of something, right? The inner products are what we really care about. We really care about the X inner product with theta and the variance of your estimate depends on the X and that makes a lot of sense. If you've chosen, for example, a bunch of vectors that go in one direction, right? So let's say this is our X, we want to estimate how good is X, right? How good is our estimate in their direction X? Well, if we played a bunch of actions that are practically orthogonal to it, we actually get very little information about the real payoff of X. The only way to get information about the payoff of X is to play a few actions that are at least a little bit in its direction and then you get some good information and this is what this norm is doing, this norm weighted by G, it's saying, well, if G is sort of big in the direction of X, then we are gonna be confident and if it's not big, then we're gonna be not very confident. So it's an adaptive measure of variance that depends on the directions, okay? So we have to be really careful about this, but okay. And this gives us our concentration analysis. Essentially, we repeat almost identically the argument that we did for the Gaussians. There's almost no change to this and you can actually prove that when these A's are chosen in advance, we can say, well, this here is the variance, this is what we would expect to see if it was Gaussian and then we have the confidence level. So this is really exactly the same as what we had before and in fact, if you think about the case where the actions are orthogonal, all right? So if you think about the very special case where in every single round, AT is just equal to the basis vectors. This is the standard basis vectors in ID. Okay, so in this case, actually, what is this G inverse? The G inverse is just gonna be how many times did I play the vector in that direction, all right? And the X, if the X is one of those basis vectors, then it's just gonna be one divided the number of times you played it. So this breaks down to being exactly the same bound that we saw before, but now we have linearized that we have this special scenario, right? Which is more general. Okay, so this is nice and you can prove it actually using exactly the same technique. There is one big caveat which is actually really irritating which is that this result is not true but it's actually is not true. And the reason it's not true is because how do these algorithms work, right? They choose their actions based on the data, right? They play for a little bit, they get some rewards, they do some estimates and then they play an action. And so these ATs, when you're doing a sequential analysis, when you're analyzing a bounded algorithm, those ATs are not independent, right? And the analysis that gives you this easily is actually when you assume that the ATs are chosen in a way that it's fixed. And here they're very much not fixed. They really, they totally depend on the data. And this is really irritating. It introduces lots of problems in the analysis. And if you do the right analysis, then you end up with an extra factor of D here. Okay, and we know that that can't be improved. And I think probably we won't have time but in the book at least it explains how to do the analysis and also why that D is necessary. But it's a real pity because this is simple and this is annoyingly hard. But okay, this is what we're like. Okay, so this gives us a concentration analysis for how close theta hat is to theta in the direction x. And now we just have the algorithm. We just write it down. We've done, essentially. So the algorithm is gonna play the action that estimates, okay, this should be a t minus one, I guess, the estimate of A, right? That's A times theta hat in a product, plus some confidence bonus. And this beta is like something that's about square root D times log t. There's some constants and things there but this is essentially what we have, right? So again, this is really, it's doing the same thing as UCB. There's nothing fancy here. And then we observe the XT and we update our least squares estimator. And that's the algorithm, yeah, that's right. You do, you do because what's gonna happen right here if we have really large action sets, right? We have lots of vectors going in every direction, essentially. What happens is here, this A of GT inverse is not just like square root one divided by the number of times you played action A anymore. It's more like square root one divided by the number of times you played actions that were about in the direction of A, right? It's how big is this matrix G in the direction of A? And so that grows as you play other arms. And so this term can get really small, even if you've never played A before. And in fact, we're gonna see that sort of this type of approach reduces the K dependence in the regret to a D dependence. And that's where it's, you're winning here. Exactly, good question. Any other questions? Okay, so I have just a little bit of time left. So maybe I will tell you roughly speaking how the analysis works. And it's really the same argument. What we wanna do is we wanna say, well, with high probability everything works out. Essentially, we're gonna say that with high probability this happens for all x, okay? So that's the thing we're just gonna assume. And then we're gonna choose the delta to be the same thing, like one over n, one over n squared, small enough that we really don't care. And then we're gonna show, well, if that does happen, then the regret is gonna be small. And that's exactly the idea. Okay, so how does this work? Well, we have the regret. The regret, if we remember, is the expectation of this sum, one to n. And now we have the inner product between AT star. Remember, that's the best action minus the inner product, your one. Okay, so this is the thing that we want to make small. And I'm just gonna call this term here is gonna be a little RT, okay? So now we used to sort of exactly the same trick as before. Before, what we did is we looked at the, essentially an upper confidence bound for the arm that you're actually playing. And compared it to the upper confidence bound of the optimal arm. And we're gonna do exactly that here. So we're gonna say, well, we've got the inner product between theta and AT star. Plus two times our confidence interval. So this is two times this beta. And then the norm, okay, guess this is like this, okay? Well, we know that this thing has to be bigger if we assume that the high probability event all works out well. We know that this thing has to be bigger than if we have a theta hat and a one here. So this should be greater or equal to theta hat, AT plus beta T. Now we've just got one of them in our confidence interval. Oops, this is AT minus one, like so. Right, so this is just using the fact that with high probability our estimates are pretty good. Okay, but this thing is exactly our upper confidence bound. And so it has to be bigger than the same thing for the optimal action. So we can just keep going. So this is greater or equal to theta hat again. But now we have AT star plus beta T star, this one. Okay. And this thing, well, this is an upper confidence bound that should with high probability be greater than theta, AT star. Okay. And now what do we have? Well, we have the, here we have the reward you expect with the optimal action. And here you have the reward you expect with the action that you actually played plus a little thing. So actually if we bring this one over to that side we have the little RT. We just bring this over to here. And this implies that our RT is just smaller than the two beta T and then the norm. So we have a bound now that we just assumed that everything works out, that the high probability stuff all happens and we can say, well, the little RT is not too big. It should be smaller than this thing here. Okay. So now the argument is essentially that if I play a whole bunch of actions, like how often can this be big? If this is big, then that means that the matrix G is somehow not very big in that direction. Here we have a G inverse. And so every time we play an action where it's big it actually shrinks it a little bit. So you should think about the level sets of this as being like an ellipsoid. And if you play some action in this direction it starts to shrink it a little bit in that direction and you end up with an ellipsoid a bit like this. And so on and so forth. And so this thing actually shrinks significantly. And what you're gonna end up with is, okay, now we have to do the sum and see what happens. And unfortunately we probably don't have time for this, but if we do this sum, if we sum the t equals one to n, okay, I should say we can bound the regret also by one. If we make some normalizing assumptions about how big the A t's can be and how big the theta t's can be, we can just say, okay, the regret in each round is smaller than one. That would be like the worst possible thing that could happen. And so we'll just take here this minimum with one. Okay, so if we sum up all these regrets, this is smaller than the sum. Let's have the beta t outside. It'll be kind of lazy. We'll just want to n, the minimum of one and one inverse. Okay, so this is the thing that we want to bound. And it's not so easy to do, but it's not so hard either. Yeah. The beta t. So here we have a, basically beta t is bigger than one, so I can pull it outside the min. I can say this is definitely smaller than the min of beta t and the min of one. And then I can pull it outside the sum. Ah, good point. Yeah, it should be the sum of beta t and here I'm gonna put beta n. Yeah, so if we remember this, beta t is approximately equal to square d log t. Good spot. Okay, so here we have beta n. This is the thing we want to bound. So the first step to bounding this is a favorite inequality that's not probability, which is Cauchy-Schwarz. So we have sum t equals one to n, the minimum of one and this norm, okay? Well, by Cauchy-Schwarz this is smaller equal to square root n, sum of t equals one to n, min of one and the squared norm. Okay, cool. And now we're gonna use a sneaky little inequality. And the inequality we're gonna use is that the minimum of one and u is smaller equal to two log one plus u. You'll see where the magic appears in a second, I guess. So now we're gonna substitute that in here. In fact, this is the term that I really care about bounding. I'll just call this a, well, that's a bad choice. Let's call it b. So b, now we can say, well, we're just gonna substitute in this little silly function inequality that we have is smaller than the sum t equals one to n, two log one a t squared. Okay, well, the log and the sum, they behave quite nicely. This is two log of the product. And if you do a little bit of linear algebra, this actually turns out to have a really nice form. This is equal to two, the log of the determinant of g n divided by g zero. And g zero here is just the thing you initialize g with when you did your regularization. So g zero is just lambda times the identity. Okay, so maybe it's like a nice little exercise to prove this equality. You do it by induction, starting it in and going backwards. And okay, the log dat, well, we have this relationship. What is the determinant? Is the product of the eigenvalues, right? And the trace is the sum of the eigenvalues. And then we have this amg inequality, which says that the arithmetic mean is gonna be, so here we can say this is smaller than what do we have? We have two log, right? So the geometric mean is smaller than the arithmetic mean. This is the product of the eigenvalues. Okay, maybe I'll write the bound on the debt first. Somewhere else. So we have debt gt is equal to the product of the eigenvalues, and I, and this is smaller or equal to the average, and I, to the power of the d. Okay, and if you have the sum of the eigenvalues is equal to the trace of the matrix. So this is equal to the trace of gn divided by d. To the power of g, and the trace grows linearly with n, right? Like, as you're playing actions, you're just adding the, adding the A transpose A, right? This g, g is equal to the sum, at transpose. And so the trace of this thing is just the norm squared of at, and so this is just growing linearly. Okay, so I'll just say that this is approximately equal to, and divided by d to the d. And here we have a log. So the d comes down, and when we substitute that in, this thing here is approximately equal to d log, and divided by d. Okay, that was all a little bit quick because we were running out of time, but this is the core of this argument. And so the regret bound that you get at the end of the day, well, we have one square root d from the beta, and we have another square root d from here. And so the regret that you end up getting for this algorithm, up to logs, is d square root n. And actually we have a lower bound that says you can't beat that. So this is essentially optimal. But the real point here is that in some sense this is all mechanistic, right? You say, I want to do UCB, so what do I need? I need to estimate the thing I don't know. Then I need to build a confidence interval, and for that you can just look at some book on statistics and it will be there. And then you plug it in and you have your algorithm, and that's also completely mechanistic. You just have this recipe that gives you an algorithm. And then the analysis is sometimes a little bit tricky, but even this is really just what you would expect if you read these statistics books. So then it's a simple, it's good, yeah. Seems to me that we're sort of trying to compute the volume of the span by the data matrix sort of the other side. Yeah, that's exactly right. So if you look at, what do we care about, right? We care about how accurate is our estimate. So if we look at this thing that we care about is like the inner product between x and theta hat minus theta. And this is approximately equal to this norm of x and then the g inverse n, I guess. We have n here. And so yeah, so you can draw the ellipsoid that gn defines, right? So this is this set E as the set, what do I want? I want like x transpose gn inverse x where x is in the sphere. This sd minus one is the sphere of dimension d minus one embedded in our d. So this thing defines an ellipsoid. And if it's looking like this, then you have a great deal of uncertainty about the x's that are in that direction. And you're a little bit more certainty about their x's in this direction. And this here, these are actually the eigenvectors of this ellipsoid. And so if you play an action that's literally on one of the eigenvalues, then what is this x g inverse here? Well, it's just if x is unit norm, then it's just one divided by the corresponding eigenvalue. And intuitively, if you play something sort of close to the eigenvalue, then it's some average of that and the other eigenvalues when you take the linear combination. It's unfortunately not a very clean picture that you get for the update, right? When you play it, then if you play this action, then the ellipsoid shrinks a little bit in this direction and it shrinks a little bit in this direction, but there's also a little rotation. And that calculation is a big mess. But intuitively, that's exactly what happens. You have this ellipsoid, the algorithm tends to play in the directions where it's wide and that shrinks it in those directions. Okay, so I think I'm done for today. I'm out of time, but of course I can take questions. And then tomorrow, there'll be no probability anymore. We'll just be able to come up with an analysis. It's gonna be really fun. So if you got lost here, tomorrow's gonna be completely different and you can get lost in a different way. Yeah, thanks.