 Hello everyone. The first lecture of today is going to be Claire Bernard. She's a group leader at the University of Tobingen and before she was at DeepMind and I think that she also held a position at Amazon. And basically her research interests her theoretical aspects of bandits and reinforcement learning and she's going to talk today about contextual bandits so let's welcome her. Cool, not at worst. Okay, good morning everyone. Thanks for making it to the 9.30 a.m. morning lesson on bandits after like a week of reinforcement learning classes and probably long nights or short nights. So today we're going to talk about contextual bandits so it's kind of an extension of what Emily presented yesterday and like maybe also related to what Tor presented yesterday to some extent. So maybe like the usual, okay the program for today is to kind of cover you know like how we go from bandits to like adding context and adding you know like how do we build with contextual information in bandits do bandits in structured action spaces and yeah we'll see how to extend the optimism principle and in general like what are the main you know the main complexity trade-offs I'm going to talk a bit about with regret bounds without like clear you know without detailed proofs a little bit but not too detailed proofs. Okay, okay the usual slides to start with a bandit talk a recommendation use case. So you run a mini Netflix you have two TV series in your catalogue and you want to recommend your users which of you know like these things they might like and you know like as a first approximation you can just run a two-arm bandit algorithm that you saw yesterday like typically UCB so you have customers coming let's say that they're an IID crowd and you just want to maximize you know like what they click on something like this two-arm bandit is fine so that we know how to solve but now you have a slightly larger catalogue you have more arms but it seems like these arms have a bit of like similarity and if you would run a K-arm bandit problem like a four-arm bandit algorithm here it feels like you know you'd be losing on some information like you'd be relearning things and you don't really know you know like how to yeah reuse the fact that if someone likes Emily in Paris very likely they're gonna also like Emily in Paris season two. Okay and moreover you have different kinds of customers right so like I come to your platform and I ask for a recommendation you know I'm a woman I'm French you know I don't know I like hiking whatever like I have a vectors of cookies and whatever represents me and now you want to make me make a representation and you still have like a structured action space so the question we're going to address today is kind of how can you resolve this question using bandit tools that are a little bit more elaborate than K-arm bandits okay is the set of the problem kind of it's a toy problem you can imagine that like in gen like in all applications there is like much more coming into in your way but let's just like dumb it down and model like take the simplest model okay let's introduce some math to work with this so your observations are context so context are like some some vector of cookies or something like it's my my my representation as a customer and you have actions and you you have action sets so here you have a finite action sets we will always have only K actions but each action is a vector in is going to be represented as a vector in some dimension space I'm gonna build that like slowly but for now you can kind of understand this context as a vector there's actions as a vector and the reward as like some functions some scalar function of the context and the action you took so the reward is a noisy thing so whenever you take an action A in a context C you observe some reward with some distribution P that depends on the context and the action and in all of this talk we're going to focus on minimizing the regrets so I think yesterday you heard a bit about like best arm identification and other you know but like here we're minimizing the regret where here the regret is defined as the expectation of the difference of the sum of the maximum rewards you could have obtained by taking the best action in every round T in every context C so here really I maximize the action for the random context CT which is observed and we define like so that's the baseline and your performance is just the sum of the rewards you obtain by taking actions and this expectation here is over the randomness of the rewards and the randomness of the actions the action sequence that you took which also depends on the rewards and the context that you observed and everything so there's a lot of randomness in there that's what this expectation stands for okay it's not much different from like the regret you saw yesterday so if you're familiar with regret definitions that should not okay weird okay I hope hopefully it's weird that it doesn't okay so it's not much different from what you saw yesterday okay we're going to assume that for now let's suppose like a very simple setting where you actually have a finite amount of context let's say a finite amount of customers in your customer base which is usually the case there's really infinite amounts of customers so like the first thing you can do is to say okay every customer is a different person we're going to just treat them all independently and make like a bandit algorithm for each of them so it means that like every time you see a customer you instantiate you recall all the data for that customer and like you just like do one step of UCB and if you do this well you have we know that for each bandit's algorithm that you run like typically UCB the worst case regret is like of the order of square root NK and so overall the worst case regret for that thing where you have instances of bandit's algorithm is of the order of square root NMK I just like spelled out the computation for you here but like overall you know you sum these regrets and like the worst case the worst thing that the environment can do is to show you square root N over M times like each of the context like each of the users like every user has a certain amount of participation and in the end you get like this square root NMK and it's bad because M can be really large also you don't cross learn like if two customers are very very similar you actually don't use this here so like we're not satisfied with having one bandit per context so what are we going to do we're going to make assumptions so first recall that like the linear product the scalar product is just some of the XIYI I think most of you are familiar with this and now we're going to assume that my context and my actions can be embedded into a structured like into RD into a yeah Euclidean space and you have a scalar product on this space and now your functions sends your contextualized actions into RD and the reward that you observed for context C and action A is simply the scalar product of this embedding and a known vector theta star in RD as well so now I've simplified the problem I've made it linear I've by assuming that things are embedded and turns out that like we're going to make this a very strong observation here which is that actually this is kind of all you need if you say my actions my context and my actions come in a kind of arbitrary way and I embed them into some space it's as if I was just saying that my actions at every round my action set is literally set of k vectors they are the contextualized action so I abstract away this embedding you can think of it as like you have a neural network you learn representations of things you take context you take actions and this neural networks outputs a d-dimensional vector that represents this contextualized action for this context and all the k actions of your vector I'm going to come back what like the key observation here is it's equivalent to say I have a neural network that represents my context and my actions and at every round I get k d-dimensional action vectors and now I can work with just linear reward that's an assumption in all generality you could say okay I can work with you know a reward function for each action that's kind of beyond what we are going to talk about today but some people are looking at this question but here we're going to just spell out everything we know about the linear reward setting does this make sense so far okay so the like implicitly here I'm saying that d really is span rd is spanned by my action spaces if your actions would actually be you know kind of ill-defined and like they would actually be in a smaller dimensional space and this theta star vector would actually be zero in some places you would have a sparse problem it's also beyond what we talk about today there is work on this as well but this is not what we're going to talk about today the d-dimensional space is nicely spanned by the I don't know why this is doing this it's in my computer that locks because it doesn't understand and I'm clicking okay so the space is spanned so it means that I can you know observe scalar products of theta star with the all the directions of the space alright so how does the learning works at every round I observe a contextualized action space I denote 80 80 the learner takes one of them and observes a noisy reward so here I made your drawing in dimension 2 with k equals 4 actions so this is the unknown like the hidden reward vector and at this round I observe these 4 vectors that are bounded whose norm is bounded and whenever I observe I'm going to pull this arm here which is the contextualized vector and I will observe the scalar product with like so this is like here is going to represent the scalar product of the action so it's like roughly this quantity here that you're going to observe plus some noise like some Gaussian noise let's say centered Gaussian noise or sub Gaussian noise and yeah and the the sequence of action well maybe there's something I can yeah so this here that happened to me like this 80 for now I've made no assumptions on them so it's an arbitrary set of vector and perhaps like one simple example of such like actions that is when like action 1 there's only 2 actions and action 1 is the canonical vector 1 action 2 is the canonical vector 2 so whenever you pull action 1 what you observe is just theta 1 and when you plus noise and when you pull action 2 you observe theta 2 plus noise and so this is a 2 arm bounded like the exact same as you saw yesterday so my problem is just a generalization with more actions you know in the structural action space of the K arm bandit actions can actually be correlated so it's exactly this modeling of having correlated actions okay but there are other possibilities like basically there are like 3 big families of problems of like cases where for the families of action says that you get the first one is where which is maybe the most general one is where this action set is just arbitrarily changing at every round so like I said before it's gonna be some neural network that you don't control or some black box representation that like embedding that you have and this is you have no control on these action sets and we would call it like finite armed that's the contextual bandit but you may also assume that you have a fixed set of arms and they are never changing at every round 80 is going to be the exact same set of vectors so from the beginning onward you always you know what is going to be your action set and how you can plan your pulling of the actions to optimize the information that you get so this is really different to this first setting where here you know you don't at every round you need to do with whatever you're given and you know the environment could be arbitrarily nasty it could actually show you actions that don't bring you enough information and we're gonna see that this makes this problem in general harder than the second one and the third one is when not only do you have a fixed set of vectors but on top of it like it's orthogonal to each other and then we come back to the K-unbanded okay so we're mostly going to talk about this case A and B because you already saw the case C in the past classes in just Emily's class okay so we are going to you know build on what you've seen so far so optimism and what does optimism tell us you should you know at round T every time you know you have to take an action you should choose the best action in the best plausible environment so we are going to use this idea of building plausible environments through confidence bounds so here is going to be like for now I'm talking about the frequentist algorithm that's based on UCB so plausible environments I'm going to build confidence ellipsoids which are these like kind of regions like convex regions around an estimator so confidence ellipsoids whose probability of containing not containing the true parameter is very small so just like you saw yesterday except that now we are in a d-dimensional space and we have information in all the directions so like our confidence bounds we have to kind of cover all that space so we are going to build confidence regions like this I'm going to show you how and then how do we act well we act optimistically so we find for any action A we find the theta tilde that maximizes like the scalar product within the confidence region so that's the maximal possible value for that action and then we take the action A that has the largest optimistic value and that's it that's kind of the algorithm so you're missing a lot of things you're missing how to construct this and how to compute these things but that's kind of the overall idea yes there are questions yeah so here yeah the theta tilde T depends on A really doesn't want yeah here I build an optimistic value for the action A right yes why this ellipsoid geometrical shape is it just for illustration or no it's actually it does have a geometrical shape it has the shape of an ellipsoid and that's a good question hold on it we are going to construct it in like a minute so basically it's it's a ellipsoid that has a center the center is an estimator of the true parameter that I'm going to build with you and it has a radius and this radius is directional like in you know it depends on where you look yeah that's an important point it should be not uniform it's a little bit not very yeah maybe I could have made this even more like skewed but like it is a little bit wider this way than this way or at least that's how I had planned to draw it but yeah no it should be non-uniform and it's also a very important point so good question because like the this is so this is the direction of the best arm so you're going to pull like if your algorithm works by design it's going to pull much more actions that point this direction than this direction so your uncertainty on this direction should be much larger than this one so your ellipsoid should be like like ill conditioned I don't know how to yeah say this yeah okay so let's just build this since you know we've had very good questions that's nice transition to the next steps okay so to build this we have a very good tool which is called linear regression so I think most of you know about this here I'm going to do like regularized linear regression and so I mean you see scalar products of the vector in different directions of the space so at a certain round T you have a collection of feature vectors and labels RTS like observations so like you want to minimize the squared loss which is L2 regularized for some lambda parameter you minimize this cost function here so if you've never done it it's worth computing it once it's a little like you differentiate from theta you obtain and then you zero out and you obtain a closed form solution which basically is written this way so theta hat is some invert matrix matrix here times weighted vector or weighted sum of the actions you've taken and this matrix here is what we call like the design matrix or the covariance matrix it's like A transpose so it's a D by D matrix like this that we sum we stack and sum plus a regularization term like on the diagonal that makes everything nice and invertible this is a symmetric matrix so it also has the nice properties of defining if you want a scalar product and we're gonna so this gives us like for a certain sequence of action it gives us an estimator of the true value of theta star with some uncertainty in some directions on the space and the problem here is that we're gonna want to control the uncertainty to control the uncertainty of I could just do this sometimes to control the uncertainty over theta star a theta hat but to control this uncertainty we have a problem which is that this design here is random because at every round I pull an action that depends on everything I've seen and done before so like this there is the sequence of dependency between like actions in my design space and that kind of blocks me from using other sorts of nice IID theorems I have to work harder to get the concentration in equalities so this is where I'm going to like only draft the ideas of how to build this confidence bound so like I said earlier we're going to define confidence ellipsoids so here I defined a scalar product a new like for any vector V and symmetric positive definite matrix M I defined a generalized scalar product or generalized norm as V, M P transpose which gives me like a way of measuring the norm of a vector directionally like in different you know like if the matrix M you can think of it as diagonal matrix if it has a large eigenvalue on one direction it's going to blow up the value of V in this direction and if it has a small eigenvalue in one direction it's going to make this value of V in that direction smaller so this is a way of computing a norm that depends on the shape of this matrix of this symmetric definite matrix here and the way I'm going to use it is by saying I'm going to consider all the vectors that are at a distance from theta star but this distance is not measured as like a normal scalar product but according to the eigenvalues of the matrix VT lambda this design matrix that contains the actions you know in a different directions so just this is the matrix VT lambda so it has some lambda on all directions so that's equal and then if you pull more actions in one direction then the if you want the eigenvalue or the eigenvalues of VT in this direction become bigger right okay and I just say the radius or the squared radius of this ellipsoid of this distance like sorry the squared distance between theta hat and theta star with this distance must be bounded by a beta square I'm just saying I want to consider all the thetas that are close enough with some radius and this radius is here close enough to theta star yes theta hat sorry yes right so okay what you know at time t is theta hat so you you pick the center of this ball and you say I'll consider all the points that are yes okay good points I understand there is no star here thank you okay forget about this star here there is no star here I consider all the thetas that are at some distance of theta hat and what I'm claiming what is this theorem here is that actually theta star is with high probability within this ellipsoid is within this distance of theta hat thank you yes yes there is a question over there yeah and see if our features on the context go in this reduction to this linear regression problem so the question is here theta star does not depend on the context and where did we lose this dependency on like of the reward function from the context okay so somehow we lost it or we modeled it into the embedding so you can see it here for instance here we said the reward function is going to be the linear product of a theta star that is unknown and that is fixed is that I want to cross estimate it over context and you know contextualize action spaces like I want every context to feed me into understanding what is theta star you can think of it as like supervised learning you have images and like you have a function that gives you the label compared to like for a certain image and like this function is parametrized by the theta a set of weights for your neural network and when you want to learn theta you take all your images and whatever the labels are and you know you learn theta this theta is this one parameter that controls the black box function that maps features to labels here it's what happening we have context and actions that give us features and I observe rewards and I assume that these rewards are just a linear function of these things right okay so the thing is that our action space kind of changed we're not talking about the same action space anymore but it's we added the context to it kind of yeah exactly exactly no that's really a good question like here I make the actions the actions change we have contextualize action spaces and this is yeah this is kind of what I try to say here like having this context and these actions and giving like rewards that depend on context and actions is equivalent because I don't want to make assumptions on the side like on the the representer on the embedding so effectively for me as a decision maker it's as if I was seeing contextualize action spaces right okay and so this is where you know all I see as a learner is a sequence of vectors and a sequence of rewards obviously these vectors these action vectors are things that I've taken but this is the only thing I have to estimate theta so I'm going to do linear regression to obtain it and then here without the star I have this like confidence ellipsoid and I can prove that's like chapter one of the bandit algorithm book that like shows you how to prove this theorem uniformly in time and for any you know for any sequence of actions to prove that this ellipsoid contains the best parameter with high probability yes what is s in the first line good question s is the bound on the true norm of theta star so like I said yeah actions are bounded like there is some and here the norm of this is s like that's the maximum of theta and the actions right and perhaps what is not visible here is that the noise I've said that the noise is sub Gaussian the noise would be normally sigma sub Gaussian and the sigma would appear in the radius here but like here is one so it's it's hidden but like it's here I think it's like in bandits you know like when you draw like UCB has a sigma like the standard deviation of the noise is in front of the square root that like stands for the amount of noise that you have in your observations that would also appear here at the beginning well yes good question you would we would assume we know s for now s is a given parameter my actions and my vector is bounded and I know s I know the upper bound and I also know sigma the standard deviation of the reward function but you know that is true that it might be a strong assumption I am not sure who has played around with like estimating s does anyone know about this yeah you should know an upper bound on your rewards but yeah but like could you estimate it or could you like with high probability could you say like I've only seen actions because the action sets you actually observe them all of them like all the actions so like you know at every round you get k observations of k vectors in dimension d and maybe they have like you could say that like with high probability they have like they are bounded or something open question I don't know I don't know if Emily you know about this so I don't know about this but when I implement this algorithm with my students on toy data sets actually we played a lot with the beta we actually use like we totally remove s we replace the square root by just a square root log t over delta and often it still works I guess practitioner anyway they play a lot with the beta they use so maybe yeah sometimes they are lucky and that's also true so yeah just to repeat because the sound went in and out but basically it's true that this like beta here that controls the radius of these ellipsoids are nice for the theory they allow us to prove that theorem but in practice they are often a little bit conservative and like if you really want like implement a linear cb that works most people like remove some things in here kind of play around or add a scaling factor in front so that like things work work work better so it's you know that is true that like yeah the dependence on s in theory is a very interesting open question I don't know in practice might not make a huge difference okay so now I need to implement the new cp with this first I'm like okay I have a beta that have a log determinant of vt over lambda d I'm like okay I need to compute a determinant at every iteration so it's nice because it means that my radius actually exactly depends adaptively on the action set because I need to compute this quantity so what people do like what makes this algorithm implementable in the first place is to upper bound this thing and it's actually a clever lemma that's called elliptical potential lemma that's known in online learning for quite a long time we've even like kind of studied this question a little bit of like the tightness of this elliptical potential lemma in like a little archive note but anyway the way it works is simply to like upper bound this log determinant via like in like little algebraic manipulations of the actions in vt and you obtain a d log t kind of quantity and that gives you a beta plus that you can actually compute at every round and you know you can even pre compute everything that's it and it's not adaptive to the action set anymore but it's close enough and it's relatively tight and now I can compute linear CB so I can again this is I made the same mistake here because it's just a copy paste but so I have a new ellipsoid that's a little bit larger that controls the distance between my estimator theta hat and any vector and to do linear CB what I have to do is for each action in my action set I compute an optimistic scalar product by maximizing a linear function for any theta that is within some distance of my estimator and this is maximizing a linear function in a convex set so we know how to do it that's I see you're good that's like do we really know it's okay I agree we don't in general always know but it so happens that this problem you can actually write it down and write the Lagrangian and do your do your job with the convex optimization book of Boyd on the side and follow the steps and it's going to work and it's going to spit out the exact solution for use linear CB in this closed form here so usually when I give this class courses I actually have a slide and stuff where I show how to resolve this but I removed it because it's early in the morning but like it's actually a nice it's a nice little convex optimization exercise to come up with this and here what do we see so we see that linear CB is maximizing just the greedy scalar product plus beta the radius of the ellipsoid times the VT lambda minus one norm of the action so it's a directional norm of the vector a that you're looking at so we compute this index for each action in the action set and here I need to here I need to invert the VT lambda matrix at every round so that's a little bit painful and in general in practice you can use this nice Morrison Sherman Morrison formula that gives you like an update of the minus one matrix when you do a rank one update so that's like the practice trick but it's still like a lot of matrix multiplication to compute this norm at every round so computationally speaking this is a little bit you know especially in high dimension this poses quite some problems but it is like I can call it computationally efficient yes very good question I do I assume that the action set is finite and small so that's the question what is you know like it's a K is roughly a bit bigger not a bit bigger than D that's what makes the problem interesting because like you have several actions you know like in my little example here in dimension 2 like you have K actions and this action and this action are correlated so if you pull this one you get information on all of them but like K is small and actually if you make K very large it becomes painful to compute this for every action all the time and if K is infinite I actually don't know an algorithm that optimizes this and I think unless I'm mistaken this is an open problem there is no like we don't know a polynomial algorithm that solve this problem when the action set is infinite or continuous action set let's say like I have like my action set is all of a certain part of RID I don't know how to solve this so yeah very important the action set here is finite yeah the capital L here the norm of the actions okay so okay the S is the norm of theta and usually S and L are the same I took the notation from the bandit book where they used like two different letters I think in most bandit papers this is the same and it's going to be you know one but yeah more questions okay so we have a nice algorithm we can prove regret bounds how do we prove regret bounds it works very similarly to UCB so there is the UCB trick and then like you upper bound so okay UCB trick we have the regret the regret is just the sum of instantaneous regret where you have the difference between the true value of the best action and the true value of the action you took but it so happens that the true value of the best action is bounded with high probability by its UCB which is bounded by definition of the algorithm by the UCB of the action you pulled because you pulled the largest UCB and then like this is equal to like some vector theta till the scalar product and this theta till day is close to the true value of theta so I'm going to upper bound the simple like the instantaneous regret by this new scalar product that comes from like manipulating these equations and this equation and I upper bound by some form of Cauchy Schwarz why inserted some VT matrices inside the scalar product here and I obtain you know basically this radius 2 beta times the norm of the action and then you can calculate and you can obtain the regret bound from the the betas like summing the betas and summing these things and upper bounding the sum of the 80 norms like over all these this is where the elliptical potential lemma comes in yeah I think I wanted to spare you the I wanted to spare you the calculations here what we should what we know about what we want to the key take home message on this regret bounds is that the regret is bounded by d square root n that's the main leading term plus some term in square root n d okay this is a regret that depends on delta so here I have set some delta a priori in my algorithm and my very good bound is depending on delta but like you can you can make it like you can make delta in something like 1 over t or 1 over n here and like obtain the regret bound like in expectation but okay I think what are the comments okay first comments this is this proof can be kernelized quite easily and you would obtain like Gaussian process bandits like bandits this GP, UCB algorithm another thing is this first this dependency in d square root n here it's true for any action it's a minimax regret bound so like if you don't want to make any assumption on the action sets that are coming to you you'll get this in general but in practice you know there are action sets that we know have so smaller a smaller linear regrets a smaller regrets can you can you think of one can you think of a problem where you know in general the regret the minimax regret is smaller than this yes a standard banded problem if you have K arms in dimension K okay so let's start over again so to kind of slowly restart first of all I realize that like I went relatively fast so like if there are questions and things like don't hesitate to we can we have time to go back on things I wanted to demonstrate a bit how the linear banded algorithm works so this UCB so here we have a like a problem in dimension 2 the true mean is this black dot on the on the diagonal and I've already there's already been 19 time steps so here we have the actions so the actions are the green dots there are one two three four five six seven eight of them around and they are like their norm is banded as well okay just gonna give one minute for everyone to sit down you haven't missed anything the movie hasn't started yet still the ads alright so that's the more elaborate version of the little drawings that I had in my slides earlier where now things are gonna move so my estimator after 19 time steps is this red dot and I have you know I mean it's actually total credits to Tor Latimo who like did this video so I have not done anything but like so here the ellipsoid is can around around the the red dot because you know at the beginning UCB tends to take uniform actions anyway and there is the regularization so we start with we start with this and the chosen arm is going to be surrounded by a red dot I'm giving you a lot of heads up because things go quite fast so like be ready okay I can also replay it I could also try to play it a bit slower let me see yes I just like for me even like on this screen see super well okay maybe the 0.75 is going to be enough alright nice seriously why am I connected though let me just redo this yeah let's go really? and now it's here how can I send this to the other screen ah yes I know I need to first do this and then take it here intense let's just make it bigger I feel like a grandma okay alright now it should work if I can make it slower so like buckle up it's going to go fast alright so the time steps are running here and you see the chosen arm is this red dot and look at the pattern so it's almost always choosing the optimal action which is this action aligned with theta with the largest you know the largest reward and once in a while it just like goes and pick an action that's kind of orthogonal to it so we're going to replay this there are a few things to look at as well like look at the ellipsoid shape so like we anticipated earlier it really gets ill conditioned into this direction of largest like the largest uncertainty which is also the lowest reward so it's really like a design like by design of this algorithm it tends to make an ellipsoid that has this shape okay maybe one last time for and the regret actually goes very slowly so we had 11 at the beginning after 19 time steps and now you know we really accumulate regret extremely slowly so this is also due to the fact that you know actions here are actually fixed so so there's no changing action changing action spaces and they nicely span regularly you know with large norms they nicely span the action space so that problem is actually not you know the worst case so things the regret actually doesn't grow that fast yes right so okay so you know that like in UCB did you show the UCB video okay so in UCB you have for the best arm the concentration the confidence interval becomes very very small and for worse arms the confidence interval stays very large and upper bounds then to align you know on the value of the best arm kind of thing so that's the intuition here in the sense that like now we don't have only two actions orthogonal we have a bunch of actions and like our best action is the one that goes this direction so the the radius of the ellipsoid in this direction is expected to become small because the UCB is just the the norm of this action when you compute the norm by like weighting the value of the coordinates of the vector according to how much information you have in this direction so that's maybe I should like I'm trying to not you know rely too much on the formula but eventually you know that's kind of how things okay so you see here what is important in the UCB is the norm of the action when you weight all the coordinate of this vector according to the the eigenvalues of this matrix here which is an inverse covariance matrix we're going to slowly walk through it assume that this matrix is diagonal so it stores like it's just you know every time you pull an action in direction D you increase that's here if I pull the vector in direction D like with a coordinate or EI with a coordinate in the I position this ASAS transpose is just a big matrix like this with a 1 on the diagonal on the II term I just increase this AA transpose is just this diagonal matrix with just 1 so it's going to increase the value of the eigenvalue of VT of this direction okay so the eigenvalues of this matrix AA transpose count the amount of information count the number of samples I have in this D the D dimensional space in this D directions of the eigenbasis of this D dimensional space so the eigenvalues of VT can be understood as this NA of T the number of times you pull action A except that like here there are many action As and they span D so I have to not count how many I pulled this action but I count how much they gave me information in the various directions of the space that's the rough kind of intuition like the eigenvalues of this matrix of my action counts and when I invert the matrix I have 1 over 1 over N and because here it's like it's just the norm here this is actually a 1 over square root N like this is like a physicist analysis of what's happening here but like roughly like this is homogeneous beta T being like some square root log T something square root D log T it's homogeneous for an action A to square root D log T over square root N roughly number of times you pull this action it's actually not exactly this it's like roughly the amount of information you have for it and so what happens in this video is that we see you know how the ellipsoid kind of is shrunk in the directions where this covariance matrix has a large eigenvalue and larger where it has a smaller eigenvalue does that make sense it's a lot like a hand wavy explanation but it's actually more kind of a visual thing it also writes down quite well like if you write down this optimization problem and you can you can see how things play out okay okay so that was for linear CB and we left with this question that like when you have K arms and they are orthogonal to each other we do that UCB just UCB would get a minimax regret of square root KN and here I have a D square root N which seems to be larger so the short the long story short is that this problem the K arm bandit is actually a bit like simpler in a way than having arbitrary action spaces popping up with just like this constraint that the actions are bounded and that like when you are dealing with arbitrary action spaces you have to be a bit more cautious kind of with what you're doing but in practice when you have finite action spaces you can you can exploit the structure of the action space a bit better and gain the square root D N regret so we just saw the video I'm going to first go through Thompson sampling before before going into the finite action space just to give you an alternative to linear CV you've seen Thompson sampling I guess yesterday yes you've seen Thompson sampling already for the usual the usual K arm bandits and again it's going to work very similarly for linear bandits except that now we have to sample from a D dimensional posterior which in some cases might actually not be that easy but you can make nice assumptions like for example that like you gave a Gaussian prior and you exploit linear Bayesian regression to obtain a Gaussian posterior so same here as before I have a fixed action set actually I don't have to have a fixed action set it could just be an action set but yeah I assume that just now the noise instead of being sub Gaussian it is Gaussian so I can use Gaussian you know prior posterior updates and the algorithm is very simple you give it a prior here Gaussian prior and at every round you run linear Bayesian you run an update of the linear Bayesian regression you know usual things which is extremely similar to what we've done so far and you sample a theta T from the posterior from this Gaussian posterior that you obtain and the action you choose is the action that has the largest like scalar product with you know the sampled parameter theta so it's very similar in to what we're seeing here you could see this as like the level line of this posterior distribution that you have so you have to imagine it's a bell shaped curve and you have this fixed probability line for this bell shaped curve it gives you an ellipsoid and instead of maximizing a scalar product in there you just sample theta and just the action that has the largest scalar product with a sample from this region so it's going to be you know a sample you have to imagine like high probabilities around the expected around the this is like the mean posteriori like the and then like smaller probabilities to sample theta around like in in further away regions of the ellipsoid but that's really the same idea so you pull the action that maximizes this scalar product and ok so very simple algorithm the advantage the main advantage is that you don't have to compute this a norm like the norm of the action with like the vt-1 matrix so it's really nice but you have to sample from a posterior which in the Gaussian case when the model here is going to all worked really nicely but in many cases it might not be that easy to sample from a posterior so we use heavily you know all sorts of machinery like conjugate priors but also like Laplace approximations and things like this in like more complex models people like there are different ways of analyzing this you can make a frequent analysis but you can also propose to like study the Bayesian regrets of something where the Bayesian regrets adds like an expectation over the prior that you gave to your algorithm so this is saying look I know that my parameter theta was sampled from some prior q and I'm going to feed my algorithm this prior q and it's going to update its belief according using the observation to like estimate theta and now you're saying on average over all the potential environments from this prior what is the regrets so it's another way of studying of studying you know bandit algorithms and for Thompson sampling it seems like kind of a bit of a natural way and you would get something like d square root n but for the frequentist analysis that you can also do for linear Thompson sampling we get a d to the 3 half square root n so when the noise is like has some like bound and like the variance of the noise is like bounded by something of the order of d so this d to the 3 half adds a square root d compared to like the regret of linear cb so where we said that like the minimax regret is d square root n and it is not known whether this is an artifact of the proof or if it's like more conceptually like you know it's something that belongs to the nature of Thompson sampling that it's minimax regret frequency minimax regret is actually larger than the one of linear cb it is not I don't think it's been verified in practice that there exist a problem for which you know there is no lower bound like there's no construction of any problem for which Thompson sampling would have like a d to the 3 half regret so I don't know and I think it's still an open question whether this d to the 3 half is real or not okay so that's for Thompson sampling and I think it's just a very natural way to like you know do something different from linear cb like to implement I wouldn't say optimism but like put in the algorithm the uncertainty that you have around the estimator of the environment parameter yes no there are there are assumptions there are assumptions on the coverage so what is it again that's a good question actually no there are there are there are some assumptions on like the kind of coverage of the true parameter like no because this is that doesn't make sense I am not sure there are assumptions they are not very strong but there are assumptions on this prior and I am blanking yeah sorry good question no one knows what is like the typical assumption for the prior here it's a technical thing it's just like essentially to make so the way the proof works is that you manipulate these expectations like this different expectations to like condition nicely and everything so that you make UCB appear like you bring back the proof of UCB the optimism to you know you kind of say okay with high probability basically the sample is going to be smaller than the UCB and then like you bring back in the analysis of UCB you average overall this and you that gives you the same rate and for the inverts inverting the expectation the integrals and conditioning you need some assumptions on the prior that's kind of the the basic idea but I'm not exactly sure what's the key thing that you need so the prior is not going to decay but the posterior for sure there's concentration of measure so but I'm not sure what you need for this concentration of measure thing to exactly like work well but you're right it's something like this you want to be able to apply this kind of concentration arguments eventually yeah which is going to work because like the noise is Gaussian so should all work but I'm not sure what I'm blanking on the exact thing yeah okay yes okay so like the question is whether this is related some typical techniques in active learning where you're saying that you're saying regrets in square root dn instead of d square root n so I am not sure about the first part like I mean I 100% you know like this type of bandit algorithms are connected very strongly to what people do in active learning you know Bayesian optimization and GPU CB and all these algorithms share a lot of like similarities like in design I don't know like so that connects nicely to the initial question that we had posed for a bit to talk about terms and sampling but there is still this question of regret bound in d square root n versus square root dn to obtain a square root dn I would kind of assume that like they make some other assumption on their action space like something fixed or something or something with a fixed distribution like if you say that like your action space is just simple from some distribution with like a control covariance matrix with nice with nice eigenvalues that also gives you square root dn kind of bounds so I'm not sure I can't really answer very clearly your question because I'm not sure what algorithm you're referring to but it's very likely that there is a link between these two things so that actually connects quite well so what happens when you have a fixed action set you have k actions and they are just fixed from the beginning on you actually know what they are they span nicely the space meaning that well you know what is the lowest eigenvalue and it's going to be part of the difficulty of your problem but overall you know in advance and the smallest eigenvalue is not arbitrary small or it's not worst case small and well in these finite action spaces there is a paper by Tor and Chaba from 2017 that is called The End of Optimism and something something and this paper The End of Optimism is always hard to google because when you google The End of Optimism you find a lot of other things that's if you like yeah so what is what does this paper say it says if I have a finite action set I can design a hard problem for which linear CB has an arbitrarily large regret when you look at the regret in a problem dependent way if you look at the regret as a function like I have not really introduced this problem dependent version of the regret but you've seen it yesterday the regret in log t over delta kind of things you can define the exact same definition of the regret when you have a finite action space because now the covariance of the matrix the gaps means something there's a fixed gaps between the best action and the second best it exists whereas when you have arbitrary action spaces the gaps change all the time so now I have a fixed problem I have gaps I can define a problem dependent regret and I can show that there exists a problem with a fixed action space for which linear CB you can make it explode by like varying a small parameter and so the question that they ask is like okay so what is a good algorithm to explore the space to minimize the regret especially this problem this kind of regret when I have a fixed action space is there another way of minimizing the regrets since linear CB is failing and the idea is to use a tool from statistics that's called optimal experimental design so optimal design is when you you have this fixed action space and you're going to find the proportions of pools of each actions such that overall ellipsoid that you get has the best determinant of the smallest volume so you optimize the information in all directions such that you can sequentially eliminate actions so it's the same thing as like phased eliminations in UCB but now I have actions in several directions of the space so I'm going to pull actions kind of you can think of it as like uniform way but here it's not uniform because I may have like three actions pointing this way and one action pointing this way so if I want to have a uniform ellipsoid I need to pull one third these ones or one fourth these ones and one fourth this one so that like the overall uncertainty in all directions is kind of nice and uniform and then I can erase actions in phases so that's the algorithm that they propose like a type of algorithm that they propose in this paper well in that paper exactly they propose something a bit more complicated that uses like tracking lower bounds and like kind of more advanced ideas but the principle of this type of method is to optimize on the design choices, the action choices to get an ellipsoid it doesn't become like ill conditioned in this direction by design but like controls better like the uncertainty in all directions so you can sequentially eliminate actions and if you do this so if you do optimal design plus phase elimination then in this type of like fixed fixed action set first you can bound the problem-dependent regret in any problem so that I didn't write but that's what they prove in this paper but also in the worst case you get a square root and d log k regrets so that's good the problem is you cannot apply this kind of techniques in like a changing an arbitrary changing action space like there's no algorithm that like if I instantiated for a finite action it's like it does it has square root and d regret and if I instantiated for like arbitrary action then it has the d square root and regret so for a very long time there wasn't this algorithm that like is the not best of both worlds because it's not exactly what it is but like that nicely navigates these two settings seamlessly so that was a question for a long time and the answer is that yes, there is actually so there's been there was a first work by Therian Zoni et al in 2020 and just like after Jonas Kirchner, Toa Latimor and myself and Chaba we worked on like an information directed sampling version so like both algorithms kind of solve this problem they propose like an approach that if you instantiated in either case they are going to get the best minimax regret but like empirically the primal dual algorithm is not so good needs a lot of like hand tuning our algorithm is like slightly easier to tune I mean eventually you know it's kind of similar similar results but so yeah this problem is no longer that open some corner cases that could be could be cut like some second order terms and things but like okay I think we're reaching the end actually quite well on time so we had a problem we had this structured action space different customers and different movies and now we have a way to encode these contexts and actions into contextualized action spaces and to based on these linear assumptions on the reward we can learn the reward function across actions and act to minimize regrets and we do have most you know regret bounds like up to second order terms we have the minimax optimality for almost all the algorithms except this open problem that remains so like there are still open problems in this field first Thompson sampling the D to the 3 half you can also like there are lots of questions that remain beyond linear bandits if you work with harder models like even logistic bandits have actually posed a lot of problems in practice and in theory so very recent paper by Fourier Tal on generalized linear models neural bandits also they we have some things but we don't have that much sparse models where like you know like I said the D the D dimensional space is not fully covered like it's that there's a lot of zero zeroes like Botao and Tor have papers on this I also recommend the recent tutorial by Dylan Foster on contextual bandits where they go beyond the linear model and the study in general the simple complexity of decision making like really good line of work and yeah there are some active topics of research you know from from from more theoretical questions to more like concrete applied questions of what happens in like non stationary environments and sparse sparse cases and like you know in Dylan's line of work simple complexity of decision making also lots of open questions so like even if there are lots of things that we understand there are still very exciting areas of research for you to dig in and I thank you for your attention and I wish you a good coffee break I think that we have time for a couple of questions sure if there are burning questions I'm very happy to take them right so so far we have seen that the action space is fixed but in some cases you have like let's say the movie one that new movies appears or some content are just shown for a given period so my question is how can we use these techniques or this approach when the action space also change over time so actually quite naturally so as long as your movies like or whatever new items in the catalog can be embedded in the same action space as the previous ones like you know like this technical thing but let me see alright this phi here this function as long as you can apply it to your new customer and your new action they are in the same they are you know they in your machine they look the same as long as you can represent them and embed them you can just make it just gives you a new d-dimensional vector that is going to appear you know as in your action sets at the next round I'm assuming that you know the number of actions remain bounded like I've always said there are k actions but these actions can change you can think of it actually that's how most real-world real-world system work there is customers there is the entire catalog and you have a first classifier that cuts off most of your catalog and decides on like the k most likely actions or a k a set of k vectors a k of items in your catalog that are a bit diverse but things that you usually like and then there's a bandit algorithm on top of it that like selects from these k actions that selects the one that you want to present actually there are most of the time it's a ranking bandit algorithm is going to select the first the second the third and it makes your YouTube watch next recommendations that's I mean you know there I'm not disclosing any secret like how would you how would you build a system like you would do this right like you have an infinite amount of videos you first have like classifiers like black box things that your teams of engineers have built they remove things and then you still have to make a choice and that's how it happens so adding things and removing things is okay but to the to the extent that you need to filter out at some point some of the actions I mean the way you filter that's like that's kind of beyond the scope like I don't know how like they're actually you know this is updated all the time the way you filter actions the way you like there is there is no a priori hierarchy but you know you could say that like some like some profiles of customers never watch some types of videos and so these are never going to be given to the bandit algorithm to make decisions with it's not interesting for them in a sense like you're saying there is a very large d-dimensional space I'm going to cut everything except for like 10 dimensions like I cut all the actions that are not interesting for my my customers or for my right now this this kind of ideas more questions if not right now I'm going to be around actually until Wednesday so you can like come and come and ask your questions and say hi and is there any last minute thoughts no I think it can be coffee time this just could be today yeah cheers