 So, welcome everyone for today's talk with Sebastian Wiebeck from Microsoft. Before we begin, I should just usual thank all the organizers, especially since today marks the end of the fifth year of TCS Plus, so with time flies. It's nice to be here after five years of TCS Plus, especially with this great talk. So, I should thank all fellow co-organizers on India Day, Kremokanon, with Gautam Kamat, we have Toma Vidic, and we also have Thomas Hollensang was helping us in the first few years. So, I mean, Ilya who's here with us today, Rasenstein. So, I thank all the co-organizers for the help in making TCS Plus possible. Maybe before we continue, let's quickly see who's here with us today. So, we have Benjamin Miller with the group from UW Medicine. Hello. We have Budima from EPFL. Hello everyone there. Hi. We have Kupjin with the group from University of Michigan. Hello. We have David Weitz. Hello from CMU. We have Guy Evin from Tel Aviv. Hello. Good night. Thanks for joining. Ilya Rasenstein and a group here from Colombia, not far from me. We have Mark Selke from Cambridge University. No, Mark, I can't see you, but I believe that you are there. Ray Lee from Stanford. The group from Stanford, hi everyone. And Shavasrao here, a few floors above me. Hi everyone. Good to see you here. And finally, Thomas Vidic from Hackaltech. Okay, that's it for now. Well, yeah. So, somebody joined later. Okay. So, I'm happy to finally present today's speaker, Sebastian Vibek. He's from Microsoft Research, Redmond. Sebastian did his PhD in France in Lille as part of the INRIA. Then he was an assistant professor in Princeton for a few years and then moved to Microsoft Research. His research is mainly in machine learning, convex optimization, and interested in bandits. He won numerous awards, including Best Paper in Colt 2016, Best Student Paper in Colt 2009, as well as the Sloan Research Fellow in 2015. So, thank you very much for agreeing to give this talk, Sebastian. And please go ahead. Welcome. Thanks for that. So, I'm really happy to be giving this talk on our progress on the case server problem. So, this is joint work with really fantastic collaborators, Michael Cohen, James Lee, Yintatli and Alexander Madri. So, of course, this talk and the paper more generally is dedicated to the memory of Michael Cohen. So, I will just start by telling you what is the case server problem. And then, even though it doesn't really need any motivation since it's such a classical problem, I will still try to put it in a broader context. And this would be useful because the solution that we came up with comes from this broader context. Okay. So, what is case server? It's a sequential game which is being played on a finite metric space. So, you have a finite metric space with endpoints and you have a fleet of case servers. So, you're a player and you're controlling those case servers. So, they are at some locations in the metric space. Now, it's going to be a sequential game where at every time step there is a location in the metric space which is being requested. And what you have to do is you have to choose which server you're going to use to service the request, meaning you're going to send the server to this location. And the price you will pay is the distance that the server has moved. Your objective is to minimize the total amount of movement over, let's say, capital T requests. And what you compare yourself to is the best offline algorithm that would know in advance all the requests. And there is a purely information charity question which is whether as an online algorithm you have enough information to be competitive with respect to the offline algorithm. So, that's the question we're going to try to address. Now, as I said, I want to put this in a broader context and this broader context is the one of online decision-making. So, and in particular, I will just discuss briefly two rather recent probability-free models. So, online decision-making, you know, it goes back at least to the 50s and even much before. Control theory is one big field that is interested in this. But recently there are two kind of new models which do not make any statistical assumption about the data. And these are online learning and online algorithms. And in this slide, I will try to describe both of them and put some link between them. Okay, so the general context is like this. You have a set of actions or states. We can think of them as actions of states, capital X. So, this will be more, for us, it will be finite, but it could be continuous. And you have a metric on it, D. Now, you're going to play a sequential game for capital T round. And there is a set of possible cost function script C, which are functions that map an action or a state to a real value. Okay, and this is the associated cost to a certain action or state. Now, the way in those two models that we're going to model the unknown environment that we're competing against that we're playing in is as a sequence of cost functions. So, we have C1 up to CT in this set script C. These are unknown cost functions and you're going to play against them. So, what do I mean by that? The protocol is like this. At every time step T, myself as a player, I get to choose an action or I get to choose which state I want to be in. And this choice is based on all the past information I have and possibly some information I have about the current time step. Once I have made this decision, I pay a cost, which is a cost of the unknown function, possibly unknown function CT in that state, CT of XT. And I may also pay a movement penalty for, let's say, changing state or changing action, the distance from XT minus 1 to XT. So, in online learning, in learning, what we assume is that the cost is unknown at the time of decision. So, we have to choose an action without knowing what will be our cost for that action. So, the way to think about it is let's assume that capital X is, let's say, a set of weights for a neural network. And what comes at us is some image that I need to classify. So, given a certain set of parameters for the neural network and a given image, I'm going to make a certain classification. But I don't know when I see the image, whether it's going to be a correct classification or not. That is told to me only later. So, you're playing this game online. You have the stream of images that come at you and you want to adapt your weights of your neural network online. And the goal in learning is to find, we assume that there is some good action, there is some good state that you want to be in and just stay in there. So, in that context, it makes sense to look at what is called the regret, which is how much you have paid. So, it's this sum of CT of XT, that's the total cost the player suffered. And you compare yourself to the best you could have done in hindsight with a fixed action. So, again, in learning we assume in some sense that there is some good action to be found. This is completely different from the online algorithm setting, where the cost is revealed to you at the decision time. And the way to think about it is that this cost represents a request. So, you have a request and you have to service this request. Now, what you want, of course, if you know the cost in the online learning setting, you would just move to the minimum of this cost function. But now, in online algorithm, you also have a penalty for moving. Again, in online algorithm, it's more natural to think of a state space. And for changing states, you have to pay. And now, what you don't want to say that there is maybe a good fixed action, this doesn't really make sense. What you want to compete against is really the best optimal dynamic policy. So, you see here, you're competing against the minimum over x star in x to the t. So, the opt can change at every time step. And, of course, it's paying the cost at every time step plus a penalty for movement. And what you look at is a competitive ratio. You cannot hope to get the regret guarantee for this. This would be too strong. But you can hope to say you're not paying more than 10 times what opt would be paying. Now, sorry, a lot is known about these settings, but in some sense, even more is unknown. So, a very basic question is what is the optimal guarantee that you can get for finite action on state space? So, this is very well understood in the online learning setting, but it's not understood in the online algorithm setting. So, by the way, the name for the online algorithm problem that I described here is a metric or task system. You think of the function as a task that are coming at you. X is a state space. And when you see a task, you get to move in a different state to complete this task, but you have to pay for changing your state. So, in MTS, we do not know what is the optimal guarantee that you can get on a finite set. So, it's contractual that you can get log n, but the best that we know is log squared n times log log n. And I will briefly discuss this at the end. On the other hand, for online learning, we exactly know that the optimal guarantee is square root t over 2 times log n. Whether the log n in online learning is the same as the log n in online algorithm is not clear. I mean, even this sentence is not clear what it means. But in online learning, we even know we have algorithmic proofs, but we also have information theoretic analysis where we don't even have to give an algorithm. We can just reason about how much information there is in the system and that when you make a decision, you get some information about what is there. And you exactly see that the log n comes from some entropy-decreasing argument. And as far as I know, we have no idea how to do something like that for MTS in the online algorithm set. Good. So, I won't say more about this until the very end. Now, one thing which is really interesting is to go beyond the unstructured case. So, what you want to say is maybe you have a huge set of actions in online learning, or you have state space in online algorithms. But there is some combinatorial structure in the cost function. The cost functions are not arbitrary mapping from the state actions to a real number. And again, in online learning, sorry, we understand a lot about this. And in particular, we understand the importance of mere descent in this setting. And with you, case server has combinatorial structure, combinatorially structured MTS. And so, it's natural to try to understand whether mere descent could also be helpful in this setting. And that's exactly what we're going to do. Sebastian? Yes. For those of us not too familiar with the context, can you say explicitly how case server fits in that in the framework? Yes. Okay. So, that's this. So, here is the case server problem introduced by Manas, McGeok, and Slayzer in the 90s, which is a generalization of the paging or caching problem. So, in case server, the state space is x to the k. So, you have a case server. So, each one, you describe it by a point in your metric space. Now, the distance, you have a distance on x. So, the distance on x to the k is the earth mover distance. It's a minimal matching between two configurations. And what are the cost functions? The cost function, so script c, is defined like this. So, any cost function is parametrized by a request r in x. So, given the request r in x, how do you assign a cost to a configuration x in capital x to the k? Well, it's simply plus infinity if the request location is not in your set of k servers. If it is in your set of k servers, then it's zero. Okay. So, the structure is very simple. It's infinite on a very large part of the state space and zero on the other part. So, you just have, you know, in this geometric view in k server, what you have to do is each time you're given a subset of your state space and you have to move in that subset. That's a geometry. Now, the conjecture, which I will call the weak randomized case of a conjecture, is that there exists a polylog k competitive randomized strategy for k server. So, this is significant for two reasons. The first one is that it's very easy to see that for this problem, if you have a deterministic strategy, then k is a lower bound on the competitive ratio. You cannot hope to do better than k. You can just imagine that, let's say, the metric is trivial. So, every point is a distance one of each other. Okay. This is the paging problem. So, then what happens is that you have your k servers, which you think of it as a cache of size k. Now, the request sequence can be such that every time there is a request, it's actually not in your cache. So, every time you have to move. So, every time steps you pay one. You pay all the time. That's if you have a deterministic strategy. If you have a randomized strategy, the adversary cannot do that. But if you have a deterministic strategy, the adversary knows exactly what is your cache state and he can always give you something which is not in the cache. So, you move all the time. But it's easy to see that there is a way to design up so that it never pays more than one over k fraction of the times. You know, you can just say, I'm going to eject the guy who is going to be requested the latest. So, in the next k request is all those guys are in. So, that gives you a lower bound of k. So, this conjecture says that you can get an exponential gain by randomization. So, that's a very, very strong statement. The other nice thing is that it's independent of n, the size of the state space. So, I told you just before that in the general MTS setting, log n is a lower bound. But here we have this combinatorial structure. So, we can hope to go beyond this log n lower bound and have something independent of n. So, that's significant. So, what do we know about this setting? There are two famous results. One of them, I mean, there are more, many more results. But these are the two most important ones to me. One of them is Kutsupias and Papa Dimitri, who showed that the work function algorithm, I won't define it. It's a very natural dynamic programming strategy, is 2k minus 1 competitive. And this is a deterministic algorithm. And the case of a conjecture is asking whether you can improve this 2 to a 1. Okay. This is, I find it less exciting than the randomized case of a conjecture. On the other hand, on the randomized front, for a long time, not much was known except for the paging case. And then came this very nice paper by Bensal, Buschbinder, now in Madrid in 2018, where they show the following. So, I'm going to say the word HST several times in the talk. So, let me define it. It's actually not relevant at all for the talk, but I will still say what it is. So, an HST is a hierarchically separated tree. So, your state space X, you can think of it, it's a finite state space, so you can think of it as a graph. Now, in HSTs, what we have is actually a tree. And the requests are on the lifts of the tree. And the additional assumption, so the distance between the two lifts is the shortest path distance. So, you have a weight on every edge. And in HSTs, the weights are decreasing exponentially with the depth. So, let's say, you know, at the root, the edges have weight, let's say 1. Then at the next level, they have weight one-half, at the next level, one-fourth, etc. So, they are decreasing exponentially. For some, what I told you is for separation one-half, but you could have another separation term. Now, what Bensal et al showed is they showed a log n times log k competitive algorithms for HSTs, where the separation, so this exponential decreasing is actually really big. It decreases like log n times log k. And the reason why they need this separation is that they don't work really on the trees. They define a more complicated problem on stars. And then they combine those more complicated algorithms. And for the combination to work, they need this big separation. Now, there is a general theorem which tells you that any metric space, you can embed it into an HST up to a distortion log n. I will come back to this towards the end of the talk. So, what you get is log n log k is a competitive ratio on HST. You get another log n because of the reduction, and you also need to multiply by the separation. To make sure that you embed in a tree with that separation, you need to multiply by the separation. So, in total, you get log cube n times log squared k for general metric spaces. That's very nice. It doesn't solve at all the weak randomized k-server conjecture because of this log cube n. The theorem that I'm going to tell you today, which is joint work with Michael Cohen, Yintat Lee, James Lee, and Alexander Madrid, is that we resolve the weak randomized k-server conjectures on HSTs. So, for HST, we show that you can be log squared k competitive. So, independent of n and polylog k. In particular, this implies log squared k times log n for general metric spaces. So, the non-exciting way to say it is that we replace in the Bensale et al approach, we replace this log n by a log k, and we get rid of the separation constraint. That's a non-exciting way to say it. The more exciting way to say it is that we identify mirror descent as a general tool for online decision making, and we show that it can be applied directly on the tree without doing something on every star and combining them. This direct view of the tree allows us to get rid of the separation, and then replacing log n by log k is something technical that I will get into. We also have an improvement for the embedding. We get a dynamic embedding, and I will briefly mention this. This is what we are going to do. So, let me maybe just pause for a minute to see if there are any questions. Yeah, if there are any questions, just feel free to unmute and ask. One quick question for me, just I want to connect as to what you said in the previous slide. You had online learning and online algorithms. So, this is about online algorithms, right? Yes. And the reason you mentioned the other one? The reason I mentioned the other one is, in the other one, we understand very well how mirror descent fits in the picture, and how it's a central algorithm, whereas in online algorithms, this was not identified. But once you see that, in fact, those two settings are very, very close to each other, I mean, you saw in the previous slide, it's like my normalification, then you say, okay, maybe I want to try to apply also mirror descent and see what happens. Okay, now I'm guessing this is what will happen soon, right? This is what's gonna happen. Okay, thank you. Any other questions? Okay, so we can go on, thanks. Okay, so before I define mirror descent, and I tell you the general analysis, including the online learning analysis, let me tell you, let's keep the discussion at the level of case server for a moment and tell you about the general framework, how we are gonna use the problem. So, we're gonna introduce fractional configuration, anti-configuration and continuous time framework. So, let's do that. So, for HST metrics, it's actually sufficient to specify a fractional configuration. So, what is an integral configuration? It would be Z, where for every location in one to N, I have a number which is either zero or one, which tells me is there a server or not. And my total number of server is K. Okay, so that would be an integral configuration. Now, a fractional configuration is that I relax, the set zero one to the interval zero. That's a fractional configuration. It turns out that from an online fractional algorithm, you can get a randomized integral algorithm. I won't discuss this rounding. I will briefly come back to it at the very end in the open purpose. Okay, now we also have what we're gonna look at mostly is an anti-configuration. So, X is one minus Z. So, this is telling me where the servers are not. Okay, so X represents the missing mass. So, if I have a missing mass of zero, it means I have a server there. Why do we want to look at anti-configuration? Well, we want to look at anti-configuration because very soon, we're gonna want our cost function to be a linear cost function. This is gonna be very important. Okay, so this combinatorial structure will in particular implies that we have a linear structure. So, let's see how this goes. And we're also gonna be in continuous time. So, let's say we have a request R at some location in one to M. We're gonna view this as a continuous time cost CT, which is just a unit mass at location R, meaning a unit mass at the unit vector in Rn with a one in the R score. Why do we want to do that? Well, this is a linear cost acting on the anti-configuration. What do I mean by that? So, let's look at X of T. Once XR of T is zero, meaning I have zero missing mass at R, meaning that I have a server there, then my inner product with CT is zero. I have no loss anymore. And until then, until I have zero, I have some loss. So, just really to clarify, this continuous time cost CT is artificial. We're adding this, but it's gonna be very useful to think of it that way. So, we have this linear cost on anti-configuration. The next thing is that we're actually gonna expand the state space. So, I'm gonna try to explain the K-paging algorithm before moving to trees. And this, I think everybody should be able to understand everything. And then I will move to the tree case. And the tree case, if you're not an expert, you will not understand everything. But I think it's fine. You will still get some ideas. So, in the tree case, we will need to extend the state space. So, instead of just looking at this set of fractional configuration, we need to look at the bigger state space. This is not a crazy idea. So, this bigger state space is gonna be some convex body K in R to the capital N. So, capital N is potentially bigger than little N. And what we want is that the projection on the first N coordinate is actually the set of anti-configuration. So, it's really, you have the set of anti-configuration in R to the little N. And then you expand it in some way to R to the capital N such that the projection on the first N coordinate is a set of anti-configuration. Okay. This will come back only in 20 minutes or something. By the way, a question? Yes. So, you say continuous. Do you assume that time is continuous as well or you just assume that everything is fractional? Sorry, time is continuous. I'm sorry if I didn't make this precise. So, C of, so now T, you're right. In the previous slide, T was discrete. Now, T is gonna be continuous. So, the post will be like integral of dot product or something like this? Yes, exactly. Integral of dot product and this will come very soon. Yes, I will make this precise. Thank you. Okay. So, the general setting that we're gonna work in and in which we're gonna design an algorithm is via what is called the differential inclusion, okay, which is just a generalization of ODEs and PDEs. Okay. So, let's say we're in some state right now, X naught in K, and we get a request. Now, we have to respond to this request, right? We have to go from X naught to some new X, where X at location R is zero. And what we're gonna do, we're gonna flow, we're gonna evolve our states X via a differential inclusion like this. So, the time derivative of X is gonna be included in some set F, which depends on X and C. So, given my current cost and given my current state, I have some time evolution. And we're gonna run this until X R of T is zero. Okay, this is, at this level is just very general. And we're gonna, it's just so that you're not surprised when we get to the actual algorithm. That's what we want to do. Okay. The particular form of this set valued function F of X C that we're gonna take is the following. Minus the h and inverse of some function phi applied to C plus lambda, where lambda is a Lagrange multiplier for K at X, and phi is a regular horizon. Again, I'm not expecting you understand this right now. The next two or three slides are dedicated to explain this. Okay, why do we consider this? But just so you know what's gonna come, you see that when we apply, when the time derivative is equal to this, inverse h and applied to C plus lambda, there are two, so the time derivative is giving you the movement, right? And now we see that two source for the movement. One source of the movement is the cost C. And the other one is the Lagrange multiplier lambda. Okay, so there will be a simple movement due to just replying to the request. So this will be the term which is due to C. And this we're gonna be able to analyze very easily using general mirror descent technique. And then there is a part due to the Lagrangian movement, which is that you want to keep a fractional solution. And this will be easy for k-paging and more complicated on trees. Okay, so that's where we're going. Good. So now, now I want to tell you about mirror descent. Okay, so this was invented by Nemerovsky and Newdin in 87 as a way to optimize convex functions. So the general spirit goes like this. You all know gradient descent. You are x and you move in the opposite direction of the gradient. So you do x minus some step size eta times grad f at x. So if you were, if you think of this very abstractly, this update equation doesn't make sense. x could be a primal point and grad f of x could be a dual point. Of course, if you're in finite dimension, you know, by Ries's theorem, everything is equivalent. But let's say if you were in a banner space, if there was some normed space, possibly infinite dimensional, then this just doesn't make sense. So what Nemerovsky and Newdin said is, okay, what we want to do is we want to first map x to a dual point, do the gradient update in the dual and then come back to the primal. And the way to move to the dual is using the gradient of some function. So what you do is you use some regular rise of phi and then you take x, you move to grad phi of x, you do the gradient update there and you come back. Okay, so let's see how this works. So we have our set k, which is a convex set representing probability distributions over the state space. As we said, it could be an expanded state space, et cetera. Now we're going to have what we call a mirror map phi, which is defined on a superset d of k. So this is a real valued function. We look at its associated fragment divergence. So this is a very important object. So d phi between x and y is the error that you make when you do a first order Taylor approximation of phi at x using the point at y. So it's phi of x minus phi of y minus grad phi of y dot x minus y. Okay, so this is like the action of phi at y, you know, applied to x minus y, x minus y, something. What's going to be important in the next slide is look at what happens when you differentiate with respect to x, then you get the difference of gradients. Okay, so if you differentiate with respect to x, then this becomes grad phi x, this becomes zero, and here you just get minus grad phi y. Okay, so again, the fragment divergence when you differentiate with respect to x, you just get the difference of the gradient of phi at x and y. Now, here is the picture from your descent. So now we are back in this very general online decision making, whether it's, you know, online learning or MTS, it doesn't matter. We are at some current point xt in our state space k. And we are now seeing this cost function ct. So I said before that ct is going to be linear. So this is very important. ct is a linear function over the state space. In particular, the gradient of ct is just ct. Okay, so ct is a linear function. So I'm at xt. I use the gradient of phi, sorry, I use the gradient of phi to map to the dual. Okay, so I get grad phi at xt. Now I do a gradient step. Okay, I have some step size eta, and my cost function is ct. And again, ct is a linear function on my state space. That's important. So the gradient of ct at any point is just ct. So my gradient step is grad phi of xt minus eta ct. Now I move back to the primal. So to move back to the primal, I want to use the inverse map of grad phi. So if phi is a convex function, then the inverse map is nothing but the gradient of the differential dual of phi. This is not going to be very important, but this is the formula, grad phi star. Okay, phi star is the differential dual. Now you get to a new point, but this point might be outside of your constraint set. k is your constraint set where you want to end up, but you might end up in d, which is a domain of definition of phi. Now you need to project. Of course, you're not going to project in Euclidean norm. I mean, the whole point of this is to generalize, to abstract possibly infinite dimensional spaces. So you want something which is independent of the Euclidean structure. And the natural thing to do is to project in the d phi normal distance. Okay, so that's mere descent. So let's see how this works in continuous time, how it relates to differential inclusions, how it relates to the inverse session, all of this. So this will be the next slide. First, let me remind you basics of Lagrangian duality, really just elementary. So if I have a convex body k, then the normal cone of k at x is the set of direction theta, which is negatively correlated with all the direction inside the body. This is the picture just to clarify. This is k, and I have some point x. So I can take all the normals at the currently tight constraint. All these directions are negatively correlated with everything going inside. And in fact, any positive combination of them is negatively correlated with anything inside. So the cone spanned by the normal to the currently tight constraint, that's the normal cone at x. So that's the picture, but really the definition I think is clear. It's all the direction negatively correlated with anything going inside. Why is this useful? Again, just reminding you this. So let's say we have a convex function little f, and we want to find the minimizer of little f on k. Then if there was no constraint, right, the first-order optimality condition tells you that the gradient of f at x star is zero. But now we have this constraint. What does it tell us? So it's telling us that minus grad f at x star is in the normal cone at x star. Why is that? So let's assume that it was not in this. If it's not in the normal cone, then it means that there is a direction going inside the body, which is positively correlated with minus the gradient. So if you're positively correlated with minus the gradient and you do a little step in that direction, then you're going down. You're getting a function value smaller, which contradicts the fact that x star is a minimizer. The only way for x star to be a minimizer is that minus grad f of x star is in the normal cone. Okay, good. So let's see how we're going to use this. So this is a one-line formula of Mj. And now we'll write this in this funny way. So we are at time t, and now we do a little step of size eta, and I'm looking at what is x c plus eta, and we're going to get eta to go to zero, and then we'll see the time derivative. That's what we're going to do. So I told you, you map to the dual using grad phi of x t, then you do a step in the negative gradient direction, which is minus eta ct. You map back to the primal using the gradient of phi star, and now you project using the Braggman divergence. This is a formula. Just for intuition, Yinta told me I should say that. So this is like doing a small step, and then projecting using the non-nuclidean norm. So here I use this notation norm at x t. This means the norm induced by the hn of phi at x t. So one way to think about this is you have this convex function phi on some domain D. What it's doing is that it's equipping this, it equips this domain D with a Riemannian structure. So at any point x now, I have a Nino product, which is given by the hn of phi at this point. So I have a tangent space, which is all of Rm, and I have a Nino product on it, which is given by the action at this point, which also gives me a norm. Now what you're doing here, so when eta is small, this is exactly the same thing as trying to minimize. So you want to move in the negative direction of eta ct. You want to minimize this, but at the same time, you want to be close to where you wear, where closeness is measured in this norm, in this local norm. So this is like when you get eta going to zero, this is a projected dynamical system, but a non-uclidean one, where the norms are given by this Riemannian geometry rather than the Euclidean geometry. This will not be useful anywhere, but maybe it adds some intuition about what's going on. So now let's go back to the formal equation. So this is the equation. I just want to use the first-order optimality condition. Remember, I told you that when you differentiate the Braggman between x and y with respect to x, you get the difference of gradients. So what is the difference of gradients? So at the optimum, you get grad phi at xt plus eta, minus grad phi of this thing, but grad phi and grad phi star compose, they give you the identity. Grad phi star by definition is the inverse of grad phi. So you get minus grad phi of xt plus eta ct. And by definition, you know, minus this is in the normal cone. So this is in minus the normal cone. Okay, this is just everything I told you, gives you this. Now, what happens? Let's take eta to go to zero and normalized by one over eta. So this gives me ct. Now, these two parts, when I take one over eta and I let eta go to zero, you see what I get is the hn of phi at xt applied to the time derivative of x. This is exactly telling me that the hn of phi at xt times the time derivative is in minus ct, minus the normal cone. And this is exactly the differential inclusion I was telling you two slides ago. And this, we will take it as the definition of our algorithm. So again, now we are in continuous time, we have those cost functions that arrive continuously, we want to say how do we evolve our state x, we're going to evolve it according to this differential inclusion. And the theorem, which is not too hard using general things about differential inclusion is that there is a solution to this differential inclusion. Okay. And in fact, it's even unique provided that whatever k is a compact phi strongly convex and the hn of phi and c are lipped. Okay. So now is a good time to stop for again, a few seconds and see if there is questions on this. So again, this is, this line is the algorithm. We're going to spell it out for specific k, but this is a general algorithm. So I guess I have two questions, probably a naive one. Right inside in this equation, it's some convex set, right? So when you take inverse under the hessian, it's still convex, right? Yeah. This is just a, you just apply a linear map. Right. And you're saying that there is whole theory of this like differential inclusions. Does it usually work when right hand side is convex? So it's like more general than that. It's more general than that. Now here it's, so there is this general theory of differential inclusion, which it did work for, for something more general than convexity on the right hand side. But here we have actually what's called a viability problem. Because in addition to this time derivative being in some set, you also want to guarantee that the solution stays in some k, sorry, this x should be k. So the viability tells you that not only you have this differential inclusion, but also you want to stay in some convex domain. And for this, for the viability, you need convexity. Does that make sense? Yeah. Another question is, I guess, immunological. So at some point you said that, so basically you were adding up this like c and like multiple, multiplier. So it's exactly this like element of this cone, right? Yes. Okay. Yeah, exactly. That's exactly that. Okay. Any other question? No, I have just so, so you're going to solve this, right? Yes. So then on the next slide, what I'm going to do is to show you how this very general equation, you can analyze it at this general level and say something about the cost and the movement. And then for k server, we're going to solve it indeed, I mean, solve it, we're going to write explicitly what it means for some specific k, which is going to be different for k-paging or for k-server on a tree. And in the end, you don't actually care about the whole part for everything. In the end, you're just going to move one step to the T where this server had to go. Yes. Yes. So as you will see, there is a question about whether, in a sense, this is an algorithm. So you're right. We only care about getting to the point where, in the end, at the requested location, my missing mass is zero. So I want to get to this point, but it's described by this differential inclusion. So of course, it's true that if you discretize fine enough, then you will get to something which is close enough. But this doesn't give you a fast algorithm in any sense. So in a sense, this result, as I would say at the end, I view it more as an information theoretic result telling us that you can do k-server in polylog k competitive ratio, but it doesn't really give you a poly-time algorithm. And there are several layers for why. And one of the layers is because you have this differential inclusion. Okay. So let's see the general analysis of this. So here is the basic calculation. So that's our algorithm. hn of phi at x t applied to the time derivative is minus the cost, minus lambda t, where lambda t is a Lagrange multiplier. So it's an element of the normal cone. So we're going to, the whole point of mirror descent, the reason why it was designed in the first place is that it comes with a Lyapunov function. It comes with a potential. It immediately decreases the Braggman divergence. So you remember the Braggman divergence? Now I'm going to d hat the Braggman where I have just dropped the first phi. This would be useful for some reason that I will explain. So d hat phi between y and x is minus grad phi x minus, sorry, minus phi x, minus grad phi x dot y minus x. So this is the same thing as before except that I have dropped the phi of y. Now let's see how this Braggman, this Braggman evolves over time. So I will view y as opt. So opt is somewhere. And now what we're going to say is that as we move, we're getting closer to opt in this Braggman distance. So let's see. The calculation is obvious. And the calculation is a reason why mirror descent was set up in the first place. So let's say y stay put for the moment. So y is fixed. And we have this path x of t. Let's look at the time derivative of this Braggman. Okay. So when we take the time derivative, we have a first time which is the inner product minus grad phi of x dot the time derivative of x. But this cancels with I have this product. So by the chain rule, I'm going to take either the derivative respect to that plus the derivative respect to that. So let's see what is the derivative respect to this second term. Well, I just get minus x prime. So I get plus grad phi dot x prime, which cancels with the derivative of this. So I'm left with only the only term that left is the derivative of this grad phi dot this. What is the derivative of grad phi? It's the action applied to x prime. So what I get is this equality, the derivative of the Braggman is minus the action of phi at x t applied to x prime t in a product with y minus x t. But now the point of mere descent is that I know what is this quantity. I know what the value of this value of this is exactly ct plus lambda t dot y minus x t. But lambda t is in the is in the normal cone. So by definition, it's negatively correlated with any direction. Okay, so this term is negative. So I see that the time derivative of the Braggman is smaller than ct dot y minus x t. Now let's interpret this. ct dot y is at the instantaneous cost of opt. That's how much opt is paying on this current cost vector ct. ct dot x t is how much I am paying. So if I'm paying more, if ct dot x t is bigger than ct dot y, then this guy is decreasing. So this equation is exactly telling you that when the mere descent pass is paying, when it has a cost, it's paying more cost than what opt is paying. And actually, I'm getting closer to opt. This is really fundamental. This is what's behind all of the regret bounds for online learning. You can really write results for MTS using this also. So in online learning, there is just that I say something about this. There is another layer which is discretization is important, but we won't get into that. Now, in online algorithms, opt is not fixed. Opt is also moving. So we have to look at what happens when opt moves. So by the chain rule, I can decompose the time derivative of the Braggman between first I move the algorithm and then I move opt first. So let's see what happens when I take the time derivative of this quantity with respect to y. So now y is moving and x is fixed. Well, the only thing that depends on y is this. So I get minus grad phi xt dot y prime. So let's just apply whatever the definition of dual norms. So I get that this is bounded by the dual. So I have some norm. Let's fix some norm. This is bounded by the dual norm of grad phi at xt times the norm of y prime at t. Now, by definition of the Lipschitz constant in some norm, the dual norm of grad phi is the supremum of all points x. It's exactly the Lipschitz constant of phi. Now, where is this norm coming from? I'm going to assume that, you know, in the end, I care about movements. So I'm going to assume that my movement, infinitesimally, I can view it as some norm. There is a norm that measures the movement. So the movement of, let's say our alg is just the integral of the norm of the norm of x prime or y prime. And I will see that concretely in k-paging and k-server. So those inequalities are very simple, but they are important. Don't you have a term corresponding to x prime? Norm of x prime. If x moves fast? Yeah, yeah, yeah. So, okay. So, yes, this I will come to, okay. So here, exactly. We do not control the norm of x prime. We want to control the norm of x prime, but this doesn't come from general principles. How to control the norm of x prime is going to come from the specific applications. This is a very good point. It's a part of the equation of this inequality. It's not part of this inequality, you know, because, no, it's not part of this inequality. This inequality is true, like it's written. Okay, it seems like if x would move fast, the left-hand side becomes bigger, but it's a gradient. Okay, so it's not a gradient in terms of t. Yeah, yeah, this is a gradient with respect to x. Yes, yes. Okay, maybe that was the contribution. Okay, good. Very good. So just those two inequalities, what they tell us is exactly the following. When you integrate over time, here is a lemma. The mirror descent path, x of t, satisfies for any comparator path, so any other path, yt. Let's say they start in the same configuration. Okay, so in case of we're going to start, you know, opt and arg are going to be in the same configuration. Then as Ilya was saying, this is my cost is the integral of the dot product. This is my service, my cost associated to C. In case of this is a virtual cost. This is not a real cost. But this virtual cost, ct.xt, is upper bounded by this leapshift constant times the integral of the norm. So this is the movement of opt plus, because of this term, the integral of ct.yt. So we see that in terms of this heating cost, this cost for C, we are actually one competitive with respect to the cost of arg, and then there is a leapshift constant with respect to the movement of arg. And this one one here is exactly the reason why you can get regret bounds in online learning. Okay, because you have one one. And now as Oda was saying, this is not really of interest for case server, because for case server what we care about is the movement of the algorithm, which is the integral of the norm of x prime. And this is not controlled by this. So the whole point now, the whole game, once you have seen this slide, once you know this, is how can you relate the norm of x prime to this virtual service cost, and make sure that the leapshift constant is small. If you have those two things, then you're done. But that's that's that will be what we want to say. Okay, so let's see this concretely on kpaging. So here is a beautiful result by Ben Sal Buschbinder and our from 2012. For weighted kpaging, meaning this is case server on a weighted star. Okay, so now you have a star and with n leaves, and the requests are coming to the leaves and our weights to the edges. So this is a specific case of case server. They show that in this case, you can get a log k competitive randomized I go. And we're going to really arrive this result in in just one slide, I mean two slides. So let's introduce some notation. Wi is a weight from I to the roots. Now, I need to tell you a norm, right? That's that's one thing which is important in this construction. What is the norm measuring the movement? So let's say we have a fractional move from z to z plus psi. Okay, so the z is the current fractional allocation of mass and z plus psi is I took some mass somewhere and I put it somewhere else. Well, what is the movement associated to this? It's exactly the weighted L1 norm. You know, for each location, I look at the differential, maybe I gain 0.1 or I lost 0.1. How much movement does that induce? Well, that 0.1 needs to go through the corresponding edge. So if I were at location I, and I lost 0.1, then I need to pay 0.1 times Wi. So this is the norm is just the weighted L1 norm. The anti paging polytope, okay, this sorry about this, this x should be k. It's just what we said before, it's a set of point x in zero one, the sum is n minus k. This is anti anti paging. So z equals one minus x is a fractional configuration. So let's apply our very general framework, let's say we have some mere map phi and we run our mere descent algorithm. So what happens is, you know, we are in some state x and now we get a request. So now we're going to run the mere descent update equation until we get, let's say we had a request application r of t, we run until we get zero missing mass of this location. You know, maybe we never get there, but let's say we get to a point where the request, this is zero. Remember that if I don't get there, then I have some cost. And remember by the equation of the Bregman divergence, as long as I have a cost, I decrease my Bregman distance. So hopefully I should get there at some point. Okay, the previous lemma exactly tells us this. So the cost in this case, remember it's a unique cost at the requested location. So the integral of the missing mass at the requested location is upper bounded by the Lipschitz constant times the value of opt. There is no term corresponding to the cost of opt because as soon as there is a request somewhere, it just moves out. Right, it just puts zero missing mass there and there is no missing mass. Okay, so there is no term here. There is just the movement, which is the value of opt. Okay, we have this inequality. So now I already said that the only thing we have to do is we have to do two things. One, we have to get phi, which has a small Lipschitz norm. And most importantly, we have to relate the movement cost. Okay, so the norm of X prime, recall what is X prime? X prime is nothing but the inverse action applied to the cost, which is just unit at the requested location plus a Lagrange multiplier. So we have to relate this quantity to this virtual service cost, which is the missing mass at the requested location. So let's say, you know, the Lipschitz norm was A. And let's say we can show that this quantity is upper bounded by B times the missing mass. Then we get an AB competitive algorithm. Okay, so again, this might be a good moment to stop. Now I'm going to, you know, run this program and show you a concrete phi and the concrete dynamics and show that you can get lucky. Wait, one quick question. Yes. Why exactly we need the second conclusion? So where does it come comes up? Yes. So and I guess a bigger question, what's virtual service cost that is probably something? Yeah, yeah. Okay. So what is the objective at the end of the day? The objective at the end of the day is to say that the movement of the algorithm is upper bounded by C times the movement of up, right? So what you would like is to have this inequality where the left hand side is not this thing, which has no real meaning. But what you want on the left hand side is the integral of the norm of X prime. That's the movement of the algorithm. And you want the integral of the norm of X prime to be bounded by some constant times the value of up. That's the goal of the whole thing. Okay, now what you get for free by the mirror descent analysis is this inequality. So since you got that for free, maybe one step would be to say, okay, I got that for free. So let's work with that. And instead of relating these two ups, which might be something complicated, it could be simpler to relate this to this value. That's the point. Okay. I want to try to finish K-paging in the one hour and then maybe we can whatever, take a short break or whatever you want. And then we do the tree case. Okay. So that's what we want to do. So let's look at two. Okay, two is the most important. How do you relate the movement to this, what I call the virtual service cost? It's not a real service cost. It's just it comes from this general perspective. So let's, you know, for simplicity, ignore the Lagrange multiplier for a moment. Let's only look at what happens, what is the movement induced by the request. And it's pretty natural in this case to look at a separable regularizer. So you look at five of x, which does the sum over i of five of x i. So now what do we want? So this h and inverse applied to E r. This is nothing but, this is nothing but, okay, this is nothing but one over five double prime at x r. Okay, one over five double prime at x r. And now this norm, remember, it's a weighted L1 norm. So the weight on location r is w r. Okay, so this term, if I ignore lambda is just w r over five double prime of x r. And I want to relate this to x r. x r is what I control. This is what I would like to control. So I need to upper bound this by this. But maybe let's not upper bound. Let's just make them equal. And if you make them equal, you just get the entropy, right? If you make w r over five double prime of x equals to x, then phi is nothing but w r x log x. Okay, so you just got the weighted entropy out of this analysis. If you just make this movement equal to the virtual service cost, then this is satisfied for phi being the weighted entropy. Now, the corresponding dynamics are as follows. So the dynamics, you cannot ignore the Lagrange multiplier. But in this case, it's very simple. There is no Lagrange multiplier at zero because you cannot get to zero because the entropy is blowing up at zero. So this is a natural barrier for zero. Now, there is no Lagrange multiplier for one, which is just telling you never go above one. So if you get to one, then your time derivative needs to be zero. And then there is a Lagrange multiplier for the equality constraint. This is the only important one. This is this mu. So you see x prime is the inverse session applied to this with a minus sign. So when I apply the inverse session, importantly, this is diagonal. I see that at the requested location, I get x r t over w r with a minus one. And then I get plus the Lagrange multiplier. At any other location, I don't get this minus one. I just get mu t. And what's important is that I get this mu t, this Lagrange multiplier for the equality constraint, it's the same everywhere because this equality constraint is just the sum of xi is equal to something. And I get this indicator that xi is less than one. This is the Lagrange multiplier coming from the constraint at one. So the dynamics is just this multiplicative weight. The rate at which I'm decreasing the amount of mass application r is the speed is proportional to the current missing mass reweighted by the weight. And now the key is mu t is positive. We need to understand what is the movement introduced by the Lagrange multiplier. So mu t is positive. So if I just look, I can decompose the movement into the positive and negative part. And it's up to a factor two. It's fine to just look at the movement induced by the negative part because whatever comes in has to come out. So I can just look at the norm of the negative part. But because mu t is positive, why is mu t positive? The sum of the x prime should be zero. So mu t has to be positive. So because the sum of the x prime is zero, mu t has to be positive. So the only negative part is at the requested location, which makes a lot of sense. So the dynamic is I have those missing mass. I have a request at location r. I'm just going to decrease the mass at location r and disperse it to the other location. So the movement is upper bound in weighted movement. So I'm notified by WR. It's just upper bounded by the missing mass. So my movement is upper bounded by this virtual service cost. So I get the one to one relation between the movement and the virtual service cost. So the only constant that I get in the competitive ratio is due to the Lipschitz norm. And the Lipschitz norm, actually, people in online learning already did that back at the end of the 90s because they wanted to get tracking regret bound. And then it was also realized already at the time by Blum and Birch that there is this connection between online learning and online algorithms. But at the time, they didn't know about neural descent. They were doing very specifically for multiplicative weights. But the idea is simple. It's just to cap. So why is the Lipschitz norm of the entropy not bounded is because x can be close to zero. And the log of zero is infinite. So what you want to do is to make sure that you never get to zero. So one way to do this is to look at the polytope k, which is now instead of being between zero and one, you're between delta and one. And now instead of running until you have zero missing mass, you run until you have delta missing. But now the mapping from this x to an actual server configuration is the following. z is now 1 minus x over 1 minus delta. So that when x is delta, 1 minus delta over 1 minus delta is 1. So when the missing mass is delta, you get when the missing mass is delta, you get that z is 1. So you get that you actually have a server there. But the problem is, if you have an anti-configuration with mass n minus k, then the total mass of this z could be k over 1 minus delta. So you're using more fractional servers. But it's pretty easy to show that in fact, you're allowed to use more fractional server as long as you don't use more than one more server. So we show that you can round this. Meaning you can go from a fractional configuration with k over 1 minus delta servers to fractional solutions with only k servers, provided that delta is smaller than 1 over 2k. Or put it another way, if you have k plus epsilon servers, but epsilon is strictly less than 1, then you can map it back to only k servers. And correspondingly, the dipshitz constant is now log 1 over delta. So log 1 over delta, delta is smaller than this, you get log k. That's how you get log k competitive pressure. Okay, I'm going to stop here now for question. If you want to summarize quickly, do you make sense? Yes, absolutely. Okay, so these slides are the real part of the paper. I'm sad that we didn't get to, but okay. Here is the dynamic embedding. So let me just say one word about this. So again, Bartal's embedding algorithm is a static thing. So you have your metric space, you just embed it into a tree right away and then you work only on the tree. What we do is we do a dynamic embedding. It's actually really, really simple. It's just one line. And I would just say this one line for the experts. So what happens is that why do you get a log n loss in Bartal's algorithm? You get a log n loss because you have to union bound over n points. Now what you would like to do is to union bound only over poly k points. And you can do that because what happens is if when you build your tree in Bartal's algorithm, at some point there might be a scale at which you have more than poly k centers. Let's say two k centers. But if you have more than two k centers at a certain scale, then it means that ought to move at this scale k times because it has only k servers and you have two k points which are at distance at least s from each other. So now you can just delete the whole subtree below those scales. And you can do that because you can afford to reallocate the servers. You have only k servers and they are all you have only to reallocate them as scales below s so you can pay for it. And I want to say that there is a new paper by James Lee which is doing something much, much more elaborate, really much more complicated than this simple restart idea. And he found a dynamic embedding algorithms which only loses a poly log k in the embedding. Whereas here in our embedding, because we use Bartal's original algorithm, we have to pay log of the diameter, log of that time. That's all I will say about this. So the summary, okay, the summary is this. We use continuous mirror descent. And the general theory of continuous mirror descent shows to us that the missing mass at the request is related to the value of opt using this Bregman divergence potential. And this relation is up to a multiplicative factor which is a constant of phi in the norm measuring the movement. Next, we see that for k-paging, we can directly relate the missing mass to the movement of al. That's what I showed you because the Lagrange multiplier is non-negative. On a tree, we didn't do it. It's more complicated, but you really have to deal with the movement induced by the Lagrange multiplier. But here the key new insight is that the movement induced by the Lagrange multiplier has some effect which is to increase the weighted depth. And in fact, you can see it as increasing the movement is increasing some notion of multiscale entropy. And once you do that, then you can say that the movement induced by the Lagrange multiplier up to this new potential which is either the weighted depth of the multiscale entropy is again related to the missing mass at the current location. So that's the other log k terms that you get for a tree. Now open questions. Okay, polynomial time method. This is a bit ridiculous, but even for weighted k-paging, we don't know a polynomial time algorithm. So for weighted k-paging, the reason is how do you round online a fractional solution? We don't know how to do that in polynomial time. For the tree solution that we have here, I didn't do it, but the expanded state space actually has exponentially many constraints. There is exponentially many constraints. So that's one reason of exponential. There is the exponential in the rounding. And then there is also the discretization of the continuous time dynamics, which is also something else. Okay, our algorithm actually gets depth times log k for any tree. It doesn't need to be HST, but we don't know how to round on general trees. Of course, a strong randomized conjecture, which is can you get log k competitive ratio in any metric space, not poly log. So the first step would be to improve this log squared k to log k on HST. But this could be difficult. So we have an upcoming paper for MTS where we remove the log log n from Fiat Mendel 2003. And to do this, we have to go beyond mere descent. So we cannot use straight up mere descent. And in fact, we believe that if you use straight up mere descent, then you will get block squared n. So you have to go beyond mere descent. Something I find very interesting is can you do poly log, some poly log using a purely mere descent strategy for general metric space. So not going through the embedding into HSTs. And the last one just how general is mere descent for online computations. A concrete open program is whether it can be used in virtual type problems. Okay, I will stop here. Thank you, Sebastian. And this is a good time for some questions. So I actually have a question. So for this like weighted paging, you get log k competitive ratio. Is it because when you receive this like hyper cube, plural coordinates b to b, at least delta, log k is the upper bound on this, basically if it's norm of this entropy or? Yes, that's exactly why. That's exactly why. And the reason why delta has to be one over k is because when you cap at delta, then you stop once you're missing massive delta. So that means that to go from the missing map, from the anti view to the real view, you lose the factor one over one minus delta. And this factor, you can afford to lose it only when delta is smaller than one over k. So you're exactly right. You lose log one over delta because of the Lipschitz constant of the entropy on this cut off hyper cube. And delta has to be small because of the rounding. Yeah, sorry, I guess a related question. So basically, if I want to run mirror descent on some, I don't know, convex body, is there any sense for like, I don't know, optimal Bregman divergence for a given convex body? I guess it depends on the application on the objective function. But like, is there any natural sense you can formalize it somehow? This is not an easy question to answer. So certainly for online, so I can say that for online algorithms, no, we, or at least we don't know how to answer this question. Because for online algorithms, there is this mystery of how do you calculate the movement of the algorithm. So this doesn't come from some general mirror descent theory, it comes from specific application looking at the dynamics, okay, what's actually happening. And yeah, so, so I can say that for online algorithms, we don't have an answer and maybe there is no answer. For online learning, it's, it's, it's, yeah, it's another talk to answer this. Any other questions? Yes, I wanted to ask a question about the algorithm for paging. Yes. Is this a different algorithm than the algorithm presented by Bansal et al? Or is it a new interpretation of that algorithm? It's a new interpretation. It's the same algorithm. I mean, okay, it's, okay, let me rephrase. It's not exactly the same algorithm. So what they do is, so we, we apply this cost, right in continuous time at the requested location, and we move continuously to, to, to respond to this. What they do is, they immediately respond to the request. So they put the missing mass to zero, and then they do the continuous evolution to fix the mass, you know, on the other location. So in that sense, it's not exactly the same, which, which brings me to another thing, which is maybe you don't need to do those continuous time evolution. Maybe you can do like what we do in online learning, just do a step, you know, fix the requested location and then project. But we don't know how to analyze this. Any more questions? So if there are no more questions, so thanks, Sebastian, again. And if anyone wants, we can stay here. We can talk about more details, maybe briefly about the case of trees. But for now, again, thank you, everyone. I hope to see you in the spring when we start the 60th year of TCS Plus. Okay, so, so, but everyone, you're welcome to stay here if you want to chat a bit longer. Take us offline for now.