 All right, let's continue. Welcome back. Our next speaker is Stephanie Yokelka. Stephanie is an ex-window concertum career development professor at MIT. She got her PhD from ETH Zurich and the Max Planck Institute for Intelligence System at two places. Her research interests span the theory and practice of algorithmic machine learning, which have led to many, many awards, including slow research fellowship, NSF Career Award, the DAPA Young Faculty Award, the German Python Recognition Award, and the Best Paper Award at the ICML, I don't know which year, 2013. She's going to talk about robust learning with robust optimization. Please welcome. Thank you very much for the introduction. And thanks for the invitation to speak here. So this talk is probably going to be the most theoretical of all the talks in this workshop, I'm guessing. But I'll go slowly if you have any questions. Just feel free to ask. So what is this talk about? This talk is about robustness in machine learning. So the motivation, I think, I don't need to motivate it much in this workshop, is that we have data from all over the place. If we actually deploy machine learning in the real world, we want to really make sure that we understand what will the model do if the data changes a little bit under perturbations or so. So let me make this a bit more formal, like semi-formal. So what we do, typically, is that we have our training data set. So here it's just some people. And we train a model to fit this data set. So a common principle to do this is so-called empirical risk minimization. That means we try to fit the model as well as possible to the data, meaning we have some kind of loss function, which I call L here. And we just look at the average loss on my training data, and we minimize that. So here I actually say P hat n. So I write it in this very complex way to make it like you will see later why I do this. What this really means is this expectation under the empirical distribution, meaning just those data points that I have, it's just taking the average. So here I'm minimizing the average loss on my training data. So what that basically ensures is that I'll do very well for the data that I've seen, that we all know. So now what has become an increasingly important question is like what happens if now I'm doing predictions and the predictions are not exactly the training data? So that can take various forms. The most classical form is that of generalization, which basically means that I'm looking at data at people which come essentially from the same underlying distribution as the people that I've seen before. But they are not exactly the same. They're just a different sample from the same distribution, but I want to do well for those also. And I wanna make sure that this happens. So this is generalization. But there could also be other things like maybe there's a shift in my data set, maybe because I'm transferring over time, maybe because I'm in a slightly different domain. There could be the extreme case of adversarial examples that you've probably seen and heard about where I'm explicitly having data points that are bad for that particular classifier, for example. And this could also mean other types of invariances that we have. Maybe we have a vision data set and we wanna be invariant to certain lighting conditions or so rotations or something like that. So the data doesn't look exactly the same, but we still wanna basically perform well for that new data. So how can we ensure that? And that's of course a big open question. And I want to today talk about one approach towards better understanding or better adjusting our learning algorithm to work for these slightly different data for examples like this. And the idea I want to present today or introduce to you if you haven't heard it is that of robust optimization. So robust optimization really follows that idea that I have slight perturbations of my data. Why don't I just take it into account when I do the learning, meaning optimization? So learning is nothing else than optimization. I'm minimizing my empirical loss, finding the best function F that fits the data. So why don't I just take those perturbations into account during the learning process? So that's the rough idea. So here's the idea and how do I do this? Essentially what I do is I specify a set of perturbations of my data and I change my criterion from... Can you just make sure you're using the mic? Okay, okay, I'll just stay here, like you. Okay, so I'm changing my criterion from just performing well on the training data that I've seen to performing well on all those perturbations that are possible. And now what the way I'll do this, there's many ways you could possibly do is, I'll say I'm going for the worst case. So among all the possible perturbations that I have, I want to work well for all of them so I look at basically the worst case and optimize for that. So formally what this looks like is I go from my empirical risk minimization where I'm just minimizing for my training data, I essentially imagine an adversary in here and make this a two player game. So I am minimizing the loss and there's like a little devil in there that tries to perturb my data to maximize the loss and make it as bad as possible for me. So now if I optimize, so now I minimize for all possible perturbations, I basically make sure that I am optimizing for every possible perturbation that this little adversary could do here because I take this into account. So now these types of perturbations could be, for example, relating to generalization shifts, et cetera, et cetera, all that I mentioned before and the way I get those is by basically defining different types of perturbations. So that's the general idea. I can't talk about all of this today so what I'll talk about today specifically is the first point, how this actually relates to more classical ideas of generalization and machine learning and how this actually can lead to a very, very simple proof that you actually will generalize. And the framework I will follow today is actually called not exactly robust optimization but distributionally robust optimization and it's called distributional because the way I allow to perturb my data is not only single data points but basically I view my data as a distribution and I can perturb that entire data distribution in certain ways. So that's the main idea. So let's look at this, what this actually looks like so I again start with my average loss and now I want to change my data distribution and want to work well for a set of different distributions that my training data could have. So formally what this means is that now my adversary is allowed to just change my data distribution, the samples that I have in certain ways in what I call an uncertainty set so that's the set of perturbations that I can have and that's called you. So now I'm no longer taking the expectation under my empirical distribution which is just the average. I'm actually optimizing the expectation under this perturbed distribution. So that's my new objective, it's now a mini max objective. Yes, it became a bit harder maybe than just the average but it has many good properties as we will see. So now let's try to understand a bit better what is this kind of uncertainty set you typically. So one thing is it should probably contain the data that we saw that's very reasonable and maybe it should just allow some small perturbations of that data that are likely and not like arbitrary perturbations. So if I allow arbitrary perturbations the problem is that I'm optimizing for everything I cannot fit anything really well because I'm just trying to basically fit everything and that's not gonna work. So the way this uncertainty set typically looks like is that it's essentially a ball around my empirical distribution my data. So now if I look in the space of all distributions I put in the center my data distribution the empirical distribution that I have my samples and I just draw a ball of radius epsilon around that. And now I optimize basically for all the distributions that lie in this ball that's all perturbations of my data distribution. So now the key question is of course what is this exactly that distance that I have so it's some kind of divergence between probability measures and we'll get to that. But before we get there let's think quickly why this actually pretty directly leads to optimizing for generalization instead of just optimizing for the samples you have. And that's actually a super, super simple proof. So let me just show you in a few lines and pictures. So what we have is we assume our data is drawn from some underlying distribution that we don't know and we do this distributionally robust optimization. So we actually don't optimize directly for our sample but for some kind of ball of distributions around that sample. So now if with high probability the distance between my empirical distribution and the true distribution is not too large then what happens is that my true distribution actually falls within that ball, right? So if basically those are pretty close maybe I have observed enough examples or the radius of the ball is actually large enough because the samples are with high probability fairly close to my actual distribution. Then with high probability I will actually be optimizing also I will be taking that true distribution into account when I'm optimizing. And hence I guarantee that I'm not too bad on that distribution as well. So basically I'm optimizing for everything in that ball. My true underlying distribution that I actually want is also in this ball. Hence I'm optimizing for that as well. So my objective is actually an upper bound on the true loss that I actually want to minimize. And that's typically the major problem is like how do I get some kind of handle on what I actually want to minimize if I don't even observe it. So that's the simple proof of generalization. You actually are directly optimizing an upper bound on the loss that you're interested in. So that is I would say a very simple proof of generalization guarantees. So now let's look a little bit deeper into what exactly is this uncertainty set or this ball. And as we will see there's several ways of specifying it and several ways basically have different implications correspond to different things. So essentially this ball, what we have, the way we specify it, yes, we have to choose that epsilon somehow and basically that epsilon trades off how many distributions you include in your ball. So basically how much you aim for being robust versus how much you actually fit your data. Like if you make it small, you fit your data better. If you make it larger, you're basically robust to larger perturbations and that's a trade off. So what we have to specify is what is exactly this divergence that I just mentioned? There's like some way to measure distance between distributions or divergences. More generally, what is that? So the first thing that people like one, there's basically two popular things that people have tried. And one of them is called the chi-square divergence. So if you haven't heard of it, it's essentially a friend of the Kulbeck-Leibler divergence which is maybe more widely known. It belongs to the same family of divergences. So this actually has some nice implications but it also has some problems. For example, that nice way I showed you how you can get a generalization bound actually doesn't work anymore. And the reason is that if you use that particular divergence, the distributions that fall in this ball basically only have support on the points that you've already observed. So basically what that means is they would never actually allow you to observe a different point, those distributions. So basically there's not much about like shifts in data points or something that you can include here. You can get generalization bounds but it's via different routes. And it's also computationally somewhat challenging. The second option that people have considered is so-called Wasserstein distance. So what is the Wasserstein distance? It's a very different way of measuring divergences between distributions. So the easiest way to understand it is to say that basically each distribution is a little bit like a pile if you think of a Gaussian or so. So say you have two distributions that are two piles of sand and you're trying to transform one of those piles into the other. And you measure how much work does it take to transform one into the other? Well if they are the same, it's zero. If they are very far apart, then it takes you a lot of effort. If they're kind of well aligned and only a little bit changes, then you don't have to transport your sand too much. You just have to make some shifts and it's okay. So it basically takes into account how far you have to transport your mass and how well aligned the shapes are. So that is one of those divergences that directly measures like how far away are actually your distributions in space. And many others don't explicitly do that. They just say like basically they're disjoint or not. So that's why Wasserstein has been very popular in machine learning recently. Wasserstein also has some nice implications. One thing is that if you actually want to optimize it, you look at upper bounds, these typically are just asymptotic and they need further assumptions. So this is kind of like the somewhat more difficult side of those two. Both of them actually also, what I wanted to mention is, have some nice interpretations. And I said, well this robust optimization, it leads to generalization. So you could wonder well, how are other ways we typically achieve generalization? It's via regularization. So we put some kind of L2, L1, penalty, you've all seen that. And in fact, there's a pretty direct correspondence between regularization and those types of robust optimization. And the idea is that different types of divergences lead to different types of regularization, which may help you better understand what actually these kinds of robust optimizations are looking for. So the first one, the chi-square divergence corresponds to penalizing the variance of your loss on your data. So basically look at your data, you measure the loss and you measure empirical variance. How much does the loss vary on your data? And you penalize that. That is also an upper bound on your actual population loss. The second one, vaso-stand distance, penalizes the norm of the gradient of the loss. So let's try to understand this. Basically it looks at how much also does the loss vary at the data. And I basically don't want the loss to be super sensitive to any particular data point. So that's something if you think about adversarial examples, so that's maybe something you want. You don't want the loss to be super sensitive to any specific data point. So that's kind of what you measure and then you take some kind of norm of that. But so these are just two choices. So the question is, are there others? Are there others that connect to methods? We know, are there others that have other good properties that sort of complement what is already there? So what we were interested in was to look at a specific type of divergence that has also been widely used in machine learning and that is super easy to compute. So it doesn't look like it at first look, but it is. So that is called the maximum mean discrepancy. So what is that, what is that name? And we're like, here's that weird formula here. Let me explain it. So if you just look at the right hand side, what you see is you see a function H. And what we are doing is we are taking an expectation of that function under one distribution and the other distribution. And then we're looking at the difference. And that distance difference is a measure of like, how different are these distributions? So now you can plug in different H functions and see what comes out. So if H is the identity, what you get is the average or the mean of the distribution. So you're just looking at how far are the means of the two distributions. That's one measure of how far apart they are. It's not maybe the best one because they could have the same mean and still look very different. So that's not necessarily the same. So what this H can do is it can take into account many different higher order moments that look at shape, et cetera, et cetera. So this H is also called a witness function. It basically tells you where are the differences between your distribution? If you actually look at it. And so we are taking basically the maximum over all possible Hs in a certain class of smooth enough functions, let's say. So this is the technical criterion that's for the moment, think about it that way. So why is this easy to compute because there's a kernel trick and basically it's very easy to compute it with kernels. There's another way to understand this and that is that what you're actually doing is you're taking your two distributions. So here's just some pictorial distributions. You're embedding those two distributions in a function space, that's the kernel space, or Hilbert space, and you're taking the distance, the Euclidean distance, essentially, in that function space. So after that, it's all Euclidean distances. We are happy, we are like, that's typically very easy. So that's essentially what it is doing. So that's why I can also write it as this, like just a distance between two embeddings of my distribution. So that is why this is actually computationally fairly nice. And it has been used widely for generative modeling, for testing whether two distributions are the same for causality and many other applications. So what are the advantages of this criterion of measuring divergences compared to maybe some others? So let's go back to our distributional robust optimization. So one of them is that it's actually very good for estimation. So if you look at the distance between my population and my sample, as my number of samples grows, this should decrease, and it decreases fairly rapidly in this case. So that means my ball actually doesn't need to be super large to make sure that you actually include the population. And there's other divergences like Wasserstein, whose estimators converge much, much more slowly. The second thing is that if you look computationally at the problem, if you now basically plug this into our robust optimization problem, and you look what it actually becomes, you can relax it, basically, that finding the worst case distribution can be relaxed to just a linear optimization over a ball in that function space. And linear optimization over balls is relatively easy. So we essentially get a closed form solution for that that gives us a specific form or upper bound on our optimization problem. And here's what it looks like. So on the left-hand side is our, just the inner part of our robust optimization problem, what the adversary essentially has to solve has to find the worst case distribution for us. And what that corresponds to is essentially to having some penalty, some norm penalty, that blue term in the end, on the loss of my function. So I'm actually applying the loss to the model that I'm learning, and I'm looking at the norm of that. So that's the regularization, you may think about it as this way, that's the regularization it corresponds to. Like these other ones, so Wasserstein was corresponding essentially to the gradient of that. That's the closest, I guess, of those that I saw. So what does that remind us of? Like penalizing some kind of norm of a function as a regularizer, that's like rich regression. We do something similar there, right? So if you look at kernel rich regression, what we're penalizing is the norm of the function itself. It's basically the smoothness of the function that we're estimating, and here we are penalizing the smoothness of the loss of the function. So it's slightly different, but they can be related with enough mass, they can be related. So in the specific case that I use a Gaussian kernel and I have the squared loss, which is maybe the most popular for rich regression, I can show that basically I can get an upper bound from the kernel rich regression penalty on my MMD DRO penalty. So what that essentially means is that kernel rich regression is actually implicitly optimizing also an MMD robust optimization criterion. So that gives me just a different understanding of kernel rich regression and a fairly easy proof that kernel rich regression generalizes via this ball proof. So we do actually recover standard generalization bounds. They are a bit tighter if you actually use the right regularizer here essentially, but your recover essentially the same just via a very different route. So it gives us a new understanding of this very classical method. We also get a suggestion for a new regularizer. So if you actually go and look at this norm of the loss of F a bit closer in this case, you can actually relate it to the norm of the square of F. So you could also regularize by the norm of the square of F. So that works better in some cases. So here's a toy example of using just the norm of F, which you do in rich regression versus the norm of the square. I don't claim it always works better in some cases. That's just one example, but it is a reasonable thing to do. So now let me just briefly sketch some other relations of this new MMD distributionally robust optimization that we recovered. So the first one is a relation to this chi-square distance that I showed you in the beginning. That penalizes the variance of my estimator, I said, of the loss of the function. So we can actually also recover that from MMD, DRO. So if we look not at the entire ball of our distributions, but just a few points in there, basically we say we want the distributions in this ball, but only those that have the same support as our sample. So basically that only ever observed points that we have already observed. Then we actually directly obtain a variance regularization criterion. So what this means is that MMD, DRO implicitly does a variance regularization. So we get relations to kernel rich regression and we get relations to variance regularization. So kernel rich regression essentially implies the MMD distributional robustness and MMD, DRO essentially implicitly implies this variance regularization, which has also been used for fairness and other criteria. So that was a brief overview of this new way of doing robust optimization using this different popular criterion of measuring divergence between distributions. In the interest of time, I think I will just briefly mention that there's also other work where you can connect robust optimization and machine learning that we have been looking at. So for example, you can use distributional robustness to actually do robust optimization in the discrete case of doing robust subset selection, a subset selection under uncertainty, doing robust network optimization and also extending this idea to black box optimization and Bayesian optimization, where the task is to essentially sequentially select what observations you wanna make to find an optimum of an unknown function as quickly as possible. So how can you do this if you actually know there's some perturbations in your data that you want to take care of? So that was all. Thank you very much for your attention. So maybe I have a question. So we say when we talk about robust optimization, there's a 1.9, which would be the transmission efficiency. So how can we improve the availability of this kind of measure, like this is a theoretical value measure to help us to build a practical robust system to have any comment for that? Yeah, so you asked about computational efficiency and I didn't talk about that. So essentially what you'll have to do is you'll have to also look at upper bounds. So one way you get the computational method out of it is to go via the route that leads to, that connects to variance regularization. So that's one way and you can essentially get a relaxation of this criterion. So actually Matt, who's the co-author on the paper is sitting right here. Do you want to say anything? Yeah, I guess for the background regression case, it's also, even the normal math square, you can get a close form for it. So you can still do it like similarly to the classic kind of regression stuff. Yes. You mentioned that you rely on robust optimization for a variety of robustness companies. So generalization and standard schools. Have you looked at, is it possible to solve this to all those properties or are you still kind of studying one at a time? Oh, so basically, yes, you can use the same framework and other people also have looked at connections. So I wouldn't say you can solve it all with this MMD DRO, so it's case by case. But for example, adversarial training can also be viewed as robust optimization where just your uncertainty set is a little bit different. You allow to perturb single data points in certain ways. Everyone in an epsilon ball maybe. So that's one example that has been studied. There is a bit of work on this, I would say. I think it's not complete. So you can basically, the question is how you best express these invariances that you want in the uncertainty set that you have. And that's open for further research, I would say. On epsilon, is it there that has to be fixed or do you see that, for points that are around decision boundaries, do you see that you would want to have a smaller epsilon or a larger epsilon, or is it fixed or is it changing the set? So basically, if you think about what does epsilon actually do, it's a little bit case by case dependent. If you think about it, basically, it defines the radius of the ball. If you look at the transformation to regularization, it's essentially the coefficient in front of your regularizer. So basically, a large epsilon means you're doing more regularization. You're taking larger perturbations into account. You fit your data less. Smaller epsilon means you're fitting the data less well. So it depends a little bit on how basically close your data is. So you're basically the true distribution, how much variation you expect. So the usual things you have to consider when you think about regularization. So it's not like one single number fits all. It's a bit case by case dependent.