 OK, so thanks. First, I want to thank the organizers to give me the opportunity to present this work. Actually, I also want to apologize to two or three people in the audience who already saw this talk exactly in the same room two months ago. I changed a few things, actually, because when I presented this work two months ago, there was only statisticians in the audience. So I made a lot of terrible jokes about people in optimization theory. And actually, today, I removed these jokes. And yeah, so yeah, actually, because I have to be honest, I know nothing about optimization theory. And still, I wanted to talk about it. So I mean, the reason why, and actually, it will explain the reason for this work, actually. The reason why is that I come from a different community. So I come from the community of people working on aggregation of estimators. And in this community, while there are many theoretical results, there are also people using Monte Carlo methods. And when I talk about other theory or Monte Carlo methods to people like you, I mean, people who know about optimization theory, they usually expect that I'm not able to implement anything from the methods I'm talking about. And in some way, it was true. But I mean, the object of this talk, the purpose of this talk is to prove that actually some prediction methods in aggregation theory can be approximated using variational approximations. So using, actually, optimization theory. And even if I don't know a lot about it, you can imagine by yourself that, actually, you can use a very powerful algorithm to implement these methods, actually. So I'm going to start by a short introduction to aggregation theory. And then by theoretical analysis of variational approximations of, in some way, optimization theory for aggregation theory. And so, yeah, first, aggregation theory. And it will be a very, very low level introduction to aggregation theory. So you might know already or not a lot. But I'm sorry I have to keep the slides at a level that I can understand myself. And it's a kind of challenge. So I should not say that, actually, because this talk is being recorded. So some of my students can have a look at it in a few days. Maybe we'll remove the first three minutes before it's published online. OK, yeah, so just a motivation for aggregation theory and, like, actually, for learning theory. So you have a sample, and you want to learn from it. But you don't want to write likelihood. So you don't want to do traditional statistics examples. You've already seen these examples since 25 years on the web. So you know that you can learn something from these data sets, but you don't want to write the likelihood. And so what you have, what you usually have to choose in order to deal with this problem, you have to choose a few ingredients that are recurrent in all the versions of supervised learning. You have to, well, you have observations first. So actually, I will deal with object levels problem. So x is the object, y is the label. Actually, I will present results in the batch learning setting and in the online setting as well. But in any case, I will stick to the same notations. And then you have to define a set of predictors, linear predictors, kernel, anything you want. So I will use this notation. F theta is indexed by parameter theta, which can be finite or infinite dimensional set theta. F theta, of course, is meant to predict y. And then you have a criterion of success. So obviously, I will define accurate notations later, depending on whether we work on the batch setting or in the online setting. But basically, you can think of something like, well, in statistics, people tend to use things like that. But I will more focus on prediction-related criterion, like, for example, the out-of-sample prediction accuracy. So this would be for a classification problem. Obviously, I will deal with more general problems, classification regression later. But you can keep this in mind. So we want a criterion, what is a good prediction? What is a good predictor theta? And finally, we'll use, in many cases, an empirical approximation of r, which I will denote by small r. And for example, you can think of the empirical risk that you want to minimize. And so basically, this is all the ingredients that we need to be able to talk about something in learning theory. And actually, in aggregation bounds, so-called packed hydrogen bounds that I'm going to present right now, you need one more ingredient. It's a way, well, actually, you need that. You know that in some way, you need some assumption on the set of predictors. You need to control its complexity. I mean, not even in order to do some optimization, but just in order to relate in some way this theoretical criterion of success and this thing that you can observe. And in order to relate this to this, you usually need an assumption on the set of parameters. And in aggregation bounds, usually, you replace this by a prior probability distribution on the parameter space. So it will be used in some way to replace the complexity measure like that big dimension on the parameter space. And all the bounds I'm going to present today take, they look like that. So you will have a bound on the average prediction risk. And it will be upper bounded by a bound, which is a balance. So you're probably used to the bias-variance trade-off. In this case, you have a kind of slightly different trade-off. Actually, you have an infimum over all the possible aggregation distribution. And this time, you have a balance between what this term, which would be actually the best possible prediction. And in some way, here, you wouldn't want to take really a probability distribution. You would just like to take Dirac mass at the best possible parameter. But on the other hand, you have a variance term. And here, it is actually the feedback divergence between the aggregation distribution and the prior. So in some sense, it will replace the complexity measure that you have, for example, in that big bounds. In the sense that, well, obviously, as I said before, you want, in order to keep this term as small as possible, to choose rho as a distribution that is very, very spiked and concentrated around the best parameter. But on the other hand, when you concentrate rho around a single parameter, the feedback divergence with respect, for example, to a uniform prior will explode. And how fast will it explode? And usually, it's related to the dimension of the parameter set, theta. So here, you will have the usual balance between good prediction and complexity. So obviously, this one is just like, I didn't say anything about, well, what is this small rho of 1? I mean, does it depend on the dimension on the sample size? Actually, I will give precise bounds later. So just accept that. There will be additional terms, but that's what you need to understand the bond is these two terms. And actually, the other good news is that usually, well, depending on the bond, it will hold for a large class of aggregation measure. But actually, very often, the one that we want to choose is this one, because some bounds are valid only, actually, for this probability distribution, this aggregation distribution. And you can see that it looks, I mean, if you're familiar with Bayesian statistics, it really looks like a posterior. So you have the prior multiplied by something that gives weights, more weights, actually, to parameter that make good prediction. So if you see this as a kind of pseudo likelihood, then you can interpret this as a posterior. If you prefer, you can see it as a kind of a smooth version of empirical risk minimization, because obviously, we'll give more weight to parameters with a small empirical risk. OK? It's clear? So you call this rosing the aggregation measure? Yes. It's not a, you don't like the term aggregation measure? I'm not familiar with the term, that's why. I mean, I would call aggregation measure any measure on the parameter space, actually. So the thing is that, depending on the papers Yeah, OK, sorry. I will call aggregation measure any possible row, actually, because I will replace this one by a more convenient one later. But depending on the paper, actually, you have many different names for this one. In Bayesian statistics, people, well, they don't really like that because it's not a likelihood, but still they use it and call that a pseudo posterior. My PhD advisor, Olivier Cattoni, called that a Gibbs measure. And finally, you have the exponential weight averaging because of the exponential weight that is used a lot as well. So it depends on the paper. So how do you relate this aggregation measure with the, if you go to the previous slide, to the standard setup of prediction and machine learning? Yeah, actually, it will depend on the result that you use. Sometimes, I mean, when the risk is convex. Oh, yeah, I mean, I talk to people doing optimization. So I had to use the buzzword convex at least once. I did, thanks. When you use a convex risk, actually, what you will do is simply use Jensen's inequality here and actually try to lower bond and just lower bond this by the risk of the aggregated estimator. So in this case, actually, what you want to do is to compute a posterior expectation, okay? On the other hand, when it's not the case, when it's not convex, you can see that this is still a bound on the procedure, but this procedure is a randomized procedure that will, at each time that you are given a new object X, it will draw a parameter theta according to this probability distribution and then you predict Y as f theta of X, okay? And then this is on a per bond on this randomized procedure. So it depends. I mean, if the risk is convex, you can relate this to an aggregated parameter, otherwise you have to use a kind of randomized procedure. But in both cases, you have something practical. I mean, practical if you can deal with this and this is what I'm going to talk about. Okay, so I will present you actually two more accurate versions of this bond depending on the context. And the first one actually is a bond for batch learning. So you have a given sample, IID. Well, IID, actually the independent part can be removed if you have some assumption on the dependant between the observation, but just for simplicity, yeah. Can you go back once? Yeah. There's no small r. The bond is not with small r or it is big r? No, I mean, okay, I can provide you, and actually I will later, I can provide you a bond with the small r here. But here in some sense, it will be, yeah, like a kind of empirical bond, the one that you can compute on the data and you can know that your probability of error is smaller than zero point something, okay. Here, I'm more interested in this kind of inequality where actually you want to be sure that, I mean, even if you cannot compute this bond, you want to be sure that in some way you will be very close to the minimizer of this quantity, okay, but actually both are related. Your posterior is a minimizer of that one? Exactly, exactly, I will come back later when I present the proof, but it seems that you already guessed how I'm going to prove the result, so yeah. So an IID sample from a probability distribution P, well, the set of parameters I have nothing to show about it, it can be anything for the moment, and a risk that can be written as the expectation of a loss function and for the sake of simplicity, even though, I mean, it's not necessary, I mean, this kind of bond can be generalized to unbounded loss functions, but for the sake of simplicity, I will present one of the weakest versions of these results for a bounded actually loss function, okay. And finally, while the empirical risk is dead, obviously defined in this way, and you still need a prior pi, but for the moment, I don't provide an explicit form. And in this case, you have the following result, so it's a pack bound, okay, it's valued with large probability, and it say that the risk of this exponentially weighted aggregation procedure, whether it's randomized or whether you have a convex loss and you use then the posterior expectation, well, it's super bounded by this, okay, this term I promised you, okay, the balance between a good, I mean, a small value for the theoretical risk, and then this complexity term, and then the reminder that you have is, well, obviously a log of one over epsilon as it's with large probability, and then this lambda times b squared over n, okay, where actually I remind you that b is the upper bound on the loss function, okay. So, actually this bound is due to Olivier Câtony, but actually it's based on previous work by John Choteylor and David McAllister. And I just want to explain, I mean, obviously it looks like what I promised on the previous slide, but on the other hand, maybe it's not very explicit, so I just wanted to, I'm going to provide many examples later in the talk, but I just want to provide one example in a very simple case where actually the predictor set, so theta is a finite set, and just to check what happens in this case, okay, what this term and this term look like, okay. So just assume that you have a finite set of predictors and then I'm going to do something quite simple, I choose the prior as a uniform distribution, okay. Then, obviously you have this bound, I just, actually it's just the same bound as in the previous slide, but if you don't want to calculate something too complicated, you can replace actually the infimum over all probability measures, but the infimum over all Dirac masses, okay. And then, obviously the integral of the risk, in this case it's just the risk of the parameter theta i, and then actually you can compute the Krullbach divergence between a Dirac mass and a uniform probability measure on the finite set, I think it's feasible, and then, well, actually it's the log of the cardinality of the set, and then you can see that, well, this bound, remember that it depends on the parameter lambda, the one in the posterior, or maybe I should come back here, this one here actually, and in case you don't know how to choose it, now you have a way to choose, you can try to optimize the bound, and you obtain that actually your aggregated estimator performs as well as the best predictor plus this term, square root of log m over n, okay. So, without any other assumption, well, I'm sorry, the optimal choice for the parameter lambda is given here, without any assumption on the loss function, actually you cannot improve on this. If you do reasonable assumptions, for example, when you use least square, the least square estimation, so actually you use the quadratic loss, obviously you can improve on this, so there are refined versions of this bound, there are many, actually in all your books, you have probably 156 different versions of this bound, some of them don't include one parameter lambda, but 77 parameters, lambda one to lambda 77, but in the end you have bounds, for example, for the quadratic loss without the square root here, which is once again the optimal rate, okay. So, it just gives you an idea of, actually what you usually do, when you have a prior and a set of predictors, usually you try to upper-bomb this by just not taking the infimum over all probability measures but actually over a suitable set of probability distribution that you can actually deal with, I mean in the computations, and you end up with a prediction bound, which is usually, if you do the computation, the calculations quite well, you will usually end up with something that is not too far from being optimal. Okay, I want to present the same bounds but in a different setting actually, so in online learning, so as promised, I used the same notation, but this time I don't have any assumption actually, so maybe the uppercase later in this case is not very well thought, because the x, y are not meant to be random variables, I deal with online setting and any possible sequence x one, y one, it can be obviously generated by any algorithm, even by an algorithm which knows actually which aggregation procedure you are going to use, but anyway, you have the same setting, a set of parameters, and in this case, I will focus on the regret, okay, so it means that at each step, I'm going to use x t and the previous observations to predict y t by say y t at, and then I want to compare my accumulated loss to the accumulated loss, sorry, of the best possible predictor, okay? So I use still the same assumption that the loss is bounded once again, it's something that you can remove with different assumptions, but I want to present once again the simplest version of the bound so that you can compare with the previous one, okay? And then at each step, actually you can still use, you can still have a kind of proxy of the quality of predictor, okay, which is the empirical risk up to time t minus one, okay? And the prior, but once again, I will give a general version, so no assumption on the prior. So here what I propose is to do basically the same, but at each step t, okay? So at each step t, I define this exponentially weighted aggregate, okay? So the prior multiplied by exponential to minus lambda times the empirical risk at time t, and then I use as a predictor, and in this case, you can see that I use the convexity of the loss function, so actually I don't use the randomized procedure, but actually I could have, okay? In this case, I use the aggregated predictor, okay? Under this pseudo-posterior distribution. And then you have this result, so actually I'm not really sure who was the first person to write this result. You have a version in Chezzabianchi and Lugosi's book with a discrete parameter space, but actually I found a very clear explanation about this result and many, many variants in Sebastian Gershinovitz's PhD thesis, but I don't know if he was the first one to write this result under this form, actually. So this time, the accumulated loss is smaller and you still have this balance between, well, the best possible aggregation, but you have to keep once again this complexity term, so the distance with respect to the prior, okay? So you see that in some way, things will be almost the same. I mean, if you specify a prior pi and the parameter set theta, then you can then choose optimal, I mean, you can explicit this term and this term and then you can choose an optimal parameter lambda. So I mean, basically it's the same, for example, if you use a finite parameter set, so I don't want to do all the calculations again. Okay, so now the remaining of this talk will be about, well, are you sure that you can compute this, okay? So I think that is quite important. I discovered that this kind of thing can be important quite recently, but even though, I mean, I think it's quite important. Before that, I have a few remarks about these techniques to do in order not to upset anyone in the audience. First, you have many, many other versions of these aggregation bonds, so a very famous one is a version by Dalalio and Tibakov for, well, the difference is that it's not exactly on prediction, it's on estimation, regression estimation with fixed design, so you cannot really give something about prediction risk, but on the other hand, it's very, very convenient because they have no boundedness assumption, so it's very, very convenient. That tool is very, very convenient in many settings. And there are, although interactively, they used, in their paper, a bound, a first, I mean, a lemma by Lungen Baron, but you have many, many related things in aggregation theory and statistics that are not exactly the same kind of bond but still are very related to this approach, okay? And something else, there is obviously a relation for those, among you who are statisticians and who know about Bayesian statistics. Obviously, there is this link I mentioned with Bayesian statistics. If you see this as a pseudo likelihood and obviously pi, I call it the prior actually and it's, I call it a prior because there is this link with Bayesian statistics. And there's been a few paper recently where people motivate the use of this probability distribution even when you're not doing statistical learning, just when you're doing Bayesian statistics to use something like prior times, pseudo likelihood, I mean, you can think of an example just for example, this is just a short parenthesis for statisticians, but for example, if you have Gaussian observation and you try to estimate the mean, then here obviously you will use a Gaussian posterior, so actually it will be the same but the risk here would be the quadratic loss, okay? But in practice, when you have all layers, you know that the quadratic loss leads to non-robust estimation. So for example, here, they propose to replace it by a robustified loss function, okay? And in this case, you don't use here the likelihood but still you have something that looks like this, the composition prior and pseudo likelihood. So you might have other reasons, even though I believe that the one I presented before are the best, but you might have other reasons to use this kind of pseudo posteriors. What do you say pseudo likelihood? So if r is a negative likelihood, then it's exactly a likelihood. The reason why I call it pseudo likelihood is that usually when you mean likelihood, you mean that you describe a parametric distribution on the observations, which is not the case here. But Bayesian are not allowed to choose lambda. Bayesian are not allowed to choose lambda. And actually in all my theoretical results, I choose lambda in a way that would not be acceptable by Bayesians. So I agree, it's not exactly the same thing, even so it's close, but. And just one last thing that I really like to mention is that there is a community in Bayesian statistics of people who like to analyze the rate of convergence, the concentration rates of posteriors. And it's quite funny because they use completely different tools, but in the end, they usually end up with the same computation. What I mean is that when you compute this bound here, usually you have everything that is needed just from a technical perspective, even though the proof is different, but you have everything that is needed to use, for example, this kind of tools to compute the concentration of posterior rates. Okay, so sorry for the parenthesis, but now I'm coming back to the main purpose of this talk. So actually the problem is we want to compute this. And when I say compute, what is to compute a probability distribution? Well, I mean I want to sample from it or to compute the mean, okay? Obviously, if you know about computational Bayesian statistics, you know that there are methods to do this. There are Monte Carlo methods, okay? So for example, actually in the paper by Arnak and Sacha, this is what they do, actually, they propose the Langevin Monte Carlo algorithm to compute their estimator. Well, I did a work with Gerard Buen, we use the reversible jump algorithm to compute our estimator. So actually there was an attempt to use Monte Carlo method for this, but this is where actually the dark side and bright side in the force comes in. I mean, when I present MCMC method in front of optimization theory, people usually they tell me, ooh, dark side of the force, you shouldn't use this, it's too slow. And I mean, it's not only that it's too slow because in some cases it works well, but we don't have guarantees on how far are we from this quantity, okay? So I want to say that it's not completely true. And I want to mention these two papers presenting different approaches. I mean, this one presents concentration inequalities for Markov chains when you start from the non-stationary probability distribution. So in some way you have a tool to prove a concentration of the empirical approximation that you get from an MCMC. On the other hand, it depends on many, many, many assumptions on the Markov chain. And to me, it's not clear how this assumption are related, for example, to the dimension of the problem. So I don't know if it's, this approach is okay if you have one dimensional parameter space, but I don't know how it scales with the dimension of the parameter set. Ahmad El-Enio has a very nice paper, actually, where he has the exact scaling with respect to the dimension. So it's for a version of the Langevin-Monte Carlo algorithm. So it's very nice. I mentioned it as a preprint, but maybe it's accepted since I wrote this slide. So I don't know, but anyway, it's a very nice paper. On the other hand, you have many assumptions on the pseudo posterior in this paper. So here I want to discuss another possible approach. So the idea is not to use Monte Carlo methods at all and to use optimization theory. You convinced me it's the best thing to do. So the idea is, well, I mean, just for one slide, I will use this notation, which is the usual notation for the posterior in Bayesian statistics. And the idea is just that, well, if this is not reachable in practice, they let's not try to reach it. Let's try to reach a simpler object. So what we're going to do is to propose, not to, how can I say? To pay attention to all the possible aggregation distribution, but to pay attention just to a fixed family of aggregation distribution. For example, a parametric family, like all the Gaussian distribution, okay. And then we are trying to minimize over this family the distance between what would be the true posterior or pseudo posterior, so our objective, and the approximation. So actually this is not our idea. Obviously it's very famous in Bayesian statistics. People use it since a long time. So I don't know actually what was the seminal paper for this idea called variational Bayes approximation, but I learned it in papers by Michael Jordan on application to graphical models. But I think it was one of the first to use this method. I'm not sure, but. I think a mean field was used, like, one subversion had been used a long time. Okay. I don't know what was the name. Like it's statistical. But when I say who is the first to use this, I mean who is the first in statistics because if we rediscover it in our own field, usually we don't pay attention to what was done before in other fields. So it's a, no, no, okay. You're right. Probably you're right. You should have been using physics much before. I like the Jordan reference. No, this is why I used it. No, I mean, it's true. I learned it in this paper myself, but probably this paper provides more references to what was done before. So in the end, the thing is that, okay, you use either a non-parametric or parametric family. So by this I mean finite or infinite dimensional set F. So the mean field approximation would be the case of usually an infinite dimensional set F. But here we focus on a parametric approximation. So you just give a set of probability distribution row that is indexed by a parameter A in a finite dimensional space. And then in this case, you completely replace your posterior sampling problem or your posterior mean problem by just an optimization problem, okay. So actually it's done. I mean, in some sense I'm going to present now applications. I'm going to try to obviously to justify this approach and to prove using the previous theory that actually in some cases you don't lose a lot in your prediction ability. But I mean, don't expect me to provide, for example, optimal approximation optimization algorithm here. Okay, it's not my job. I'm trying to learn it thanks to your talks yesterday and today. But I'm not the one who can give the best in any case the best possible algorithm here. I don't want to show you that actually even if you use aggregation theory, the problem can boils down to an optimization problem which you can solve, okay. So our first question was, do we have any theoretical guarantees on the approximation? Okay, so I will present a result but just before a few explanations. So this is what we target. This is what I want to compute. So I want to minimize the distance in terms of feedback divergence between the approximation and this pseudo likelihood. And the thing is that, and this is exactly what you said before, Francis, when you write it, it boils down to the aggregated version of the empirical risk this time plus the feedback divergence. And you have a reminder term but actually it does not depend on A so I can forget about it when I'm doing minimization, okay. So actually what I'm going to do is to first to minimize this with respect to A and then define A as my estimator and finally my aggregation distribution if it's necessary to compute it, then it's actually a row where you take the parameter alpha, A, sorry, yeah. So I'm a bit confused where you want to approximate this one instead of just the down you have actually used to derive this one. Actually it's the same. Okay, so it doesn't change? You will see that actually the bond that I had, sorry, where is it? The bond that, and this is actually why you had the idea to do this and to look at variational algorithm. The thing that is the bond, well it's too far away in the first slide but the bond that we had on the aggregation procedure actually it's proof is based on the fact that we minimize this bond with respect to all the probability distributions while actually if you just minimize over a parametric state, it still gives you a theoretical guarantee on what you get in the end. So you're saying that the KL between distribution was the right metric of distance to minimize the bound that's what you're saying? The point is you had a bound which was actually giving you actual performance you cared about which had nothing to do with KL in general, it's just like the risk. And so you just wanna find a new approximate aggregation measure which has good risk. So why minimizing the KL? Okay, so I have two answers for this. I mean first some people did it before and obviously if you try to replace the KL by another metric on probability distributions things might be harder from a computational point of view. I mean the good point with KL and that in the end you can compute it explicitly. So I mean you know what you want to minimize. I mean if you replace it for example by the total variation measure I'm not sure that you will be able to say anything about how to minimize this. So the reason why we were interested in this is that actually, sorry, this the minimizer of this quantity it plugs in very easily into the analysis that I presented before, okay? So these pack-by-design bounds that were already known where you relate the prediction risk actually to this balance between aggregation and distance, I mean Kullback-Leibler distance with respect to the prior. Then actually you can plug this into this analysis and gets a theoretical guarantee on your approximation. You see what I mean? Or no? Your bond is essentially that one with... Exactly. R is passed by capital R. Exactly. You just guess the next slide. So I'm going to show the next line up. So I had your paper also. I'm not guessing, I'm... You know what I mean? No, no, no. My explanation was so good that you guessed. I know. Okay, I mean, so in some case, I mean this result, I mean I wanted to cite the paper obviously but I shouldn't be proud of it in some sense because it's just weakening a result that was existing before. But what I'm in some way more proud of is what will follow, the fact that actually it leads to a practical procedure for aggregation, okay? So this paper that also, okay, I forgot to mention my co-authors at the beginning of the talk. So now it's time, it's the paper that we wrote with Nicolas Chopin and James Ridgway. James was PhD student of Nicolas at NCI but now he's doing a postdoc in Bristol, okay? And it tells you that actually if you use this fashion approximation, so raw tile, it's not the raw hats that I had before. Sorry about the mess, but actually raw tile is this one, the one where you plug the minimizer of this criterion, okay? And the approximation that you have then instead of having the minimizer, sorry, the best balance, okay, for between risk and KL for all the probability measure, then you have it for only in the family A, okay? Only in the parametric approximation. Obviously you can do it for non-parametric approximation like mean field, actually we did it in the paper but as this is only a three-hour talk, I just wanted to present the short version, okay? So the thing is that obviously the work is not done then. I mean, you have a bound, it looks in an abstract form. The question is, I mean, do we have, and actually obviously if you take here a very poor family A, it's possible that this bound is very large, okay? So the thing is now, it does not give you a way to prove that VB, a variational based approximation, always works. It provides you a way to checking when it works. I mean, you have this bound, you compute this bound in your model, if you get something that is good, then it means that to use VB, you will not lose anything in terms of accuracy when you use it for prediction. On the other hand, if you compute this bound and you find something large, it just tells you nothing, okay? So it's just a tool to try to make sure that the VB approximation makes sense. The same with a variational in France. If it works, you're happy, if it doesn't, you don't know what kind of concept it is. Yeah, yeah, yeah, yeah, but in some way, it kind of give you a theoretical guarantees at least that at least in some setting, it should work, okay? So I prepared the proof of these results. So I mean, I'm not sure, maybe I should first present the applications and then go back to the proof if I have time later. So, okay, so I want to apply this to a linear classification problem, okay? So in the back setting, actually. So I come back to this setting, okay? So you have a sample, IID from P, sorry. Then this time, classifiers are linear classifiers, okay? So you just compute the scalar product between the object and a parameter and then you check on what side of the hyperplane is your object. The risk is actually then in this case, the classification risk, the probability to make a prediction error and I want to approximate it by the empirical risk. So as thanks to you, I learned a lot about optimization theory. I can try to minimize this. So I just have to compute the gradient of this quantity and set it to zero, I think. It's always work in order to make approximation and to make optimization. But I mean, in this case, I was not able to compute a sensible gradient. So actually I decided to use our approach with aggregation. So I use a Gaussian prior just because it was simple for calculations, actually I will explain what can be changed if you replace this by another prior if I have time. But for the moment, just stick to a Gaussian prior. And then actually, so what is going to be the posterior? I mean, the posterior is the prior multiplied by e to the minus this, okay? So it's not very nice and the idea was just to approximate it by a Gaussian distribution, okay? Which must be quite easier and in some sense then you just have to optimize with respect to mu and with respect to sigma and you hope that in the end you will get something sensible. So first, what is the optimization criterion? In this case, we wanted to write that down explicitly and in the end what you obtain is this. So actually phi is the CDF of the Gaussian distribution and you obtain something that obviously, well, actually this looks like a slightly modified version of the empirical risk, a smooth version of the empirical risk. It's due to the integration with respect to the posterior, to the aggregation distribution. Then you have the mu squared term, so it's like a rich penalty and then you have also this penalty depending on the, sorry, on the covariance matrix, okay? So the problem is that, well, in some way it looks better than this one in the sense that this is a smooth minimization program while this one was not smooth. On the other hand, it's still not very good in the sense that it is not convex and I heard in the talk yesterday that you have to use the word convex a lot when you talk about optimization. This is another thing about optimization that I know, so it's not so good. Even though we tried to optimize it, I mean, in small dimension you can still do it using, for example, gradient descent, but at different scales, so using deterministic annealing. And, yeah, sorry, first I should show you the results maybe before the theoretical analysis. But, okay, so what we did here, we took seven data sets on machine learning repository. We used sequential Monte Carlo, which, I mean, it's a Monte Carlo method that works well, at least when the dimension is not too large, you know that it will give you a kind of benchmark result. We compared our variational approach in this case, and then just as, how can I say? We used non-linear SVM to compare to linear R methods just in, I mean, I didn't know why we did that exactly, but the reason was that we wanted to use another method and check for the data set whether, I mean, it's sensible in some way to analyze it using a linear method or not. I mean, for example, in this case, you see that the two linear methods they do, well, they don't perform so well, and in this case, it's clear that you should use a non-linear method. On the other hand, in this data set, well, it seems that to use a non-parametric method as like SVM does not bring you a lot when compared to linear methods, so in this case, linear classification is sensible. And you see that in many examples, actually, not in all the examples, not here, for example, but in many examples, we have actually a slightly better performance, which is due to the fact, yes? What is SMC? Sequential Monte Carlo. Yeah, right. What, on which, which, on which model, same model? Oh yeah, on the same model, yeah, sorry. So actually, it's the approximation of the usual pseudo posterior using this, okay. So actually, we just tried to do what we would have done before we learned about optimization, exactly. It's not MCMC actually, SMC, it's a variant because actually, it's this algorithm where you generate a point from the prior and then you just eliminate those that have a poor posterior, but the ones that have a large posterior, you duplicate them and then apply to each of them, exactly. Oh, it's another name for practical return. Exactly, exactly, it's, yeah. So the two columns are the same? The same estimator computed in two different ways. And the one that I'm trying to sell today is this one and this is the old-fashioned one. Like, VB is superior to Monte Carlo? I'm not saying that it is superior to you. You just said that like one minute ago. Okay, I said that I'm trying to sell VB today. So obviously, I mean, I chose the seven data sets that prove that I'm right and I just add this one because I know that you, I mean, in simulation study, you don't have to be right 100% of the time. No, I mean. Why do you say aggregation methods are linear if you're aggregating different linear classifier anyway? That's nonlinear, no? Yes, you're right. That's nonlinear in the end, but it uses a linear set of predictors. Yeah, but in the end, it's a nonlinear predictor. You're right, okay. But even though, okay, my objective in the end and using this bond that I will present later, I mean, I might be much, much better than the best linear classifier, but actually my theoretical analysis just says that I do as well as the best linear classifier. I will come back to this later. So actually in some way, this was my objective. My objective when defining a set of linear predictors was to do as well as the best of them. It's true that when you aggregate, you can do much better than this, but even though this is not what we try to do in this analysis, okay, so. And actually in this case, obviously, I mean, this improvement might be due to the clear nonlinearity of this. But I mean, for example, in this case where clearly it seems that there is a very good linear classifier, you still have an improvement, and which was due from what we observed actually to the fact that, well, obviously, we stopped SMC after some time and you're not sure that it converged while actually we stopped the gradient algorithm when the gradient was exactly equal to zero, and so we were sure that it converged. Exactly equal? Yeah. Tulien, up to 0.5 decimals. Yes. But didn't you say it was not convex? Yeah, no, no, you're right. I mean, it's not convex, and I will discuss it. Promise. I mean, I pointed this out, exactly. It's not convex, so I pointed this out. We don't have guarantees on the minimizers, but I will come back to this later. I have another one later, actually, so. But first, okay, before this, so we have these results that seems, well, at least promising, even though we have to say more, yeah? So this is a final performance. Could you plot as well the bound that you obtained? Because for me, Pac-Bejian is used also to get like, over ball, confidence bound, or the predictions. Could you plot them? Did you plot them? You see there were like, many full of. No, I mean, sorry, I don't have the plots here anyway, but I mean, you're right. And once again, I will come back to this later when I will discuss the proof. Just one thing now. Actually, if we use the theorem that I presented before, so what, sorry, where is it? Here, this one, and we will place this set A by, what I said, okay, so set of linear predictors. Then we obtain here this result. Okay, this result. So what does it tell you? Well, there is an assumption that we'll come back to this in one minute, but first it tells you that the risk of your aggregated metal. And actually, you see that here, I use the randomized version, okay? So you're right, I mean, it can be, if you're lucky, it can be better than linear, but anyway, what I know, at least for sure, is that I do as well as the best possible linear predictor plus this rate, square root of D over N, which is actually, once again, in classification, the best you can do. Well, there is an additional log terms, which can be removed at the cost of very, very exhausting analysis, but in some cases, it can be removed, actually, using pack-based bonds, but in the simpler version, you have these log and term, okay? And yes, I just wanted to mention this, so you have a kind of regularity assumption in some sense, which tells you that when you slightly change a theta, the real risk doesn't jump a lot, which is obviously sensible in the sense that you replace a point estimator by an aggregated estimator that takes a neighborhood, and obviously it cannot work. I mean, if you use a Gaussian aggregation, it cannot work if you don't have just a kind of regularity. So you have this assumption, and if you don't have this assumption, we are not sure, actually, about what happens, okay? I want to mention that this assumption is not necessary for the pack-based analysis of the empirical rate aggregate if you're not doing VB. In this case, we have this assumption, but under this assumption, we know that the method will work. And actually, how do I prove this result? And this is the opportunity, actually, to show you another application of the trade-off between risk and KL term, okay? So actually, what we do is that we just apply the theorem, so the risk of the aggregated predictor is actually as good as the infimum over all the possible Gaussian distributions. And actually, for the sake of simplicity, I used just the ones with diagonal covariance matrix, but actually, I mean, you can change it, it will improve on the constants, but this makes life much, much, much simpler, okay? And the idea, I mean, I don't derive all the calculations, but first, you know the feedback divergence between two Gaussians, so in this case, it's this term, and you see that here you have the m that is the dimension, so, sorry, here, it's m, and in the previous slide, it was d, but it's the same. So you have this term here, and then the other term, actually, this is where this kind of smoothness comes into account, so actually, the aggregated risk is almost as good as the risk of the mean of the distribution plus some reminder, okay? And you just optimize with respect to everything, with respect to this s squared here, and then with respect to lambda, and you get, in the end, the optimal bound, okay? So this is how it works, okay? I still have 15 minutes. And what does this differ from the full Gibbs posterior? The bound? Okay, actually, so in most of the papers that I know, when people wanted to use here to compute the bound for the full Gibbs posterior, they use here the infimum, and the first step of the bound was to replace the infimum by overall probability distribution by the reasonable parametric set anyway. So, I mean, the bound is the same. Okay, so when you apply the bound, you do this anyway? Exactly, so this is where, actually, we had this idea. I mean, it was frustrating in some way to use VB approximation in the bound, but not for the estimator, and actually, obviously, you can do it. So this is what people do. I mean, in the beginning, they were not using actually Gaussian, but rather something like uniform distribution on the ball around the best parameter, which actually is a good idea because it allows you to use a pack-based bound to prove bounds on the empirical risk minimizer as well. But usually, I mean, as long as it's possible to change the mean, to take whatever mean you want and to change the scale of the distribution, you can basically get what you want. Okay, yeah, sorry. So, as I told you before, it works well, but the problem is that, obviously, it's not convex, and so, as it's not convex, you're not convinced. It's one of the jokes I prepared yesterday. Sorry. Okay, so, obviously, you know this. Now, you're doing classification. You want to replace the zero-one loss bioconvexorogate. So, for example, you can use this paper by Tong Zhang, which you have many, many theorems about all the possible replacements for the zero-one loss and what you lose in terms of rate of convergence. So, for example, here, we wanted to use the hinge loss. So, the one that is used for support vector machines. So, we define now our risk in this way and we have this empirical risk and we still use a Gaussian approximation for which this time, and this is important for the analysis, actually, we, this time, we fix the Gaussian approximation with just a fixed variance, okay? I mean, all the coordinates have the same variance and the variance, the covariance matrix is a diagonal one. In this case, we did the calculations and I mean, the calculations are not very difficult but what is not maybe obvious when you see it is that this is a convex criterion, okay? So, you have still the CDF of the Gaussian distribution. Here, you have the density function of the distribution of this distribution. So, here in the end, you have two terms that still looks like a kind of, actually, it's just a convex surrogate. It's a new convex surrogate and Sylvain pointed this to me last time but this is just a new convex surrogate of the zero-one loss function but this is a new one for which you have another warranty, okay? So, this will be the empirical risk. You still have this kind of ridge term here, okay? So, the penalty that is the square of the parameter mu which is due actually to the Gaussian prior and then the penalty on the variance. And, well, it's not abuse. I mean, for example, we tried for fun to replace here the zero-one loss by other convexified loss function and then when you integrate with respect to the posterior, you don't necessarily get a convex criterion, okay? And, actually, what can be, I mean, usually it's convex with respect to mu but what can be painful is the parameter sigma, okay? So, in this case, we are lucky it's convex. Convex and sigma? Yeah. Can you check that? Just with a second derivative. We did the computation and it works. Okay, you have the, you read the paper. No, no, no, no, not like that point. Okay. I know it's not abuse and once again, it's not something like, it's not because you take a convex loss and then you integrate that you will obtain a convex function but then in this case, it works, okay? And then in this case, so this is where I told you that I don't, I mean, I took many precautions about that. I'm not a specialist of optimization theory. Well, in this case, there's not a lot that you can do because it's convex but it's not really more than convex. You don't have many, many good properties. For example, you can, if you make some assumptions on sigma, like for example, if you prevent sigma from going too close to zero, then you can make better things but in this case, we just like took a ball on the parameter set mu and sigma and then use the gradient algorithm but what we were proud of is that even though we don't know a lot about optimization theory, we were able to write this theorem which tells you basically under the same assumption as the previous one that you have, I mean, at each step, okay, you compute, at each step of your gradient algorithm, you compute an aggregated distribution so you have a mu at k and a sigma at k and then the risk, according to this procedure, so the risk according to what comes out of the computer and not what comes out of the paper, is actually as good as the best possible risk for a linear classifier plus the minimax bound for the classification problems, square root of d over n and plus something that depends, well, probably badly for people who are good in optimization theory but still something that depends explicitly on the number of steps that you have, okay? So obviously, I mean, what I wanted to present to you and this is why I went into the details more than into a good bound in the end is that obviously you can play with this, you can change the hypothesis, you can change the parameter class, sorry, the predictor class, you can change the prior and then in the end, you can change the algorithm and I'm sure that you will be able to improve on this bound. But still now, you see that using pan-basin bounds, it's possible to provide guarantees on the prediction that you have for a variational approximations and it's possible to reduce the problem to a convex optimization problem in some cases, okay? I just wanted to advertise for something that, I mean, I feel free to advertise for it because it's not mine actually, just after we submitted this paper, James Ridgway, so one of my co-authors decided to write a package which after a very, very long peer review process is now available on our website and well, I mean, I don't want to, I mean, it works like very basically so I don't want to enter the details but it's just that, I mean, you just enter obviously the matrix of labels and the matrix of objects but the thing that I wanted to show is that even so we, I didn't make the plot, actually you can get the bound, okay? So for example, in this case, you know that the probability, with probability at least 99%, the probability of error is smaller than 0.79, which is not so good. This is the bound on the hinge loss. In this case, yes, yes, yes, yes, this is the bound on the hinge loss. Yeah, yeah, yeah, yeah, but yeah, you're right. This is an empirical version on the bound on the hinge loss. Even though, yeah, yeah, okay, you're right. Even though it's not very, very good, I mean, we know that these bonds are pessimistic and especially when you use it in a problem where you clearly have a linear classifier then the square root d over n rate is not optimal. You can replace it by d over n when you have a margin condition or something. So this bond is usually not so good, so good, but I mean, we are happy with the fact that to minimize this bond provides a good estimator, okay? You can play, I mean, in some points, it's possible in the end to get something smaller than one half. You have to wait for a long time, but it's possible. Do I have time for the proofs maybe? Five minutes, yeah, okay. So I come back to the main result. I mean, the proof is quite obvious, but I want to, I mean, it will be the opportunity to mention this empirical bound, which unfortunately I did not plot, but, okay, so sorry, I want to prove this theorem. So the fact that your approximation, your VB approximation performs as well as the best possible approximation in the parametric family A, okay? And then we start with things in equality, okay? So you just have a bond on the exponential, on the exponential moment capital R minus small r. And then, well, I just rewrote it introducing my probability epsilon, okay? And you integrate it. So this is the change, I mean, with respect to Vapnik's tight bound, is that you integrate with respect to the prior, which does not change anything for epsilon, okay? Because it's a constant. And then, so this is very standard, but then you get actually to this point, okay? And then you use this lemma, I mean, the fact that you can compute the convex conjugate of the KL divergence. And that's more of what I will use later. You know the distribution rule that reaches the minimum here, okay? And the supremum, sorry, here. So actually when you use this lemma, you get this kind of uniformized version of things in equality, but here again, as I integrated with respect to the prior, the difference with what you have in Vapnik's type analysis is that in Vapnik analysis you have a soup with respect to the parameters theta, while here you have a soup with respect to all the probability measures on the parameter set theta, okay? And then finally you use Markov's inequality and it gives you this empirical bond. And I wanted to insist on this because actually Francis mentioned it many, many times, but here, you know that it's not only for the minimizer, it's for all possible probability distribution rule, all possible aggregation distribution, and then the risk of the aggregated procedure is smaller than something that is completely empirical. I mean, it depends only on things that you know, the risk, the empirical risk, so it depends on the sample, and then the parameter lambda that you choose, and then the feedback divergence with respect to the prior that you choose as well. So actually you can compute this bond numerically, and this is the bond that is provided as well by the package, okay? So actually I mentioned in the beginning of the talk, the origins of pack-based bonds, like McAllister work, he focused mainly on this kind of bond because he think that's what's important in the end to be sure that your classifier with probability 99% has a mistake that is smaller than 0.1, okay? On the other hand, even though if you want to prove something, a theoretical bond, something that depends on the true prediction risk, you have to derive this as a tool anyway, so this is quite important, yes? Is that one, are you allowed to optimize the lambda? No, you're not allowed to optimize on lambda in this bond, okay? On the other hand, I mean you're allowed to, in some way you're allowed to optimize on lambda, but for a lambda that does not depend on the sample, okay? On the other hand, what you can do is to use the union bond obviously for lambda on the grid, and then actually using the fact that this is increasing and this is decreasing respect to lambda, you can even optimize in an interval, and this is what people do actually, well. I mean usually they provide a more sophisticated version of this bond where you have the infimum with respect to lambda, even though it's not the thing that works in practice, I mean in practice if you want to choose lambda it's the main problem that we have to solve yet, because the minimization, even if you would minimize this bond with respect to lambda, usually you don't have the best possible lambda and cross-validation for example, much better, but obviously it's much, much expensive. Okay, so you have this empirical bound, and actually the thing is that the reason why people use this exponentially weighted aggregate that I introduced in the beginning, is that actually this is the one that minimizes the right-hand side, okay? And as it minimizes the right-hand side, you have here an infimum, and then what you do, well, first still you have this empirical bond, okay, that I mentioned before, but as you minimize the right-hand side, what you can do is then to use the reverse bond, okay, when you replace, you start to gain all the process but you replace r minus small r by the opposite of this, and then actually what it tells you is that you have this time something which seems useless in practice, okay, the integral of the empirical risk is smaller than the integral of the true risk plus something, but you can plug it in the previous analysis, okay? So what we had before, the integral of the, sorry, of the minimizer is smaller here than the empirical bond, but the empirical bond, if you take the minimizer, it's an infimum, and then using this result, you can replace the empirical risk by the true risk in the infimum, okay? And so here you have the theorem. So this is quite a standard analysis actually, but the only point was to remark that actually here, I mean, if you don't minimize with respect to all the probability distribution but just over your approximation family, then it still works. Okay, well, it's time to conclude. So I have just one slide called conclusion, I think, yes. So there are other things to do, and actually some of them are already done. In the paper, we also provide, for example, complete analysis of ranking models, so it's very similar obviously to classification because we still use linear score functions, but we also have the same thing actually that we are able to replace MCMC methods by optimization method, and even in some cases by convex optimization problems. We also have a sketch of the analysis in matrix factorization. I mean, a sketch of the analysis in the sense that there are many problems there. In the sense that, yeah, if I have just one more minute, I will come back to this, sorry, this result, sorry, yeah. This result tells you that the variational base approximation performs as well as in some sense as the exponentially weighted aggregate that you wanted in the beginning if this term is not too large. So actually what we did was to prove in this case that this bound, whether you compute it for the exponentially weighted aggregate or for the variational approximation that is used in practice by people, by Bayesian statistician when they do matrix factorization, the bound is the same, okay? So it means that in some way if the packed Bayesian bound would hold for the exponentially weighted aggregate, then it would hold for the variational approximation as well. On the other hand, up to Manolage, until now nobody was able to prove that it holds for the exponentially weighted aggregate. So there is a missing part in this problem and obviously there is the other missing part that is that the criterion used anyway for matrix factorization are non-convex and so we don't know whether we convert actually to a proper minimum or even to a minimum at all. But even though, I mean, it seems that there is something to be done there even though we are not able to complete the analysis. Okay, so the theory is not complete, I wrote it. We have other works in progress actually, so James is currently in the package for the moment, we only have one version of the gradient descent which is not actually necessarily the most efficient one. So actually James is currently writing other functions to use other optimization methods. And actually what I'm interested in currently, and this will be my last slide, it's a question for you, can you help me on this? You remember that I presented in the beginning two packed Bayesian bounds, one for the batch setting and the other one for the online setting. And the one for the batch setting, actually it holds for the exponentially weighted aggregate but it also holds for the variational approximation. I said nothing about optimization, I mean about the online bound. And the reason why is that we were not able to perform the same analysis. So I can show you what we have. Remember that in this case at each step T you want to compute the mean according to this probability distribution and what you could do, I mean is to use a VB approximation. So you perform optimization. It will be already more costful actually than a proper online gradient algorithm because here at each step you would have to perform an online optimization. But even though we're not really sure that this one works. If you use this, so the mean prediction according to the pseudo, the approximated posterior sorry, rather than the true one. Then actually we have this bound. So we have the cumulated loss that is smaller than the same criterion as the one that you would have for EWA but on the class. But in this case you have an approximation term but instead of having it only once, then actually you have it at each step of the algorithm. So it means that the cost is much, much higher. So maybe actually I mean obviously this is what comes out of the standard proof for online EWA. Maybe there's a better way to analyze it but until now we are not able to improve on the fact that we don't pay the price for the approximation once. We pay it at each step of the algorithm. And so I mean if there is a non-zero distance between in some way the true EWA and your approximation family, then you might pay a huge price here. So we're not able to generalize this analysis for in online prediction but we would like to. So if you have some ideas and if you want to write a paper for me, you're more than welcome. Okay so I mean I can start to work on the jokes for the talk that we will give about that later. Okay so thank you for your attention. Is there any question? Yes. I was wondering about the connection there. I mean you didn't give a lot of citations and you didn't cite that word so maybe you were aware of it. But so I was wondering the connection with the work that Tony J. Varadid on maximum entropy, what he called maximum entropy discrimination. So he considered something which is very similar to what you're doing with the hinge loss. Essentially a formulation where you compute an expected value of the empirical risk and you have, if you replace the regularization term by a cobalt library divergence between distribution on the parameters for how you call them and a certain prior distribution. So it seems very related to this. Do you know about this work? No, but I'm very interested by the reference so obviously I will read this paper tonight after the talks. Yeah, no really I mean I'm interested in it but I didn't know it. I think it dates back from 1999 so I think the connection with Pied Base is not as elegant as what you're presenting but I think it's great. Okay, thank you. Yes. John Chantazer and Langford propose already a vibration bound for linear classifier. I would like to know if you know this work and can you comment on the relation between? Once again I mean I think that we have different objectives because usually they are more interested in the empirical version of the bound. So something like this when you relate, I mean you provide an explicit bound but I mean it's true. I mean in some way Pied Base and Bounds once again you have to minimize with respect to an infinite dimensional object and people already had the idea to minimize just over a smaller set of parameters. Okay, so it's just, I mean I read this. The thing was to make a connection with VB but. Yeah, but I think at the end the algorithm is very specific to some algorithm already existing in Pied Base literature. Especially the first one proposed with the Gaussian posterior, the Unibagic Gaussian posterior. Yeah, once again I mean I don't think that any of the algorithm here are new. I mean the thing was just to provide analysis of existing algorithm but you're right. I mean most of the algorithm already existed before anyway. Yeah, we know about this work by John Chantazer here. So did you run the SGD version on the Ingellas version on the experiments? Like you had three columns where VB was doing better than Mr. Coach Montecalo. Actually James did it and this is the version that is implemented in the package. So it's the, yeah, yeah, sorry. So actually oh yeah, yeah, yeah, yeah, sorry. I removed this slide actually because yeah, okay. I thought I had it somewhere. Yeah, yeah, you're right actually. Sorry, where is it, I'm lost. Yeah, we had this and we have another one in the paper with the Ingellas version. So actually usually, I mean in most cases, I would say in a country where there may be five or six cases, it improves on VB in the sense that it converges much, much, much faster. But there is one case where we have a very, very surprising like accident, I don't know why. We did not understand it. But in much cases it works much, much, much better than this one, so. Well much better and like there's a difference between speed and best error. Exactly, but I mean in some way the idea was to, I mean this is a test error, okay. It works much, much better than in the sense that if we run the algorithm for the same time, then in the end we have a better test error. Okay, in some sense the idea of this simulation study was that we don't want to split between the optimization and the test error, but that we wanted in the end a good test error, whatever you have in the sense that you can have a complicated model with a poor optimization procedure or a simple model with a good optimization procedure, in the end we want to compete in the test error level obviously, so. I mean the improvement that we have is in the test error in the end, but obviously it's due to a better optimization. But to be clear this column of VB there was with the original multivariate Gaussian approximation. Yes. And then you get a local minimum, I said I agree, because you get some conductor. And you're saying if you compare that with the other one, which is now using not a multivariate, but a univariate, well I guess identity covariance matrix, but it's the hinge loss it's looking at. Yes. Now it's convex. Yes. Now you're saying between those two, we gave the same test error, but one much faster. They gave this actually even a slightly better test error in some cases. I mean, I don't know why, but actually you're right. I mean it might be that the accident that we are, I can't remember for which, probably it's for this one I think, which seems very easy in some way. And then actually we have an accident in the sense that it's the worst method. And it might be due to the fact that actually we have to restrict our attention to diagonal variance matrix. I don't know, but yeah, in this case we don't have a good prediction error. But yeah, what I wrote there is the prediction error. It's the test error on half of the sample. Yeah. Okay, another question? You mentioned that integrating a convex function is not necessary convex, but how about integrating a strongly convex function? I don't know. Is there something obvious or? Awesome. The convexity in CMA is not obvious. Exactly. I think that's the problem, because when you integrate a convex instead of the zero one loss, you integrate a convex loss function with respect to theta, it's okay. But the problem is that the criterion also depends on the cool back divergence. And the cool back divergence, I mean, you're already lucky if it depends on the convex way in CMA. You see what I mean? I mean, the criterion, sorry, I should. When you minimize this, okay, if R is convex, you integrate with respect to a probability distribution. I mean, this should be convex as well, but the question is this part convex. And with respect to CMA, it's not always the case. But even the integration of the loss will actually be convex in the bias parameter. No, you're right.