 He was a lecturer, I think, or professor in Cambridge, and now he's working with Google in Paris. So please, content. Yes, sorry, I was just looking at the way to unmute my audio. Can you hear me well? Yeah, that's perfect. Okay, thank you. So today we'll talk about recent work on learning with different virtual optimizers. So this is a deeply machine learning topic, but it has some relationships with statistical physics as you can see. Okay, so this is joint work with some of my colleagues at Google and also with Francis Bayek, and we have an preparator available on that. Okay, so I guess, I mean, this is a slide that's been kind of shown already and will be shown as well. What a lot of machine learning is these days is supervised learning. So there are some inputs and outputs that must be related. So this is like a regression problem through a model that has some parameters or weights, W. And so this can be as an example, a neural network. And the goal is to optimize the parameters W of this model in a way that minimizes some empirical loss. And the workhorse of these methods is first-order methods. So doing some sort of gradient descent or a variant thereof. And often when the big blue box is a neural network, this is done through back propagation. One of the issues is what happens if these models contain non-differentiable operations. So this also includes operations that are piecewise constant. In this case, derivatives would be equal to zero, and this can be a bad thing. This can mean that the model will not move at all. And there are many, many discrete decisions in machine learning. So here you could absolutely consider a setting where on some input X, a network or model computes some theta as output. And then this goes through a discrete decision box that solves maybe an optimization problem. And you get a Y star of theta that must be compared to a Y. So there are many examples of such discrete operations. One of them is the maximum. So this is my last example here. So if we are doing classification scores for each class and for Ys, we have a one-hot vector. So vector with one and a lot of zeros. Then the very reasonable box to have here in the middle is a maximum of all the scores theta. But this can be also adapted to a lot of other settings. So if theta is an edge cost over a graph, this can be the shortest path between two vertices, depending on this edge cost. And if we have as theta the scores for key products, then Y star can be a vector of ranks. So not only the best one, but the whole order of all the problems. And so in this case, a lot of these operations are discrete by nature. We cannot back propagate through them. And the question is how can we address it? So we want this whole block to be end-to-end differentiable. So our approach is to see this through the lens of linear optimization problems, linear programs. And we say that for a function that is the maximum over a whole polytope between theta and Y, that has a solution, an argmax at one of the extreme points. So this is this Y star of theta. We can smooth the problem a little bit, make it differentiable by introducing stochastic preservation. So what this means is that we will have a tiny amount of noise epsilon z on the input. This will then generate a distribution of the extreme points with more or less mass represented here by the size of the point in red. And the expectation of this, so the soft version of the heart maximum Y star of theta will be Y epsilon of theta. And it's the expected argmax under perturbation. And an important thing is that it's the gradient of the soft maximum. So these ideas by themselves are not new. There are models of optimal decision on the uncertainty going back to the 50s and 70s in economics, saying essentially, when people act unoptimally, it's just because they don't really know what they're optimizing. So they are indeed making optimal decisions, but under noise. In this case, the decision follow a better model. So distribution that we call P theta, that is a discrete distribution that was represented by the points in red in the figure before. And the expectation of this distribution is the soft maximum. So this is something that's been looked at in the machine learning literature. So you have perturbed map. So these ideas of Goomba tricks and how they can generate new distributions that are useful in machine learning or follow the perturbed leader as a way to promote exploration in the band literature. So you can see here one example that I will illustrate further after. So if you have some features XI that correspond to an image that we see, right? So this is an image of some terrain on the map. And there are some hidden costs, theta that correspond to this texture, right? So here you can see going through the rock or going through the water is much slower. It has a very high cost. If you solve the shortest path between the two corners on this map, then you will find this path in yellow. What we look at is introducing a bit of noise on the cost so that the shortest path solution changes, right? So here you can see the average of a lot of shortest paths when we add a little bit of noise, 0.5. And then as we add more noise, the solution becomes more and more diffuse. So you can see this as a way to explore a little bit and know more about the cost, right? Trying other paths with some probability. Or you can see this now as a smooth differentiable object as a function of the cost. And one of the most, at least maybe to this community, well-known examples is over the simplex, right? So when we take the maximum of the example of the theta i's, if we add some goombal noise, epsilon z to the vector of theta, what we get is the Gibbs distribution. And so the function, the softmax is a loxomex. The soft arcmax has exponential rates. And the distribution that we get over the vertices is the Gibbs, you know, the scalar product between theta and EI, right? So really what we suggest is a distribution that mimics the Gibbs, but that has different properties. And one of the points also that it shares with the Gibbs distribution is that it has a link with regularization. So if we look at epsilon omega, the dual of the function that plays the role of the loxomex for us, it is a convex function that lives on the domain C that we are interested about. And an interesting fact from duality is that our expectation is, so our expected perturbant max is also the arcmax of a regularized objective, right? So there is an equivalence here between regularization from the dual of an expectation and perturbation. And so you can see here again, if you have no regularization, then you have the solution is here. You have a direct with probability one around this corner. And when you add a tiny bit, you are suddenly here. And when you add more and more, then you converge towards the center and only the noise dominates or only the regularization dominates. And so when you're doing this on the simplex with global noise, what you recover here is the entropy. And it's well known that when you regularize with the entropy, you get the exponential rates, which represent the Gibbs distribution. So here we generalize this idea. The difference with the Gibbs distribution is that we cannot explicitly write down what the probability is in a lot of cases, in a general case, but we can very easily simulate from it, which is the complete flip from the Gibbs distribution, which is hard to simulate from, but easy to write the exponential form. So what we observe is that we have nice behavior at extreme temperatures. So when epsilon goes towards zero, we recover the unique arguments. And what we care the most about is we have nice differentiability. So there is smoothness in the input theta. This is differentiable with a non-zero Jacobian everywhere. And the Jacobian is written as a simple expectation of the arg max and the noise. So how can we use this to learn? Well, the idea is to say we might be in a situation where we observe some data y1, yn, that comes from a model P theta. We observe the path that people have taken from point A to point B. These paths are different. And we want to infer the true costs, the true expected costs, based on the idea that people have minimized their costs with perturbation. So if you do this under the Gibbs, you minimize a negative log like you. So this is the typical Gibbs inference problem. You have a linear term and a log summates, a log partition function term. And the stochastic gradient or full batch gradient are doing a sort of moment matching. They're the difference between the expectation of the variable under the current parameters and the observed variable or the observed mean. It's called moment matching because in trying to make a gradient zero, you're trying to make these two the same. However, it's algorithmically challenging to compute this expectation of the Gibbs. So an idea of Papandreou and Newell when he introduced these perturbation models was to say, well, let's replace the expectation under the perturb model and hope that it's close enough. And they provide bounds to show that one functional dominates another. What we realized was that actually when you do this, you're minimizing another loss. You're not really minimizing a negative log like you would anymore. You're taking the stochastic gradient or full batch gradient of a modified functional in theta, which is related to our problem. So instead of having log set, you have this modified log sum X that corresponds to the kind of perturbation that we introduce. And the very nice thing also is that, well, this was the reason why this was introduced, is that the gradients are very easy to compute. And if the variables are distributed indeed according to a perturb model, so not according to Gibbs distribution, which might make more sense. You want to say, you know, people are, people cannot in general maybe compute Gibbs distribution, but they can make optimization based on their belief of certain costs. Then you recover something that's called a function of loss that has very good properties. So it is convex in theta. The randomness plays nicely. So if the y's are on random here, the randomness only acts linearly on this loss. And it is minimized at the correct value if the y's indeed follow a model with some pre-phytazole. It will be in the large sample limit to modify the phytazole. And so what this allows us to do is to replace in this large neural network of machine learning pipelines these hard optimization blocks by soft optimization blocks. This is something that was already done with the softmax with exponential weights when we're doing classification and we're proposing a generalization of this through stochastic perturbation. One thing that we can also do is instead of having a loss between any particular loss between y and y epsilon, we can have directly through these ventral young losses a loss between y and theta. This is easier to implement. We don't have to compute a Jacobian of the softmax. We just have to compute a gradient of this loss that only involves the Bertrand max. And one of the reasons this is very practical and very convenient is that everything can be computed through Monte Carlo estimates. So the Bertrand maximizer and the derivatives are written as simple expectations. So we have the full formulas in the paper. And what you can do when you're at a given theta is just generate yourself iid copies of the noise solved through a linear optimization oracle this problem many, many times. So you get empirical values for all these buckets for the size of these buckets in red. And you get a non-biased estimate of the softmax by averaging them. And you can do the same thing for derivatives and Jacobians, et cetera. So what we do in our supervised learning setting we have features X i. We have model outputs. This allows us to have a stochastic gradient in the width of the model. The only part here that's not explicit is something that we generate for expectation and we get a doubly stochastic scheme. And so very quickly. So we applied this to two experiments. So the first one is just to check that we're, you know, proposing a softmax generalization as well as the softmax at least, right? So we showed that we're competitive with the softmax and the classification tasks on site for 10. We investigate a little bit the influence of M, the number of samples and epsilon, the amount of noise added. I'm happy to talk about this offline. And we do a task that's a little bit more challenging which is to infer shortest path based on examples of features. And so we do this where we observe a lot of 12 by 12 representation of cost matrices through these image features. And with a few, with a few epochs and all method, we are going to very accurately predict what will be the shortest path and propose a shortest path that is very, very close to optimal. And I think none of time now. This is good. And thank you very much for your attention. Thank you very much.