 Let me share my screen. All right, can you see it? Yes. Thank you, Adriana, again, thank you to the organizers. And I'm sorry for this spacey background. I couldn't deactivate it. So I'm gonna discuss the last installment in a long series of research papers. And this last project was done in collaboration with many people in Boconi. You see them in these pictures. So let me start with stating some basic heuristic facts about the lost landscape of neural networks. So neural networks are highly non-convex, non-trivial function, the lost landscape. So in principle, could contain many global minima, some of them low, eye-line minima, some of them low-lying or global minima. So in principle, optimization could be very, very hard, but in practice it is not. You start from some random initial configuration and you apply your same program, the same procedure and you go down in the lost landscape to the bottom of the lost landscape itself, to low loss configurations. Also, you can actually initialize again your training dynamics, start from another initial random configuration and you will find yourself ending up in a different final configuration, again with low loss. And surprisingly, you also find out that you are able to connect through a low loss path, each pair of solutions, your algorithms are able to reach. So essentially you have a unique connected bottom in this lost landscape, which has a lot of valleys, a lot of canyons. And anyway, there is a unique and a tentacular minimizer, but not all of the regions in this flat bottom are equally good. Some of them generalize better, some of them generalize worse. Other features are that you can make some architectural changes to your models. You can, for instance, add the skip connections or you can use different loss and activation and you can obtain drastic changes in the landscape. For instance, here you see in a model with no skip connections or no residual connection that you see in the left pictures, you have a very rough landscape, while if you add a lot of skip connection, you get the very nice, very smooth landscape you observe on the right. So also there are some training choices that you can do. So using dropout or which or different learning rates or different batch sizes for SGD, that may lead and may drive your training dynamics towards different type of minimizers. So not all of the minima are created equal, some of them are better than others because some of them are better at generalizing, which is what really matters, generalization. So what we want to achieve is to connect some geometrical properties of the training loss landscape to generalization performance. And our argument is that flatness around the minimizer is what can guarantee you some good generalization capability. So in order to define a measure of flatness and also some objective function which enforces flatness in the final minimizer obtained, we define the local entropy function which is defined in the equation that you can see and that essentially says that a configuration of weights W has a low local entropy loss if and only if it is surrounded by configurations which have low loss. At least within a certain radius which is connected to this parameter gamma that you can tune. And in this picture, you see that starting from some rough landscape represented by this gray line and then applying this local entropy construction, you achieve a much nicer, much smoother objective function whose minimizer essentially corresponds to the position of the flatness minimizers in the original loss. So this is a good surrogate function for optimization. And in a series of work, we showed on very simple models using replica theory that low local entropy loss, low local entropy loss, so minimizers of the loss function which also have a low local entropy loss, they have better generalization performance than other kind of minimizers. And also that architectural changes such as using real activation instead of tan H or using cross entropy instead of mean square error loss can also lead to the emergence of flatter or equivalently low local entropy regions which is again good for generalization. And this could be no rigorously but analytically could be shown in zero in the layers and one in the layers neuronally first. But now let's move to the new stuff we have been doing in this last work. Consider a binary mixture of Gaussians so where you have a collection of data point each data point belongs to one of two classes. So you have class plus one where this sigma label is plus one and the data point is generated as a perturbation around a true signal booster. While data points of class minus one are generated as perturbation of the signal minus this time. This is how you generate two sets of points and then the task is a classification task meaning that you have to classify each data point according to the correct class. And we use a linear model so we use a perceptron for this classification task so the prediction from the perceptron is just given by the weights of the perceptron W so dot product with the input then you apply the sign operation and you obtain the predicted label. So what people more specifically Mignac, Giacala, Lu, Ezeborová obtained the rigorous result they obtained for this model is that when you train the classifier using mean square error or logistic loss and in addition to that L2 regularization term you can by minimizing this convex objective function you can obtain the variation a point-wise estimator which also achieves variation at optimal error meaning that you cannot do better than that. And you obtain this variation optimal estimator by taking the infinite limit of L2 regularization which is kind of weird because you are also shrinking the norm of this classifier to be zero since you're applying infinite regularization. Well, the problem with this approach is that it's not very general. I mean, we here we are in the end our classifier and the way it classifies the inputs is not norm dependent. It's just a sign of a scalar product so that doesn't depend on the norm of W. And the same goes for deep learning architectures with regular activations. The predictions from these architectures do not depend on the norm of each layer. So we would like to achieve, to find such optimal classifier bypassing the need to constrain the norm or pushing them to zero. Okay, so what we do is to apply our local entropy framework. I don't wanna go into the details of that but essentially we can take the same minimizer as the one obtained in this paper and then compute the local entropy of this minimizer. So essentially measure the flatness of the landscape at different distances deep. And in this case you find that the classifiers which are also the better are generalizing are also the ones with a much flatter local entropy cube which means they belong to flatter minimizer. So essentially we found another criteria which leads to bias and optimality which is more general than the norm regularization criteria and depends of course only on the training data. It's something you compute and you can optimize based on the training data. This were analytic results that we again obtain through replica method. Now I wanna discuss some algorithmic results that we obtain on deep learning architectures. So we have this local entropy framework and we want to obtain some efficient algorithm to achieve local entropy optimization. One of such algorithm is entropy SGD which has been introduced in this paper and is based on the observation that the gradient of the local entropy is just given by an expectation of the weights you are considering minus this other set of weights y prime. And this y prime is average according to essentially the measure contained in the local entropy definition. So this is a thermal average. And this thermal average can be approximated by stochastic gradient, large event dynamics. This is what the authors of that paper did. And in our work we find two and then we scale out the algorithm and apply it to state of the art architecture. Another algorithm we use in this work is replicated SGD. Maybe I don't wanna go through the details. The division is not hard, but we don't have much time. Essentially this algorithm is a simple cooking recipe to which states take your original loss function, make multiple copies of your system, and then consider the loss function for this replicated system where you just sum the individual loss function and then you also add the coupling term among the replicas. So you have each replica essentially is coupled to the body center, this W bar, the body center of this replica sets. Once you define this replicated loss function, well, you just apply any optimization procedure you like in order to optimize it. And this again is an algorithm which enforces the fact that you end up into minimizers with high local entropy and so in flat minimizers. So we have two very simple algorithm. Sorry, I'm just checking the time because I'm not sure how much time I've left. You have a minute roughly. Okay, so we have two algorithms. We apply on some state-of-the-art convolutional receivable networks and where essentially implement the baseline from some paper and then on top of that we apply our replicated SGT or our entropic entropy SGT procedure and we consistently manage to outperform the baseline with our, let's say, entropic algorithms. And then we also check that the minimizer we found are also, besides generalizing better, they are also effectively within a flat region. They belong to a flat minima and we do that by implementing another measure of flatness which is easier to compute than a local interview. So everything is consistent and essentially it seems that flatness leads to good generalization and entropic algorithm lead to flat minimizers and essentially this is all I wanted to say. Thank you. What, thank you. God bless.