 ... give accessible minimum neural networks. OK. OK. OK. OK. OK. OK. So good morning. The approach I would like to discuss is more related to optimization and how neural network zaradi iz izgledanje z način, različim, ko vidim, ko bil smo tudi naredi. V mojem povrsta je koment, kako se bo vse tudi srečov. To je nekaj zelo, ni srečov. Srečak je nekaj. bo vse med tako jo nek attackedi... ... gods Esc CP-y poš mustardovom naprejval. Sa bi je zdana s državom, o čлавnih postihlami in ima prioritő. da vsebovarim trouvéstving nebo segaقد nebo niveauovitva, semo sem na prunem se vzifljenje k kom plasteru, v te broj nebo do st tirevno, na krativ Himalka In divživo punim st будем na ranje dinamici, pasao, mam je, da poj professionalsve. Iskone seatov ge generalej skrevek, si别 pos downstairs, imi mod если do margins RE walki and have good generalization properties. That's it. I'm not talking about the influence of the data or anything, it's a very basic thing. So the kind of models we will be talking about are a neural network of this type. So the output is a composite function of all the layers. Here you have nonlinear activation functions. And we want to learn a set of patterns. So x is the input and y is the output. And this training is performed by minimizing a loss function. So in the pure model this loss function should be the number of errors. After all, what we want to do is we have a set of examples and we want to minimize the number of errors we make on the examples. And then check how things behave on... We have no priors on the data, so this is what we should do. However, we don't actually minimize the number of errors because this function cannot... You cannot run a gradient here. The derivative is not analytic. So what in practice we do is to minimize something like the mean square error. So the difference between the desired output activation and the actual output activation or something else, which is the cross entropy, which more or less is like assuming that the variables, which are determinist, are actually interpreted probabilities. And so it's a soft version. It's an arbitrary choice, but so this function works. And the key point is that on this kind of function you can run stochastic gradient or gradient or whatever algorithm you want. And what is most used in practice is the cross entropy function. Why is it so, because it works better? This is the answer, which is given by the people that are using it. So what is the general situation? Well, the general situation, let me just go to this point here, is that we normally use algorithms that have been designed for convex problems and we use it for a non-convex, highly non-convex task. So this is the situation, and it works. And it works very well. So we need to understand this point, I think. OK, first of all, these are real deep networks. Imagine that each block is some kind of a huge neural network. So these are the objects we would like to understand and, of course, we cannot do anything analytically on these kind of models. And so we just restrict to the building blocks and ask ourselves if the building blocks are special in some sense. Because, after all, if you have such a huge object and you want to optimize this, you better have something which is really efficient at the basic level, otherwise it would be very, very difficult to optimize anything here. So these kind of complicated objects must be composed of some very efficient devices from the point of view of optimization. So these are, in fact, the networks that we can look at. I mean, we are really back in the 90s from these perspectives. So we can study simple multilayer network with one hidden layer, or we can study with continuous or discrete weights. And these are the kind of things we are going to study and also we are going to study random patterns so without structure. I would like to be able to study the kind of patterns that Mark was describing with some superposition of features and so on. But as you know, this is an insurmountable problem from a technical point of view because of correlation. So let's stick to just random uncorrelated patterns. And what I want to argue is that, so somehow what we are going to study are more properties of the device rather than properties of the data. Because the random patterns are hard, we know they are hard to learn, but they don't have a particular correlation to exploit. OK. So the approach we will take, and it's just as you would expect, it's just to construct a Bolson weight, which is given by e to the minus beta, the loss function we want to analyze. And z is the partition function. And so in the space of all the weights, so the couplings on the edges of our neural network, we want to study this kind of measure. And we will be interested in the limit of large beta, which will lead to a distribution, which is focused on the minima of the loss function. OK. So this measure is going to concentrate on the minima and so it's going to tell us a lot about the space of solution, the space of minima of this learning problem. And we are going to use techniques from Spingler's theory. So as I said, in the limit beta going to infinity, we are back to the very old and famous Gardner volume calculation, because here we are only selecting the W such that all the patterns are correctly satisfied, otherwise the weight would be zero. So we are cutting the volume, the space in the W space into pieces and we are looking to the volume, what remains in the accessible volume of the W's, what remains after we have stored all the patterns. And something is known to the non, for the physicists who are not old enough or for the non physicists just ask, because I don't have to skip the details for the reason of time. And what we find, what has been found, for instance, for an architecture like this in which you fix these weights and you learn only through the first layer, but already it's a highly non-convex device, is something like this, is that P is the number of random training patterns and N is the size of the input in limit of large N and large P, where the ratio is constant. And what you find is a scenario like this, is that for small alpha, namely when you have relatively few patterns with respect to the number of degrees of freedom you have, the space of solution is somehow connected, whereas above a certain threshold you start the symmetry breaks somehow and the space of the weight space has to be disconnected and also local minimum critical points appear in this region, so learning becomes difficult. So if you run a plane stochastic graded descent on this machine, you easily find solution here, but here you get into troubles. So is this really what, is this enough to understand how neural network behaves, because this would be one of the building blocks of this huge network. What I think is that, no, this is absolutely not enough and also I think that this was a kind of misleading view we had in the 90s. So we somehow thought that, I mean, at least I thought that somehow algorithm would have exploited the structure of these states and would have converged somewhere and so on, so forth. And this is not the case. Not because the theory was not incorrect, but there are other things happening in this network that are relevant for algorithms, which, as we all know, do not agree to detail balance, so need not to end up in the dominant states of the Gibbs measure. So we have been kind of a bit biased towards the study of equilibrium statistical physics for these kind of problems, whereas we should have studied the non-equilibrium aspects, and this is what we have been doing recently. So let's start from, let's revisit few results from the 90s so the model of the neuron that also T.G. was talking about is, the McCulloch-Pitts model, which is just threshold, a sum of all the input, weighted sum of the input, and then you have this step function to decide the output. So this was the old model, but in the last years we went from this model to a model in which the activation function is an hyperbolic tangent and a sigmoid, and now people are using the ReLU function. Is there any qualitative difference between this model? I know this is a very picky kind of question, but let's look at it. And so if you now study the Gardner problem with ReLU functions, what you find, so you take a model like this, so again, you have one hidden layer, here you have the W's, and here you have random signs, so some plus and some minus is because the ReLU function is either zero or positive, so in order to have a balanced output you have to have some of these fixed weights in the last layer to be minus and some other to be plus, and you study this object using statistical physics method, what you find, let me be quick, is this result that diverges as the square root of log of k of the number of hidden units, so the more you increase the size the number of hidden units, the better you can perform in terms of capacity, so the number of patterns you can actually learn, whereas if you use ReLU functions, this critical capacity remains finite independently on k. I mean, this is actually also true for the sigmoid, so this is kind of a strange result because at least in the 90s I would have been happy about this result because oh look, we have a great capacity, but then it turns out that we are using a device with a small capacity, so how is it that we are, the neural network have evolved to choose this kind of devices rather than this kind of devices, by the way, this is suggesting that it's not particularly effective to go wide because you don't gain in doing that. Well, the reason is for sure that we can run gradient descent on this kind of function where we cannot run it here, so that's a very practical observation, but also when you start to have ReLU function, the notion of hyperplane is going to be weakened because the output is a continuous variable that has to be summed, so everything becomes a bit more complicated and also there are other things that we will try in later, this is just the starting point, so there is something strange going on already at this level, so I have to, I know I've been saying this many, many times, I'm sorry, but I have to go back to the binary perceptron for a moment because let me remind you that my plan is, you know, analyze what we have overlooked in the weight space of neural network and then see how the modification that have been implemented in the last decade exploited this fact. And so let me go back to the simplest model of neural network, which displays a non-convex landscape, a non-convex behavior, the simplest one is probably the binary perceptron that has been starting since the 80s and it's a beautiful model and here instead of talking about the volume of weight space we have to talk about the entropy because the weights are binary so everything is simplified and so what we are interested in is to understand which is the entropy of solutions and that's the way you characterize the capacity and of course the space of couplings is just an hyper cube in n dimension. So this model was solved by Mark and Bernard Crown in 89, with a beautiful paper and you know there's been a lot of work on this topic and some there's an interesting, I mean this very inspiring paper by Wang and Kabashima in which they analyze the minimum distance between solutions in this problem and also there's a rigorous result by Ding and Sun it's not at this workshop, sorry it's a previous workshop anyhow and in which they prove a lower bound for this quantity which is a beautiful technique but so the scenario for this kind of model is that fine the capacity is 0.83 so you can store with n synapses 0.83 in random patterns and the landscape is a golf course with the distance between typical solution which is always of order n for any value of alpha so this is the no you agree with me that with such a landscape you expect to not be able to do any learning in this kind of device because this is very similar this landscape is very similar to an error correcting code and we know that from random initial condition finding a code word is a hard problem a cryptographic system based on this so also from spin glass theory so clearly there's something strange happening here because we would expect learning to be impossible but in fact so this is the so the results of the theories that typical minima are isolated and then there's a glassy landscape there is even a freezing transition in random energy model for sufficiently large values of alpha and yet algorithm works but what they find is something that is not belonging to this scenario so in order to so we found some numerical evidence at the beginning in which we could detect the fact that the kind of solution that algorithm finds are actually solutions that are not this type they are not point like but they are kind of wide regions so the find solution which are surrounded by an exponential number of other solution on a region of diameter which is order n so this of course this is a subdominant contribution to the Gibbs measure so you cannot really detect this using standard replica techniques so you have to do something else and so what we did is to introduce what we call the local entropy measure is just instead of optimizing just the instead of considering the energy function as the number of errors we look for a region in the weight space that are dense in solution so what we want to maximize is the so called the local entropy namely the log of the number of solutions contained in the hyperspheres of radius d so we are just in this golf course we are looking for a lake so this is a complicated function in principle to compute but as we will see this can actually be done so if instead of just trying to minimize the error to maximize the number of solution within a certain region even though this region is subdominant you can actually describe this phenomenon so this is a kind of a large deviation analysis and what you find for the binary perceptron is actually let me use this something like this so here is the distance and this is the number of solution so this is the radius of this hypersphere and here is the entropy so these red dot lines correspond to the typical solution so let's take you take alpha equal 0.4 so you are within the region in which you can store patterns and then below a certain distance you will find that there are no solutions and this is actually kabashima work then however if you look at this local entropy you find that indeed there exist these regions that contain a lot of solutions so for instance alpha equal 0.4 you have this blue curve which essentially overlap with the log of the binomial coefficient which means that you have regions that sufficiently short distance and they are very flat in the sense everything is a solution so this calculation for us was a relief because we finally understood why algorithm work because then you can numerically check that the algorithm actually end up in these very dense regions of solutions so the message is in a non convex device you do have the typical solution you have all the beautiful replica symmetry breaking schemes and so on so forth but there exist also dense regions that are relevant ones at least for the promise that we have looked at so we recently did some kind of rigorous calculation on a model that Lenka introduced of a binary perceptron and also on the binary perceptron here I mentioned only this result and we can show that in fact in these models you do have this isolated solution but you also have for a pairs of you also have an exponential number of solutions and distance this is a rigorous bound based on second moment method so you have rigorous result that show that actually there is something more than isolated solutions now that was the binary perceptron what does the same scenario hold for continuous neural network the answer is yes of course you have to consider non convex device because the perceptron just has a convex space and so at least you have to consider in deep network you have plenty of them so this is not a real problem and you have to replace the notion of high local entropy regions with wide flat minima that's the continuous counterpart of that and what we find by again doing this last deviation analysis is that on this kind of model I go to the result is that in addition to this scenario in the end of the network there exist dense regions of solution which are overlooked by the replica analysis and I mean this picture is horrible beside being horrible we have no real idea of the shape of this region we know that the solution can diffuse up to long distances but the geometry of this thing is not under control we just know that there are regions of different densities which are connected but that's the kind of information we have and so this is the analytic outcome and here instead of distance I have the overlaps so small distances are here and here is the log of the volume you would have at that distance without any constraint given by the pattern and the volume at that distance with the constraint so by cutting away all those configurations that do not satisfy the learning constraint and so when this curve overlaps with the horizontal line it means that you are in flat region take it like that and you see that for this machine so with 3 in the unit the critical capacity is around 3 more or less and so you observe that below the critical capacity you do end up in regions that are very flat and these are not the dominant ones and if you run algorithms if you for instance do a quiet planting of a solution then so in this plot you have again here the distance and here the volume around the solution computed using belief propagation I skip the analytic details and you see that this curve corresponds to a planted solution so somehow this would be a typical the volume around the typical solution this is the volume obtained by a certain type by an algorithm which is called LAL but it is one algorithm and this is the volume obtained through an algorithm that explicitly tries to maximize the flatness so are these devices special well in a sense yes because in a sense for instance if you would replace the load to be a parity machine instead of taking a sum you would take a product of the output and decide based on that then you can show analytically that this kind of device does not display any wide flat minima in the rare as a rare event of course it doesn't also as a typical event but also as a rare event so now the question is can we design algorithm out of this so the idea is that if you have a local entropy term and then you can change the radius of the hypersphere inside which you try to maximize the number of solutions then you can imagine to design an algorithm doing that and in fact the trick is very simple you can write the Boltzmann weight which is as this form where y is the conjugate parameter or to the local entropy so this is kind of a inverse temperature for the problem and if you take y integer then this the partition function or the Boltzmann weight can be written like this and so to make a long story short a way of constructing a measure or a cost function or a sampling technique that focuses on high density region is just to take many real copies of the original system and couple them together through a distance constraint and this distance constraint is going to impose that you are sampling within a certain region then you have many copies and in the limit of infinitely many copies you will find those regions that are maximally dense so this is a practical way of accessing entering those regions and in fact it's quite straightforward in the algorithm like belief propagation cavity techniques to find this kind of regions and by the way if I go back here this blue line is obtained by belief propagation on this replicated graph so you can actually find efficiently these very dense regions but you can also do replicated simulated annealing which works beautifully you can do replicated stochastic descent and this has been used also in deep learning now in deep learning there are so many things that are done that is very difficult to disentangle what is doing what that's the point but I guess that many of the techniques which are introduced like for instance drop out initialization batch renormalization and so on are just techniques that keep the system out of equilibrium and convert to this kind of regions let me mention one thing because this could be interesting biological modeling you can take a very simple algorithm which is called least action learning in which is just an algorithm that whenever there is an error tries to update the unit that has the which is the cheapest update because it has an activation that is very close to the threshold so with the minimum change in the Ws you can actually try to increase to enhance the output towards the right answer so this is a greedy algorithm that follows a path of least action and this kind of works but doesn't generalize very well but now if you actually do the entropic version of this so you take many copies of this and bound them to be to a distant constraint then this super simple algorithm is going to converge to this wide minima and normally try to access through some gradient stochastic gradient with noise and blah blah blah so this is an example maybe you can find very elementary learning process that can exploit this kind of properties so well this is the same as before so let me just mention that one can also check the flatness of the solution which are found by looking at the action of course the error cost function doesn't have an action because it's a discrete function and so for instance for the algorithm you don't really know which action you should compute but here the idea is that we find the solution then we compute the action using the mean square error cost function or the cross entropy cost function and normalize the way so that the spectrum is comparable for different algorithms and so here we compare this greedy algorithm stochastic gradient with two type of stochastic gradient the entropic version of the greedy algorithm and then the belief propagation algorithm which focuses on these wide regions and you find that in fact FVP so the algorithm which is designed to access this region as a spectrum that's more concentrated around zero so it's flatter and the other very interesting message is that if you run a stochastic gradient with cross entropy and you do a small cooling of the temperature which is inside the cross entropy which amounts to keeping under control the norm of the ways these two things are equivalent then you can reach very flat regions so normally people do not perform an annealing on the cross entropy parameter but in fact this is a very effective way of reaching wide flat minima and on some simple examples we could also check that this behavior correlates with generalization performance so the flatness correlates with the generalization performance on real data so there are many other applications of this for instance we have shown with the people before and also with Federica Gerac and Bert Kappen if you use stochastic weights instead of deterministic weights then if you keep the fluctuation of these stochastic weights not zero far from not very far but definitely not zero then automatically if you now try to learn the probability distribution of these stochastic weights then automatically they convert to these rare events in the W space so stochasticity not a thermal stochasticity but a stochasticity in the parameters helps in this direction another thing which is fashionable is the fact that quantum annealing also works for this kind of problem because quantum fluctuations take advantage of the existence of wide regions because the wave function can delocalize the kinetic energy term in the quantum abiltonian encodes this information so by minimizing the energy of the quantum abiltonian you automatically go to these wide regions and in the limit in the classical limit you are trapped there and so you remain there and you find so it's very effective in finding solutions this is not quantum supremacy because we also have classical algorithm that works for these problems but it works we recently ran some experiments also for unsupervised learning so we did an auto encoder and with this auto encoder auto encoder we compare just the plain auto encoder with this entropic version and for the problem for the kind of data that Mark was talking about I mean patterns which are obtained by a super position of feature coming from a certain dictionary we can show that in fact this wide minima are very effective in recovering the features from the data so just to say that also for unsupervised learning this wide minima seems to be very effective this numerical seems to be very effective in extracting information about the structure of the data and the reason for this but I think I'm out of time so so just to conclude let me answer the question so so what about say the evolution of neural network so can we now understand what is happening in deep network in terms of optimization and in terms of all these features that have been introduced since the 90s well first result which is very simple is the following the binary percept or any non-convex device for which you can actually do the calculation you can show that if you minimize the cross entropy the ground state of the cross entropy loss function are surrounded by an exponential number of ground state of the error cost function so when you minimize the cross entropy you are moving away from the typical solution of your learning plane learning problem and you end up in ground states that corresponds to rare regions of the original error cost function so to me this is the answer why people are using cross entropy it's not any particular, it's just an optimization problem you are ending up in accessible wide minima and it's it's kind of obvious because cross entropy introduces robustness and so on so forth but this is something one can compute using a Franz Parisi Franz Parisi method and then going back to the ReLU cost function we have this result about the capacity that I mentioned to you before this result here but then you can ask well are these is this all? No it's not all because what happens is that if you use the ReLU cost function compared to the sign or hyperbolic tangent that is steep enough what happens is that for low alpha so below the critical capacity the ground states that you find by optimizing a network that uses this ReLU cost function are actually more dense in solution they belong to lakes that are bigger and in fact so this is the difference between this would be the ok let's look at these two curves here is distance zero ok just look the fact that this red line is above this blue line means that the ReLU function is more entropic compared to the sign function and what is kind of interesting is that if you now perform a perturbation of the inputs just take the training set you perturb some of the inputs and the output is changed what you find is that with the ReLU function the rate of error is strongly reduced so there are much more robles to input perturbation so you have these things that always go together a dynamical aspect so they are easy to optimize and at the same time are more robles for say generalization purposes and this is a numerical experiment but this can be done analytically just that we didn't want to do it we were just tired but it's very easy to do analytically so the final message for me is the following is that we start in the 90s we had a landscape looking like this and then by changing those function transfer function, learning dynamics and architecture we somehow end up with a landscape that looks like this and this is more or less what's happening in the building blocks of these deep networks and we are now studying the effect of having multiple layers on the flatness of the minima that you find so we have some results about that but I'm not going to talk today because they are too preliminary and so these green things are something that we more or less understood and then the red thing is something we want to understand in the near future but of course there are many other problems that we would like to address that are of much more general nature so somehow for me deep netters are not the end of the story they are just devices that work a bit better than what they used to work in the 90s and now given the computational power and all that we can do a lot of things with that but there are fantastic problems that we should address one that I find particularly fascinating the idea of trying to detecting variances across data sets as a way of extracting causality and this is a very complicated problem but it can be addressed in algorithmic terms and also the idea of merging the search of architecture together with the optimization of the weights is very interesting and this leads as far as I've known from experimental results to highly glassy landscapes and highly non-convex problems so it's again I think it's a good thing for us to look at I think I stop here