 So what we have seen yesterday is that, just to summarize, in the weight space of neural networks when non-convex neural networks storing random patterns and again with random labels so I'm not, this property do not depend on say much on the data but rather on the device itself the weight space can be composed of a narrow minima of the loss function of the errors where I mean these are the regions in the weight space such that the error over the training set is equal to zero okay and then you might have larger regions and so on but what we have shown is that there's one thing important which is common to both discrete and continuous network is that whatever is the structure of the most numerous solutions so that those that dominate when you sample them there exist rare regions that are, though they are rare they're attractive for algorithms and they're extremely dense so this is like saying that your landscape you know can be in the W space the landscape can be you know something like this and so on but somewhere you have something like this a wide flat minimum okay and this is attractive for algorithms that do not satisfy detail balance because if they would try to minimize just to maximize the Gibbs measure so to minimize just the energy which is defined as the number of errors over the training set with probability one essentially you get stuck in in metastable states okay so but if you just forget about these constraints that come from physics which allow for any sort of stochastic process that not necessarily wants to minimize that it's kind of easy to be attracted here and we conjecture that this phenomenon plays an important role in deep learning now if you don't mind since I know that there's some of you who are interested about stochasticity in these systems I would like to to give you a couple of examples of stochastic process that converge in this states and do not satisfy detail balance that are and then next I will move to the last and hopefully most interesting part of the talk to the organizer I apologize but I have to leave before the end of my talk so because I have a taxi so that's happens now I have a taxi at 20 past 4 so that's fine so just to so this was a picture in which what we did was to compute analytically this is the so-called weight numerator function so you choose a certain configuration W and then you count how many other configuration of the weights exist at a given distance from this reference configuration that satisfy the training set okay and so you compute the log of the volume corresponding to these other solutions that are at the distance D divided by the maximum volume of the distances to for normalization purposes and so from this plot you can compute this analytically for random systems from this plot you see that this is what you see from one typical solution and you see that if you are sitting in a typical solution and you move away quickly solution disappears so if you are in a typical solution so in one of these most probable minima then you don't see much around you this green curve is provided is given by a particular algorithm which is called least action algorithm and this algorithm ends up in minima that is slightly wider than the typical ones and as you see that if you take as a reference configuration a solution given by this algorithm solution to the learning problem and you ask yourself how many other solutions do I see at a given distance no it goes down relatively quickly but you see the convexity of the curve in in zero I mean around zero you it decreases sharply whereas now if you use another type of algorithm which in this case analytically we can do this target this kind of regions then you find a solution that is flat up to a certain distance which means you really end up in a region which is super flat okay so I think I think this is nice and let me show you this is a I think an amazing simulation that a student did you take a network with reading units storing random patterns blah blah blah and then you compute a solution with different algorithm and then you check the Hessian okay to as a measure of flatness this is used every day I mean a lot of people do this in analyzing deep networks right and so watch out that this this is a student so the scale is random and so that you have to think about it because but anyhow so let's take okay let's take stochastic gradient with mean square error okay so you end up in a minimum and if you look at the spectrum here you have you know a lot of eigenvalues which are zero and then you have this tail which goes between say zero and three hundred okay this is the scale right now if you just gradient descent you get you get stuck in a minimum which is much wider sorry much narrower in the sense that you have many of the eigenvalues of the Asian are actually positive name and the scale is between zero and two thousand okay and if you use stochastic gradient with well let's forget about let's just look at these two just to make it think simpler and then you use the this algorithm that ends up in this wide minima and if you look at the eigenvalues it's a delta in zero and notice the scale it's just between zero and ten so you just I I've never seen such a nation I don't know if you have ever seen such a nation in deep learning I mean it's really super flat okay so just to say that this this technique of using as a cost function the local entropy in which you don't compute the Asian at all because you're considered in local entropy over macroscopic lens okay so it's not a local quantity is a by-product gives you this spectacular flatness even at the local level okay so again let me just mention that I don't today I don't have time for this so we have checked that in fact the the wideness of the minima correlates with the generalization error on several test sets okay either teacher-student scenarios or benchmarks so everything looks pretty consistent in this set of shallow networks for deeper networks we need to do more experiments and you have some evidence but I don't have it things to report right now okay so just to conclude it so this concludes the more or less the discussion we had yesterday okay there and one thing that I want to remind you is that it is a particular property of system that are a threshold sum if you take as a learning machine a so-called parity machine in which instead of taking the sum and checking the threshold just take the product of the hidden states as an output then this machine does not display any any flatness okay and it's very well known that you could you do nothing with this with this device okay but it's just a technical example to say you need a machine that has the propensity to have this wide flat minimum and this machine does not have this propensity so I wouldn't use it for for learning okay let's see okay so so before I discuss before discussing the evolution of that took place in deep learning let me show you a couple of examples of processes that that end up in this wide flat minima and the first one is a really super simple and it is this one so suppose that your weights are stochastic in the sense that what you can specify in your neural network is not so you have your input so you have your weight here let's say this object this is your weight but you cannot specify its value you can only specify its probability distribution you can just so let's say that w is a binary variable and then so you will have you know something like okay so the only thing you can specify is this eta so you have a stochastic system so the idea is the following you present a pattern to this network okay and again the weights are again we are back to the simplest possible non-convex model just for analytic purposes okay but this will hold also for other system so you present a pattern you extracted random your weights and then you compute the output this is the process okay so it says you have stochastic weights and now what learning what does learning mean he discounted means that you somehow modify the probability distribution of the weights now this is a truly stochastic device in which you present a pattern extract your weights at random compute the output and check it and then you use and then what you want to do in order to train the system is clearly to to maximize the likelihood of the output no no no you compute the weights have a certain probability to be plus or minus one but when you actually computed you have to look at the value yes yes no you do learn but you learn this eta you learn the probability distribution so this model we reduce to the standard model if this probability distribution is fully picked on one value but if you have a broad distribution then it's still stochastic okay is that yes the process is when you use the network you feed something and then your synapses will tell you some value and you compute the output yeah on the computer you sample but if you have to imagine this as a model simply your synapses is a stochastic device that most often gives say it takes a value plus one or sometimes takes the value minus one so it's like saying that you cannot perfectly learn the value of the synapses just it remains stochastic somehow which is probably quite realistic no I mean probably never takes twice the same value right precisely so and so if you do this then this is a this is a stochastic device and what you can do is try to in principle what you would like to do is to find a w that maximizes the the probability that you get the right answer in the output and so you can write this in a Bayesian context and so what we can do here is just essentially write make a certain assumption about the probability distribution of the weights and then try to maximize the likelihood under this this assumption so your the cost function will be essentially the probability the log likelihood related to the probability that the weights give the right answer so it's just a standard the standard thing now okay and as I wrote here just in a different language we are going to parameterize the weights with the one real number which characterizes the probability distribution and what is good about this is that now you have a discrete system on which you can actually use any gradient the algorithm because what you're actually learning are the probability distributions but now the key point is that when you do this you can this analytically you can you know check you can keep under control the magnetization of the weights I mean the bias of the weights okay so let me call mi is equal to say the expected value of wi okay and now if this always take if this synapse this weight is always say equal plus one for all the patterns this would be a delta in one otherwise it's a it can be between minus one and plus one okay so the idea is the following is that one can show analytically let me just describe analytically that you can find solutions you can find the solution that satisfy all the training set when the magnetization of the weights is still different from one so in the stochastic regime you can actually store everything and you know you see that for a magnetization which is 0.85 which means that you still have a lot of fluctuation in the system you can store everything okay and and now and then if you run a stochastic gradient on on or any simple stochastic process gradient descent on this on this device you will easily find solutions let me remind you that the original problem the binary perceptron is totally hard to train okay but in this case it becomes totally easy why is so because if you keep the magnetization different from one so if you are looking for solution in the regime in which the weights can fluctuate they can only fluctuate in large in white flat minima okay they cannot fluctuate if the minima are narrow okay and so it's just a consistency argument and and therefore if you allow the synapses to fluctuate and you try to train they naturally end up in this white flat minima okay this can be shown analytically and I don't want to bother you too much with all this but this just to say that noise on the weights helps in this case but not it doesn't help because you know it's not the kind of noise that helps to overcome a barrier it's nothing to do with that it's a kind of noise that allows you to identify this white flat minima it's a it's not the kind of noise you would think you know in terms of simulated annealing or optimization yeah yes this is a drop link this is very similar to drop link yeah again I mean you see that all these connections that the drop there are two drops one is drop out and drop link in one case you drop out the nodes and the other case you don't drop out the links so in this case it's it's a model for drop link you could we could also design a model for you so again this is a connection that we're going to explore but it seems very strong okay the key point is the following that if you have an underlying optimization problem that has white flat minima solutions then this solution will be the only one consistent with the fact of having a magnetic fluctuating variables the only one if you have on the other hand if you have a problem that doesn't only has a narrow minima this would not help at all so if you want to use this kind of device to train a parity machine doesn't it doesn't work so you need an underlying structure that has this property even if these wide minima are rare doesn't matter with this kind of system you're going to end up there but clearly in this case we're maximizing like we are not minimizing the energy we are doing something which is totally different now this is a okay so then let me for the physicist here let me just remind you that okay there's a quantum version of this this is a bit ridiculous but okay so we can imagine to okay in order to use the language of of physics let me denote the W's with some sigmas okay that's Simga Z so this is the Z component of the polymatics but what we can do is again we can define we can define an energy function which is this one which is just a number of errors again for the binary perceptible the same holds for other networks just a number of errors and we know that if we try to perform simulated annealing on this is not going to work so what about quantum annealing in which sorry this is just for the physicists of those of you who are interested in quantum computing and you can try a different approach and say okay I take this energy function as my objective and then I so I identify the weights with the Z component of a polymatics and then I add a transverse field in the x direction so it's a quantum term and and the idea is that of quantum annealing sorry is that you start with a very large gamma here and so your system will be totally polarized in the x direction and then you reduce this gamma this magnetic field say you reduce it and at the end of this process when this is going to be zero you are going to end up in the ground state of the original amiltonia so you would have solved your problem right and there is a theorem that the abatic theorem that tells you that if you reduce this gamma slowly enough essentially the rate is inversely proportional to the gap between the ground state and the first excited state in in this amiltonian if you do it slow enough you are guaranteed to end up in the ground state and in fact the reason why quantum annealing doesn't work is that the gap is typically exponentially small and so it doesn't work but in this case in this case it does even though simulated annealing fails so this is one example of this so and how do you show it well let me be quick so when you study this you typically do a quantum Monte Carlo simulation so what you do is you perform a transformation you map this you add one extra dimension this is called time dimension and you can rewrite a new amiltonian in which you have many replicas of the original system which are coupled one with the other wire with the ferromagnetic coupling okay so you see if you sample from this amiltonian with some markup process you get the ops sorry you get the probability solution you would obtain by sampling from this using density matrix and so on okay but so you see that what you get is essentially a certain number of replicas of your initial system but this is called a Suzuki trot if you're not familiar look it up is very nice it's something you know a physicist should see once in its life and then the you have a coupling between different systems and this coupling is ferromagnetic so this gamma here is you know this quantity here and the point is that when gamma goes to zero the quantum effect goes to zero this coupling becomes infinite and so all the replicas have to collapse and you end up with just one system which is the classical one that's the mechanism okay but to make all this long story short you see there are very big analogies between a quantum the Suzuki trotter representation of the problem which is many system coupled through this nearest neighbors interaction say and the robust ensemble I discovered to yesterday so this replicated system that are coupled to a centroid they are very similar and in fact we exploited the similarity said okay but these are almost the same thing so let's check what quantum annealing is doing and and okay you can do all the calculation for this problem I just skip it because and I go to the results what you get is that as a function of gamma you can solve analytically this problem so as a function of gamma you you can compute which is the expected value of the classical energy in this quantum Hamiltonian and it is dot it is this dot curve now if you then try to do some quantum annealing some simulations on the system you see that as you increase the number of slices and the quantum limit is obtained with infinite slices you approach the theoretical curve and in particular you find that when you make the transfer field go to zero you recover zero energy on the other hand if you run simulated annealing you get stuck at the finite energy and has you increased the size of your problem you get stuck to higher and higher energies at the end you will be there for an equal to infinity so you see this is a weak quantum supremacist just one example in which simulated annealing doesn't work and quantum annealing does work which has nothing to do with quantum supremacy in absolute sense because as I told you yesterday there are algorithms classical algorithms that work in this case but it is a nice example in which quantum fluctuations just like the fluctuation in the synapse would automatically lead you to identify valleys which are very wide and end up in this rare minima so from the quantum point of view just for the physicists why is it so it is so because in the quantum I built on an even at zero temperature you have kinetic energy okay and the kinetic energy is somehow proportional to how much a system can delocalize so whereas in a classical system at zero energy you either have entropy or you're stuck at quantum level you can still delocalize so somehow the information about the entropy is encoded in the kinetic term of the Hamiltonian and this is why the system so this is just for the physicists I apologize but can can go in these states and the idea is that as you reduce gamma you end up in these states and then as when gamma is zero then you remain there right so somehow the quantum limit and the gamma going to zero limit does not do not commute okay so you see the quantum fluctuation are dominated by these rare events in the limit gamma go to zero so I don't know maybe it's interesting for physics I don't know okay and one can show that in fact the minima that quantum annealing fine the quantum annealing finds if you just find one minimum and sample at random and check you know the energy of the states around you you find that the minimum is flat whereas if you look at what happens around a local minimum found by simulated annealing the the gray curve it's very narrow these are analytic curves and the fact that they start in the same point is just because I have subtracted the reference energy but this gets stuck somewhere with higher energy okay it's just to have it on the same scale then there was a referee who asked us to do the real quantum dynamics so that luckily enough with the software nowadays you can do things really quickly and in particular with Carlo Baldassi who is a great scientist and computer scientist so we did the we analyzed the dynamics of this system and essentially okay what we found that everything is consistent with the analytic calculations so and in fact one thing that we could check is the following is that if you take your original problem which is the which the energy levels are on the diagonal of the Hamiltonian and you randomly permute these elements you are going to scramble and to get rid of all the geometry right because previously a certain energy level corresponded to certain states now you reshuffle you keep the energy level but you place them you know in states which are random so you don't have any more this structure in which you have these dense regions okay and if you now run simulated you run quantum annealing on this system you observe the exponential slowdown you find that it doesn't work anymore so quantum annealing sorry would you know in this reshuffled spectrum get stuck here it's not able to go down to zero so really it's related to the geometrical structure of this wide minimum which the system can delocalize okay and if you compute the spectrum you find that for the reshuffled system at a certain value of gamma which is different from zero you find small gaps whereas in the original system you find no small gaps okay so so this is an example or I've given you two examples of fluctuations which are attracted by these these dense states and then there are a lot of other simple stochastic process but at least in these two cases we can do the calculations okay now let's go back to machine learning okay this is what this is the last thing I want to tell you this is the first step in the plan of understanding all the tricks that have been introduced in deep learning and so the the thing I want to discuss is why people started to use the cross entropy okay why is it so in in systems which are not stochastic in the deterministic network okay so first of all you take what is the cross entropy essentially it's related to the fact that you interpreted the output of a neuron as the probability of the label not as the label itself okay so you normalize the output so that it's between zero and one and then the output why is just the probability that the label corresponding to the input is say one and not zero given the input and the weights okay you just interpret that and the idea is that you have your neural network and then when you are at the end of your neural network instead of reading the output you read this value flip a coin and decide so you are introducing artificially this coin flipping okay it's an abuse in a sense and so then clearly if you interpret I mean so if now you interpret the output as a probability then or in this case we assume that the labels are zero one then you can write let's say the probability the output is what corresponds to label equal one is why the probability that the label is zero is say one minus why in the case of just two two classes and so the probability of a given label given the weights and the input can be written in this form you can check it if t is equal to zero you get this disappears you get one minus why and if t is equal to one this disappear you get why which can be written as in this entropic form okay and this is essentially the so now if you sum over all the patterns you get this likelihood that you want to maximize and so you can redefine your learning problem as the problem of minimizing a loss function which is given by this entropic term this likelihood term plus say a regularization if you want okay so this is now there are various way then you can consider softmax which is more or less the same thing but so the cross entropy is more or less this quantity okay in this is in a specific case in what in which one can do the calculation okay so so in this setting in which you flip artificially a coin the you by maximizing this likelihood you would find the W that is the say the most probable vector given the data okay so now what happens in now for when you use this in in neural networks well whenever you present a pattern if you compute or so here okay again I have this this type of network again the input is psi the weights are W okay so I apologize I do the calculation again for the binary perceptible but they can be generalized it's just the length of the calculation it's proportional to the number of hidden nodes so but this is a non-complex case so it's enough so you have a device you have an input psi i psi mu and then this is the x mu is the is the local field so here you have psi mu here you have x mu which is equal to say okay one over square root of n x okay for this device and then this is the formula for the that comes from the cross entropy so when you present a pattern for each pattern you pay a cost which is given by this expression that more or less tells you the following if the x is aligned with the desired output it is blue line here okay if x is not aligned with the design output as the opposite sign you pay a price okay so you want to minimize this if it is correct so if these you know consistent you still pay a price until the this input without is sufficiently large where sufficiently depends on this gamma which is a kind of temperature for that you put in front of the of the probability but anyhow let's take a fixed gamma for the moment so the idea is the following it's like a bit of hinge lost in a sense so in the core cross entropy case you still pay a price you still reject patterns weights that correctly store the patterns but are not sufficiently robust okay that's the if you think to the cross entropy as a deterministic function you have you know we have introduced this cross entropy by flipping a coin but let's forget at the end of the day when you compute thing it's like using an error function that has this blue shape here whereas if you just use the error loss function for one given pattern if the x is the wrong sign you pay a price equal to one and it is a correct sign you pay zero and you want to sum up this overall pattern or I mean you can have other type of losses but this is the main mechanism okay so when you have the error loss you pay a price when it is a mistake when you have the cross entropy loss you pay a price also if there are no mistakes but you are close to the threshold say is this clear I mean as a how to derive this is just the two lines of calculation is not so important I mean the way you practically use cross entropy is this one okay good so what we did is to do the following calculation first of all I like we take our non-convex device again the binary perceptron and we can do it also for continuous devices and compute which are the properties of the ground state of the cross entropy okay this is the first thing and I spare you the pain but what you find is the following result that let me remind you that 8.83 is the critical capacity say for this device story random pattern so what we find is that there's a wide range of alpha so you can store a lot of patterns and the ground state of the cross entropy correspond to configuration of zero errors which makes a lot of sense because as we have seen the the cross entropy you know favors correct configurations okay so this is this is fine so but the real question is okay now we have these different loss function which is capable of finding solutions because just to just to clarify one thing is that you see the total loss with the cross entropy as this form here and this F is not a function which is either zero one it you know it's it could it always changes as you change W it's not bounded like the error loss that is bounded by zero and that's it this can continuously change even if you are not making any error okay because so the ground state is not obvious just by looking at the ground state that you are not making a mistake you have to check it so you have to do a slightly more complicated calculation because you could have you know kind of terms that compensate here and so it's not clear but anyhow the calculation tells you that in the ground state you compute the number of errors you find that it works now the question is okay you find the solution let's say so what is the nature of this solution is this you remember that for this device what you have is that you have an exponential number of solutions plus this dense regions somewhere okay now the question is what does this function find and this is all analytical okay I mean these are analytical creation I don't yet have any algorithm doing anything I'm just computing the property of the ground state of this function here okay now first of all what you can do is actually implement a kind of simulated annealing algorithm using the cross entropy as an energy function instead of the energy function instead of the errors and if you do that you can check that it works does find solutions that's already an empirical evidence that things are going the right direction but then to be precise what you have to do is to say okay now I find that I found a ground state of the cross entropy and suppose that I sit on this ground state would what do I see around me okay what do I see do I see other solution I mean I'm in an empty region where am I that's the the kind of problem okay and I think I've been super fast you know I was a bit worried and okay so this is an in order to do this you have to to do the following okay suppose that you have a probability distribution which is characterizes the ground state of the of the cross entropy okay the Gibbs measure we which you use as an energy the cross entropy a new sample from this distribution and then on on the other hand you have another probability distribution which is the probability distribution of say the error loss of the standard error energy function and what you would like to do to know in this case is given a typical configuration of the cross entropy so by sampling from the cross entropy okay so I'm not optimizing so I'm not looking for W's that optimize the density no in this case I'm just sampling from the ground state of the cross entropy okay then I ask myself how many configuration that are ground state of the error function do I see okay so it is called this is called the Franz Parisi method so suppose now I sample from the cross entropy I find the ground state of the cross entropy and then I ask myself what is the how many configuration that are ground state of the error loss do I see okay so essentially it's it's like it's like a bit like the kind of calculation we did before in which we we fix the configuration count how many solution we see around but here is done analytically by not by fixing a configuration but just analytically by saying I find the typical configuration extracted from the probability ground state of the cross entropy okay so I assume I have a fair sampling of this ground state and then I compute everything now this is a very in order to compute this entropy you need to be able to compute this partition function and this is a quite complicated object so you see that this term here is the sampling from the from the cross entropy this f is the energy function of the cross entropy and here you have the log of this trace over the configuration that minimize the energy of the error function and this beta you could choose different betas but they say you have we want beta to be to go to infinity so we are concentrating on zero temperatures on on on ground states okay and here you have a control parameter which controls the distance between ground states of the cross entropy which and ground states of the error plus which is this b okay so this is this set of variables so you have to be able and this is a normalize so you have to be able to compute this this quantity so essentially you want to come to compute the average of this log with this probability distribution you see that this year normalization this is the kind of the way that the probability and you want to average this quantity here okay now doing this is is quite complicated and why it's complicated first of all because this quantity itself is exponential and therefore it's it does not concentrate so in order to find so that zeta fp fp stands for France parisian okay this is something that fluctuates exponentially so that in order to find the typical one the most probable one you have to compute this as e to the expectation value of the log of this z fp okay so again what you have to do is actually to compute the expectation of the log because this is going to concentrate and that's what you're going to observe so you first have this log here but then since you want to compute the expectation of an entropy you have another log there so this log contains so this is a log but inside here you also have another log okay and so what you're going to do is to use the replica trick for this log here and another replica trick for this log here okay so it's quite heavy so you see you have an index a here that goes from one to n and this allows you to to write that log as the nth power of the argument and then you have another set of replicas don't be confused by the real replicas that you use in algorithm the virtual replicas of this calculation okay two different things and then you have this index r which corresponds to the replication of the internal log okay so that's the so when you do these calculations so let me I've skipped just one second okay doesn't matter I will skip this sounds not okay so you can do this calculation you have a double analytic continuation to be performed and so on so forth so it's kind of complicated but you can do it it's you know a lot of years of studies in in this field are helpful in the sense that we know how to do this kind of calculations and this is the result that you find so if gamma is say let's say that we choose alpha equals 0.4 which is more or less in the regime in which in the regime in which solution exists and they are the typical ones are far apart and then you ask yourself this these curves here are this this entropy this Franz Parisi entropy at the given overlap p so when this p is zero is one it means distance 0 p large it means p going to zero means small distance so big distances so what you observe is that indeed for alpha equals a point four close to the grand state of the cross entropy you see an exponential number of solution of the pure classifier that's the point okay so this is the result that when you minimize the cross entropy you end up in a region in which the problem is let's say easier to optimize and around the typical point you find an exponential number of of solution for the original pure classifier not for the cross entropy which you know it's kind of might change and so but this includes solution that would not be considered as good configuration by the cross entropy because cross entropy penalizes a lot of configuration that barely satisfied the pattern you remember close to the treacher so this is the the main the main result which you know it's really it's really really a long calculations calculation and and there are many things that can be observed here so first of all this this gray line here is the maximum possible density so it's the log of the binomial curve okay the maximum number of configuration you can see the log of the maximum number of configuration you can see the total number of configuration at a given overlap p or at a given distance okay so this curve cannot go above this point okay and what you see if you look carefully is the following is that if this is the curve the curve given by the the cross entropy behave like this okay now if you choose if your alpha is too big the cross entropy doesn't work anymore if you are too close to the critical capacity and what you find is that the curve goes like this what does this mean it is the following is that up to a large value of alpha the cross entropy find solution which are surrounded by a very large number of solution of the pure class yes oops sorry I made I made a mistake this one okay what I'm doing here is is the following if you look at the structure of this object you have two terms okay so you have something like this you have an integral over product the omega sorry there's a measure I mean this could be this is could be a constraint on the on the weights okay and then you have e to the minus beta f of x dot w okay sorry this is a w is not an omega okay and then here you have times the log of something okay divided by this integral and yeah there's a product here over mu and here divided by product the new omega w I and then you have a product over mu e to the minus beta f so what is this object this object if you think about it this thing here is nothing but the Boltzmann weight of this f corresponding to this energy function so what we are computing is the expectation value of this log of something with respect to the probability measure given by the cross entropy cost function so in practice what does this means is that if you look at the typical solution the most probable solution given by a loss function the cross entropy loss function what is the expectation value of this object now what do you want to there are many things that you can put inside here what we have chosen to look at here is just the number of solution of the pure classifier which of course depends on the omega the w that you extract from this probability distribution because you have this constraint on the overlap okay so that's that's the idea so we are computing this very complicated expectation value and this is why I say sampling because the what I'm talking about here is a fair sampling from this probability distribution then look what's going on around and in order to do this you have this double this is okay this expectation per se it's it's an exponential quantity so you have to take the log and so on so forth so you have two logs that you have to deal with and you do the double analytic calculation continuation in order to find the result now so what we find is as I was saying the following picture that so this is p equal one and this is p say equals zero and and and here you have the the entropy and so this is the the upper bound for the entropy for the number of of solution that you observe at a given overlap p and you know in the regime in which it works you find curves like this and if you know the parameters are say alpha is too large so you have too many patterns you find other type of solutions for instance you find this curve so let's try to understand what these curves are telling us that these two or three things are nice first thing if you are it's alpha is too big which means that essentially the cross entropy is not going to be able by minimizing cross entropy you are not be able to find any solution in order to see something from your ground state you have to move quite a bit so first of all the ground state of the cross entropy does not necessarily is not necessarily a zero error configuration somewhere doesn't matter you have this reference you know you have this reference by the way as I said this curve is obtained by taking beta going to infinity so we're looking at the ground state so you first find the ground state of the cross entropy now if alpha is too big so it doesn't work from this ground state you now move around and in order to see something so to observe and on trivial entropy here you have to go far away to start to see something so you are far away from solutions of the pure classifier however at the same time if alpha is not too big so in the regime in which we expect the nectar to work if you find this curve here which is telling you two interesting things first thing is well if I find the ground state I'm immediately surrounded by an exponential number of solution of the pure classifier this by the way means that also the ground state is a solution of the pure classifier okay so it means that the region is dense but it also tells us another thing which I think is interesting which is the phone if you remember the plots are we show it to you again later but if you remember the plots that we I described to you yesterday in the case of the local entropy solutions really in which you really look for solution that have maximum density here we have chosen at random some lost functions for the cross-entered in that case we are leaking really so we are not sampling from this distribution we are really looking for the W that maximize this if you remember the curves look like this so there is a range of distances in which there is an overlap with the with the binomial with the maximum possible number of solutions and so what this is telling you is that the cross-entropy ground state is not something like this but it's something like this so it's sub-optimal it goes in the in the dense regions but probably not in the densest part of the dense regions so I have no idea of how the geometry of this region looks like I mean what is they are clearly heterogeneous and this is an example if you find the ground state using the cross-entropy you will see around you an exponential number of solution but we know that for the very same problem there is there exist configurations that are actually more dense so I will so let me go back to the to the action that I've shown you before sorry so okay so this would be somehow the stochastic gradient with this is the continuous version of the model but it's related to this so you see if you use stochastic gradient with cross-entropy you get you are in a region which is pretty flat because you know the range in which you find again values different from zero is zero and 200 it's not zero and two thousand like in the case of gradient that gets stuck somewhere but still it's certainly not the delta function you would observe by really maximizing the local entropy so somehow you are on the boundary of of the region or something like that so it seems to me that this result is a first step in in our plan so as I said deep natural this evolutionary process in which loss functions have been changed in the essentially people have gone from mean square error to cross entropy okay then we have the transfer function well then we have different type of algorithms I mean some regularizations and so on then we have architectures well then we have data processing like datamentation and so on and many other things okay but as far as the machine is concerned our plan is actually to analyze these four features and and compare the pure classifier with the modern networks okay and for this case here for the first step in this plan what we observe is that indeed the the reason why people are using cross entropy it's not because of any stochasticity it's just because it's a robust cost function that is attracted by this rare flat minima okay that's that's the evidence we have for of course you know this is the evidence we have for random patterns it's not necessarily I mean it's not the worst case analysis it might be different from from other realistic data set I don't know but all the experiments that we have seen are consistent with this okay so there's nothing mysterious about this it's just and and as I said this is not optimal you could still do better than that if you think about it there's a kind there's a parameter in the core center with the gamma the effective temperature you're using nobody optimizes over that that's already a signal that we can compute an optimal temperature so that's a signal that I mean it's not optimal in some sense okay now to be to be now chronologically consistent with what we did this was the first calculation we did and then we had to go back and go to the next step which is the case of continuous networks okay so this is against for the binary percept and now the next step would be to do perform the same type of analysis and to show that if you use the the cross entropy for a continuous system and you train you end up in regions that are at the boundary of a wide-flat minimum or inside the wide-flat minimum okay okay this is shown experimentally here okay this is for a one in the layer network and this is the the spectrum of the asian for cross entropy and this is for the most dense region so you are not in exactly in the center of now what is the problem the problem is that if you want to generalize this calculation to the case of continue system the the integral that I've wrote before it becomes impossible so this is a real problem we cannot do the same calculations for continuous weights just because we have to solve equation that contains something like nine nine nested integrals somehow people managed to do this in the 90s but now we are not able just because we ask for precision and I mean there's a mystery here I don't know that's the public result that I'm not able to reproduce because clearly what they did was to be happy with the very rough approximation of integrals which is not acceptable nowadays and so it's a mystery okay some of the results are made a mystery but because when you have to solve equation that depend on six nested integrals really difficult to to solve the nonlinear equation but in this case it's even worse so how can we show that the same thing happens in continuous networks okay now I can go back and sorry I have to go back here yeah so the trick we use is is this one again okay so what we can do now is the following as I've shown you yesterday what we can do is actually try to find the solution either by this technique which would give us so if you if I use the robust ensemble this is going to give me a solution that and with this I use sorry I use the propagation algorithm algorithm on this Hamiltonian here then I'm going to be able to find solution that are in the center say of the wide flat minimum okay this is by construction we know that this function here is just another way of writing the local entropy and now by finding the ground states I'm going to maximize this if I take why big enough and I know how to do it from an algorithmic point of view so this thing here is going to give me you know the reference the best thing possible thing I can do the second thing I can do is to use different kind of loss functions so mean square error cross entropy for instance and by using stochastic gradient descent or just gradient descent I can find solutions okay some configuration of the ways that satisfy the training set so once I have this and this is unfortunately is numerical but once I have this what I can do is for a given architecture so you take a network of size and say equal 10,000 so a pretty big one and you find storing random patterns so and you find solution with the say three algorithms okay then what you can do is just compute the log of the number of solutions that you find at a given overlap or given distance q from the the solution that you have obtained and this can be so in order to compute this quantity here you can compute this quantity by analytically right so you want to compute this quantity here and you don't want to do it numerically because it's an it's an exponential quantity and but you can do it again you can use the cavity method or belief propagation to compute this value here okay so for one given sample for one sample you can compute the log of the number of configurations around a certain solution within a certain distance d you can you can do this analytically for one sample so it's a pseudo analytic method for the given problem training set and a given architecture you find a solution and then keeping that architecture and keeping the same number of patterns analytically estimate this number here okay I mean this is very well known in error correcting codes when you design error correcting codes typically use random parity check codes so these are random codes that satisfy random linear equation and then you ask yourself given a code word which is a ground state how many other code words do I see at a given distance this is very important in error correcting codes because if you see other code words which are nearby you're not you're not going you're not going to be able to correct errors so you want to actually observe the opposite phenomenon in case of error correcting code so in the case of error correcting codes from a ground state you want to find that there is a distance where nothing happens and then you start to see solutions in this case we are interested to the in the complementary phenomenon which we want to given we want immediately to find solutions okay and in fact the two models are somehow complementary you cannot use a neural network has an error correcting code and you cannot use an error correcting code for storing patterns we have just shown that the parity machine doesn't work and the parity machine is very closely related to an error correcting code so and this is what you do you can set up an algorithm so this algorithm is not very simple because the the idea is the following so you take a let me draw it like this okay you have a network with more or less this architecture and for a given configuration of the weights you want to compute how many other configuration of the weights at a given overlap exists that satisfy all the training set analytically the problem is that these weights are real numbers now so in order to to write the belief propagation equations you have to deal with integrals over the probability distributions of these weights and this is difficult to to analyze okay to to write in an efficient manner however luckily enough you can all the sum that appear here turn out to be Gaussian sums so you can apply the central limit theorem and so you can essentially close these functional equations I'm not entering the details of you can you will find it if you're interested but you can close the equation only in terms of expectation values and variances of of these fields here okay so one can write this equation and close them in an efficient manner so you can actually find a way of writing these equations for these complicated networks continuous network and so you are able to compute these logarithms very efficiently on one hand on the other hand you can also this is on one side and so we can compute these quantities here and if we do this what we obtain sorry I don't know where I put it sorry I missed okay so let me okay so then you obtain curves like this okay for different algorithms and and you can observe that indeed I don't have the curve here but in the case of cross entropy you indeed observe that there's a little flatness so this is the best you can do and with cross entropy you're doing something like that and if you do mean square error you're doing something like that okay so the region you find with a different loss function the solution you find are more or less flat depending on the loss function you use and the cross entropy is flatter than the mean square error and still not optimal okay I apologize that I thought I had the picture but I didn't I didn't put it so I apologize but anyhow this is the scenario you find so this is with this is the optimal and this is cross entropy and this is mean square error and then you have other algorithm greedy algorithms that are even less dense so this analytical creation which is you know kind of 30 page longer allows you to compute this curve and so show that also for continuous systems you can actually find the solutions okay so that the cross entropy ends up in a dense regions because here you can also compute which is the typical density of a sorry which is the density of a typical solution okay because we can build some planted models which like so we can beat the model in which we know which is a typical solutions and we can compute which is the weight enumerator function using the language of error correcting code so the number of solutions you would observe at a given distance for the most probable solutions the most numerous one and you find that the curve is something like that okay and then with maximizing the entropy you find something like that so clearly these regions exist and you can show that the cross entropy does something like that so it's not optimal but it's pretty good I mean there's still space for improvement this is the first thing the second thing is that if we take the replicated graph so I mentioned this yesterday but now I think it's much more clear if yeah if we take the replicated graph that you know the the replicated Hamiltonian in which we couple the different replicas so that we are we know that the typical solution of this system will be the dense ones we can actually and we write the BP equations for this system using all these tricks we can actually write an algorithm that works for finding solutions very efficiently for at the moment for shallow networks we didn't try for deeper networks that are of this type so we have an algorithm for finding these solutions that not only analytically but concretely for for these networks and and these are the results okay okay good so yeah I think I want to I I I scrambled a bit because I was worried for so I think I'm going to wrap up now and if there are questions we can discuss now the next step is I hope it's clear on the scenario now we have shown that the state exists in discrete and continuous networks they are attractive for different type of algorithms cross entropy ends up there we have some analytical connection right now that show that the same thing is true for the transfer function so why people why do people use ReLU functions okay because the dynamics with ReLU functions tend to end up in this region they help so this is something I don't want I didn't want to tell you about because we still even have to write it in latex but but it's you know it's the second step in the process and you know another step in the process would be to ask it's pretty clear that tricks like dropout are again as you noticed before I mean they're in this setup and now the key question will be what about the architecture is it that we are using many layers because it's easier to end up in this kind of states when you have more layers rather than a wide wide and few layers okay architecture that's the other step we need to investigate and we didn't start even yet but why not one thing that I can tell you is that what we everybody knows is that training wide networks is harder it's difficult when you have a lot of hidden units in a shallow network finding the ground state is it's difficult practically I mean and so even from this replicated even though for instance this wide flat minima even though they exist it might be that it's still a bit complicated to find them because we are not convexifying the problem so and so maybe you need more layers to to avoid this need of having an almost number of hidden units in the first layer which would make the problem more difficult to to optimize okay so this is the maybe the hint maybe I'm wrong we will see I mean this is not what I can say today is just that loss the cross-entropy loss goes there and the relu also then you know architecture will come and other tricks will come maybe so but I think that this is going already shows something about so there's all this discussion going on about the nature of the ground states in deep networks for instance at the beginning I was mentioning your typical question so typical observation that you find in the literature one observation is that all the minima are connected we know that this is not the case I mean just in one in the layer continuous network we know that there's a ergodicity breaking we know that this is not the case however if you use the cross-entropy as a loss function you will always end up in a connected region and so only the only thing you would observe is yeah connected solutions so again if you just look at your system with the with the eyes of the optimized system which you have already you know hidden all the complexity by the choices that you have made you only see a part of the solution space and so you get this impression I mean this is a strange thing about literature you know the results that are 20 or 30 years old which are completely ignored and people can claim things that are obviously wrong because you can show analytics they're wrong right so this is one one point and the other point is people observe that the dynamics is not glassy so that somehow learning is easy but again it is easy because we have already chosen a loss function and maybe transfer function that confine yourself in a subspace in which things are easier for instance if we take the local entropy as a cost function we run simulated anini we will have the impression that the system we are observing is a ferromagnet boom you easily find the ground state but as soon as we go back and you know get use the the classification machine you know the pure classifier as a reference model we know that this model is very complex and highly glassy so it's very very risky to you know take your finite model and do some simulation because you risk not to understand anything because you have already the reason they work is that you've already washed away all the complexity okay so you want us to go back and build step by step I mean this is our our approach okay so sorry I miss completely the the timing because I more or less finished what I wanted to say in this lecture it's often or early but in any case I have to take a taxi and so we have still 15 minutes or so too for discussion if you want okay yeah okay this is a very interesting so it's I just tell to the other so he's asking well given that you have this scenario can you give us an algorithm that behaves better okay first of all let's define the context so let's consider shallow networks as for simplicity and you use a teacher you can set up a teacher student scenario in which you generate data with a certain data generator and you learn and you check your capability of inferring the rule in this context the answer is yes we can show that the wider is the minimum the better is the generalization they correlate now the next question but we are not using any trick here we are just you know taking mean square error cross entropy and this the in the robust ensemble which goes in the high entropy regions and compare the performance of the three on a data set and that's it okay but when you go to deep networks things are more complicated than this because you take a huge data set and then you have your network with this that has a very complicated architecture in which you have cross entropy dropout batch renormalization blah blah blah blah at the end of the day I mean you probably have something that goes into this wide flat minimum so certainly we don't get worst performance and we get comparable performance the question would be you know if you just use local entropy and compare with just stochastic gradient with cross function certainly does better but then we start adding all these tricks things are so complicated that I don't I don't even know how to compare it's a bit but I don't so the question my view as a physicist the question is more can you find a simple learning process that is performant that's more or less because hopefully we might end up in something which is of a relevance also for neuroscience or something like that so the internet I think that the interesting question is once you have understood some basic fundamental mechanism can you find algorithms that are more realistic and of course if you can improve performance they're so sophisticated the the the net or the algorithms and so on that beside the amnest which is you know just forget about that which because it's too easy so there's no way of doing anything more than I mean it's not but on on toy models yes we do observe when everything is under control we do observe this clear correlation and I think this can be maybe sound theoretically also but we have not yet worked much on that even though it's I think there's quite some agreement on the fact that white flag minima are needed to generalize well and the reason is robustness the fact that is not overfitting because if you have this white flag I mean this is not just a random minimum it is the largest one so it means that you are maximizing an entropy so you are constraining the system to have maximum entropy and this is what gives you robustness with respect to overfitting and this kind of of things just robustness is very well known to work in this sense and this is an intrinsic way of doing it so what was your view of the landscape of deep net what is your view of the landscape of deep networks do you have any alternative opinion I don't know if you have ever worked on that so I mean you don't need to have an opinion but because other results in the literature that you find is are that all these minima are more or less equivalent and you know this is not true they're just not true I mean at least in this shallow network then you know of course once you have cured your model and you have changed everything maybe everything becomes equivalent because what is left are only these dense regions so of course at that point everything but not in principle that's not the okay they're not so you don't have so many wide flat minima so you have two entropies here one is the total entropy so the total number of solutions say and then you can count how many dense regions you have okay and so the number of dense regions is exponentially smaller than the number total number of solutions so you have relatively few dense regions and so those are probably yes they're probably equivalent yes there are also some symmetries in this network so but you don't have a that huge number okay so that's the because they are of course since they are flat and wide the point is that they are wide on a scale which is of the order of the size of the system so that you cannot have many of them right so that's the the main okay so thank you and I apologize for the mess