 argument that we have briefly seen in the case of a percept or two architectures like this. And the geometry of the space of solution of multi-layer architecture are actually non-convex and very complicated. But you can still use the same method. And the kind of volume you have to compute is the same as before. So you integrate over all the weights, you impose a normalization on a hypersphere of all the weights. You see now you have two indices. One is the layer index. Another is the input index. And again, you have a function that is equal to one only when all patterns are correctly classified. And these tau's are the outputs of the hidden units. It's just a generalization of what we discussed before. And again, if you do the same kind of calculation, you have to compute the average log in order to estimate what is the most probable volume you're going to see. And if you do the same calculation as we did before, they're a bit more complicated, but you can do it. Now you discover that at a certain point, you can no longer make the assumption that everything is symmetric. So in this matrix QAB that we discussed before, now you have to assume that there's some symmetry breaking. So before we consider the case in which all these elements QAB are all equal to Q. Remember, we said QAB equal to Q in the search for the saddle point. Now you have to break the symmetry. This is a technique introduced by Parisi in the 70s. And it's a way of systematically breaking the symmetry. I don't want to bother you with this now, because you have already been very patient. But the outcome of this is very interesting. So you do all your calculations. And what you find is the following. It's that as alpha increases, so the number of patterns, let me remind you that the number of patterns is alpha times n. So you have an extensive number of patterns. Again, random patterns and random outputs. So for the moment. And now as the number of pattern increases at a certain point, this is the typical overlap between two solutions. So this plot here describes the typical overlap between solutions and what happens in the following. Up to a certain point, the space of solution is, say, like this, where you have a typical overlap between two solutions. Beyond the certain value of alpha, learning is still possible. But now this weight space breaks into many components. So you have a typical overlap inside one region and another overlap that describes the distance between weight vectors that weights that belong to different regions. So in intra-inter regions. So you have two typical overlaps. So when you observe at this point, the replica symmetry solution becomes unstable. You have to break the symmetry, find the solution. And what this is telling you is that the space becomes disconnected. And OK, this is interesting. So this is the scenario. I spare all the calculations. And this is the situation. This is a real calculation for one hidden layer, just three hidden units just for simplicity. And you see that at the beginning, you have a connected space and learning is easy. And at this point, here, learning becomes difficult. And in fact, if you just run, this is just a simulation run by a student, if you take an error function, mean square error, and you take, again, here, all the calculations are done with neurons that have a sine function. But let's now assume that instead of a sine function, you take a hyperbolic tangent so that you take derivatives. Then you can define your error function as the mean square error, so the distance, the gradient distance between the desired output and the actual output of your network. And you can do gradient descent or stochastic gradient descent to try to find solutions. And what you observe is that up to this point, in this region here, when the number of patterns is relatively small, you can easily find solutions. But when you reach this point, stochastic gradient starts to fail. OK? Oops, sorry. So up to this point here, this is the error which is reached by the algorithm at a certain point starts to. So learning becomes difficult. This is why I was saying that using the tools of the 90s, you would not be able to train deep networks. I'm not saying that you cannot train that architecture. I'm saying that that architecture with sigmoid and mean square error, it's difficult. But that's the underlying space. OK? Then you can cut solution, make all tricks you want in order to find solutions also in this region here. But this is the bare model, the underlying space you have to deal with. OK? And you can also, this again is a result by Monason. And then we work also together, again very old, in which we were able to show that for each region here, you may have different internal representation corresponding to different choices of the weights. OK? So we could count how many different internal representation for the pattern you have in each region here. And so you can really do this computation. And so say things about how the network internally represents the data, OK? So essentially, you can count how many domains inside the region you have corresponding to a given internal representation, look at how many you have, and so on. So it's kind of a very interesting. You have some of these domains corresponding to a given internal representation. An internal representation is a configuration of this neuron here for a given pattern. And you can show that some type of internal representation dominate above frequent and others, OK? That's essentially. But anyhow, OK, when you see this sphere is again another way of representing this picture here, OK? So this is just a drawing. This one is a bit less than a drawing in the sense that you assume your weights are normalized on a sphere and whenever you add a pattern, you cut somehow your space. If you have a perception, just cut with an hyperplane. If you have a multilayer network, you cut with many hyperplanes. It's much more complicated, and you obtain separate domains. And that's why learning might be difficult, OK? Now, before we go to the new thing, let me say one which is question, which is the simplest device which is non-trivial. So where we can see something non-trivial. Because I bothered you before with this perceptron, with this very tedious calculation. But at the end of the day, we studied a model which is totally trivial because it's convex. And if the patterns are linearly separable, you have an algorithm that converge, blah, blah, blah. And that's it. So nothing important for deep network can be found there. So which is the model which is simple from the architectural point of view and still non-convex and still non-trivial? Well, this model is a neural network in which you constrain the weights to be discrete. This is already enough to change completely the rules of the game. So if you take a multilayer network in which the weights are discrete, let's say plus or minus 1, then even if you have just one ideal layer, namely if you have a perceptron, this is already a non-convex problem. You don't have a connected domain like in the case of the perceptron. It's already a very, very hard problem. So the binary perceptron is, first, the simplest possible device, non-trivial device. And that's where we started from. And I think it's really worth trying to understand what happens there. So we can try, I'm not going to do it, but you can, in principle, reproduce the same calculation we discussed before, assuming that the weights, instead of being constrained to be an hypersphere, can only take values plus or minus 1. You can do that. So how do we formulate the problem? Well, again, the patterns are random, plus or minus 1. You store a number of pattern, which is alpha n. And this would be the so-called energy function, namely a sum over mu of theta of this quantity here, which is telling you that this is equal to 1 whenever sigma, the desired output is opposite to the actual output the network gives, meaning that you have an error. So this function here is equal to 1 when there is an error, and 0 when there's no error. So this energy is just the number of errors the network makes. So you can take this function and study the meaning of this function, which correspond to the zero energy states, which correspond to the volume we discussed before. There are different ways of recasting the same problem. So the zero energy configuration of this function here just correspond to the volume of the configuration that satisfy all patterns. Because zero energy means the patterns are satisfied. So two different ways of seeing things. So for the binary perceptron now, this is the famous paper in which the problem was solved in 89. So quite some time ago. And already for the binary perceptron, when the W's are discrete, the landscape is not a paraboloid, like in the case of the perceptron we discussed before. Before, we have a convex body, which means that essentially, if you look at the energy function, it's like a paraboloid. Here, no. You actually have an exponential number of local minima. We'll see this in a minute. So this is a very complicated device. Yes? OK, well, first of all, you define a distance, which is the Hamming distance. But then you normalize with n, so it becomes a continuous distance. So the distance, I mean, it's just like a magnetization or something like this. So a configuration of weights will be a string of minus 1, plus 1, minus 1, blah, blah, blah, and so on, something like this. Then the overlap between two configurations, W, W prime, will be 1 over n sum over I, W, I, W prime, I. And the distance is 1 minus 2 q, or something like that. So if q is equal to no, sorry, something like this, probably, or 1 minus. So when q is equal to 1, the distance is 0. When q is minus 1, the distance should be 1. So it's minus this. So it's 1 minus q. So in this case, the distance is continuous. Now, if your question is, are you going to compute an action with respect to this, the answer is no. So it's a discrete space. But so this is the simplest non-convex device we can think of, a perceptron in which the weights are discrete. And as was just pointed out, the configuration space is just a hypercube in n dimension. Here, these are some hypercubes up to 7 dimension. But n is then going to infinity. So you have this space. And we can try to reproduce the same arguments that we discussed before. We are now, instead of the volume, we are going to count how many configuration you have, which are at the given overlap. So we just have to replace volume with entropy. You have a discrete system. You want to count how many configuration you have that satisfy the patterns. And we can try to compute when this number goes to 0 in order to compute the capacity. And again, as I said, this is the problem of studying this energy function here in the zero temperature limit, in which you just pick the low energy configuration that corresponds to zero energy. So if you do this, actually, I cannot read anything from here. So you can recast this as a statistical mechanics problem, a standard problem. So as I said, you have an energy function which counts the number of errors. Then the Boltzmann weight will be something like this. And in the limit, e to the minus beta e divided by the partition function, in the limit, beta goes to infinity. So when you focus on the lowest energy states, and this function is bounded below by 0, you're just going to this Boltzmann weight will just be equal to this characteristic function, which is equal to 1 if the pattern satisfy, so the weights satisfy all the patterns, and 0 otherwise. And so the Boltzmann weight is just given by this expression here. And the partition function just counts the number of solutions. So these are either a combinatorial way or a statistical physics way of describing the same problem. And I bet with you that if you study this paper, you get completely confused. Write me if this doesn't happen. I mean, it's really a very elegant paper to confuse the reader. I mean, it's amazing. Anyhow, it's a nice experience. Anyhow, it was a very important paper. If you do the statistical mechanics calculation, as I described briefly to you before, what you find is the following result. It's that this is the entropy, the total volume. Previously, I didn't plot the volume. I just plotted the Q to overlap between solutions. Here you can compute the total entropy, namely the total volume. So the log of the number of configurations that typically satisfy your problem versus the number of patterns. First of all, this is a binary network, which means that you have n bits. And you're trying to store n random patterns. So at most, given that you have n bits, you can at most store n patterns. You cannot store more patterns than bits that you have at your disposal. So there's an information theoretic bound. So the critical capacity cannot be larger than 1 because you have not enough degrees of freedom to describe more than n patterns. Anyhow, for random patterns, what you can compute, look at the red line is that up to a certain critical capacity, which was computed by Mazar and Kraut in 89, you can actually find solutions. So the entropy is extensive. You have an exponential number of solutions, which shrinks to 0 at this point, which means that with the binary perceptron you can store an extensive number of patterns, which is fine. The blue lines are for more than one hidden unit. But you see, qualitatively, the problem doesn't change anymore. However, this problem is a nightmare for various reasons. Above critical capacity, so above 0.83, this model undergoes, if you study it as a statistical physics problem, as you lower the temperature, this model undergoes freezing at finite temperatures, like the random energy model, which means that the local minima are very, very frustrated, let me say. So it's, and you can show that together with this zero energy configuration, there is an exponential number of local minima. And in fact, there are some papers by this one here, Horner, in which you can show analytically that if you run a stochastic process that is fine detail balanced, you actually get stuck at very high energy. There's no way you can perform learning by trying to directly minimize the energy function that I described before. If you actually, if you run a simulated annealing algorithm, this problem, it's very interesting, because say you fix alpha equal 0.3, and then you run your favorite simulated annealing protocol, you flip a variable if the energy decreases, you accept the move, otherwise you reject it with a certain probability that depends on the temperature, then you lower the temperature, blah, blah, blah. And it's very interesting, because you start with random initial condition, and for n large, the rejection rate is 1. You cannot lower the energy given by your random initial condition. Your moves are always rejected. So it's really impossible to perform any type of learning here. So I guess, so this is the situation, and another result that was shown is that all the solutions are point-like in this problem. So you don't have a convex domain of solutions that are nearby, but there are many points far apart in humming distance and isolated. Yeah. OK, in this temperature, in this graph, there's no temperature. This is for zero temperature. What I just said, suppose that you have a temperature going out here. If you are in this region here, and you try to minimize the energy by lowering the temperature, you would get a finite temperature, say 0.1 degrees. You get totally frozen. Nothing changes between that temperature and zero. You reach a totally frozen configuration already at a finite temperature. So thermal fluctuations are not changing anything. The system is so constrained that at a finite temperature, it freezes completely, like the random energy model. OK, so that was the prediction. Now, I guess the neuroscientists here should be a bit worried, because let's take a very simple neuron where the synapses can only take discrete values, two values, which is a pretty reasonable assumption for a synapse, in the sense, I mean, probably closer to reality compared to the case in which you have infinite precision. And learning appears to be totally impossible. That's scary, right? And in fact, the situation is scary. So people tried for many years to use algorithm to train this model, and they failed. And there's a recent paper by Wang and Kabashima, in which they studied the distances between solution in the weight space for this very simple model. And they show that all the typical solutions here are always far apart. As soon as you start to store pattern in this device, you're going to typically find solutions that are far apart. So this is the landscape, and I'm a bit tired now. So I'm going to skip the technique, which is the so-called Franz Parisi. I'll get back to more on this. But there's a technique to compute the typical distance between dominating solutions. And this is the situation. So it's a golf course. It's a golf course on top of which there's an exponential number of local minima. So it's hard to think to do something worse than this, right? And in fact, as I said before, so here you can imagine that this is how the geometry changes as you increase the number of patterns. So at the beginning, all the configurations are allowed. So these points correspond to all the vertices of the hypercube. And then you start to load patterns, and you sparsify this initial configuration in such a way that what remains are always points that dominating contribution to the probability distribution are points that are far apart. So what I mean with this is that you cannot flip 10 bits and go from one to another. You always have to flip an order n number of bits. And if you run simulated naming on this, you observe this plateau phenomenon, which, as I said, for finite n, you start with random initial condition, you decrease your energy at the beginning, and then you enter in a plateau. Then if the system is small enough, at the end, you find the solution. But as your n increases, this plateau goes up, and it extends to infinity. So for large n, in fact, in the literature, you find people that strain this system up to n equal 300, more or less. And then above that, it's impossible. OK, so this is the situation. And there is something strange happening. So the idea is, well, as I said at the very beginning, in between, say, 95 and 2010, a lot of us worked on statistical physics of optimization problems. And so we developed a whole theory of how random constraint satisfaction problem look like and how the space of solution is made up. What is the geometry when you take a random constraint satisfaction problem? This is a random constraint satisfaction problem. Each pattern is a constraint on the system. And so this scenario seems to be a nightmare, in which in the sense that if you compare to what happens in other optimization problems, you would deduce that learning is actually impossible. So this was our starting point. And since this is a school I want to be a bit, you have to suffer a bit. So let me tell you very briefly where all this comes from so that you can have a better understanding of the comparison between these two problems. What is strange about this problem here? And then based on this, we are going to make the step which is, I think, relevant for deep learning. So I really, I mean, if you are patient enough, I want. So we discussed a bit what happened in the study of neural networks in the 90s. Now I want to make a connection with optimization problems that were studied in the last decade or so, and then put the two things together and try to advance towards deep learning. So this is the idea. No calculation, just some intuition. Yes? Yes? No? Well, I want to describe better what this means in the sense that in this problem you have this non-trivial geometry, but there is a region in which the geometry is non-trivial and still the problem can be solved by algorithms, and then there is a region which cannot be solved by algorithms. And this appears to be right in this region. But I want to give you a better sense of why this is so. So let me make a little detour in the statistical physics of constraint satisfaction problem. So now for the moment, let's forget about neural network instead of having patterns, a training set in which you want to associate with a given vector label, which you can think of as a constraint. And the independent variables will be the weights. Now you have a generic constraint satisfaction problem. So you have a list of constraints, and you want to find the variables that satisfy these constraints. And I'll show you some examples very quickly. So the study of constraint satisfaction problem is important in computer science because it's the root of computer science. I mean, the worst case analysis of this problem have led to the notion of NP completeness and so on and so forth. So it's very, very important. But there has been interest in the last 20 years in the study of the so-called typical case complexity. So what happens when this hard optimization problem are generated at random? And this you can imagine that all this theory is relevant for data science, because in data science you are not interested in the worst case, but you're actually interested in problems conditioned on the data. And so this is why I think it's relevant. All the study of this random constraint satisfaction problem is relevant. So what we have learned in the last 20 years is to answer this kind of question, so which is the structure of the space of solutions underlying the exponential behavior of algorithm or seeing the other way around up to which point problem can be solved when they are extracted at random. What happens in the geometry? And then we want to connect this with deep learning. And OK, let me be quick. At what time am I supposed to finish? I don't remember. Six? Six? Ooh. You're going to be OK. So in order to discuss, let me go back to something that I've been studying for a lot of time. I'm totally nauseated about it. But so what makes a random constraint problem hard to solve? Well, let's consider one very famous case, which is this random case set problem. It's a very simple problem. I mean, you take a Boolean formula. So you have some set of Boolean variables, which you can think of as the weights in the binary perceptron. You have a set of Boolean variables. Then you generate a random formula in which you have clauses. And you choose the variable to be put inside the clause at random. And you put a negation on top of the variable probability 1 half. So you generate a random formula like this. And in this case, this is called random case set because you fix k. So you fix the number of variables inside each clause. And then each clause, it's just randomly generating. Now, given this formula here, in the limit of a large number of variables, and again also here we have a control parameter, which is alpha, which is equal to the ratio between the number of constraints to the number of variables. Just like in neural networks, the ratio between the number of patterns to the number of variables is just the same thing. And the question is here, up to which value of alpha there exists assignment to this Boolean variable that satisfies the formula. In other words, up to which value of alpha can you satisfy all constraints simultaneously? You see that this is an end operator. So if you want to have f equal to true, you need to have all the factors be satisfied. So you see that this is a very general constraint satisfaction problem. Anything can be recast in this type of language. In fact, this was one of the first problems shown to be NP-complete. But now here we are interested in the random case in which we just draw this formula at random. And you can imagine this leads to a graph that has this form. Its square is a clause. And the circles are the variables belonging to a clause. So generating this graph that represents your constraint satisfaction problem, and then you ask, does it exist a configuration of the variable such that the number of violated clauses is equal to 0? Just like the number of errors on the training set is equal to 0. So that's the general scenario that we're studying. And so don't look at this. And this is what you observe in practice. You observe that up to a certain value of alpha, if you take your favorite algorithm, up to a certain value of alpha, there is an easy phase. So the algorithm quickly finds a satisfying assignment. And then at a certain point, it starts to peak exponentially. So it becomes impossible to solve the problem. This is just for small n. But the distance between these peaks increases exponentially within. And above this alpha c, the problem has no solution. And so it takes an exponential time to show that it doesn't have a solution. But the interesting thing is this region here, is this blue region here. Because in this blue region, what you have is that the problem is satisfiable, and yet algorithm take an exponential time. So that was the problem we were concerned with 15 years ago or more. And now, so this is the probability that the formula has a satisfying assignment. You can show that it drops to 0 at this point here. But below these points, you already have an exponential regime of algorithms. And so it's interesting to observe what is going on there. So I will not show the details of the calculation, which are more complicated than in the previous, who's that? And after many years of pain, we found this result that tells the following, that the scenario, this is me, actually, but we found the following scenario. In this type of, not only in this problem, but also for other problems like graph coloring, vertex cover, I mean, there's a whole family of random constraint satisfaction problems that are very famous. And what you find is that at very low value of alpha, the space of solution is connected, which means that you can go from one solution to another solution by flipping few bits, or the one bit. Then at this space starts to break in many components. One dominates, meaning that the vast majority of solution belong to this giant cluster, and the others are subdominant. Then there is a so-called dynamical threshold at which the dominating cluster become exponential. This means that this cluster here is going to break, because when you add more constraints, you're going to delete solutions, or to delete also this graph. And so this is going to break, to shatter, into an exponential number of solutions. But still the problem is solvable, because this gray cluster, which dominates the Gibbs measure, are so-called unfrozen. Which means that if you take one solution belonging to one of these gray cluster, you can always reach another solution inside the cluster by flipping some of the variables. And this is true for all variables. There's no variable that is frozen to a fixed value for all solution belonging to that cluster. So you can always modify a variable, change the true value of the variable, and find another solution which is nearby for all variables. So you can rearrange solutions. And this corresponds to the fact that the algorithm still works. Then there is a point in which the dominating clusters start to freeze. And there is where a simulated annealing will stop working. But still by using some others mult-algorithm, you might find solutions that belong to clusters that are not yet frozen. And then there is a region before the problem becomes not solvable, in which the spatial solution is made of cluster in which there is a finite fraction of the variables that are totally frozen. And that's where it becomes a golf course. Because it means that in the subspace of this fraction of variable, that's just one point. And so this is the scenario that comes up. This is true for a random constraint satisfaction problem in which the constraints are generating with some probability distribution that is particularly simple and yet not trivial. The point is that at variance with respect to the worst case theory, it's very difficult to make an average case theory. Because when you try to transform a problem onto another problem, then the probability distribution gets distorted. And so it's very complicated. So this is a totally open challenge for computer science to make a theory of typical case complexity. So we have a more modest approach here. We generate problem with some natural distributions that are non-trivial. I hope the message is clear. So you have many solutions. The beginning of the problem is easy. A giant cluster of solution, everything is connected. Then it breaks into many components. But each component is unfrozen. So you can still find solution. And then things start to freeze. And at the end of the day, everything is separated in clusters that are frozen. And that's where there's no algorithm that's capable of finding solution. And really, no algorithms. I mean, for these random problems, you know that they are satisfiable. And yet, you cannot find solutions. And if you would be able to find solutions, that would be a great breakthrough. So you see, it's an interesting theory because it's a way of generating hard distances. And this would be very, very important for many applications. I mean, you really need qualitatively new algorithms. At the moment, there's no idea how to solve these problems. And so that's the region where simulated annealing starts to fail, or an algorithm that wants to satisfy detail balance. Because that's where the local minima start to prevail, and so on. So this is, we are both tired enough, and I don't want to bother you too much. But from the statistical physics point of view, it's very interesting what's happening. Because in the region where the space of solution is decomposed in an explanation number of disconnected components, then in order to describe what happens, you have to decompose your Gibbs measure into the union of many different Gibbs measure, one for each component. And you have to describe how the Gibbs measure inside one cluster fluctuates and is connected to the Gibbs measure in another cluster. So you see, it's a statistical mechanics of the statistical mechanics of single clusters. So it's a very sophisticated physics that is going to happen here. And so I mentioned in this to you, because the same phenomenology is present in neural networks. But nobody talks about it because it's too complicated. And all this knowledge has been lost in the current community. Nobody knows about this as far as I know about these things. But what has been interesting, at least for me, was to discover that you can take all the equation that describes these models, the type of equation we analyzed before, just a different version, which is called the cavity method, but essentially has the same properties. And you can turn it into algorithms. So understanding something about the structure gives you ideas of how to create new algorithms. And this is going to happen in deep learning also. So this is why I'm bothering you with this information, even though I'm not entering into any detail. And so this is interesting, because based on the statistical physics of the composition states, you can, given a certain problem, you can run this algorithm which tell you things about the role of single variables inside the problem. And so it's a different way of studying statistical physics problem, in which you can say what single variables are doing. And you can also use it for finding solutions. So there is a connection between geometry and of the space of solution, the statistical description of what's going on, and the algorithms. And so you will say, how universal is this? Well, sparse graph constraints problems, like case size, graph quality, net reconstruction, and so on. Then there's unsupervised learning, some network problems, optimization. So there are many, many things where you can find this kind of phenomenology. OK, now, if you have a learning problem, so you have a neural network, you plug in a pattern here, you look at the output, and you check whether it satisfies or it's equal to the desired level or not, this is just like a constraint in the random constraint satisfaction problem we discussed before. So you can imagine that each pattern corresponds to a square, and the independent variables are just the weights inside your network. So you can always think of learning problems as constraint satisfaction problem over a graph like this in which the squares are the patterns, and the circles here are the weights. So just to say that there is a common language between the two things. And so now, going back to our results on the binary percept, and you see, in the case of this random constraint satisfaction problem, as I said, you have this kind of scenario. But in the case of the binary percept, you are always in this situation. So as far as we used to know, this is something for which no algorithm is known. And so going back to neuroscience, the question that came to our mind was, well, if this is a situation, how is learning possible? Even with the simplest neural device, you have a landscape of the error function, which has a lot of narrow, narrow minima, and then an exponential number of local minima there. And so no way of doing any learning. So I hope this serves as a motivation in the sense that I think that putting together, I mean, I've been super short now in describing the optimization theory, statistical physical optimization. But you see, there is a huge bunch of ideas that came out there related to the behavioral algorithm and to the geometry. So it's interesting to try to transfer these to neural networks. And the first result was a kind of a punch because the first result we had said, OK, we have the simplest device, which is non-trivial, and apparently it's impossible to perform any learning. So but with Alfredo Braunstein, who was here in 2006, we applied one of these algorithms derived from this theory of random constant satisfaction problem to the binary perceptron. And what we found with some surprise is that it worked beautifully. It's very easy to perform learning. I mean, people have been trying to train these kind of devices for decades using simulated annealing, generated algorithm, always failing. And then all of a sudden, you run a kind of belief propagation slightly modified and it works. So how is that possible? That was the kind of contradiction we faced. So the mismatch between expectation and what we actually found. So before starting any kind of, but OK, this result remained there for about 10 years until we got interested in more seriously in neural networks and a bit less, say, eight years. Well, then we turned back to the study of this problem and we made some numerical experiments and we said, OK, we find a solution using this algorithm and then we check how many other solutions we have around. And what we found, either by some random walk or by computing this with some semi-analytic technique based on belief propagation. I mean, this is the kind of technique you would use in error correcting codes, given a so-called weight numerator function. So given a code word, you want to know how many other code words you have within a certain radius. So it's a technique to compute how many other solutions you have within a given distance. And if you do this, you discover that the solution that the algorithm finds are actually very dense. They're not isolated at all. And yet, the theory predicts the opposite. So there was a problem there. So there are two possible solutions to this problem. One is it's a finite size effect. So by taking n larger, everything would disappear, which doesn't seem to be the case. And the other solution that the solution found by the algorithm are actually very rare solutions in the original problem, which are not described by the Gibbs measure. So some very special solution. And so we, after some thinking, we started to consider the second possibility. And this is where the deep learning stuff starts. OK. That watch is actually. Yeah, I was hurrying up. OK, good. Yes. Yes, the statistical physics method based on what I described before, it's perfect if you want to describe the dominant solutions. Or the slightly subdomain ones. But very rare events are not captured by the method itself. You have to use a large deviation analysis to reveal their existence. But this doesn't mean that the algorithm don't care about them. That's the point. Algorithm don't care about physics. That's the point, the main thing. Algorithm don't have to satisfy the state balance. Don't have to even bother about that. And they might be very well attracted by very rare solution in the sense of. So statistical physics remains as a very useful tool to analyze geometrically the problem. But then when you connect it to algorithm, whether which are the relevant states is not described necessarily by the Gibbs measure. That's the idea. You wanted to say something? OK, good. So OK, now we have this problem, the binary perceptron problem. And now we want to ask another question. Instead of looking for solution, we ask another question. So suppose let's take a generic vector W tilde of weights. And this has to be chosen somehow. Let me draw this as a vector like this. And the kind of question we are interested in is the following. We want to find vectors W tilde such that in a hypersphere of radius d around this vector W tilde, you find an exponential number of solutions. That's the kind of question we're interested in. So in other words, since we have observed numerically that this solution exists, we want to find an analytic tool to explore this. And the analytic tool is I want to check if the problem contains regions that are very dense in solutions. They can be very rare. But I just want to check if they exist. So in order to do this, we first of all define the number of solutions at a distance d from a reference vector W tilde. You see this is a sum over all W's. This characteristic function that checks whether W is the solution of the training set. And this is a constraint on the distance between W and W tilde. So I want to analyze this number. Do region exist such that this n is big? If the solution of the problem described by the classical statistical physics analysis is the only thing that is going on in the problem, so you just have a solution like this, you can always find a radius small enough such that this n is equal to 1. So the solution would be totally trivial. But maybe there exists somewhere regions that are extremely dense so that they contain an exponential number of solutions. And yet this number of solutions is exponentially smaller compared to the overall number of other solutions. And so it's not seen by the statistical physics analysis. That's more or less the idea. So once we have defined this number of solutions with this hypersphere, so instead of probing the spaces by looking for solutions, we use a big sphere and move it around and see if we can find regions that contain a lot of stuff inside this sphere. And since we expect, we want this number to be exponential, we define a new energy. Now the energy is no longer the number of arrows, but it is the log of the number of configurations of the weights that satisfy all the patterns that are at a given distance with respect to W tilde. So it's the local entropy. So without this d, this would be the entropy. With d is a local entropy. So it's like saying you have a golf course and you want to know if there's a lake somewhere with solutions. And now we can start from scratch and say, OK, instead of studying the problem with the usual energy function, I studied this new energy, which is the local entropy. And I start again, and I define a statistical physics problem in which I use this as an energy function. I introduce a fictitious temperature y, which controls which configuration I'm looking at. And we want to study this statistical mechanics problem and study what happens when y goes to infinity. So we want to study if there exist W's that have a very high, a very small in this, actually there's a minus sign here. So a very small value of this energy, which means a very high local entropy. So this is an analytic tool to probe the existence of dense region in neural networks, OK? And so what is the role of this local entropy? Well, you can imagine, suppose that your original energy landscape look like this. Then if you fix a certain distance, clearly you see that this region here, suppose that the distance is something like this, this region here is going to have a lower energy, so a higher local entropy, because in this region you have a lot of minima compared to this one. So you see you just throw away narrow points that are isolated. And as you reduce the distance, this is a true calculation, it's not a picture. If you reduce the distance, you see that the local entropy focuses on narrow, narrow regions, and then it ends up in the flattest minimum of your energy function. So this is the minimum that contains the larger number of solutions. OK, the problem is not totally convex, so it might be still difficult. But you see that it's much more convex compared to the initial problem. It's actually much easier to optimize. The drawback, as we will see, is that it might be difficult to compute the local entropy. But let's assume that you can compute it, and this is automatically going to concentrate into this local minima. This is essentially, in fact, in the random walk you explore this region. Well, it could be multiple steps. It depends on the dynamics you choose. But OK, so now you could do a calculation which is similar to the one we did before and try to study the ground state of this new energy function. You can try to answer the question, do this kind of high density region exist not for this one-dimensional problem, but for the binary perceptron? And OK, let me skip this because we are all, I guess, tired now. And this is what you obtain. So let's look at this graph here that we already discussed. This is the statistical, classical statistical analysis. And this is what you find. Sorry. The answer is yes. Dense region exist. And in particular, you can show that up to a certain value of alpha, which is quite close to the critical capacity, there exist clusters that are subdominant. So the number of red dots here is much larger than the total number of blue dots inside this region. But still, these blue dots are very dense, and they contain an exponential number of solutions. And these clusters here are extremely dense. With this, I mean the following. Suppose we choose for the binary perceptron alpha equal 0.4. Let me remind you that the critical capacity is 0.83. So we are half away. There's a lot of solution. This corresponds to this curve here, the highest one. Well, you can see there that just below that curve, you have a gray dot line. And that's the log of the binomial coefficient. What this is telling you is that if you take a distance which is equal to, say, 0.01, 0.015, essentially you cannot distinguish between everything around your solution is a solution within that radius. All the points are solutions. And this appears to be true for all clusters, dense clusters, up to when they disappear. For instance, you take, say, you can zoom this cluster here, and you see that for any value of alpha, at a certain point this curve overlaps with the maximum possible density. So they get really local, very, very dense. And then at a certain point, this cluster breaks. And these red lines here are given by the calculation of Wang and Kabashima. These are the distance between typical solution. And in all cases, you always find that this is distance versus entropy. You find that for a certain distance, there's a gap. So no solution exists below a certain distance for typical solutions, where you don't optimize over W but you just choose them from the Gibbs measure. But if you try to ask yourself, do special regions exist? The answer is yes. They do exist. And they're very dense. OK. So this was the first result, which is telling you the following. For non-convex neural networks, there exists region, at least for, in this case, of the binary percept, and also binary multilayer. I mean, also for binary network of this form here, there exists, even though the problem is dominated by point-like solution, which would make learning impossible, there exist these dense regions that allow learning to take place. And that's where algorithm end up. OK, this is some work done by some PhD students, which generalize this to the case of multistate variables. So this does not depend on the fact that the weights take just two values. They can take any kind of value. OK, if this makes sense, this is just an analytic calculation. So if this makes sense, then it should be possible to define an algorithm based on this local entropy and find solution, a constructive approach. Let's check if all this makes sense in practice. So we said we define this energy function. And we could run a simulated annealing not on the error function that typically characterizes learning in neural vectors, but using this local entropy. So how would the algorithm look like? You start with some random vectors. You flip some of your weights. And then you compute the new local entropy. If it has increased, or the new energy, if it has decreased, there is a minus sign, you keep the move. Otherwise, you reject the certain probability that depends on the y that we introduced before in this effective temperature. The problem is, how do you compute this local entropy? This is an exponential object. So what the way we computed is using belief propagation. So there is a technique that allows you efficiently to estimate this local entropy. But then I will tell you a way which there are ways to compute this. It's not hard. But for the moment, let's say that we have a way of approximating this very well. And so we can run this algorithm. And if we do this, this is what we find, is that while simulated annealing gets trapped in these metastable states, this immediately goes down and finds the dense cluster with no hesitation. So this, I think, was a very reasonable sanity check. And, well, we generalize this to the case of teacher-student scenario, which the data generated are no longer. So up to now, you see, in these kind of problems, all the data were just random. There was no generalization involved. And the only thing I'm telling you is, in order to perform learning in this system, you need to have these dense states. Otherwise, you get trapped in some non-convex. You cannot beat the non-convexity of the system unless you focus on these accessible states. It might be very rare. But still, algorithm can find them. And you can generalize this to teacher-student scenario, which the data generated by a teacher. And you want to use a neural net to infer the teacher. And there, you also have these dense states, which appear to generalize better than typical solutions. But I don't want to bother you with this. Just let me tell you that not only they're accessible, but they generalize well. So these are two things which are not in conflict. This region? No. No, in this specific case, below this point, the teacher is just an isolated solution. These solutions here are more barycentric. And so in a Bayesian sense, they predict better. But let me tell you this discuss this later. Because I think if I start this discussion, I'm going to, because there's the role of the prior and so on, it's a bit complicated. I think it's totally useless to what I want to say. It's not that it's not interesting. It's just creating some confusion. My wife, my daughter. OK. So the teacher is isolated. It's like a typical solution here. So now, of course, everything is nice, but is this relevant for neural networks? Well, first of all, for binary neural network, for sure. And now the question is, is it true for continuous neural networks or for deep networks? So this is the, sorry, here I have some problems with the ops. So the question is, are these results specific to, these are unpublished results. So we are just writing up things, but they are very basic. So I thought I could discuss with you. So in the case of continuous weights, we should not discuss about the entropy, but these dense regions in entropy would correspond just to wide flat minima, right? It's the same thing. Either you have a corner which is very dense on the hypercube, or you have a very flat minima on the manifold. There are two things which are just the same. So let me, these high local entropy regions for discrete networks, I will call them wide flat minima for continuous network. And the answer to this question, are these results specific to discrete weights? The answer is no. They are not specific to, they do exist also in, in multi-layer, non-convex continuous networks. And you remember in particular, when I mentioned to you the fact that the space of solution breaks for a continuous networks at the beginning is connected and then, well, on top of that, there is the existence of this very dense region, so this wide flat minima, that are not captured again by the statistical mechanics analysis, where is actually where algorithm end up. Okay, so that's, I will show you this in a minute. Is this clear or are you dead? Okay, now, both. Okay, region. No, no, because computing the number doesn't mean that you find a solution. It's an approximation. Anyhow, this is what I'm going to answer. This is the answer in this, in this, I mean. Okay, his question is, how do you compute this entropy, local entropy, or how do you find the solutions, okay? And if you can count them, you can also find them. This is what you were saying. It's not, this is not true, but it doesn't matter. I don't want to, this is a more abstract question. Now, there is a way of, which is very useful also for analytic calculation, and this has been used in real deep networks, so it's already something useful, which is the following. So, what we are doing is to define a new energy function, which says the following. I want to count how many solution I have at zero energy, and that are at a given distance with respect to a given vector, right? So, this ball in high dimension. So, this is the kind of thing we are looking at. We are, in this case, this here is still in the discrete case, but it can be generalized. So, I want to trace over all the sum over all configuration that are at a given distance from W, okay, W tilde or W. And I'm going to impose this constraint with some Lagrange multiplier, okay? And then I have a characteristic factor, which is e to the beta, the energy. Now, if you send beta to infinity, this is, I forgot the minus sign anyhow. If I send beta to infinity, the only term that is going to survive is that corresponding to e equal to zero. I forgot the minus sign here, okay? So, this is just another way of writing the function N of W, D that I was talking about before. It's just a smooth way of writing. So, the Gibbs measure corresponding to this function here is just, what's one way? It's one over z e to the y, this effective temperature, this free entropy, let me call it. Is it clear? It's just another way of writing things, in which I impose the distance constraint not through a delta function by some Lagrange multiplier. And I write this term here with the generic beta and I'm going to recover the zero energy configuration and I send beta to infinity, okay? If I do this, then I can write the partition function in this form. And now here comes the answer to your question, is let's take y, this effective temperature to be integer, okay, instead of a continuous number. If it is integer, this partition function can be written as a trace over this reference configuration and then you have a product of y systems, right? Because it's this to the power y. So, and so the partition function can be written as a trace over the reference, sorry, I changed notation. This W star is what I used to call WT before, okay? It's this centroid and then I trace, I have to sum over a configuration of many of y distinct systems. Each one has a certain energy, okay? And sorry, here it's a beta. I'm very sorry, I did this yesterday night, so I'm sorry. This is a minus beta. And minus gamma, this term here. So you see you have y replicas, real replicas, not the replicas of the replica method y copies of the real copies of the original systems that evolve in parallel, say, but they are coupled in such a way that all of them have to be at a given distance with respect to this centroid configuration. So if you want to find regions of high entropy, what you have to do is instead of running as just assimilated in an inning, take 10 systems that are just identical, run it in parallel and put a constraint that forces them to be at a given mutual distance. If you do this and you equilibrate the system, then if you sample W star, this is going to give you the properties of your centroid. So in this way, you can find high density regions. The point is that this process is attracted by this local density region and does not have local minima that stop the convergence. Whereas if you just want to minimize the energy, you have to deal with local minima. That's the difference. Yeah, you have n times y plus one variables. But y equals three is often sufficient. So, sorry, in that picture minus, so y should be minus beta and W star should be equal W tilde and also the W is equal W tilde. Sorry, I was a bit quick. It was a spring force. Yes, exactly, exactly. So you have to choose gamma, how do you tune gamma? You start with a small gamma, so a relaxed problem and then you increase it in order to focus on the dense regions. Now, in deep learning, I think five or six years ago, guys in the group of LeCune created a method which is called elastic averaging stochastic gradient in which they were actually doing exactly this, exactly this, but they didn't understand what they're doing. They call it workers that communicate, I mean it's a strange language, but they were not tuning this. They were not doing all the things that are needed to make this work, but still it was working okay and it was used by Google in deep learning. So, you see, in deep learning, they discovered this by trial and error and what the algorithm is actually doing then is to focus on these dense regions, okay? Now, but before we go to deep learning, let me now just show that these dense states exist in continuous network, which is a fundamental result, a building block that you have to, on the pure model that you have to solve before talking about deep network. Okay, that's the energy we want to study. Now we want to consider the case in which these Ws are continuous, okay? And so what this function is telling me is that I want to study, say, why replicas which are connected to a centroid, right? This is the way of exploring high density regions, both analytically or numerically, okay? That's the way. Now, there is an alternative way of doing this. So instead of, okay, let me be very quick because I feel I'm exaggerating. So one possibility is to say, okay, I want many copies at a given distance from a centroid and this is going to focus on high density regions or another way of stating essentially the same thing is that I want several copies that have to be at a given mutual distance. In both way, you can constrain regions. What is the difference? Here, this is more isotropic because in this case, since you only constrain the distance between replica and the centroid, you could have configurations like this, right? And they would still satisfy the constrain. Whereas they would not satisfy this constraint here in which you say that they all have to be at a given mutual overlap. But for small distance, you expect the two approaches to be more or less equivalent. Okay, it's another way of looking for dense regions. Instead of defining a local entropy as a region around the center, you define it as a region in which there are many solutions at a given mutual distance. It's more or less the same thing. But the second method, let me skip this, has an advantage that, and this is just for the expert, that if you use the second approach, you can use the one-step replica symmetry broken calculation to automatically analyze the existence of this dense region. In the sense that in this replica symmetry breaking scheme, you use Q1 as a measure of the distance between solution. The parameter M plays the role of the number of replicas and you just have to optimize it with respect to Q0. So it's just a technicality, okay? For those of you who are not expert in replicas, let me just tell you that there is a way of doing the calculation for this problem which is relatively simple, heavy but simple. And so if you do this, well, you find, again, you find expressions that are similar to the one of the perceptron, just more complicated. And but this is the result that you find. So here you have the, okay, what is plotted there? What is plotted is the following quantity. So instead of plotting the distance, I plot the overlap which is just complimentary to the distance. So when Q is equal to one, this corresponds to D equal to zero, okay? And here I plot log of V over V zero. So what is V zero? V zero is the volume in the weight space when you don't store any pattern at a given distance D, just the bare volume, okay? That's the bare volume. And V is the volume that you compute once you have stored the pattern that has this high-density property. And what this curve are telling you is that for a given value, let's say for, this is the case over three of the neurons, and let's say you choose, let me say alpha equal two, and you choose a very large M, that's a huge number of copies, so you're looking for the densest regions, and you find a curve like that. You see that there is a region here, and here is zero, you see, and then there's some negative numbers. So there is a region here where the volume, locally the volume that you find around your solution is as dense as the densest possible volume. And this is not described by the standard calculations. So also in continuous networks, there exist these very dense regions, okay? And so this is the second step. So now we have seen for discrete network, we have seen this for continuous network, and these are already, I mean, why do not, I do the calculation for one hidden layer because I cannot do it for more than why it's too complicated analytically, but then you can check it numerically, okay? So somehow the configuration of weights in neural networks, you should think of the landscape as a non-convex manifold, many local minima and so on, but there exist rare dense wide flat minima. At least for random pattern, this can be shown analytically. There's another way, let me just, these ideas have been implemented also in deep learning, not only with this elastic average SGD, but also by, again, the Kuhn is in any paper on machine learning, so essentially it's a constant. So this guy, Pratik Chaudhary, is very brilliant. What he did is he estimated the local entropy using a Lange van dynamics. Essentially you move your weight and then you run a second stochastic process to check how many solutions you have around and use this for training, and he was able to implement this which is called the entropy SGD algorithm, entropy stochastic gradient algorithm, and it actually works very well for finding solutions also on really deep networks, okay? This was, and so this idea of local entropy have already led to several algorithms. One is replicated simulated annealing, another is when you replicate your system you can write the belief propagation equation on the replicated graph, and so this again gives you a message passing algorithm that is very effective for finding solution, then you have replicated stochastic gradient, this elastic average SGD, and then you have this two loop Lange van SGD, and so on. You see that this idea immediately translates into algorithms. Let me just mention one thing is that if you take your replicated system, you can try to solve this problem by message passing. Again, those of you who are familiar with this message belief propagation equations, and if you do this, and you assume that essentially all the replicas are symmetric, I mean all give the same messages to the other replicas, at the end of the day you can reach some message passing equations that are relatively, that correspond to a belief propagation with an extra term, which is a reinforcement term. And if you use them, this equation for learning, you actually easily access those regions, because you can again, what you can do is that find the solution and then count how many solutions you have around and you see that you actually very easily end up there. And there's no other algorithms that can reach easily those regions. I mean algorithms that try just to minimize the energy. I have no idea what you mean by slowing down the dynamic. Why do you slow down? Okay. So essentially what you do is that you bias your dynamics according to the total local field that you experience for each degree of freedom in the previous time steps. So for a dense region, this term is going to increase. But one thing I should say is the following, that this reinforcement belief propagation equation do not converge to the Gibbs measure. They fix point to not describe the Gibbs measure. They describe the high local entropy region. So, but yeah. Yes. Yes. Yeah. I'm coming to that. Yeah, so, okay. So this is the second, I mean this is what I want to discuss in this next lecture. And essentially what I want to try to clarify is the, sorry I've been very, very pedantic. I agree with you. I'm sorry. But the result is, non-convex neural network do have with random patterns or no particular bias from the data, do have this wide flat minima. And this notion allows you to derive algorithm. Now this is the second step. Does it, I mean, is it the case that the algorithm that we are currently using are actually targeting these regions? Okay, that's the second question because this is my conjecture that this is what is happening in deep learning. I cannot prove this in full generality. I can only prove two results. First, the loss function which are used currently focus on these regions. And second, if you use ReLU function also tend to focus on these regions. The second result I'm not going to tell you about because it's too long, but for the first case I can show complete analytic characterization of this and also an algorithmic. So the answer is we are trying to understand what is happening because otherwise you don't know where to start. There are so many algorithms, so many networks, so many definitions that, okay. So this is an example. There's an interesting simulation. It's, in this case, again, it's versus distance. Here you have, again, a network with read and units and now the data generated from a teacher, another network. So we know a solution. And for this problem, you can find a solution with this replicated belief propagation which is going to end up in flat minima or you can use the teacher solution itself to see how it's immersed in other solutions or you can use some other algorithm to find solutions that are known to work. And you find that, indeed, the teacher, again by using, by computing the weight enumerator function, you can compute how many solutions you have at a given distance from a reference vector. And this line here is the teacher. I mean, the teacher is really isolated when you, for this value of alpha, that alpha is equal to one here. So the teacher is just like any other solution, essentially, for this value of alpha. And the solution find with this focusing BP, so this belief propagation that focuses on dense region, you see here it's surrounded by, this is again the log of V over V zero ends up in a very flat region. And this is another algorithm that I don't want to describe now, but which is find solution that are denser compared to the teacher, but much less dense compared to the one obtained by, so this is a very simple example to show where you know the typical solution or they look like in a constructive way. You get solution from another algorithm then from a third algorithm and you see that there is a multitude of minima in the problem, okay? And this is a bit contrary to what you would expect from the literature in which they claim that all the minima are equal and so on and so forth. This is just not the case, okay? So let me mention this example which I think is irrelevant for learning but conceptually I think is relevant. Suppose that now we construct a device which is made like this. You have many of the units. I mean, I'm not going to draw all the lines, okay? You have input layer, hidden layer and then you have an output unit which takes the product. Here you have instead of a sum, you take the product. This is called the parity machine. It's interesting because it's totally useless. I don't think it's used in any field but so you see, these are normal, these units here are normal units. They are just a threshold sum, okay? These are just normal units. Here tau L is equal to the sign of a sum over I of all the incoming W's xi I, so as usual, but in output you take the product. And you can show that you can do the calculation for this machine here and you can show that the dense regions do not exist at all. So this is a problem, by the way, training this device is impossible. It's totally, it's really hard. And so, but this is an example that is telling you, it's a nice exercise, by the way. You can solve it analytically to the very end so it's beautiful for students. Also for me, what's nice. And so, but this is a problem in which these states do not exist at all. So, neurons are special. I mean, being threshold sum is a non-trivial property of neuron in a sense because when you have a threshold sum, there are many ways in which you can generate a given input because you have a sum of contribution so you can permute them. So this is favors somehow the entropy. And so it works nicely because in this neural network, the fact that you have these dense states is going to, you know, you have two aspects which are cooperating, accessibility and generalization, okay? So this is, I think it's very special about neural networks. I mean, I cannot imagine many other learning devices that have shared this property. Okay, so now with principles still have, so let's talk about it. So my program is more or less, I mean, I don't know if you agree, but we had a very heavy thing. Now we have learned something about the existence of this flat minima in multilayer networks and also in deep networks. And now I have to discuss what David was asking. So what is happening about all the other algorithms? And particularly what I'm going to show you, I think tomorrow is that the loss which is used everywhere nowadays is the cross entropy loss. So the loss which is cross entropy, why the hell do you want to use this loss and not mean square error or any other loss? There's no reason to use the cross entropy loss in a deterministic system, a priori, no reason. And if you ask practitioners, the answer they give you is because it works well, fine. It fits the picture of this evolutionary process. But already the network used for the, started everything in 2010 or 2012. I don't remember the network of Hinton on ImageNet. It was already using cross entropy and ReLU functions. So what I will show you tomorrow is that the ground state of this local entropy are actually the dense region of the error loss. So when people use this local entropy, you use local entropy not because you have a stochastic device, in which case it would be justified to use the local entropy because it's a maximum likelihood thing, no. The point is that this local entropy just by chance focuses on this high entropy regions. It's not even optimal, it's not optimal, but still goes nearby the optimum. I can show this analytically. And so I think this is a clear indication of what is happening, okay? And then I have few more, I mean, I can show you this again in several systems. Then I wanted to maybe discuss with you a couple of topics. I don't know if you have the energy for that. It's how, which kind of stochastic process would be attracted by this wide flat minima. So this has to do with what's going on in deep learning, right? Then all the other points that I mentioned to the beginning, transfer function, it's possible, but as far as this school is concerned, I just discussed this case here. Then there is another question, is what about simple learning rules that which kind of simple learning rules converge to this wide flat minima? I think this is relevant for biology or for hardware or for other kind of questions. So here, then I have two results. One is that if you have stochastic synapses, stochastic weights, they naturally converge to this wide flat minima. This is one result. And second is something that I find disgusting, is if you have a quantum device in which you define a quantum annealing protocol for finding the weights, again, the quantum dynamics automatically is dominated by this wide flat minima. Even in the limit in which the prominence becomes classical again at the end of the process, you are bound to end up in this wide flat minima. So one is a curiosity for physicists. The other one can be of interest for neuroscientists. And this, I think, is of a general interest. Now, I don't know, because now it's a bit late, so I don't know what, if I start with some of this, we are going to go beyond, and I think we have had enough for today. I hope, I start with this, and if I have time, I cover the other two tomorrow, okay? I think we can stop here because, yeah, it's a bit complicated.