 The pleasure for us to have here Professor Caso Zekina from Bocconi University in Milan. He will be in two lectures today. You see a symptom this morning about linear landscapes today to the fully nonlinear case, and it's about landscapes and learning. So the first lecture would not be very exciting, and then hopefully the excitement will go up. The reason is that I would like to review with you some historical calculations related to the structure of the minima in the weight space of neural networks. And I think that some of you are not familiar with that, and so I think it's worth going through this milestone at least once. Even if you don't get all the details, it's not so important. At least I think it's better to know that this tool exists, and so I will try to make this calculation with you. Now, what is the general strategy? Well, what I would like to convey to you in the next today and tomorrow is essentially what is special about neural networks. With this I mean that I don't think personally that there's nothing special in deep learning. I think that we have made a lot of progress in essentially in optimization in the learning process. But I mean there's nothing special in terms of artificial intelligence. But still it's an extremely interesting phenomenon that we are observing, and I think it is the first step towards something more important. There's really a lot, a lot of confusion when you start to understand the basic properties of deep networks about what the optimization of that objective is and which are the problems that you expect to encounter in the learning process. So the kind of questions that I would like to address are the following. So how does the loss, the energy landscape of feed-forward neural network look like? So you train an edit network, you use your favorite cost function algorithm whatsoever. Are you going to be trapped in local minima? How fast is convergence? How can you end up in solutions that generalize well? These kind of very, very basic questions. And that's why I'm going to start from the most elementary device, and then by exploiting all properties of this device we are going to say something about more complex architectures. So today we will be about very simple devices, but then in the second lecture we will be already about some more sophisticated machine. And so the question is how does the landscape look like? Is it glassy in the sense this is a statistical physics jargon? Does it have a lot of local minima that trap algorithms or not? And since there are many learning algorithms for deep learning, which are the unifying aspects of all these algorithms? Is there something common about what they are doing? Okay, these are very basic questions. Now, if you look at the literature, what you find results like all the minima are global minima and they're all identical. Are you familiar with these kind of papers? Or the optimal solution are all connected? Or in the learning dynamics everything is smooth and no glassiness is observed? You find these kind of results in the current literature. Now, a natural question arises. Is there something interesting happening too? Or is it just everything is so trivial? And what I hope to be able to convey to you is that indeed there's something very interesting happening and the fact that we observe relatively simple learning dynamics and so on is just due to the fact that we are observing this system at the end of an evolutionary process of deep learning that has led to models that are actually easy to train. But this doesn't mean that the model per se are trivial. I'll try to be explicit about this. So what I would like to propose to you is to look at deep learning as an evolutionary process. So we start at time zero, say, I mean, this is the 80s, 90s. We start at time zero where we have some simple neural network with very simple transfer functions. And this was the era of the perceptron, of super vector machines and so on. And then from across many years, many things have been modified. Which things have been modified? Well, the loss functions have been changed. The transfer functions, then let me say the algorithms and with this I mean SGD plus many things. So stochastic gradient plus regularization, dropout, noise and blah, blah, blah, whatever you want. Then the architectures and then there's preprocessing, data augmentation, so on and so forth. Now, so there is this evolutionary process that has taken place and then we reach 2012 when the first deep network was shown to be effective. And so this has been really an evolutionary process because a lot of people have tried. It's trial and selection. A lot of these things have been tried. Many failed, some others have been retained. And so now we are observing at this very complicated system, so DNN. We are looking at these systems at the end of this evolutionary process and it's very difficult to even define which model you want to look at. Which loss function, which transfer function, which algorithm you can see. So it's very complicated to even agree on a common model to study. So what is interesting is that now we have a very complicated system which is just like a biological system or a physical system that we know it works but you don't really know why it works. So we have been working on this for many years and it's actually difficult to start directly from here and try to derive basic properties of the models. Why? Because if you take a deep network with all these ingredients and then you run your algorithm that works because of this selection process you observe that it works smoothly and then what do you conclude? Do you conclude that the underlying model is trivial? It's simple, it's convex? No, you cannot conclude that. It's just that all these tricks contribute to the fact that learning is actually easy, relatively easy. So we are just hiding all the interesting properties of these models under all these modifications. Now, one thing I think it's fair to say is that if we would take the tools that we had at our disposal in the 90s essentially back propagation plus an error function which was the mean square error and if you would apply those tools today with GPUs and with the data that you have today still you would not be able to train deep networks. So these tricks have changed the game. So it's not true that just because we have more power, because we have more data, now things work. No, all these apparently minor changes are actually changing completely the landscape. But in order to understand this we need to understand what the landscape of the pure model is to understand how all these things affect the problem. Is it clear the plan? So this is what I would like to share with you. So the plan for the lectures that we are going to discuss how the space, the geometry of the weight space looks like in simple models which are however non-convex. We will start with the convex one, but then we will move to non-convex and we will see several geometrical properties which may allow us to understand what's actually happening in these systems. And then we will give you some new results in particular about the evolution of the loss function that I think clearly show what has been going on. We have other results on the trust us function on the algorithm, but I don't have time to deal with all this stuff in two days. So I will limit myself to this topic. So if you are patient enough, hopefully, actually we can discuss. I mean, two weeks ago I had a very interesting experience. It was a journal club done using Zoom. Zoom is an interface that allows you to talk with people. It's like Skype, but it's more efficient. So you can organize journal clubs across the world. And it was very interesting because I could have discussion with, you know, by the people working in machine learning about these problems and opinions were very different. And so it's kind of open to discussion and question. OK, so this is the outline. OK, so let me... Yeah, so one thing, again, before I start with heavy stuff, let me just, again, in this historical picture. So let me just comment with you that statistical physics has a lot to contribute to this kind of topics because it has analytical and computational tools that allow to answer questions that are relevant for machine learning that have been developed in systems which are, you know, very complex and disordered systems that are very similar to... So it's not by chance. It's just that physicists have accumulated a lot of experience in studying problems, optimization problems with very complex landscapes. So the research activity versus time in physics, so let me just give you some milestones. So there was a lot, a lot of activity in the 80s and during this period physicists were able to study the properties of neural networks storing random patterns, the things that we discussed in the first lecture today. But then there was a decrease of interest and so essentially nothing happened for many years and then, of course, recently there's been, again, an increase of activity in the field. So this is for research in neural networks. At the same time, in this period there was little activity in the study, relatively little activity in the study of optimization problems, but then in the mid-90s the research activity on optimization problems raised. Okay, so statistical physicists learned a lot about how to solve very complex optimization problems and now these two things are kind of merging in the study of deep networks. So the community of physicists have studied neural networks for many years in the 90s and then the interest decayed. Why? Because the neural network didn't work. Again, because they were getting stuck essentially. Okay, so it just didn't work. We didn't have enough data. We didn't have enough computational power, but also the models were not sufficient to allow. So somehow the interest dropped and in that period all the community started to, part of the community, to study very complicated optimization problems and I will devote some time to this in the afternoon and now we are in the position of merging together what we have learned about optimization and what we have learned about neural network and try to say something new. Okay, so this is the perspective from statistical physics and Matteo, if you have comments, feel free. We always disagree so. Okay, so let's agree first of all on our reference model. Okay, so let's define the pure classifier. Okay, so these are the set of models that we want to study in order to understand how their landscape in the weight space looks like in order to understand which are the potential traps or the difficulties that we might encounter and how they have been solved. So we need to understand the pure classifier in order to understand the recent network. So the pure classifier is any network composed of neurons, I mean devices, that are just like this. Okay, so you have any inputs and you have say one output and here you have say a sine function. Okay, or any widely nonlinear function. Okay, why so? Because you see when you have such a nonlinearity as soon as the input to the neuron goes beyond the threshold, you're going to emit something, right? A signal, which means that, you know, there are a lot of configuration of the weights that can contribute to a positive signal. So somehow this is a model that is highly nonlinear and has a lot of solutions for the learning problem. We will discuss this in a minute. So just to be clear, so I'm not considering, for instance, neurons that have, you know, a ReLU transfer function or any kind of tricks for the moment. So we will discuss this, but not for the moment. And so we are going to consider this kind of model and then we are going to use as a loss function just the number of errors on the training set. Okay, that's it. This is our model. You might want to use the mean square error, which is essentially equivalent, okay? It's just, you know, say, or mean square error. We will define this if you're not familiar, but... So this is our... So you can take your deep network. You take each neuron and you substitute this kind of transfer function and you take it as a measure of success in the learning phase. The number of errors it makes on the training set, and that's our reference model, which, by the way, doesn't work. I mean, nobody's able to train this kind of network, but we can study which are the properties of this network. Okay, so let's start from... So does it make sense to start with this reference model in order to build on it? So in the first lecture of Bert Kappen, I was not here, but I looked at his slides, you have discussed the perceptron, and we are going to discuss that again. What you have seen with him is, you know, the learning rule, the convergence theorem for the perceptron, and then you have discussed properties like training by gradient descent, stochastic gradient, error-back propagation, and so on, right? Very basic stuff. If you do that on a deep network, it doesn't work. Okay? So I'm going to continue this first lecture. In fact, it changed argument in the second lecture, right? So that's probably related to that, because I need to introduce all the tricks in order to make it work. So I'm going to connect with that, and what I want to bother you for just for this first lecture, with an historical calculation in statistical physics, which is the calculation to compute the capacity of a neural network, right? So what we are going to do is to... Why I'm doing this? Because for the perceptron, for instance, with Bert Kappen, you have seen how to use covert arguments to compute the capacity of the perceptron, just by counting how many ways you can dissect your... ...linearly separable data, right? You might remember that. Now, I'm going to introduce a method which will provide exactly the same result. However, on the simplest network, however, this result, this method can be generalized to much more complicated structure. Whereas Covert's method has any other combinatorial method. Only work for the simplest perceptron. They don't even work for a perceptron with discrete weights, for instance. So those methods cannot be generalized, whereas the method that we're going to discuss with you can be generalized, okay? So we are going to study the storage problem, okay? In the perceptron for the moment. And we are going to follow the steps of what is called the computation of Gardner's volume. I'll tell you immediately what it is. Let me just mention that the first papers on neural network were due to Elizabeth Gardner. She passed away many years ago because she had cancer, but her paper were really very, very important. And by the way, even though nowadays in machine learning there are very few women, and that's very bad, some important results have been produced by women in the field. One was Elizabeth Gardner, then super vector machine, and then also online learning had seen a lot of contribution by women. Okay, so this is just a political remark. Okay, so in order to compute the storage problem for the perceptron, what do we have to do? Well, first of all, let's define the problem itself. So first of all, we take what is the training set? Okay, the training set for us is a set of random patterns. Okay, so I have my perceptron, my network. I plug in here a vector psi mu, which is a string of, let's say, binary numbers, plus or minus one. They could be Gaussian numbers. They don't need to be integer whatsoever, just for simplicity. We can do the calculation with real numbers that are Gaussian distributed. Okay, however, these are all IID random numbers. So the probability that psi i mu is equal to plus one is equal to, say, one-half. So everything is simple. And so we have a set of inputs, random inputs, and we want to associate to these random inputs some random outputs. So you agree with me that there's nothing to learn here in the sense of generalization. There's no rule that generates these patterns. So the generalization error does not even make sense in this problem, right? The best thing you can do is just flip a coin because everything is random. However, and that's why I'm going to study this because I want to understand which are the properties that do not strongly depend on the data but are in the weight space, not depend on the fact that the data are concentrated somewhere or these kind of things, but are really structural. Also the output. Everything's... No, no, no. It's just an optimization problem here. We want an association problem. We can do then the teacher-student in which the data generated by another network. I will mention that. But for the moment, let's do this, okay? Good. So since this is a calculation coming from physics, I will call the weights, I will call them j's, okay? Just because this is what you find in the literature. And so what we want to compute is how does the volume of the weights that correctly classify all the patterns change as we increase the number of stored patterns? So we give more and more patterns and the volume in the weight space that correctly classified this pattern is going to shrink, okay? Let me write it. It's easier. Okay, let me call this volume omega and this will be a function of all the training set. Okay, and this will be an integral over some measure over the weights. I will specify this in a minute. And then you have a product, I going from 1 to p, the number of pattern of theta of this sigma mu j dot psi mu, okay? And then minus k. So what is this integral, this volume? It's an integral over all j's. I will tell you what this measure is in a minute. So it's just a computation of a volume where you're going to cut all those configurations of j's which do not correctly classify the pattern. So let me discuss with you just a minute with you this function here. This is a step function equal to 0 or 1. It's 1. This function here, this theta over x is just this function, right? Okay? So it's 0 or 1. And whenever the argument of this function here is positive, it means that for the moment, you know, set this k equal to 0 just for simplicity. Now, when sigma has the same sign of this scalar product, it means that since the neuron here is implementing a sine function, right? If the input to the neuron has the same sign of the output, it means that the sine is going to give the right answer, right? Do you agree with me? Is that clear? So it means that when this and this scalar product here, because I forgot to say that the input here, this is the input that the neuron receives, right? The sum over all the inputs weighted by the weights, right? So whenever the input has the same sign of the output, we are fine. It means that the output is correct, right? So we want to keep all those contributions in the weight space that correspond to correct classification. So we are going from this integral to throw away all those configurations that violate this constraint. Do we all agree on this? Now, this k here is just a stability. You might want to not only be on the right side, but you might want to impose some robustness, okay? So that's just a robustness parameter. Okay. Okay, that's good. And what is the phenomenology that you would expect? You expect that, ah, what is the measure, first of all? So d mu j, it's this kind of measure. Well, this is irrelevant, but there's a delta. So essentially, we just constrain all the j's to lie on a hypersphere. We don't want the j's to become arbitrarily large. Why so? Well, because we want to avoid any trivial rescaling. You might have heard about the fact that, in deep learning, people are talking about the fact that there's this rescaling of the weights so that things become complicated. So just impose this and you get rid of all these trivial symmetries, okay? So we are constrain the weights to be confined. No, over the pattern, sorry. So this is the measure. In the region of, in the space of j's that satisfy this condition, we are going to compute which is the volume corresponding to weights that correctly classify the training set. Is that clear? So this, you see, in principle, you could generalize to multi-layer network, right? You take all the j's, you impose that for each neuron composing your network, you start to satisfy that constraint, and then you put here a characteristic function that's going to output 1 whenever the output of the network corresponds correctly to the input and zero otherwise. And this is just kind of checking whether all the training set is correctly classified. So the scenario that we have in mind, probably also Beth Kappen used the same notation. Let me introduce the parameter alpha, which is p over n. So the number of patterns over the number of neurons. When alpha is small, it means that you have very few patterns and a lot of weights. So not neurons, number of weights, sorry. So when alpha is small, it means that you have a bunch of patterns and a lot of weights, so it must be easy to train the network. When alpha becomes very large, means that you have an over-constrained problem with a lot of patterns and a few weights, and a few weights relatively, and the problem probably is not going to be solvable. And in fact, what you observe, what we will find is that at the beginning, when say alpha is equal to, let's start here, when alpha is equal to zero, all the volume is okay, it's fine, right? No constraints. Then whenever you impose one pattern, you're going to impose a condition on this color product, which means that you are going to choose to cut in half your hypersphere. You want to be on one side of an hyperplane. And so what is going to happen here, as soon as you add patterns, you cut some of the space. And so as the number of patterns increases, the origin of volume that corresponds to a correct classification of the training set is going to shrink until you reach a critical point in which the volume disappears. This is what we want to... Now, why the perceptron is trivial? Because this space is convex. Because you're always cutting in half your hypersphere, so it's always convex. And therefore, that's why algorithm converge. That's why you could prove with that cap and that the training algorithm works, okay? But we will see that this method can be applied also to cases in which this is not the case, in which the space is non-convex. They are by far more interesting. However, the calculations are a bit complicated, so let's do it in this case, okay? Now, look at this quantity. Now, this volume is exponential in the number of weights. It's an exponential quantity, okay? Because you have... Here you have a product of a linear number of j's, and the volume that I've drawn is actually exponential in n. Now, this is a problem because this is an exponential quantity in n which depends on some random variables, which means that in turn, this is a random variable that fluctuates exponentially. So, this is the first problem we have to deal with. If you compute just the expected volume as the number of pattern increase, the expectation of omega is going to be dominated by exponentially large, rare events, which you would never see in practice because they have probability which is exponentially small, okay? So, that's not what you want to do. So, what you really want to do is to compute the expectation of the log of omega, okay? And then, once you've done this, the typical volume, which is the volume that you would observe in experiments, will be to the exponential order, will be e to the expected log, okay? And so, you see that in order to study this problem, you need to be able to compute the expectation of the log of this volume. This is not because you're not able to compute the expectation of the volume, it's just because it's useless, and actually computing the log of... the expectation of the log is much more complicated, okay? So, how did physicists solve this problem? There are a couple of methods that you can use today. Let me remind you one of those methods, which is called the replica method. How many of you are familiar with the replica method? One? Not much, not many. Okay, so the replica method is based on the following trick. You can always write the log of x as a Taylor expansion, so you can take the limit, n going to 0 of d over the n, sorry, of x to the n, this is trivial, right? And then you expand this quantity, and what you get is the limit, n going to 0, x to the n minus 1 divided by n, okay? This is the kind of identity that we're going to exploit. And what the physicists decided to do is to say, well, let's compute this quantity for n integer. The mathematicians were just shouting and complaining all the time until essentially the year 2000 when Tala Grant, the famous mathematician, was able to prove that actually these results are correct and on some important problems. So I think there's no tension anymore, but at that time there was a lot of tension. Anyhow, suppose that you're able to compute this for any integer, for any n, okay? Then you can take the analytic continuation, n going to 0, and maybe for any integer it's easier. So let me give you a silly example. Sorry, this is just for those who are not familiar. So suppose that you want to compute the log of 1 plus x using that identity, okay? So as the limit, n going to 0 of 1 plus x to the n minus 1 divided by n. And now we want to use this trick in which we first compute it for any integer and then we take the analytic continuation, okay? So if we do this, first of all, we can expand. Oh, I can do it here. First of all, we can write 1 plus x to the power n just by normal expansion. So you have a sum k going to 0 to n of n choose k x to the k, right? This we know, and this is for any integer, okay? Then we can just expand this and we get 1 plus nx plus n n minus 1 divided by 2 x square plus n n minus 1 n minus 2 divided by 6 x cube plus so on and so forth. And the generic term will be of the form n n minus 1 n minus 2 n minus k plus 1 divided by k factorial x to the power k, okay? Now, since we are physicists, we take the analytic continuation. Now we have a generic expression of this quantity as a function of n. Then we go to the real numbers and we take the limit n going to 0, which means that, for instance, this term here, so n n minus 1 divided by 2 in the limit n going to 0, it will be just minus n over 2, right? Because n square can be dropped. All the higher powers in n. And so what we get is 1 plus nx minus n over 2 x square plus n over 3 x cube plus blah blah blah plus minus 1 to the power k plus 1 divided by k x to the k and so on and so forth. And this, you immediately recognize that this is just 1 plus log of 1 plus x. This is just the expansion of the log, right? And so we are done, because you see that what we have found is exactly that the limit n going to 0 of this quantity here, you're right. So this 1 is going to be 1 plus n log of 1 plus x minus 1 divided by n and so it's going to give us the correct answer. So nothing particularly deep, but just a little warm-up with the replica method. It works, okay? There are many subtle aspects, but I don't want to touch them. Okay, good. And now we have the problem that we want to compute the expectation value of the log of our volume. This is our objective, you remember? Because if you are able to compute this expectation value, then let me call this 1 over n, let me call this s, okay? If you are able to compute this, then we know that the typical volume is going to be e to the ns and therefore we can, when s is going to be 0, then we will be able to identify the capacity, actually minus infinity, but we can identify the capacity of the system by looking when this volume shrinks. So I don't have time to discuss the large fluctuation in exponential random variable. I guess everybody agrees on the fact that if you have an exponential quantity, computing the average doesn't make any, doesn't give you any interesting information. So you have to, this is going to concentrate. And so this variable here is going to tell you which is the most probable value. So what we want to compute now, using the replica method, is the expectation of log so natural log doesn't make any difference, of course. So log of omega, now as we want to compute this, so in general we are not able to compute the integral of this log, it's just impossible, okay? So directly, I mean, computing this average means to integrate over all the j's and the randomness, the volume. So we want to compute this using the trick that we have used before for that simple example. So we want to compute the expectation of the generic nth power of the volume. And if you are able to do this, we can then take the analytic continuation and obtain the log. From this, we obtain a quantity which concentrates and is going to tell us about the capacity of this device, okay? Okay, so in order to proceed, so this is a little tour de force in statistical physics, there are a certain number of tricks. We are going to use Fourier transforms, saddle points and so on. I'll try to condensate them in few steps so that even if you're not a physicist, you can digest it easily. So let me open a little parenthesis and let me remind you some simple, everything is based on the following trick. Suppose that you have to compute an integral of this form. So suppose you have to compute an integral of this form. This looks... You can think of this as the average over some randomness, the x could play the role of the randomness over which you want to average, and this is the function that you want to average. How do you do this? Well, this looks as an entangled integral. Apparently, it's difficult to separate. You would like to be able to compute this as a product of n one-dimensional integrals, otherwise it's impossible, right? And in fact, I mean, those that are a bit expert in this kind of calculation immediately recognize why this is possible, but let me just do it together. So we are going to use the Fourier representation of the delta function. So I hope there's space enough here, yeah. So this is the integral representation of the delta function. And by using that, you can essentially replace this sum over the x with the variable y, provided you also introduce this constraint in the integral, okay? And so this quantity just looks like becomes the following. So you integrate over dy, and then you have this delta function of y minus sum over i xi. And then here you have an integral of a product over i of dx i, a product of i sum function g xi. And then here you have just f of y, okay? This I can do. If there's some mistake, please point it out. Sorry? Yes, sorry. You're right. Okay, now we can use, for this quantity here, we can use the integral representation. And what we get is the following, dy hat to pi. And then you have e to iy hat, then f of y. And then you have an integral over product over i going from one to n of dx i. And then you have a product over i from one to n of g of xi e to the minus i sum over i xi, okay? And now we are done because you see that this is factorized. You can e to the sum is the product over exponential. So everything is factorized. And so this integral can be written as follows. Integral over dy dy hat over 2 pi. And then you have e to the minus, no, to iy hat y, f of y. And this is the whole trick. Then you have just one single one-dimensional integral to the power n, okay? So you see we started with this integral, which is entangled, say, and we end up with the nth power over one-dimensional integral. And essentially this is the trick which is used to compute this quantity. So just keep it in mind because now we are going to essentially identify in the computation of the volume, we are going to see that we have integrals of this form and we are going to exploit this trick in order to turn it into the nth power of one-dimensional integral. At that point we will be able to compute everything by taking large n and using the set of point method, okay? This is the strategy. This variable here is integrated here. It's an auxiliary variable. Okay, so we are ready to go now. There is another little thing that I need to remind you which is the integral representation of the theta function. So if you have a theta function which is 0, 1, right, this is just the integral of the delta function. I hope you are all familiar with this. So this can be written as the integral between x0 and infinity over dy over delta function of x minus y, okay? Now, if I take the integral representation of this, then it means that I can write the theta function with an integral representation and this is the result, okay? So this can be written in this form. Why do I need these kind of tricks? Because I always need to disentangle integrals, so I need to have the integral representation in order to be able to reach something which is computable, okay? Okay, having said this, so we want to compute the nth power of the volume. Let me remind you that this volume is this integral. I normalize with square root of n because this is a random sequence with alternating sign, so it goes at square root of n. This quantity will be order one, right? Okay, so first of all, what I need to do is to write this in an integral form, okay? And, in fact, I can write this... Sorry, then what I want to compute is the nth power of this. What I want is the nth power for generic... So it means 1, omega 2, omega n, where these labels here are going to be attached to the independent variables over which I'm going to integrate, right? So omega n can be written as an integral. Then you have a product over a from 1 to n, okay? And then you have a product over mu from 1 to p, product over a from 1 to n, and then you have a theta of sigma mu J a dot psi mu, okay? So this is the real thing we want to compute, the nth power of the volume. If you are able to do this, then we take the analytic continuation divided by n and we get the expectation of the log of the volume, which is what we are looking for. What we need to look for is the only thing that makes sense. Okay, so let's now introduce the integral representation of this. So let's massage this term here, okay? And if you massage that, what we get is the following. The product over mu and a of the theta of sigma mu J a psi mu divided by the square root of n minus k. For simplicity, let me call sigma mu psi i mu. Let me call it psi i mu. What I mean with this is that, okay? Let me redefine this is a random variable, which is equal to plus or minus 1. This is a random variable, which is equal to plus or minus 1. They are all independent. So their product will be, in turn, a random variable, which is equal to plus or minus 1. Since everything depends only on the product, I can just define a new variable, which is the product, and in turn this would be a random variable with equal plus or minus 1 with equal probability. So I can directly average with respect to this psi, the psi prime. So here I can just write psi prime. Since I don't like primes, I just call it psi again, okay? So, okay, so this would be equal to, you have a product mu and a, and then maybe I skip some lines. I'm just worried that I'll never reach. What time do I finish this one? Okay. Okay, so here I obtain, so you see you have a theta function centered in K, so the integral representation, if you remember what I wrote before, would be this one. These are the auxiliary variables for the integral representation, and here I have, I don't write explicitly the product over mu and a in the measure because I just consider the overall product here. And then times you have x of i lambda hat a mu, sorry, there's a lambda hat which multiplies lambda a mu minus sum over i, well, i from 1 to n, j i a psi i mu, divided by the square root of n, okay? I've just used the tricks that I've shown you before, applied to this theta function. Okay, so this is equal to a product over mu, and then you have an integral. Here you have a product over a of d lambda mu a divided by 2 pi, and then here an integral minus infinity and plus infinity of product over a d lambda mu a hat. And then I have x of i sum over a from 1 to n, lambda a, sorry, lambda mu a hat, and then here I have a lambda mu a minus sum over i, j i psi i mu, divided by the square root of n, okay? Now, why is this interesting? It is interesting because then I want to compute the average with respect to this psi i, right? X i mu. And if you look at what's going on here, is that you see that if I now bring this sum as a product in front of the exponential, I'm going to have terms that only depend on psi i mu, and so I can perform the average. It's a miracle. I mean, it's a very old miracle. 20, 40, no, 87, 30 years old, but at that time it was kind of interesting. So let's do it. I'm going to, I mean, I'm giving this lecture like this in a very informal way because it's an oral tradition. You cannot find in books these things and so, essentially. So in order to compute the average over this, the only thing you really need to do is to compute the average over this term here, right? And once you bring this sum down as a product, you just have to take the product of many averages. So in order to be able to compute that average, what we only need to do is to be able to compute the average of e to the minus i sum over a lambda mu a sum over i j i a psi i mu divided by the square root of n. Oops, sorry. I thought I was writing an xp. It's not very well written here. This is the quantity I need to be able to compute. The average I need to be able to compute with respect to the random patterns. And this will be equal to, I mean, this is the only part that in the integral depends on the xi i, right? The rest are just factors. And so this I can write it as I can bring this down. And so this will be a product over i of averages of this quantity here. e to the minus i sum over a lambda mu a hat j i a psi i mu divided by the square root of n. OK? And now you know how to do this because this is equal plus or minus 1 with equal probability. So this is done. This is the product over i of the cosine because it's e to the minus i with this equal to plus 1 and then equal to minus 1. So this is the cosine of sum over a of lambda mu a hat j i a divided by the square root of n. Well, now again, we are in the limit of large n. So we can expand the cosine to the second order. Well, OK. What we can do is write this as x e to the sum over i log of the cos, right? I mean, this is just a trivial transformation. And I expand this and what I get for n large, this becomes e to minus 1 half sum over a and b from 1 to n little n lambda mu hat a lambda mu at b. And then you have a sum over i j i a j i b divided by n. OK? So we are done. We have no longer any average to be taken. If you take now this result and you plug it in back in the integral, you have just an integral which no longer depends on any xi. We have already taken the average. It's going to depend on this n here. And if you are able to compute this integral, we get the volume by taking the nullity continuation. Now, the calculation is not yet done. It still requires some faith. So because let me remind you that we need two things. First of all, we need to compute the average over the xi and then we need to be able to write this multi-dimensional integral in a factorized form as the nth power of one dimensional integral. That's the objective. I promise that the other lecture will be fun, OK? So sorry for that. Well, but you have to suffer a little bit, right? OK. So in order to be able to compute the integral, let me perform some other transformation which are just identities. So even if you don't understand why I'm doing that, you cannot object. So QAB, let me introduce QAB as 1 over n. You remember when in the example, we had the sum over xi. We want to extract that from the integral in order to write it in a factorized form in the initial example I've made. This is going to place the same role as the sum over xi in that example. So we need to define a parameter which is equal to this sum here and use the delta to impose this condition. You will see. And this will allow us to write everything as a one-dimensional integral. So in your notes, you can go back and see that this sum here plays the same role as the sum over xi in that initial integral. So let me introduce the following quantity, OK? Which means that I can always, in the integral, I can substitute this quantity here with QAB, provided that in the measure of the integral I impose the condition that these two things are equal by delta function. So in particular, I have to introduce in the integral this quantity here, this delta function, which in turn I will be writing in an integral form because at the end of the day I want everything in an integral form. So this will be, again, I need some conjugate parameter to write the delta function in the integral representation. OK? And just in order to speed up things, let me also tell you, anticipate that I'm going to need another transformation. And so I also need to impose this condition. Well, this condition is nothing but the measure, actually. You remember that in the integral we had the measure that all the j's need to be normalized lying on the hypersphere, right? Over a radius square root of n. So this I will also write as an integral representation. So some kA over 2 pi. And then I have, OK, x of minus i kA at sum over i j i a, sorry, square minus n. Is that good? No. I made a mistake here. Yeah. So, OK. And this one is still n. So what I'm saying here is that qAB, these parameters that I introduced this auxiliary variables, are equal to 1 over n this sum. So I can also say that n times q is equal to this sum. That's it. No, sorry. I write badly, but OK. OK, now, sorry, with all these things, we can go back and plug everything into the integral. You have been very patient. And so we can go back to the, you remember, we had this product of many integrals. Each integral is computing a volume. So we are taking this product of many integrals. By using all these tricks, what you get is the following quantity. I mean, I'm not going to believe me that if you just, it's just a bookkeeping. You put all these things together. And OK, where? OK, so with a lot, a lot of patience, we have reached the following result. That this omega, this nth power of the volume, can be written as a multidimensional integral of e to the nth sum function f, where this f has this expression. OK, so you plug in all the transformation that we did before, and you obtain a multidimensional integral. But you've seen that we have lost the index i over the weights. This, because what we have, what you obtain is that if you substitute the things, all the transformation, is that all the sum over i over the weights just factorized. And so you get a factor n over all times this function here. This is the magic. You have two tricks here. First, we have been able to compute the integral. Second, we have been able to reduce everything to, well, say lower dimension, lower dimension to lose the dependency on the index i. OK? I mean, again, 30 years ago it was a breakthrough. Now that I've given you this, I'm sure you're not so happy because you don't know what to do with this, right? I mean, it's a very complicated integral. But at least we can outline a strategy at this point. We want to compute this quantity. We have a high-dimensional integral of e to the nf. Since n is going to be large, this integral is going to be dominated by the maxima of f. So we can hope to compute this integral just by finding the maxima of f. In the limit of large n, this integral will be equal to e to the nf computed in the point in so-called saddle point. So the point that maximizes this function. You agree? So we have reduced everything to the computation. I mean, omega n now is just the problem of computing some integral of e to the n, some function f of all the variables, right? And for large n, what you need to do is to find the maxima of f. And then, say, omega will be e to the nf, this variable that maximizes f. That's it. And this is now. So you see, even though we are very far from obtaining some results, we have already an expression, which is very complicated, because we have this f that looks like we have this sum of terms plus these two functions here. So what we are now going to do is to look for the maxima. Now, of course, here there is a key observation, and the key observation is the following, is that we started with a problem that clearly has a symmetry, right? You can exchange the order of this product and nothing changes, which means that the problem has a global symmetry with respect to permutation in these indices, right? You can always permute these indices, and things should not change. So, however, unfortunately, well, I don't know, but if you have a problem which is symmetric by definition, and you're looking for the maxima, the extremum of a function which is symmetric, like in this case here with respect to these indices, the single maxima can break this symmetry. It's only the set of all maxima that is going to be symmetric altogether. Is that clear to the non-physicist? I mean, this is called a symmetric breaking. I mean, it's, well, so the idea is that suppose that you have a square, right? And you want to find a path of minimal length that connects the four vertices. This is a symmetry, a rotation symmetry, so you might say, okay, this is the path of minimal length that respect the symmetry, but this is not the shortest one. The shortest one is this one. And of course, there is another one. And you see that the two solutions together reproduce the symmetry of the initial problem, but the single solution might break this symmetry, okay? So I'm telling you this because in random systems, this is what is going to happen. It's that these averages, when you look for the extreme, depending on some parameters, in particular alpha, when you look for the maxima of this function, you might need to break the initial symmetry. And this has enormous consequences on the properties of the neural network, okay? But for the moment, let's look for this maxima in the simplest possible assumption, which is let's assume that the solution is symmetric, okay? So this is going to help us a lot. So we are going to, what is called, make the replica symmetric answers. So what this means is that we are going to say that QAB is equal to Q independently on A and B. We are saying that Q hat AB are all equal to Q hat independently on A and B, and kA is going to be equal to k independently on A, okay? So we are going to look for the maxima of this function in a subspace, the subspace in which everything is symmetric, okay? And now if you do this, so if you're still patient, then we are at the end of the plateau, almost, okay? So if you make these assumptions, the various functions take forms that are simpler. In particular, let's look at the function GS that I've defined before, which is now a function of Q hat and k hat. This is going to be equal to the log of the product over A of d jA over square root of 2 pi, and then you have E to minus i over 2 k hat sum over A jA square minus i Q sum over A less than B jA jB, for instance. And... Delta? You mean the Kronecker delta? Yes. Yes, it is true. And if you do this, you would find this result when you take this treatment. You're right, but, okay? No, you're right. It is more general, but... Okay, let's say otherwise zero. Yes, you're profited right. I mean, you can do it and you would find that, okay? But I mean, there are so many details about replica symmetry breaking that this would be really a very minor one, so as we will maybe see later. Okay, so now we have this expression. So let's try to see how this part of the function f changes and how it gets simplified, okay? This is what you get if in GS you plug in this symmetric hypothesis. Then you use the fact that the sum over A less than B of jA jB is equal to one-half sum over A jA square minus the diagonal, okay? And if you do this, what you get is that GS, let me avoid writing again all the arguments, is the log, and then you get an integral product over A of djA square root of two pi, and then you have E to the minus ik plus q hat divided by two, and then sum over A jA squared plus iq hat sum over A jA squared, okay? Now I can rotate the integration in the complex planar to take the saddle point, which means that I just drop minus i, essentially. Then I can use the Gaussian identity. Let me remind you that E to the B squared over two can always be written as an integral over duE to the BU, where du is just a Gaussian measure, okay? A lot of tricks. Now, if you use this identity where this is going to be your B, right, B over two, so you use this expression, so again you introduce this Gaussian integral to get rid of these squares, and what you get, finally, is the following. Okay. Before I write the expression, let me also tell you that once... So why do we need to linearize this term? Why don't, in order to proceed, we want to write this in a linear form, right? Why? The reason is that if I can rewrite this, even at the price of introducing an extra integral, but if I can write this in a linear form, then this product over A is going to be factorized over here. And so you have n times the same integral, okay? It's always the same trick. You're doing it for n, small n, big n in various occasions, right? So you do this, so you have n times the same integral, at that point you can take the limit n going to zero, and let me remind you that you have something like this, a to the power n, where this a is the integral, okay? And this for small n is equal to 1 plus n log of a plus terms of order n squared, which means that since we have a log outside here, we have that log of a to the n will be equal to the log of 1 plus n log a plus big O of n squared, which in the small n limit is going to be equal to n log a plus terms of order n squared, okay? So now, if you make the transformation and expand for small n, what you get eventually is gs equal to little n times an integral over du, a Gaussian integral, of the log of the integral over dj, then you have e to the minus k q divided by 2j squared plus some square root, sorry, plus u square root of q hat j. So our very complicated expression now reduces to this simple integral, relatively simple compared to the one we started with, okay? Okay, now, if, let me see, I must, yes. I'm really torturing you, but as I said, everything will be easier. Similarly, you can do the same tricks, the same bookkeeping calculation for the other term in f. You remember you had gs and ge. So for ge, you perform the same transformation. You get something like n times the integral over, again, a Gaussian measure, log of a function h of k minus square root of qt divided by 1 minus q plus terms of order n squared, where this function h is very famous in statistical physics. It's just defined as a narrow function, essentially. Okay, it just, it does that function. Okay, we are done now, because we are now, we can write our function for which we want to compute the saddle point. So our function f looks like is little n times another function s, which only depends on q hat and k, where this, let me write it here because it's useful. We're almost done, don't worry. Where this s, let me, is equal to 1 half k hat minus 1 half q and q hat plus gs of q, q hat and k hat plus alpha ge of q. Okay, so where, for instance, ge of q is this quantity here, gs is the one we wrote before. Okay, and so this is the result. Now what we have to do is to look for the maximum of this function and so to compute, to solve the equation with respect to q, q hat and k, right, to solve the following equation and eventually also ds over dq equal to zero. If you are able to solve this equation, we find the solution, we plug it in the expression of f and this is going to give us our final result. Okay, and now I really, I'm not going to compute all these derivatives because it's a bit heavy, but you see it's not so, if you take the derivative with respect to k, well you just have to, here you get 1 half and then you take dgs over dk, you take your expression, you compute it and you solve it. So you set a coupled equation. If I just solve these two equations, I get an expression of s, which will only be, will only depend at the end on this parameter alpha. If I solve these two equations, what I get is that s is going to be equal to the extreme over q of the following expression. So where I have already solved the equation for q hat and k hat, okay. So I still have, I first solve these two equations, you can do it as, if you're interested, if you're a physicist, you should do it just like the random energy model. I don't know if for the physicist, it's just like a milestone that once in your life you have to solve. And if you solve these two equations, you get this expression, you still have to optimize with respect to q. So the final outcome, f, which is n times s, so the final s is given by the extremum of this expression here with respect to q, okay. Now, if I do this, I obtain the final equation. Now I do ds over dq equal to zero, and what I get is this q over 1 minus q, okay. If you take the derivative, you get this expression. Now, I don't have really time to enter into too many details, but well, you clearly see that for alpha equal to zero, q equal to zero is a solution of this equation, right? It's at zero here, you get q equal to zero. So if you look more carefully at what the parameter q means, what it really means is it's a measure of the typical overlap between solutions that belong to the feasible space, okay. So at the beginning, q equal to zero means that solutions are equally distributed in the weight space, which means that their typical overlap is zero. So q is the overlap between solutions, typical solution when you sample from the uniformly from the solution of your learning problem. Okay, you have to... So q is the typical overlap, let me say qA, qB, where these q's are... Oh, sorry. Let's assume you have two solutions. Let me say j1 is one solution and j2 is another solution. If you sample them at random, what you get is the typical overlap is equal to q, okay. The typical overlap of two solutions is q. And in fact, when alpha is equal to zero, q is equal to zero, which means that the q's, the solutions are in random positions, whereas the maximum value q can take is one, which corresponds to the fact that the volume has shrink to just one point so that two solutions have to coincide. That is one, because the j's are normalized in such a way that the scalar product is between zero and one, okay. Okay, just have... Yeah, so I need... Anyhow, if you don't want to go into the details of this, you just have this equation, you just know how to do the integrals, you say, okay, for small alpha, for alpha equals zero, q equals zero is a solution and you start to increase alpha and you observe what q is. You go back to the definition of q and you realize that it's upper bounded by one, okay. And so what you get now is that if you want to compute the critical capacity of this model, you need to look at the limit where q goes to one, okay, where just q solutions are left, okay, so that they are concentrated in a point. Or if you don't want to do this, you just increase alpha and solve this equation and you look at how q changes and you see that at a certain point it reaches one, okay, which is the maximum value. Now, if you do this, direct to an analytic expression of the critical capacity, you can... Well, you have to use the asymptotic expression for this function here. Well, so h of x is equal to the value expression which I use for, so 1 half minus one over square root of two pi x minus x cubed over six plus other things when x is small and when x is large, you see that... So this x small for k equals zero would correspond to alpha equals zero, otherwise here you get... When q goes to one, you get the argument which goes to infinity, so you're interested in the other asymptotic expression. So if you use this and you plug it in there, what you get that in the limit q going to one, alpha, in order to have a solution of this equation in the limit q going to one, alpha cannot be arbitrary. It has to take a value which is given by the following equation. Voila. So one hour and a half to start from a super complicated integral and get this equation for the value of alpha at which the volume shrinks to zero. If you want to be complete, you can take all this expression and plug it in again in the replica tricuspension. You see that if you look in the notes, you will see that exactly as in the example we did at the very beginning of the expansion of the log, all the n-factor cancel with the denominator, the one cancel. So the s is actually going to be really the volume we are interested in, the typical volume. And now what you can do is take this expression here and put it on a computer and plot the value of alpha c. So you choose different values of k, you plug it in here and you solve for alpha c this equation and you get this curve. This is Garner's calculation. Now what Bert Kappen did the other day with you is to compute this point. That's it. He showed to you that the perceptron for linearly separable pattern has this capacity here. But already it's not capable of reproducing the entire curve when you're trying to impose some robustness. So I don't know how to do to give you a sketch of this historical calculation in one hour and a half. I'm sure there are people that can do it much better than me. But anyhow, you have been patient to absorb this and you know it exists. I can give you some reference if you want to reproduce the details. And I think you should know that there exists this method. Why? Well, for this specific problem you work a lot at the beginning probably she had to work for six months to obtain this result just to compute a number two which was already known. But the point is that you can apply this to multi-layer networks. And you can use this to explore how the volume the geometry of the space of solution. So now let me close this parent. So let's go back to real networks. So this is a very heavy method which however works for non-convex problem and we can use it to compute properties of the volume of the weight space that correctly train random patterns. Now, for this problem we didn't learn much because what we learned is that at the beginning Q is zero which means that solution are in random position so their overlap is zero and then the volume shrinks until it reduces to one point. But this is not what's going to happen in multi-layer networks or in non-convex networks. In non-convex networks what's going to happen is that the volume is going to break in many disconnected components that have different sizes some of which are relevant for learning some others are not relevant for learning. And there's no method that gives you this result except this one as far as I know but I'm pretty sure about it. So this is why I've been bothering you with this technique. I mean I cannot start with the multi-layer network because it's incredibly complicated and also because you need to look at the set-up point equation in a replica symmetry broken ansatz so you cannot assume that QAB is equal to Q so it's much more complicated. However, the method works and so okay one thing I want to say is the following is that suppose that we fix K equal to zero here and we move in this direction okay we increase alpha then what we observe is that Q goes from zero to one but then what happens here? Okay so let me draw it in another fashion so let's put here alpha and let's put here the typical number of arrows that you expect so the minimum number of arrows you get with probability one we know that up to alpha equal to solution exists with probability one and so the number of arrows will be zero beyond this point the number of arrows start to increase okay and if you look at the set-up point equation what's going to happen at this point is that if you compute the action of your solution it's going to become unstable which means that if you want to describe what happens beyond this point where you have errors you have to break the symmetry that we assumed at the beginning so the solution if you are just interested in describing the regime which solution exists it's perfectly fine to make this simplifying assumption but when you go into the error regime it doesn't work anymore okay but so what we did with this calculation is just to describe what happens in the region where everything is satisfiable so you can store all the patterns and the solution becomes unstable right here which means that below this point our solution is because one thing that you have to do once you make an answer to simplify your calculations to check if this answer is consistent so if you assume that there is a minimum there you have to check the eigenvalue of the action right and so what's going to happen here is that the symmetry is going to break but that's fine because up to that point we are just interested in reaching that point in this capacity calculation but this tool allow you to describe also what happens above and in other architectures and so on and so forth okay okay then we will so of course if you've never if you've already seen this, you know everything if you have not, you might have an idea of what you need to do compute the first step is watch out the volume is exponential so you have to compute the log of the volume the expectation value of the log if you want something that concentrates so that what is the typical volume means that you run an experiment with your computer you're going to observe the properties the typical properties those that pop up with the finite probability so you're not interested in rare events you need to compute the average of the log in order to compute the volume you do this and you get this result that up to alpha equal to is actually easy sorry it's actually the volume exists and it's extensive put this together with the results of Bert Kappen the space is convex the patterns are evenly separable an algorithm exists which is going to find solution inside this convex body so you see you have a nice interplay between all the results on algorithms and general results about the phase diagram of this neural network, the two things nicely come together the point is that this tool allow you to explore also the regime in which say traditional tools don't work so we are going to say something about what happens in non-convex network so it might have been a bit heavy but most likely it's not going to happen a second time in your life that you will hear about so now we take a break in the next hour we are going to use the slides and see how these things become alive in deep networks