 I mentioned where we are on the plan of the talk. So essentially, I talked yesterday about the classical learning or the standard learning theory, which is really based on what we call a pack bounds. And I didn't say it, but packs tend for probably approximately correct, which is a little bit funny. But essentially, as I said, I mean, it gives you some bound on the generalization in terms of two numbers, epsilon, which is what we call accuracy or generalization bound, a bound or generalization gap, if you want, and delta, which is the confidence. So the confidence stands for the probability, and the accuracy stands for the approximately correct. So just to make sure that you understand where it comes from. So there are these two numbers, the probability accuracy of the algorithm or the eventual rule or the difference in the training and generalization and so on. So I hope I convinced you that entropy is a very interesting measure, and the generalization of entropy, which are related to mutual information, really gives us some interesting characterization of neural networks. And eventually, I'm going to focus now on what I call the information plan theorem. So essentially, the story is that we look at this multi-layer system as each layer made out of two maps, one which I call the encoder of the layer, which is the stochastic mapping in general from the input pattern, which can be very complex. I mean, we're going to assume that it's a high entropy pattern, and we're going to confine ourselves to what we call typical patterns, patterns which are in this set of typical objects. And so the encoder is in general a stochastic map from the input to the layer, and the decoder is a map from the layer to the desired output. And I'm going to argue that when the problem becomes very large, there are only two numbers you want to care about, the mutual information, the encoder, and the mutual information of the decoder. But I'm going to make a distinction. So for any encoder, for any map from the input to some internal representation, I'm sorry, from the input to the internal representation, any map like this, there is the optimal, or if you want, the base optimal decoder, which is not necessarily the actual rest of the network. So I mean, there is the real decoder, which is what the network is doing for us. But if I had all the data in the world, I could actually decode this representation with the optimal decoder. So p of y given t, let's call it b, optimal, or base, is simply what? So this is just using this. Essentially, I'm trying to predict the true label, the desired label, from the decoder. And the only thing which intervenes between each one of those layers and the actual output is this input layer. So this is, again, using this Markov condition, this is simply the sum of all possible access of p y given x, p x given t. So if you think about it, this is just using the fact that there is a Markov chain here, y x t, eventually. So this is true for any t. Then, of course, the layers of a deep neural network form a Markov chain of representations, which are now denoted by t1 and so on, t i, t d. And I'm going to focus on the behavior of those representations. And this is a Markov chain, which means that information is only going down along this chain. Yes? Yes? First of all, why here? OK, so I thought I said it yesterday, but it's all right. So why is the desired output, desired label? This is the data that I get. I mean, so essentially, I have this joint distribution of x and y. I get some points. And eventually, what we do is map x through a cascade of such representations. And eventually, from the last hidden layer, I'm generating what I call y hat, which is the actual output of the network, but this is not y. This is the output of the network, which eventually, if we train it well, is going to be close to y. So that's why I put this on this side, because this is a map from the last hidden layer, or from the last layer. Now, the actual y is here, because essentially, what I'm giving you is this. So essentially, this input representation, I'm giving you labels. And then I generate representations of x. And they belong to y in this way. I mean, the only way I can actually predict the actual y is go through x. So this is essentially just an implementation of this Markov chain. y, x, and then one of those t's. This is the actual true y. Now, if you think about it, I mean, you're given data. I'm giving you a label. And then I'm giving you something which depends on this label. And it's here. It's to the left of the representation. You're not getting the y. And then, of course, using this statistic, the sample of this distribution, and generating I'm training the network and generating representations. But the true y is on this side of the Markov chain. So in order to actually, so if you think about it, if this is my Markov chain, so this is the best prediction of the desired y that you can make. This is all you know about y from the representation t. So this is maybe confusing for those of you who have seen it before, but I hope that mostly this is simple. So essentially, this is the Markov chain I'm having. xy, then from x, I'm generating representations of x. And from the last one, I'm generating my prediction. The prediction should be close to y eventually, if I can train it well. But y is different from y hat. So this is what I call the optimal decoder. And the true decoder, the network decoder, is actually, of course, this py hat given t, which I'm not going to discuss too much. So actually, what I showed you before in this movie, and I'm going to come back to it in a second, was that it's actually an implementation of what I call the information plan theorem, which is, this is something a little bit more abstract, and it takes time to grasp. But essentially, I'm saying that when the network becomes very large, you're still confused. That's right. Because given x, I don't need t in order to say everything I know about y. Remember that I have the joint distribution, I have p of y given x and p of x. This is essentially my rule. So essentially, given x, I know how to predict y. This is given. This is not part of the network. This is data. Then I get a sample or training data, which is essentially a collection of xi, yi, pairs like this, which are samples from this distribution, p, x, y. This has nothing to do with the network. So that's why they stand outside of the network. Now, the t's are representations of x. I mean, for any given x, there is a t. I don't need y for that. What I do need y for is for training the network. But once the network is trained, I mean, the weights are fixed. And only then, this is a Markov chain. Otherwise, it's not a Markov chain, because I'm actually recycling the data, and I create all sorts of dependencies. But for a fixed, so this Markov picture is true for fixed w's, for any fixed w's. Once the network is fixed, then there is this Markov chain of representations. And I'm going to discuss the representation, not the weights. You're absolutely right that during the training, I'm using all sorts of feedbacks, because I'm using back propagation and so on. And then, of course, I create dependencies between y and the weights and x. But once weights are fixed, any given weights, this is the Markov chain. No, here you can't, because you actually were presenting x, not y. You're given the pattern. I'm given your image. I don't have your label. I'm given any image. I'm generating representations of this image for any given w, given w. So this is important. So this is what I could do without, like, anything about the network. I mean, if you give me the joint distribution just using this Markov chain, this is the optimal decoder of this layer. Just give me t, generate this reverse decoding. So px given t is the reverse encoder, I'm sorry, the reverse channel. That's why I'm using base rule here. So essentially, I'm making, in order to see what is the best prediction of y from t, I'm just using this Markov chain. And this is for fixed encoder, or w, the weights. And what's confusing you, which is, of course, correct, that during the training, there is a feedback here, which is going to mix the dependencies completely. But I'm not talking about the training. After the training. Yes. Remove the x. This is my input layer. Why do you want to remove it? No, but you don't. In practice, what you do with these networks is you show a pattern, like, say, an image, pixels of an image. You don't know why. But what the network is generating for you are these representations. I'm saying, again, given w's. When the weights are given, then you don't know why. Of course, if I knew y, to begin with, I don't need to do anything. If I knew this joint distribution of x and y, then the problem is solved. You just do the base decoder, and that's it. You don't need the base decoder. You just take for every given x, you give y, but the maximum likelihood of what you have. So this is important. I mean, I know this is confusing. But remember, those weights are fixed in this pitch. Now, during the training, we looked at these two numbers. So this was this movie that I showed you last night. We looked at these two numbers. This is the mutual information. For every point on this plan is for a different set of w's. So w is changing here. But this is the initial condition. And again, because of data processing and equality, I have this drop of information. It doesn't have to be this. And this is because of the special architecture of this network. This network is actually, I'll show it to you later on. It's some sort of an Eiffel Tower. And the layers get narrower and narrower. And this is why information really drops with random. Now, there's a big issue. There's a delicate issue of how do we really estimate this information? But I don't want to discuss it now. I'll discuss it later. We do it by essentially binning the units of the layers and discretizing everything. And I argue that this is not important. This is a very good proxy to the true information. Even if you don't discretize it, I'll come back to this later. But that's why you see this very sharp drop. And you see this very nice concentration of all the layers. The only random number here is the weights. And I mean, of course, the weights are randomized because I start with the randomized initial conditions, different from every network. And then, of course, the examples is a finite sample. And in this case, it is trained on 75% of the data, 80% of the data. And the data, in this case, are patterns of 12 binary inputs. So it's a very small network. That's why the entropy, the information, or the maximum information, is the entropy of the pattern. And the entropy is 12 bits here. So 12 bits means that there are only two to the 12 possible patterns. So this is a very small network and one bit of output. And you see that the initial layer, the blue one there, actually more than one layer there, the closest to the input have very high information because, essentially, they don't lose any information about the input. It's just essentially one to one transformation of the input. So no information is lost about the label and the input. That's why it's very high up there. One bit of information about the label and 12 bits of information about the input. But once I move through this chain of representations, I'm losing information initially. And then when you train it and just show it again, just to amuse you, you get this very nice picture that they all come up very quickly, more or less the diagonal here, and then slowly moving to the left. And what I want to do now is really to understand this picture as much as we can and see what happens there. But eventually, I'm going to argue that there is this theorem that for very large networks were very large patterns when I can actually talk about typical inputs only. And that's really the large scale learning hypothesis that I care only about typical inputs, not about every input. And typicality, by the way, is defined by the entropy of the input, which is a distribution-dependent quantity. So then I argue that only these two numbers, the information of the encoder, this is the information between x and the layer, and the information of the optimal decoder, which is this base optimal decoder. These are the two things that are telling me the interesting story. And the interesting story for learning theory is the trade-off between sample complexity and accuracy. I mean, what is the generalization error and how many samples you need in order to generate this generalization error in the best possible way. So it's again, I'm still under the dogma of statistical learning theory, and I'm talking about typical sample complexity. I'm not talking yet about the computational issues. And in this case, I argue that it's the mutual information of the encoder and the optimal decoder of the last hidden layer, which are actually telling me, at least in principle, it's not really a very practical bound, but in principle, because I can't calculate these quantities, these two mutual information for any real problem in practice, I only estimate it. But if I could have these two values of mutual information, then they would tell me exactly what is the best accuracy that you can achieve with a given sample size of random examples. OK, so what I want to do now is to convince you that this is actually true. So again, there are two issues here. One of them is understanding the role of these two numbers, information about input, information about desired labels, and then understand the structure of these dynamics. I mean, this is the SGD dynamics, the stochastic gradient descent, which is taking me through these passes. And I want to understand both. What happened at the end? At the end of the story, I'm saying that this point E on this curve is where the last hidden layer eventually converged to, and the value of information about the input version of our output is really telling me the story I want to know. But then the interesting, the really interesting story is what happens to the other layers. I mean, why all of them move, or most of them move to the left? And what dominates, and why do they stop at some places? Like L3, L4, and so L2, you see that the layers stack somewhere in very interesting places. They don't go all the way to the end. Yes, yes, absolutely. So the reason it doesn't, the ways can change, but as long as it's a one-to-one transformation of the input, there's no loss of information. And there's no loss of information about either the input or the output. It's just some, so let's say, if the first layer is only decoding some sort of encryption of the data, which is crazy, can be crazy computationally, that doesn't change information at all. So then it will stay there. So that's right. All the story about the dynamics of the gradient descent I went to delay for now, just for a few minutes. So that's the plan for today. I really want to understand this picture. And I want you all to believe me in some sense that this is a very typical picture. I mean, this is what's going to happen in most standard neural network frame with general SGD. Of course, the architecture is going to affect this picture in various ways. If I take, for example, all layers to be of equal width, or I do some sort of games with convolutions, or with resnets, or whatever, all those tricks, they're all going to affect this picture, but not dramatically. They're just going to change the shapes of those lines, but nothing. Of course, there can be, as Bert said, I mean, there can be layers that don't do anything, which means that all the crazy dynamics of the weights is not going to change the information between the input and the layer. And those are going to get stuck or get degenerate. I mean, it can get things like this in the middle. So I can get several layers, one on top of the other, in this plan. This is a very degenerate picture. I'm taking millions of parameters and project them to two dimensions. So obviously, there are many, many different networks that have exactly the same location in this plan. OK, so and I actually argue that the degeneracy is really not what we want. We care about it this one. Yes, please. So I have a question, which is very small. Because we can't do that. So essentially, think about it this way. Essentially, what the network is trying to do is to bring all the patterns that have label one and all the patterns have their label zero into two separate subspaces, which can be linearly separate. That's it. And then eventually, they remember only one bit about the input, which is precisely this bit that partitioned the input into these two parts. Right. Right, absolutely. OK, yes. Take just the input data. Input data has all the information about the label and all the information about it, and still you cannot extract it because it's highly entangled, I mean, in some sense. I mean, there's no simple way of separating these two groups. OK, so bear with me. I promise to answer this question, OK? This is exactly the point, OK? So far, so what's really happening through these cascades of representation? Why is it that the last layer is really easy to separate? So essentially, just one hyperplane is going to put one label on one side. And in the first layer, it's almost impossible to do it. It's impossible to do it in most cases. I mean, there's no single pixel or single bit in my image that tells me this is me or not me. It's highly distributed everywhere. So that's why I need to scramble it or this scramble it or to do this representation change. And the whole miracle of deep neural networks is how does it happen? I mean, what in this stupid SGD algorithm is actually causing this separation, OK? And why in the network the layers actually help you? So I want to come back to trying to, so I showed you yesterday this bound, the classical pack bounds of learning. And so essentially telling me, look in general, the generalization gap or the generalization error is bounded by the log of the epsilon cover of my cognitive space plus something which depends on the confidence delta divided by 2m. I proved it yesterday. And I said that in general, the nice spaces are the spaces where the cardinality of the cover is scales like 1 over epsilon to the d with some dimension d. And if you just plug this there, you get this d over m log 1 over epsilon as the dominant factor. So essentially, as long as m is smaller than d, I don't generalize. And once m gets larger than d, this epsilon bound becomes meaningful. And this is really the classical picture. And this d, in most cases, this is the VC dimension, which is just a slightly more complicated way of defining this dimension. But it's the same thing exactly. And this is deep learning. This is learning theory. As I said, it doesn't work for deep learning. Clearly doesn't work for deep learning. So I'm going to play a slightly different game, which I call the input compression game or the input compression bound. So instead of focusing on the hypothesis class, which is the class of all possible functions where I want this uniform convergence of the generalization gap over all the class, I'm going to look at x itself. And now again, the motivation here is to understand neural networks. So I want to understand how those maps to the layers really change the structure of the input. So each of those layers is some sort of injection. If you want a partition, if you think about the neurons, it's just how the threshold functions. That's for simplicity. So this is just mapping all the excess into some finite compartments, which can be hard or soft. I don't care at this point. So I want to characterize this partition, which I call the input partition. So in this case, so this is my space of all possible x's. And essentially, I want to understand how such a partition of the input. So let's say that all the x's in this small ellipse are mapped to one configuration of the layer. All the others are to other configuration. So this is some sort of partition of clustering, if you want, of the inputs, which are characterized by the value of the hidden layer, which is a very large number because I have a lot of hidden units. But in principle, I have a cover. So now, if you think about all possible functions, let's say also all possible Boolean functions. So if I add all of x, so this is a somewhat confusing notation. But this class, 2 to the cardinality of x, if x is finite, I mean finite number of possible objects, is the number of all possible functions, Boolean functions from x to 0 to 0 1. This is clear. So this is no hypothesis class. This would have been, I assume, every possible function. By the way, if you plug this in this bound, you get a log of 2 to the x, which is the cardinality of x over m, which tells you that essentially, you need to label all the data. So this is the no-free lunch theorem, if you want. I can't get anything for free. If I don't assume something about a function, nothing will happen. OK, so just plug 2 to the x there. You get x over m. x over m means that the number of examples have to be essentially the number of patterns. Not very interesting. But if my layers is actually compressing the representation in the sense that it maps it into discover, so all the patterns in this set are going to have the same label somehow, eventually, then I need one label in every one of those covers, not more, because all of them have the same label. One label is enough to map all of them. So in principle, if I knew somehow to partition my input into classes or into groups that have the same label, a very close label, then the number of labels that I will need, the number of examples, will go, if you want, the number of possible functions is not the 2 to the x, but 2 to the cardinality of this cover, which I call tx. OK, so I actually argue that the layers are doing precisely that. I mean, they're going eventually to cover in a coarser and coarser cover, because of this Markov chain representation, so the partition is always coarsening of the previous partition. It cannot be otherwise. This is exactly the processing inequality. I can only lose information. So I'm going to have this cascade of covers, which eventually, and the cardinality, then goes from 2 to the x to 2 to the cardinality of the cover. All right, this is true for any representation, even if there's only one representation. That's right, but then the layers are going to coarsen this partition. The number of part, yeah, the number of partitions, the size of the partition, gets smaller, yeah, absolutely. At this point, I'm talking about any layer, or any representation, any representation. I don't say care about the layers yet. OK, so now the question is, so there's one piece of mathematics that I need to fill in here. So first of all, we move from deterministic rules to stochastic rules. I already said that. I just want to emphasize it again. So I assume that the rule is actually determined by this probability distribution of the label given x and not y of x. So instead of y equals some function of x, I'm actually thinking about either y equals some function of x plus some noise. So there is some intrinsic pattern noise. And let's say that if this noise is Gaussian or whatever, I don't know, with zero mean and some standard deviation, then this is going to create a stochastic rule. So I can only say that the probability of the label given x is determined by this noise. OK, so this is just a very standard trick in dynamical systems. And I add noise to the input. Another way of saying it is that there's some sort of quantization on x or simply the rule itself is stochastic, which means that all I know is the probability of the label given x. Of course, if the noise goes to zero, if it's very small, then it goes back to the deterministic rule, if I want. But it's much more general to think about stochastic rules. Now, for technical reasons, which you see already, I really want this to be strictly within the simplex, which means that this is going to be greater than zero or greater than some delta, which is greater than zero and less than 1 minus delta. So in some sense, I'm thinking about the simplex of y. But you all know what the simplex of y is. It's the set of all possible distributions over y. So I'm just denoting it by a triangle, although this is actually the one-dimensional simplex just the integral of 0, 1, just for the graphical. So essentially, I'm cutting a little boundary out of the simplex of width delta. So in some sense, I don't want fully the deterministic rules. Eventually, I'm going to take this delta to be zero, to take the limit of delta goes to zero. But at this point, let's keep this delta inside. So I'm avoiding completely the deterministic rules. So these margins of size delta, which can be very small, are going to be excluded from the simplex. And then the question is, OK, so what is going to replace my generalization error? So the generalization error was the mismatch in the label between two hypotheses. So it was a distance between, let's say, age 1 and age 2 in my class, which was in general the probability of the disagreement between age 1 and age 2, this symmetric difference. This is just reminding you from yesterday. So I'm going to turn this error. So this was the error that age 1 is making with respect to age 2. So if the rule is stochastic, the most natural generalization of this error is what, considered by most people, as to be the natural generalization, it's the L1 difference, or what we call the variation distance, which essentially the known one difference between age 1, between P1 of y and P2 of y. OK, so this is L1, which means it's the sum over all y's of P. So this is in a more general framework. Let's say you have two distributions. So I say this and this two distribution, P1 and P2, this is the measure of the disagreement between them. So it's this area. This is the L1 difference. So this is the L1 known, or what we call the variation distance. So the variation distance in general is just the integral of P1 minus P2 absolute value dy. So if y is a continuous variable. So this is just the absolute L1 known. And if you think about it, if the rules are deterministic, which mean y is either P1, either 1 or 0, this is going to be exactly the measure of disagreement. So this is a natural generalization of error. Now, what is nice about this L1 known, so this is the L1 known. I denoted it by one, so this is the L1 square known. So I just normalize this, I just square this, this norm or square this integral. And so the variation of this, and what we know, and I can prove it, we don't have the time, this is the standard statement in measure theory in standard deviation, in large deviation, or in whatever you can read it in Kober and Thomas. It's known as the Pinskern inequality. So the Pinskern inequality is telling me that the L1 norm, so this P1 of y minus P2 of y is actually bounded by, or up to some constant, up to whatever, square. So the L1 known square of these distributions is bounded by the KL divergence of P1 y and P2 y. Now, if you ask which, this is symmetric, this is not symmetric, doesn't matter in any order. So the smaller of the two. Okay, so this is known as Pinskern inequality, and you can read about it in Wikipedia if you want. See the proof, the proof is actually very simple. It's very elegant, I don't want to discuss it. It's actually also what we call a inverse inequality. So the L1 norm is actually bounding the KL divergence from above, but this is highly sensitive to the minimal difference here. If this minimal difference is zero, then it's one over this minimal difference has to be kept. But in principle, what I argue is that this bound is going to be tight in some limits. So it's both that the KL divergence is bounding the L1 norm and that the KL divergence is bounded by the L1 norm if you assume this delta separation, so there are no zero probabilities in this space. So it's going to be bounded by one over delta times the KL. So one of delta is a bit tricky, I don't need it here. But in principle, if I'm really bounding my distribution to be inside the simplex, then I can, the log of the probabilities is well behaved. There are no zeros and everything is bounded and this is small tiny fold and that's a nice setting to have for mathematicians. Okay, now, so this is nice because if I can bound the KL divergence on average between any x and any representation of x, whatever x and t is, so instead of p1 and p2, I now look at two possible and distributions, let's say two dependences or x and so how x depends on y versus how t depends on y. So this is the optimal prediction of y from the data and this is the prediction of y from the representation. So it's this KL divergence which I want to minimize. Now, there is a very simple relation using this Markov chain. I give you another small exercise, I simply don't have the time to prove it here. So if I average this DKL between py given x and py given t, where t is any representation of x, so t is anything that obeys this Markov chain and py given t is in this case the optimal, yeah, the base optimal, the best I can do. So it's the same base optimal I wrote there. So this is I take this KL and then average it to respect to both x and t, which means actually average over the encoder. Then this is precisely equals I xy, the mutual information, original mutual information between x and y minus I xt, I yt, sorry, I xy minus I ty. So this is a very simple lemma that I want to write here. But what it means, that if I actually increase the mutual information on y, which is the, remember the yx of my information plan coordinates, then I improve generalization. Because there is a direct bound because between this, so this difference is larger than the L1 norm according to this, which is the generalization error in any reasonable measure. Okay, so the first axis is very clear. Any training algorithm which will improve generalization, no matter what, has to move the points of the representation up in this plan. Okay, so that's obvious, I mean, obvious now. So that's why we must see this increasing information. If, I mean, no matter if I use the stochastic gradient descent or anything else, coming up, you know, like balloons in this plan is something I expect from any learning rule, including our brain, by the way. I mean, if my brain is using, I don't know whatever learning rule, if it improves generalization, I get to get very good performance. I expect my encoders of the representations of the data in my brain, whatever they are. I mean, they're very complicated. I mean, they have a lot of those, you know, in V1, I have different than in the cortex and so on, there are many, many layers in, not many, but there are several, certainly several layers in vision. Instead of several layers in auditory perception and so on, not everything is so nicely layered, but whatever happens in the brain, the information about why should increase. Now, so it actually gets interesting to ask, what is the structure of this plan in terms of encoders, I mean, so, where can I be at all? So it's actually very easy to see that there are no points here. There are no points, there are no encoders that keep all the information on X and no information on Y. Think about it for a second, if it keeps all the information on X, then this Markov chain implies that I have a lot of information about Y. So there are no points here, this is an empty region. On the other hand, there are also, what happens up, so actually we calculated this, I don't have it here, it's a new paper. Most of the encoders, if you take random encoders, just random weights, they're going to lie on a line here, so that's actually a very big density of encoders somewhere in the middle. Depends on the problem, but it depends on PXY, depends on the, but what type of random encoders actually consider, but there are no encoders here and once you get up, the number of possible encoders get smaller and smaller. Actually exponentially smaller. So the most natural, because you put this constraint on the information on the label and this is going to dilute the number of possible encoders until you get to very high information on the label and then something interesting happens there. So the most natural question to ask is what happens, what is the limit? I mean, how far I can actually go up for any given representation and also, of course, what happened to this layer? Why do they spread up so nicely in terms of information about the input? Okay, so now I'm going to go back to this bound. So remember, I'm talking about what we call typical patterns. So again, just to remind you, so what I showed you yesterday is that for independent inputs for whatever distribution, this limit is exactly the entropy. This is actually true even if there are dependencies. So as long as let's say if my x is, let's say first order Markov chain, so I'm going to factorize them conditioning on the previous, so p of x1 to xn is going to be a product of fpx1 times px2 given x1 times px3 given x2 and so on. It's a Markov chain. And eventually in the large n limits, it's again, it looks like a product of conditioning in dependent terms and the central limit theorem under very wide conditions works. And I get the center again, the same number. But actually if you think about graphical models for example, for each variable depends on some parents of some variables that influence it, the directly connected to it. And again, if this parenthood of this neighborhood conditioning is more or less uniform everywhere like in a pixel, in a picture for example, every pixel is largely dependent by the neighborhood. The color of this pixel, if I give you the colors of the neighbors, in most cases I can predict it very well and so on. So there's this Markov conditioning and again this holds. Actually, there's a very general theorem which is called the Shannon McMillan Breiman theorem which tells us that under very wide, very mild and very wide conditions, I mean this is a Gothic in the dynamic system case or some sort of a finite bounded or bounded degree graphical model or something like this. This limit exists and the same argument about concentration of the limit to the entropy holds in a much more general setting. Okay, so I'm going to again consider only typical patterns. And then I say this is true for Markov random fields and either Markov models and Hamiltonian with pairwise interactions which are bounded and so on and most common graphical models. I mean essentially everything that we really used in machine learning or in physics by the way, Hamiltonians with finite number of interactions or bounded number of interactions and so on. Okay, bounded on average. Even the SK models which has essentially a random interaction obeys this rule in some sense. It was fin glasses. So it's certainly true for things like images and speech and text and all the things that we really applied. So this is a very general statement. So this typicality arguments is going to hold in a very general setting and then notice that all the typical patterns essentially have the same distribution, the same probability which is just two to the minus n or e to the minus n depend on the basis of the log. Two to the minus n if I'm using beats, the entropy. So they're all equally probable which means that the size, just one second, the size of these typical sets is exactly one over this entropy is two to the n-edge. Because all of them are equally likely and they're talking about finite sets so the size is just two to the n-edge. So this is really a very important part of this typicality argument. The all equally probable, all typical patterns have exactly same probability which is determined by the entropy. Now I'm going to assume that my partition, this partition over that induced by the network is large enough such that the condition, the patterns which are mapped into one of these cells, so which I mean I condition it on the partition, on the clustering that is induced by the layers is also typical which means that I can estimate the probability of a pattern in such partition by two to the minus n the conditional entropy. Now this is a slightly tricky argument, you have to be careful there, but it's actually for physicists, this should remind, I mean the argument about the Gibbs distribution in thermodynamics so essentially one of the ways of generating Gibbs distribution is to show, let's say I take my glass of water and I think about small parts of it, small drops which are large enough to assume equilibrium in each one of them and then I get exactly the same argument. So this is just assuming that the partition is sufficiently large to assume typicality argument even for the maps inside of this partition. Okay so I'm going to use this in order to refine this bound and of course there's also an issue about the concentration of the information. Why do I get these two points centered? Okay, yes, okay my comment about glasses, keep it aside for a second, I don't want to get into it. I'm talking about simple things, at this point. Spin glasses are a little more complicated and there are more time, not that we can't say anything about them but at this point leave it aside, okay. So what I'm going to say is that actually it's very easy to see why those two quantities, I, X, T and I, T, Y really concentrate. So the first one, if I have indeed this factorization property which means that the probability of X even T can be written as a product of P, X, I given the parents in the graph and T and the probability of P, X can be written as a similar product, then this is a simple sum of independent, of condition independent things for the same argument exactly and central limit theorem tells me that these things are going to concentrate. I, X, I, T, Y by the way is a little more tricky because it's a sum of products. It's not like partition functions in statistical physics. So sums of products also tend to self average and as you all know, for example in spin glasses or in mean field theory, when we calculate the free energy of a spin glass, we average the log of Z, not Z. We average Z, you get, then need approximation which is usually wrong but the log of Z is what is why log of Z because log of Z concentrates. So for similar reasons which you can argue much more carefully, I, T, Y also concentrates although it looks like a sum of products but there's the most likely product in this sum which is going to dominate everything in the larger limit. So it's actually quite clear why these two numbers really concentrated. You see that they actually concentrates pretty nicely even for this very small problem. I mean, 12 bits only. And we increase the network, this concentration gets sharper and sharper until essentially you see one pointer. It's really very, very nice. Even for these convolutions or whatever you want. The only issue is how to estimate information. So I'm going to go back to this bound and now I'm going to refine it. So I want to estimate the size of this partition. So how do you do it? Again in there. So what I just told you is that the size of the typical X is just two to the HX where I suppressed the N here. N is rather than looking at the density of the entropy I look at the entropy itself. So this is two to the H. This is just a typicality argument. Now I assume that each of those cells is also typical in the same sense which means that the size of each of these cells on average is precisely two to the conditional X on T. This is what I just said. Okay, so if you buy this, what is the number, the cardinality of the partition? It's two to the H divided by two to the H T given X. Okay, it's a total volume divided by the average volume of these spaces and which is what? Just precisely two to the I. Age of X minus age of X given T is I. It's the initial information. Okay, so this is precisely by the way the argument used by Shannon in his coding theorem. So those of you who find this familiar, it's the same argument he's using in both coding theorem, the channel coding theory and the rate distortion theory. One says a cover and one says a packing but it's the same argument. And I'm using it in slightly different ways. I'm saying this cardinality which really dominates the number of functions or the number of labels that I really need if I actually get homogeneous cells eventually is the cardinality of these is two to the two to the I because T of epsilon is two to the I and the number of functions was two to the cardinality of T of epsilon. So this is somewhat surprising. You got this double exponent, two to the two to the I. This is where people stop believing me at some point. But it's true. So now you take the log there from the cardinality bound and you get that epsilon square is actually bounded by two to the mutual information between X and T plus the log delta, but then log delta is going to be negligible in the larger limit. Let's say that you take delta to be one over million, okay? 10 to the minus six. So this is six, okay? And these are orders of millions, okay? Who cares? So the whole argument about confidence is becoming negligible in the typical large problem. But what is really surprising is that the mutual information, it's two to the mutual information which acts like the dimensionality of the class. So this gives you a very good incentive for this algorithm to actually move to the left because if my last layer, which is really where I'm actually doing my prediction eventually has a very small information about input, then this gives a very tight or the tightest possible bound on epsilon will be for the smallest mutual information. But if you look at this bound, it's a very surprising bound. I must say that I didn't believe it myself for a while until we actually found other ways of proving it because this looks like a very little bit of black magic what I did here. I mean, I'm using this typicality argument and then I'm estimating the class of a function which I'm not really using because it depends on what partition I'm using and the partition itself is changing during the training which is a big no-no in learning theory. If you actually change your hypothesis class during the training, you're bound to overfit. That's what they all tell you. So I have to be very careful with this type of argument but at least naively, this is surprising. So look, when is this meaningful? This is meaningful when I is of the order of log M. Log of the cardinal, the number of examples. Okay, then only then when this, when I is smaller than log M, this becomes meaningful on epsilon. So this is a bit surprising. Log M seems like a very small number. I mean, let's say I have millions of examples. So log M, let's talk decimal, is six. So I have a, I need six, I need a compression to essentially M has to be, I mean, I has to be six for million examples, a very small number to make sense. But now if you know a little bit more about learning, there's a notion of sample compression in learning, which is really a very old notion. It goes all the way to the 80s, to Walmuth and Littlestone and people like this, which is how many labels you really need. Not if you get random labels, random patterns, but what is the minimal number of labels you need if I can ask the smartest possible questions. This is called the query complexity. If I do active learning, I don't have to label everything. I have to label very few. Now, if you know something about support vector machines, let's say, so these are exactly the support vectors. The very few points, which I know, if I know their labels, I know the labels, everything else. Or if you know what is the minimal, and this is precisely in general of order of the log of the number of patterns of random patterns. So only the log of the number of random patterns are really need the labels and mutual information by its very definition is the minimal number of labels that I will need in order to actually generalize. So that's why this makes perfect sense, that the eyes of the order of log M, where log M is essentially the query complexity of my data. Okay, so now the question is whether it actually works. Does it give me better bounds? So if you look, even in my movie, the last, the mutual information gets below log M, not at the beginning, but at the very last layers, eventually it goes below log M, and then it's actually starting to generalize. That's what this bound is telling you. So I'm going to use this bound. Okay, so this is a little bit tricky, I know, and we actually have a nice paper that proving this bound in a very rigorous way, including this worry about generalization. Because in principle, first of all, there are many, many possible partitions. So as I said, I mean, you can have the same information with many, many possible partitions, and then when I go in low information about why, there are many, many possible partitions, all the possible partitions of my data. So obviously there's something, this bound becomes meaningful only when you get high in the information plan. Okay, so how high you can get in the information plan? So that's really where this information bottlenecks things are coming to the game. So I want to explain it as much as I can without boring you too much. So in general, so I told you that already in order to generalize, you must get high. I'm now telling you that if you really want a few samples, you also want to get left because you want to reduce this ITX. So now, so ITX or IX-ACCET is just a different representation for the same thing, the representation is has to be small. So the question is how far they can go in this plan? Okay, so if you're a physicist, what would you do in order to solve this question? I mean, you simply maximize, I mean, or let's see what is the maximum information that I can get at a given compression. This, I call this the compression of the representation. I mean, a given compression of the representation, or in other words, what is the minimum compression or the maximum compression? I mean, the minimum IX-ACCET that you can get at a certain level of IX-Y. So this is a very simple variational problem. What you do is minimizing IX-ACCET-ACCET over all possible encoders subject to constraint on IX-ACCET-Y. This will give you the limit. And you put a Lagrange multiplier, which I call here beta, which has to be a positive Lagrange multiplier because I actually want to minimize the function subject to positive constraint. This beta looks very much like one of a temperature. So that's what it looks like. And actually, that's exactly the intuition we had for many years. So beta is in some sense a resolution, a resolution parameter which controls the size of those covers. So very large beta is very small temperature, low temperature, which means a very fine cover. Very low beta is high temperature, which means a very coarse cover. So some algebra, which is really not difficult, shows that, by the way, this proving this, if you can't do it, it's actually a trivial thing. I mean, you just take the KL divergence, PY given X, take the log ratio of PY given X, PY given T and write it as a difference between log PY given X and PY minus log PY given T and PY, and that's it, end of story. So now I'm actually using something similar here, that's why I got back to it. So essentially, the interesting story about this particular variational problem that it has an explicit solution, or implicit solution, if you want, because it tells you the optimal encoder which goes on the line beyond which there's no encoders at all, is given by this exponent of the KL divergence between PY given X and PY given representation. There's a bracket missing here. Okay, so this is nice in some sense because remember, this was a bound on the error, the generalization error. So essentially, it's telling you that a good compression will have small generalization. Okay, that's nice. And everything else is just technical. I mean, this is a normalization, some sort of a partition function. So there's this Z there, and this equation have to be self-consistent in the sense that the encoder, the decoder here has to be this optimal based decoder in red. So essentially what these equations are telling you, the decoder and the encoder are self-consistently related through these equations. This is coming from there and this goes there and you iterate. So this is what we call an implicit solution. I mean, I need to solve it iteratively and Bert can tell you a lot of things. It's not always converging, it's not a convex problem. It cannot be in general because there are all sorts of different phases here. There are all sorts of suboptimal solutions but in principle, it's very simple. This is exponent of the bound on the error, the KL divergence in the prediction from the data and the prediction for the representation. This was T on my other slide, it's the same thing. And this is the best you can do with this particular encoder. So you just iterate them and this is just to keep the probabilities correct. So you have to also estimate the margin correctly. Yeah, it was not the top left. I'll call back to the movie and see that there's a slight curvature there. So it was just slightly stochastic because if it was slightly stochastic, it didn't get to perfect one bit, but a little lower. I'll come back to it. That's a good question. So here I'm exaggerating and I'm talking a very stochastic rule. And what I want to say is that in general, this black line is the solution of those iterative equations for different values of beta. And actually for the same reasons that you see it in thermodynamics, let's say that beta is the Lagrangian multipliers of the energy in the free energy. So essentially the slope of the energy entropy trade of in thermodynamics is beta, one over beta. And just like here, one over beta is the slope of this curve. So actually there is a finite slope in the origin, which is actually an interesting phenomena. It's called the lower critical beta below which you don't get any solution. So if beta is too low, which is around order one, you don't get any, there's no solution to these equations. No trivial solution. I mean the only solution is independence at the origin there. But once you increase beta, you start climbing on this convex or concave curve, the black line. And eventually when beta goes to infinity, the slope goes to zero. It's one over beta. And this is the high resolution limit. Okay, where essentially you keep all the information. But what is really interesting that this is an information theoretic bound. There are no encoders above this line. No matter with what algorithm. Given the rule, this is a wall. Even an alien coming from another galaxy will not do better than this line. Okay. So now the interesting question is whether my algorithm is actually pushing me to this line. Whether there's anything in the stochastic gradient the same, which is actually forcing us to be close to this line in some sense. This was the mystery I had for five years ago. So essentially if I put a, at that time we actually put a neural network like this. I mean, H1, H2 and so on. And we knew that it has to move up, but we had absolutely no idea what's happening until we did the simulations. Simulation showed us this very interesting trajectory which have to be understood. I mean, you can't just leave it like this. But what is really interesting about this line or this problem of finding compact encoders for a given generalization error or for a given information about why is that I don't usually, so there are two other details in this picture which I want to come back to. One of them are those suboptimal blue lines. So those are some sort of bifurcations which have to do with topological changes in my encoder. I'm going to spend the rest of tomorrow's talk mostly talking about the nature of this bifurcation why that's so important or so interesting and why they're eventually going to determine what the layers of the neural network is going to encode. But there's this red line. This red line is really important. So remember that we never have the joint distribution of PXY. I mean, if we have it, I don't have to do anything. So what we get is a sample. We only have a finite sample of PXY. We have training data. So I want to talk a little bit about the nature of this line. So essentially I can recast the theorem, the generalization error that I had before. So let's say I now call P empirical or EMP, the empirical distribution which I get from a finite sample. Okay, so you can think about it anywhere you want like a sum of delta functions on your samples or you can think about it as some sort of an histogram or whatever, some crammy estimate of your distribution. And what we have, so essentially, is how far the error using the empirical distribution can be from using the full distribution. So essentially, if you want this base optimal decoder, let's say that I don't have all the data, I have only a sample here. So I can't really use this complete sum because I don't have all the data and I'm only summing on the sample here. So then I'm going to get some sort of an approximate of this which can be very noisy if the sample is small. So this is in some sense the best the network can do using the finite sample. So we wanted the bound on the difference between the true mutual information and the actual mutual information. And eventually this is something was proven by Ohachimir and together with Ohachimir and Sivansa Bhatta in 2008, we have a long paper on the bottleneck and generalization which is essentially just establishing this right curve. And you see this looks familiar. Essentially what we know is that the true mutual information is related to the empirical mutual information plus something which looks like two to the I over M times the cardinality of Y which is two. There's actually a two here but nevermind. So it's the order of the square root but this was exactly by previous bound if you remember two to the I over M square root was this epsilon. So this is essentially the same thing but here it is actually proven in entirely different way using the McDermott inequalities and very careful convergence, empirical convergence analysis. You can read it in this paper, 2008. But what this tells you that the difference within the true information and the empirical information scales grows like two to the I over M with the square root, I don't care about square root. So what we plotted here, the red curve is precisely that. So if this is your empirical information, in this case think about the black line as the best you can do with the sample. What the bound is telling you that the red line is what you're actually going to learn eventually. Now because of this two to the I factor there, I'm going to do worse when two to the I is very large. So this gap is exponential in the information compression and of course, so if I under compress, if I'm on this side of the story, I'm going to have a huge missing in generalization. Now think about it this way, this very easy to understand. I mean, imagine that here in the high end I have this very fine partition of my X but I don't have enough labels. So many of those partitions are going to be empty. I don't have enough labels to label all of them. So if they're going to be empty and we know labels, I'm going to make a random prediction there. So if I have very few labels and very fine partition, most of the cells are going to be empty and I'm going to make a very poor prediction. On the other hand, if I'm compressing too much then my compression bound is going to kill me. So there's somewhere in between where this bound gets to a maximum information which is exactly the point where I have enough labels for all my partitions. And so essentially this is where this becomes of order one where two to the I essentially tell me or two to the I is the number of partitions, the number of clusters, the number of components where I get essentially aim of this order, then I have enough labels. I need the square root because I need a little more than one, a little few more. So when M gets essentially the size of the partition, I have the best generalization. Okay, so it's the red line which I really want to worry about. So now think about the neural network again. The layers are going to compress and eventually they want to put the last layer here. This is the best you can do. Okay, if I manage to put the last layer there somehow then I'm going to do well. That's the essence of the story. And remember that everything here was based on this, the fact that it's mutual information that matters is based on this typicality argument and that's why I'm calling it large scale learning. Okay, yes. No, no, I'm thinking about the why I want to turn to boundary generalization error. So this is just a difference with the generalization gap in disguise. I mean, I just showed you that information and generalization are related to each other. So the difference in these two is the same as the difference in the total error which is this, the average error, and the empirical error. It's the same thing. It's just a good gap written in a funny way. So that's why we put this curve. So it's the red curve and you see that in general you're going to have two types of losses which are losses in information and therefore losses in generalization. One of them is the compression loss which is something you have to do if you want to move to the left. Unless you have a, yes, you're right. This is a proxy to the empirical one. I don't really know the true distribution but I'm telling you that it's going to be governed by that. So this is why I call it a useless bound because you don't have the true distribution. If you replace this by the empirical bound, you can have a much larger error. That's something I'm putting under the rug at this point. So you're right. This is not a very practical thing because in order to estimate this, you need to know the true distribution or an estimate of this compression and this can be widely because it depends on the carnality of X in a very ugly way. So if you look at this paper with a hard and Sivan, you'll see that this is tricky. I mean, indeed here. So this is not cheating, it's just making it useless but it's all right. Okay, so you caught me. But on the other hand, I want to emphasize here that in general you're going to have these two losses, the compression loss, which is the bottleneck. Even if I had the true distribution and I want to compress the presentation, I will lose the first part. And then the difference within the red and the black is what I call the finite sample loss. And of course, if I have more data, this will climb up and eventually reach, get very close to the black line at some point. Now in the, everything here is stretched in the picture that I showed you in this simulation, you can see here that actually there is a slight curvature. And you see that there is actually a curved line there. It's just very, very close to a straight line that drops from two, one bit to zero, which is the deterministic limit, but the deterministic limit is not interesting here. So because of a relatively very small noise was added to the law, to the rule here, I'll come back to what this rule is exactly. Yes, tomorrow when I talk about symmetries, it's actually very interesting what happens there, but so just notice that there is a slight curvature there. It's not a perfect line. And actually argue that this slight curvature is precisely the limit, the bottleneck limit. That's some thought again. So if this is true, essentially what I'm telling you that the number of encoders is very high when you're low in this plan, there are many, many random encoders. There's very poor information about why. But when you get higher, this number of encoders is diluted very quickly and when you get to the line, it's the solution of the bottleneck problem. Essentially a single encoder up to permutation of the clusters, which is really not informative. It's a single encoder which encodes the data perfectly. They're nothing, so this is a unique solution. Now if the layers of the network, which don't know anything about bottlenecks, unless you're Stefano Soato and you actually plug the bottleneck into your cost function, but I don't do that. So some people are doing this. I'm using the bottleneck function as a regularizer. I don't like it because I want to show that even if you don't do it, eventually you're pushed to this limit. And once you are close to the black line in the plan, the bottleneck equations are enforced on you. I mean, you must use them because that's the only game there. That's the only encoder that can work near this bound. So in some sense, the bottleneck solution is the solution, the network for each one of the layers, if indeed they reach a very good generalization. Yes. Yes. Yes. By the way, I want it to be perfectly understood. If it's not. Okay. Yeah. I'm sure you do. It's all right. I try to get a more intuitive understanding. Yeah. Yes. The vertical encoders are always below the red curve. But if they don't compress enough, as I said, many of those cells are going to be empty or not going to predict well, you want the cells to be sufficiently close to eventually get enough labels to label all of them. Right. I'm going to be, you want to say. Yes. You're jumping ahead. I'm going to talk about the inner separability eventually. I'm not going to, this is important. You're right. It's the fact that I close on my representation and I force this partition to be homogeneous with small distortion, which means the same probability of label in all the partition. This is what this distortion measure means. I'm going to eventually generalize well and be linearly separate. That's another issue. It's a separate issue. And it's very important. I'm not going to delay that. So I just want to summarize this part of the talk. What is the time now? How much time do I have? About half an hour. Okay. Yeah. I wanted to have a break right here, but it's a little bit tense too late for a break. So I'm going slowly. So this is just to show you what happens with finite samples. So this is the information plant trajectories. Color here is the number of epochs. So black is zero epochs. Yellow is 10,000 epochs. Okay. And in between I'm going through these colors. Okay, linearly. Now, what you see here is when I train it on 80% of the data, two to the 12th patterns, 4,096, 80% is about 3,000. Okay. So here it's well-trained. And you see that those trajectories, I mean they go to this line and eventually compress and converge to a very high line there. It's almost straight. When you reduce the number of data points, so this is 45% of the data. And this is 5% of the data. You see this very nice collapse. I mean, essentially, okay, the first part, they go all the way to the green line, essentially in the same way. Even with very small data, they get to this line, which is a very important line. I'm going to talk this transition from learning about the data to compressing the data. But then the compression actually hurts you. You see that information goes down. You don't even get close. So there's another line here, which can be calculated, which is essentially this red line. It's actually some sort of crossover in the red and the black. But this is the finite sample button left line, which I'm not going to discuss today. But you see what happens. So this is as close as you can see here, what you call overfitting. Why? Because you don't have enough data, but you try to compress more than you need. So I call it actually over compression. And then you get poorer information. Why is that? You actually get good compression in the sense that a very close partition, but the labels are very sparse. So you don't have a very reliable estimate of the distortion, and therefore you miss. So this is essentially overfitting. This is well-trained, and you see that there's some crossover in between, which essentially amounts to this button that curve bending with the data. That's actually a very nice piece of the theory, which I'm not going to discuss today. Okay, so this is the finite sample story. Now we want to understand. First of all, is this a general story? So this is, by the way, what happens in an entirely different problem. This is a committee machine. For those of you who know what it means, this is a neural network we really love to play in the 80s because it's analytically solvable in some limits. So essentially, each neuron here is a majority. So you think about the first layer, let's say, is a random committee. So each unit is some random projection of my space, and then I take a vote, like in committees, and then the output is the majority, and I do it several times. So this is a cascade of committees. And you see this in this case, okay, I'm sorry, I didn't talk about the curve to the right, but I'll talk about it in a second. You see the same type of trajectories so the layers go up and then left, and they do it very quickly. It's a much easier problem. Yes, and the dynamic, the limits are independent of that, but then the trajectory is dependent on that. I mean, the way you reach the limit, so this is by the way what happens in MNIST, which is a real problem, you know the scarter relation problem that you discussed yesterday or two days ago. So this is a real life. It's not really real life, but this was the famous benchmark of machine learning. And you see it's just, this is by the way values and not sigmoids, and these are convolution neural networks, and before that I used they're fully connected, they're completely unconstrained, and you see essentially the same type of picture, even for single network. And by the way, you see this even for CIFAR, which was much harder to generate. This is CIFAR 10. So it's very hard to estimate the information is a very high dimension, but it's enough for me to see a similar type of trajectory. Okay, so it's true in every network we looked at. This is by the way a very interesting network, which Ryan St. Polanski asked me to try, which is essentially a network where all the layers have the same width. So there's no reduction in, and you see that there's no, only the last layer has this very funny behavior. The other layers stay up. They don't lose information about X, about Y. They always keep old information about Y because there's just some transformation, one-to-one transformation of the input. But they do compress, you see that they move, this one got here, this one got here. All the compression happens at the top layer, at the top line. And only when they compress, the last layer is doing this nice improvement of generalization. So you see the generalization improvement, which is the second part of the training. I mean, when you move from 0.5 or 0.6 bits to one bit, this is the most significant improvement in generalization. It happens together with the motion of all the other layers to the left, which means that they lose information about the input. So this is why I say that essentially that an important part of learning in terms of the number of epochs of training is the forgetting. I mean, learn to forget what's not important here. So now I want to move to the second part. Yes, question. Yes, yes, yes. No, no, they're all lying on this curve. No, they are concerning one, I'm going to talk about it. Of course, that's the whole point. They affect each other. This compression is not independent because of the Markov chain. So that's exactly where I'm coming now. Okay, so I want to understand the dynamics of this picture. I mean, and this for this, I need to look at the SGD, at the stochastic gradient descent. Okay, so the first thing to do is actually look at this movie again, but this time, where's the error? This is actually a generalization error, but the training error looks very similar. And what you see is that this point where they start to move left, they start to compress, is precisely at the knee of the error. So the error goes down very quickly from 0.5 to about 0.1 within a few hundred epochs. Actually, in the large problem in MNIST, it's only something like 10 epochs, or very, very fast. And then essentially the error flattens. So think about it as a training error. This is the average training error, that's why it's so smooth. And you see that essentially, this goes down very slowly, but actually most of the improvement in generalization is where it goes down from one over a thousand to one over a million. That's where people want to pay the money. I mean, this is really very important, it's very larger. So this is a very misleading curve. It looks like flatt, but you should look at it as in log, log scale, or in logarithmic scale. And actually the most important and the most valuable part of the generalization error is really the last, really when you move from a 10 to the minus three to 10 to the minus six to 10 to the minus nine. That's where you really want to be. So, and most of this happened where the layers compress up there. You see they very slowly move, it's a bit slowed down, but you see that they're very slowly moved to the left. So what's going on here? I mean, how is stochastic gradient descent is doing this? Obviously, if everything is related to the knee and the error, so it's easy to see, I can show it on the board. The derivative of the error, now let's do this. This is a school after all, no? So the derivative of the error, so if you just look at the absolute value of the derivative of the error is the sum over the layers in this case of the derivative of the error with respect to the weights and the case layer. So this is a short notation for the gradient of the error with respect to the layers, to the weights of the case layer. I'm calling it WK. Times the derivative of the layers with respect to time. Okay? This is just taking derivatives. And this is true when K is going to be the number of layers. So this is true for every layer. Now, stochastic gradient descent is telling me that DW to DT is minus the gradient of the error plus noise. Okay? So it's not really noise. This noise is tricky. I'm going to talk about it. It's different for every layer and the state dependent and it's known as a tropic. And whatever you want. But its average is zero. Because it's actually, the noise is defined as the difference within the mini-batch. Remember mini-batches you had about them? Yeah. The mini-batch error and the actual error. So if you sum over all the mini-batches, you get zero error. So this has a zero mean. So if I plug this here, I get that the derivative of the error is the sum of the gradients times the gradients. Plus the noise. And when you average, so I average this, and you average this, you get just the sum of the norm of the gradients with respect to each one of the layers. Square. Okay. So if the derivative goes down, the norms must go down. The magnitude of the norm goes down. No other way. So now, okay, so let's look at the gradients. That's a natural thing to do. So here's my problem at last. This is the network I've been showing you all along. It's a six-layer neural network which has this particular funny Eiffel shape, Eiffel Tower shape. But that's not important. But what I do here is we plot for each layer, you see in different colors, the gradients of the weights at each layer. And what you see here is the mean of the norm in the solid line. And then the standard deviation of the norm. So the standard deviation is calculated over the mini-batches. So there are many mini-batches. The batch size here was something like 200 and the size of data was 3,000. So something like 12 mini-batches or something. And they are completely independent. So what you see is a very striking phenomenon which a lot of people reported by the way. I'm not the only one to see this. You see that at the beginning of the training, this is a log-log plot. So at the beginning, something up to 300, you remember the 300. This is the point where it starts to curve. So up to there, the norm of the weights of the gradients is much larger than the standard deviation. This means a very clean gradient. Essentially no fluctuations. It's actually two orders of magnitude. So it's really clean gradient. At this point, this is precisely at the knee of the arrow where the arrow saturates, the derivative saturates, or the arrow saturates. And then you get this funny note that the magnitudes go down by about an order of magnitude, which makes sense. And by the way, there's this nice dispersion according to the layers. You see that there's more, there's a lower gradient at the first layer and higher gradient at the last layer. But in the noise, you see it precisely the opposite. The fluctuations jumped up about an order of magnitude above the mean. Okay, that's an order of magnitude most fluctuations than mean. This means noisy gradients. So actually I argue that this phase, up to about 300 epochs, epochs is a cycle through the data. So it's a lot of updates, because I update every mini-bitch. So up to here I have what I call a high signal to noise gradient. That's why I call it high SNR. And so this is eventually going to be a drift phase. If you think about Focke-Planck equation, the gradient is very clean, the fluctuation is very small, it's essentially all drift. At this point, the gradients are lower, but the fluctuation much higher. So this is essentially a diffusion, maybe a slightly drifted diffusion, but highly diffusion. Okay, so and you see that this is from 300 to 9000. By the way, here you begin to see what people call the collapse of the gradients. You begin to see this very, I'll show it to you later. So eventually it's going to be very noisy because the unit saturates. But you see that this transition from high to low SNR happens way before the collapse of the gradients. It's not the reason. This is by the way a misconception, which was published last year by Andrew Zucks and his colleagues in Harvard and MIT, very good people, made a lot of mistakes about our understanding of theory, but never mind. I'm going to come back to that. So another very convincing way of seeing this is to look at the norm of the gradients themselves. Remember, I don't have any regularization tonight, and I don't have weight decay, and I don't have any specification or anything like this. So essentially the gradients are just growing. And this is again a log log norm. This is the average magnitude of the gradients. You see that up to, so this number changed because it's the number of iteration, number of updates, not the number of epochs. It's my mistake. This is the factor of three more or less between this and this, of three and a half. It's the same thing. So you see that up to about 300 epochs, it's about 1500 updates, there's a linear growth of the weights. Okay, that's what we expect from a drift. Essentially, we just accumulate the gradient and it's more or less linearly. So you see that the exponent is one more or less, if you can estimate it, take a ruler and estimate it. And from this point on, there's a very clear drop in slope. Now, if it was a pure diffusion on the plan, how would you expect it to grow? Like a square root of t. So then I expect the slope to go to be half. Now, if it's actually a rough plane and it's not entirely, there are all sorts of anomalous diffusions which can be lower or higher in the diffusion exponent. So this is actually closer to 0.4 for some reason, but it's close enough to half. Okay, so this knee in the log, log plot of the magnitude of the weights is another direct indication that in the second phase you're doing diffusion. Okay, so this, as I said, grows more or less like a square root of t. By the way, this is what you see. This is a bit noisy, but this is, oh, I'm sorry, I had to fix the title of the timeline. So essentially these are the three plots here for each layer. And what you see is the SNR of the gradients in blue. So you see that there's a very sharp drop in the SNR. And you see again the magnitude of the weights and you see that around this sharp drop, that the slope of the weight change. And you see that afterwards, the information between x and t, which is this purple, this orange line, is dropping. So again, the compression is associated with the diffusion phase, it's very clear. So we see it in everywhere we look at it, either directly at the information, or the gradients, or the slope of the gradients. There's something which has to do with losing information about the input in this compression phase. And it's true for all the layers in one way or another, even for this very small network. By the way, if you change the, so again, this is again, you see here the SNR of the gradients for each layer. And you see how fast it drops from one to 10 to the minus three more or less. Very noisy gradients. Each layer is different color. Here by the way you get to this saturated gradients that seem to become very noisy, so forget it. And this is as a function of the mini bit size. This was again one of the claims of SACs that they seek compression even with full batch. Which means that I argue that it's the mini batch that is doing the trick, because just, but even with full batch, and you see the full batch is here, along the same line, you see that the number, the point of the transition from drift to diffusion is moving with the mini batch as expected, linearly with the mini batch size. So this is a function of the mini batch size, and this is, but if you actually look at the point of this inflection, there are marks of ITX, which is exactly the point where you get the maximum information about the input. It's exactly this green line that I saw before. You see that even in full batch, you are along this line. So there's some noise in the fuel batch. It's a small noise, but it's enough to cause diffusion eventually. What is the noise? Well, in full batch, it's because you have an approximate discretized gradient calculated on the training data. There's a lot of noise there. I don't fully understand, I mean, I don't have a full theory, but I see empirically that the full batch is somewhere on this line, which means that it behaves qualitatively in the same way. I'm saying that if you don't use to cast a gradient descent, I mean, you actually calculate an exact full batch gradient, you'll still have some level of stochasticity, which behaves eventually like noise in the gradient. And of course, the diffusion phase is now delayed. You see that it moves from around 1000 epochs to 5000 to 4000 epochs. But these pictures essentially the same? Yeah, essentially the same. The information plan pictures say it just takes a lot much longer to get to these things. So now I'm going to argue again, again a question, yes. And no language rate is 50. I mean, whatever it is, okay? I'm not playing with it. You're right that there are all sorts of algorithms which do weight decay and do regularization and do, and do, and they change the learning rate. This is a really plain vanilla stochastic gradient descent in the simplest possible format. All of these are important. I argue that they're going to change the dynamics. I mean, they may be accelerated. Actually, I actually argue that if you dilute the weights too early, you're going to lose because you're going to slow the fusion. So I really want to argue why is this diffusion important? So this is a nice picture where we wanted to see the effect of the number of layers. So this is the first time we looked at it. You see the same problem with one hidden layer, two hidden layers, up to six hidden layers. And if you look at the colors, I move from dark purple to yellow. You see that here in the two, one hidden layer essentially takes forever to come to good transition. I mean, I'm somewhere in the up layer, and even there, the units here are not log two, but log ever run. That's why it's 0.7, but 0.7 is one. So essentially, it never gets to very high transition within our frame, but surprisingly, when you add hidden layers, the number of training epochs get lower. With six hidden layers, everything is here at the purple, and essentially, you see that the layers get stuck at some point, nothing happens from here on, more or less. You see this by the, it's red here, it's yellow here, which means this was the endpoint, but right next to it, it's here, okay? So it didn't move much at the end. So essentially, I get to very high transition in the last layer already here. So this looks like a very dramatic effect. I mean, I reduce the number of epochs by adding layers. So this immediately gave me, when this is a very clear intuition, these layers are related in a Markov chain. So anything that I compress in the lower layers is also lost to the higher layer. So there's something like a train, they push each other. Now, if I add more layers, they do this parallel diffusion. Each layer is moving independently because the noise is independent for each of the layers. So let's look at this more carefully. Yeah, I lost some animation here, but never mind. I copied the slide yesterday from another place and you know, when you move from PowerPoint to one presentation to the other, everything changes and I have no control of it. Anyway, so now I'm going to prove to you, this is one of the highlights of today, in five minutes, that this compression is actually due, it's actually happening in the linear part of the neurons and it's independent of the non-linearity. It's not really true because of what the non-linearity can enhance it. If there is saturation of the units, it will compress better. But I don't need the non-linearity in order to do the compression. It's really a dynamic effect of the diffusion. So here's the analysis we do. So essentially remember what happens here. Essentially, this is that the pictures you have to have in mind is that during the drift phase, the energy drops significantly into something which a lot of people call flat minima. Essentially, you get into through a ridge like a crater. You fall down and then you're essentially randomly move on a relatively flat surface. Now for most people, this means, okay, I have to stop when I go to the flat minima. Why should I continue? Actually, we see in everything that I showed you so far as this during random walking the flat crater actually improve generalization significantly. So Castigradient descent is doing something very different than just carefitting. I'm not just fitting the daytime. This dynamic effect of diffusion is going to improve my generalization. I want to show you how. So I already convinced you, I hope, that compressing the representation is good for generalization. But how does it happen? So essentially, if you look at the covariance matrix of the weights, let's say the Haitian matrix of my energy or my error at this local minima. So now I'm in a linear problem for each layer separately. This is a very elongated covariance matrix. So in two dimensions, this is the best I can do, but so there are very few dimensions which are important, which are called the relevant dimension. This is this low dimensional manifold that a lot of people talk about. Eventually, let's say I do face recognition, there are maybe 20 features that I really need to worry about. The eye, distance, nose, hair, whatever. 20 numbers is a good estimate. Everything else, all the million dimensions of the weights are essentially irrelevant. What it means, that changes of the weights in these directions is not going to affect the error. Because it's irrelevant. So changes of the weights in this direction is flat with respect to the error. Now, so this is the way I drew it here just by this very elongated ellipse. It's actually very, very low. It's not exactly flat, but there's a very big drop in the eigenvalues of the covariance matrix. Now, most of the dimensions are here in the irrelevant dimensions. Now, when you start to diffuse what happens, I mean you start to do a winner process. You just accumulate these random gradients which are not suppressed because nothing is going to suppress them. They're just accumulating this diffusion part. And essentially what it means, that the weights at the K layer are going to some matrix which I at this point want to call the CCA matrix or CCA is sent for canonical correlation analysis. But it's actually some sort of projection to the relevant dimension. This is really the important part of what we call PCA is doing the same thing, but on the variable themselves. But if I want actually the projection of the data such that the variability of the label is minimized, is maximized, I want to keep maximum information, this is called CCA. You can read about it elsewhere. So this is the best projection. But then I have this what I call delta W, where delta W is the diffusion matrix in all the irrelevant dimensions. So this is starting to grow. Now look at the map from the layer K to the layer K plus one. So here is just a simple abbreviation. So essentially this sigma is the nonlinearity which can be a sigma, it can be a value, whatever you want, it can be just linear. Of this W times the previous layer, this one, plus delta W times the previous layer. That's what's happening here. So it's a matrix times the previous layer, that's the whole trick of neural networks. And that's all there is there. But this matrix in high dimension is a random winner process which grows like swear words of T in simple diffusion. And the co-variance of this delta W is very much like the co-variance of the gradients because it's just growing, growing in the high, in the flat dimensions like swear words of T. And then the relevant dimension essentially doesn't grow. So this delta W, it's a random Gaussian numbers which are accumulated like a winner process in every component separately. And when I multiply it by the previous layer, since these are essentially independent from each other, it looks like random Gaussian nodes. So that's something we choose this as by immediately and for mathematician I really have to mark how to convince them, but you can actually prove it. So this behaves like a normal distribution with the same co-variance, but the magnitude of this W is growing with T like diffusion, that's why it's going slowly with T. Okay, so now I'm going to try to use this. So essentially the map from Tk to Tk plus one is a stochastic map which is made out of two parts, a linear part which is a linear function plus noise. This is what we call information theory Gaussian channel. Essentially it's a linear function plus noise, sorry. So this is a Gaussian channel and then there's some nonlinearity which due to data processing inequality again can only reduce information. Okay, so the information is bounded by the maximum information that can transfer through this linear channel which is again the Gaussian channel capacity. So how many of you heard about the Gaussian channel capacity? Not enough, not enough. So I need to actually go through the detail but my time is up. So this is the Gaussian channel, it's one of the plus the SNR ratio. So this time T is the signal and this time T is the noise and what you get there is SNR. So one log of one plus SNR is the capacity of the Gaussian channel. This is a very important formula in information theory. So this is a bound on the information on the channel. So I just want one more slide and the rest will be tomorrow. So essentially I can work out this bound and you see eventually I know that delta W is growing with T like square root of T or some diffusion exponent. So let's call it lambda IT and AII are now the principal, the eigenvalues of the covariance, the CCA matrix. So here I'm doing two tricks. When first I diagonalize everything in the dimensions of the CCA. So AII are fixed and the lambda is the projection of the noise on the CCA matrix. By the way, you know asymptotically these two things commute because the Haitian matrix and the covariance of the gradients are very similar to each other eventually. So I don't really need to work too hard here. Just diagonalize it over these principal components or principal channels and then since this is growing with time this is going to be a constant. So here's something is missing. It's going to be a constant which depends on the relevant dimension plus something which is going to decay like one of a T to the alpha eventually. I'm going to repeat this tomorrow in much clearer way. The nice thing is that it gives you this very surprising result. So the information is bounded by a constant which depends on the relevant dimension plus something which grows like T to the minus alpha where alpha is the diffusion exponent because this is the CO2 noise ratio. So all I did is approximate the log by log of one plus x by x. This is okay if x is small and then it's bounded anyway. And then I assume that I have the same diffusion exponent for all the layers or at least I take the worst of them and I take it out. And what you see is that the information is decaying to this constant in a rate which is essentially a power law which depends on the diffusion coefficient. Okay, so the real gist of it is that if I have these completely independent diffusion subspaces in every layer, I'll come back to this tomorrow. This is very important, it has to do with diffusion and alpha is somewhere between half and zero. Anyway, for any alpha I get this interesting relation. The time of convergence with K layers is K to the minus one over alpha where alpha is this diffusion coefficient is the time of a single layer. So this is a power law decay of time. Just want to show you how it looks in reality. And then I stop. This is the number of iterations. This is the number of layers. The red line is the power law. The expected power law. The blue line is the measurements. So it's perfectly obeyed by the networks. The time of convergence scales like a power law which depends on the diffusion coefficients with the number of layers. More layers, less iterations. Okay, so now the question is this always true? Or is there some special problem? And then we'll talk about it tomorrow and about symmetries and other things. All right, thank you.