 and every one of the layers is one big random variable. So this is somewhat unusual, because most people think about the neurons as the important elements here. We're going to come back to this later. At this point, you can have all sorts of equivalent networks just by scrambling the weights in any one-to-one transformations. This will not change anything in my picture, but this can be a very different network in terms of what each neuron is representing. So essentially, I mean, all those, I really don't believe that a neuron which represents faces or a neuron which is on a grandmother's cell or whatever, things like this, I think are not the typical thing you see in deep neural networks, unless you have some constraints on the architecture, like limited receptive fields, or like convolutional neural networks, for example, are very restricted, because they're already implementing some sort of symmetries. And with convolutional neural networks, with restricted receptive fields and so on, you can't begin to see things like this is a face cell and this is an eye cell or this is a car cell or whatever. But in general, I think that it's more likely to see scrambling of the layers. And in order to be invariant to such things, one of the main ingredients of mutual information is that it's completely invariant for any such transformation. As long as I look at the mutual information between whole layers, okay? That's one comment. The other thing that I started to, I started and will continue now to discuss yesterday. So I mentioned the very important fact that if you look at the gradients, which is really the reason why we see all these interesting dynamics, we have these two phases, the high SNR phase, when this is the mean and this is the standard deviation of the gradients per layer, average over all the units. So we have this very dramatic, high standard definition, high SNR, high difference between the mean and the standard deviation. And then at some point, which is very important for us, essentially the layers converge to some sort of a flat minimum in the energy or the air or the straining air or landscape. That's something which has been reported by many, many people. And in this flat minimum, it's flat in some dimensions, in most of the dimensions actually, it's flat, but it's actually very narrow, very well protected in the, what we call the relevant dimensions. The things that actually affect the error dramatically. And this is usually some sort of a low dimensional manifold, which depends on the problem, depends on the features, which are really important to the problem, but in each layer, they're reflected in a different dimensionalities of this covariance matrix. And, but at this point, the covariance, the variance of the gradients is essentially uncontrolled in the irrelevant dimension and can be very high. And that's why you see this very big jump. And if you actually notice in this figure, I think I should use this because this shows much better on the video. So there's actually a maxima of the gradients, of all the gradients, just before this collapse to the fall into this flat minimum. This maxima is really, if you think where is the gradient very high, exactly when you fall into this canyon. It's really not flat minima. It's a canyon, but this low dimensional picture is very misleading. It's a narrow canyon in the relevant dimensions, very wide flat minima in the irrelevant dimension. So this peak that you actually see everywhere, I mean, you see it in many, many different networks, is really a very nice indication that you're actually falling into something. So the landscape is strange. I mean, in very high dimension, you walk very slowly at the beginning and then you fall down, then you have this high gradients and then the variance starts to climb. And another thing that I, so this is what I call fall to flat minima. Another interesting thing is that if you look at the ratio between the mean and the variance or the mean on the standard deviation, this is the signal to noise ratio, essentially of the gradients in every layer. So if you look at the difference between red and red and then blue and blue and you see that essentially all these differences are the same. So this is something which is really striking and why should this happen? And if you notice these differences are really the differences in the logs. So it's the log of the ratio. So this is the log of the signal to noise ratio of the gradients. So why should it converge? You see that eventually they start in a very different places. The beginning is completely unordered, but when they reach this flat minima, the signal to noise ratio of the gradients looks like constant. Now, this is a very nice, completely independent verification or evidence and that you get this two into the saturated phase of the mutual information about the label. Because remember those gradients are back propagated in the network. So they're moving from the last layer to the lower layers and that's why the gradients accumulate. You see that the first layer, the red layer here has the highest variance and mean of the gradients. The difference is about the same for all the layers. But this difference, so it's the log SNR, the log of the mean or the SNR, let's say. The SNR is just the ratio between the mean, sometimes we might call it the power of the signal, the mean to the standard deviation or the square of it, that doesn't really matter, the power of the signal to the power of the noise. And this SNR is really completely dominating the mutual information between the layer and the output. Because you know that the log of one plus the SNR, or that the signal to the noise ratio is exactly up to a factor of confidence. Exactly the capacity of the Gaussian channel. So I have to talk a little bit about Gaussian channel, this is something which I realized yesterday that is not completely clear to everyone. But anyway, you see that this particular function of the SNR is bound, actually it's very close to the mutual information because essentially if you can think about the gradients as a small stochastic signal that is back propagating in the network. And because it's back propagating in the network, so it's going from the top, from the output layer all the way backward, and noise is accumulated from layer to layer, that's why you see this increase in the signal in the noise. But the signal to noise ratio, which is the difference within the mean and the in this log log plot, the mean, the log mean and the log standard deviation remains constant eventually towards convergence. Now if you compare it to the picture that we show that the information plant picture, you notice at this point where already, let's say around 3,000 epochs, the information information is very high, it's already, and now I'm starting to drift to the left, but the mutual information about Y is already very high. And this is directly indicated, you see it very clearly, in the constant signal to noise ratio of the gradient. So that's just a nice way of verifying what I call a sanity check that we actually see something real. I mean, this broadcast of the gradients from the end of the network back propagating to the beginning actually accumulate noise without losing information. And it's a very striking and very clean effect. I mean, there's no question about it. You see that the order of the lines is exactly backward in terms of from the beginning to the end, and the noise is accumulated. Now when I say Gaussian channel, and this is one of the most important models in information theory. So you know a channel is, I have to say something about it. So now in Shannon theory, you have this channel which has an input X and then output Y and some condition distribution PY of an X. So a noisy channel, so essentially this condition distribution means that there's some noise added here in general. So some noises is interfering with my communication. And one of the main beautiful results of Shannon theory is that the channel is completely characterized by the mutual information between Y and X which is essentially again the sum over X and Y of PY given X, PX, log PY given X over PY. That's one way of writing it. So essentially it tells me how much, so this is, you can actually write it also as function of X, but this is a nice way of writing it. So essentially what it means is that the channel is characterized by, so essentially the way we usually think about it just to give you a crash course in information theory. If X is the space of all possible Xs, just as in my space, I mean space of all possible inputs, and you broadcast it into the space of all Ys that in general can be very large. Unlike our special case where Y is small in communication, both X and Y can be large variables or large spaces. And essentially the idea is that let's say you are starting from some IXI and it is mapped to some Y, but it is not mapped deterministically, but there's some stochastic noise added to it. So we think about it as if there is actually a sphere around Y, this is the sphere that is due to the fact that Y is blowed by the noise. So I can get X and get all sorts of noisy versions of X. And those noisy versions of X are mapped into some sort of sphere around the clean or the Y I really wanted to broadcast. So there's some sphere around it. And again, the basic questions of channel coding is how can I separate those spheres with noise in some sense? I should know which X actually was broadcast, but they're all mixed up together in the spheres. So what I should do is actually pack as many spheres as I can as long as they don't overlap. Once they start overlapping, I'm going to confuse things. So that's a very simple question I need to put here as many spheres as I can without overlapping. And then I know exactly how many Xs I can actually broadcast without mistakes. So that's really the main idea of channel coding. Separate them. Now if you just look at the, how many spheres I can pack here without overlapping. So what is the size of this sphere? So the size of this sphere on average is precisely two to the age Y given X, which is precisely how many, again for the same cardinality arguments on the typicality is that the average size of this sphere is precisely what blown up by this conditional entropy. So it's two to the age Y given X. And the size of all of Y is two to the age Y for the same, exactly the same argument I had on typicality. So you see that the capacity or the best, the maximum number of spheres that I can pack here is precisely the ratio of these two things exactly as, so it's two to the age Y over two to the age Y given X, which is just this. And nicely enough, this is precisely the mutual information, IYX. So IYX is precisely the number of spheres that I get here. So the maximum possible pecking or the densest pecking, the maximum number of spheres I can get there is precisely the maximum ratio. So this is what we call the capacity of the channel. So the channel capacity is the maximum and the only freedom I have here is on the probability of X because PY given X is given. So it's the maximum of all possible input distribution to the channel of this IYX. So that's channel coding in a natural. Okay, so that's why the mutual information is so important for communication because the mutual information is really giving me the maximum number of objects that I can separate after the noise of the channel. And when the probability gets large, again the typicality arguments tell us that these are going to be very hard spheres. And they're all going to have the same width essentially. Again the typicality argument. And so the average width or the average size of these spheres is going to be the same asymptotically and these are just like pecking hard spheres. Okay, so this is in general the story. When you talk about a Gaussian channel, so this can be any alphabet here and any alphabet here in discrete channels. But when we talk about continuous channels, so we actually think about Y being let's say X plus some noise. Well, they don't have to be. This is the average size of the scale. This anyway is the average size of the, they don't have to be for this calculation to work. What I'm saying is asymptotically partition, this property that's eventually typical things have the same distributions is going to homogenize these spheres. It's going to be equal sphere pecking. So the problem geometrically is a sphere pecking problem. How many spheres you can peck in a certain volume without overlaps? So that's just, now when you talk about Gaussian, so this is let's say a real value variant number and this is a real value number. Just think about one dimensional Gaussian first and I'm adding to it a noise. So this psi is a noise with zero mean and some sigma, some variant sigma square. So this is the Gaussian channel, which means that the noise is completely induced by a Gaussian noise added to each number. So in this case, the sphere pecking problem becomes actually simple in some sense. I mean, the Gaussian is going to be mostly surrounded around the standard deviation or maybe two standard deviation or three standard, but with very high probability, the noise is going to blow up only with a certain distance which depends on the standard deviation of the noise. So essentially, and the whole, if X itself was a Gaussian random variable, let's say in a higher variability because I can have access here and access here and so, and then Y, of course, there's a sum of two Gaussian variables is also a Gaussian variable. And now again, this picture is how many spheres you can actually peck of small Gaussian noise in this big Gaussian variable without overlapping. And it turns out that at least in the one dimensional story, this is very simple. I mean, it just, everything is dominated by the ratio of one sphere to the ratio of the original sphere and this is essentially the SNR. So the SNR, the noise in this case is just sigma square and the signal is just the power if you want of the, of the, of the broadcast. So it's very clear. I mean, if I want to be resilient to small noise, I need a lot of power just that the noise is not going to mix my broadcast. So that's, this is the Gaussian noise, the Gaussian channel in, again, in a very compressed way. I just, I don't want to teach you, I want to give you the basic idea. But then the mutual information, this capacity goes to this formula. So that's again, it's a sphere pecking argument. You can actually do it yourself almost. I mean, just calculate how many spheres of size sigma swear I can peck in a sphere of size, of size P in a large, in a large, high dimensional space because we're talking about repeating the broadcast, so it's again the capacity argument is always looking at the very large blocks and then I have a high dimensional, n dimensional sphere and n dimensional small spheres and it's a sphere pecking argument in n dimensions in Euclidean n dimensional space. And that's what's giving me this formula. This is probably the most practical, one of the most practical formulas in information theory. And you should really know that. So this is the power, the power to, the power, the power of the signal to the noise, to the power of the noise ratio. The first thing about Gaussian channels, so that's really just to understand what's going on here. Why the signal, the constant signal to noise ratio in this diagram is actually verifying the fact that the information, the mutual information between Y and P may saturate with it. It's just two aspects of the same story there. The gradients are small Gaussian noise so I can linearize my channel despite the fact that there's no, no purpose. I'm sorry. So despite the fact that this is highly nonlinear system, I mean I go through all these nonlinearities with small enough gradients I can linearize the network in some sense and it looks very much like a Gaussian channel. Now once we already talk about Gaussian channels the really interesting story becomes where both Y and X are high dimensional. We'll talk about the multi-dimension Gaussian channel. And this is related to the theorem I started to prove yesterday and that's why I want to verify that you know it. So again here it's a one-dimensional Gaussian channel and the SNR is related to the capacity of the maximum mutual information within the gradients here and the input and I'm sorry the gradients here and the output is back propagated. Remember this interesting story in deep learning I mean we have the signal is moving forward from the first layer forward but the noise or the gradients are actually moving backward from the last layer backward and that's exactly why you see this accumulation of noise. The most noisy layer is actually the first one which is a bit surprising at first but when you think about back propagation that's precisely what you expect. Okay so then I argued and I hope convincingly that this transition from a high to low SNR correspond to a transition in the dynamics of the gradient descent. The weights now when I think about accumulating the gradients have this linear growth at the beginning which is the drift phase and then the sub-linear which looks like square root or some other smaller than one fraction of time which is this diffusion in the space and that's very very clear wherever we look. Whether we see it in indeed we see this transients in the gradients in many many different networks and not only in this toy problem and we see it here for example you see the gradients. This is the commuting machine again and you see how nicely the gradient change flip also here in exactly the same manner. The first layer has the higher noise and so on but if you look at the distances they're also approaching a constant at the ratios and so this is entirely different networks. This is what we see in the MNIST problem. So in the MNIST we see again the same type of compression and you see very very quickly this crossing of the variance it's not as sharp as in the smaller problem for some reason but you see it. The means goes down and the standardation goes up and exactly at this point of transition you get this backward moving of the layers and again the ratio goes to a constant because the mutual information is essentially saturated about why and again we see it essentially every problem we look at. So this is a reassuring that it's not just in my hallucination we actually see something real here. Now I mean if you see completely independent evidence for the same phenomenon that mutual information is converging and layer by layer you see it in the gradients directly. Actually I believe that you can actually calculate just from the gradient figures on this just as you see this very nice flat minima for very clearly here, you can actually take these figures and estimate using the Gaussian approximation to the mutual information with the Gaussian bound to the mutual information you can generate those information contours at least approximately just from the gradients. So that's by the way one of the answers we have to some of the criticism that people had on my work that we are very sensitive to the way we estimate mutual information, we are not. I mean we can do it without without binning without discretization without doing anything which looks fishy in estimation and we get essentially the same picture, okay. So that's just relying on some questions and now I want to go again to this quite dramatic effect that I mentioned that surprisingly when you add layers to the network you converge faster, converge faster to a good solution. So this is completely contradictory because the way we usually think about learning is that when you add parameters the problem become harder and therefore you need more time. So it becomes harder in one sense of course there's more computation on the way and I need to compute through a lot more scalar products and a lot more linearities and so on. So in terms of computation time it may not, it certainly go harder when you add more layers but there is a very dramatic nonlinear effect when you see that from one layer to six layer essentially I converge in a small fraction of the time to a good solution. So I argue that this can be explained by again the Gaussian channel so that's why I insist on talking about it and I want just to repeat this argument more slowly. So again, the observation and the assumption that the only assumption that I'm making is that the network is converging to some sort of a minimum, a local minimum which there may be many, many actually I believe that there's a whole manifold, continuous manifold of good solutions. Equally good, means that this is one of this property of high dimensional problems. There's certainly not one minima. So forget about this image that you have in mind that there's one minima. There's a whole infinitely, infinite manifold of optimal solutions. So essentially this is a big canyon in high dimensions which is flat in many dimension and essentially very narrow but well protected in the relevant dimension and this may curve in a very crazy way. I'll show you how in a few minutes how the geometries can be complicated but locally I can think about it as one local minima with some sort of a covariance matrix which is completely asymmetric. I mean so completely anisotropic. It's very narrow in the low dimensions, very hard wide. So by the way I mean you're going to hear for example Ricardo Zekina next week I suppose and he talks a lot about this, the nature of this local minima and actually the fact that there are many, many isolated minima which is very narrow like spin loss which I actually don't care about because the probability of actually falling into one of these very narrow holes is very small. So with very high probability and there are some people who analyze it very carefully now you fall into this flat canyon. You see this nice drop in the gradients and so on. So now in this flat canyon I can analyze the problem in two steps essentially. So my interest at this point is to understand what is the mutual information between two consecutive layers. So remember, so that's again it. So I have this big network or whatever. The layers here, this is X and this is let's say TI and TK and TK plus one, two consecutive layers and there are connections which are called WK I put the K up WK which is just a notation for all the weights that connect this layer with this layer. So remember, this is a high dimensional vector in general in my assumption high enough. So again, the whole analysis that we are doing all along is under the assumption that X is a high entropy variable and that all the widths of the most of the layers until maybe the very end are relatively a macroscopic. I mean, the scale with the size of the input. If you suddenly have a very narrow network layer there you can lose a lot of information and the whole story. I need this to be large for this asymptotic analysis to work. But in most essentially all the applications I know of bigger neural networks, this is true. I mean, most of the layers are large. The end you may have this convergence of very few neurons but this is just the very end. So the end is somewhere here. This may be narrow but I'm looking at this typical transition. So in order to understand how is this compression actually happening, we want to estimate these random variables, wider random variables because X is a random variable and then broadcasting X to each one of the T's through the layers. So this is going to be some sort of random variable under the X's and this is also be some sort of and I want to estimate the mutual information between TK and TK plus one just as the beginning. When TK is both of them are essentially representatives of I want to see that there is actually loss of information. So in order to do this I think of TK plus one again, let me do it here. You can see it on the video. So TK plus one, oh, I'm sorry. TK plus one is, oops. So TK plus one is a nonlinear function which I call sigma which can be sigmoidal hyperbole tangents or a linear piecewise linear like reluze or like what they call hard tang which is piecewise arctangents and so on. So I don't care. That's again, I'm answering a specific criticism that this depends on the non-linearity. But this is an only a function of this product of the weights by the previous layer, okay? So now, I argue that this linear part of the map is actually a Gaussian channel. Let me just look at the linear part. So this is a nonlinear function of a Gaussian channel. Why is this a Gaussian channel? Okay, so a Gaussian channel as I said here is, so essentially what I'm saying is the TK plus one is this nonlinearity of WK times TK, okay, whatever. Okay, whatever. And but I can actually, I can and I should break the WK into two pieces. It has completely different behavior during the first part, the drift row where I actually converge to this flat minimum. And from there, I see that the weights start to do a random walk. So actually say that WK, this matrix looks like something which I call CCA for a reason which I'll try to explain soon, which is the optimal projection to this minimum. Okay, so you just take the high dimensional space and project it locally. This is essentially the matrix that we get to at the end of the drift phase. But then there's something else, which I call delta WK, which actually grows with time. Why is it growing with time? Because I have this diffusion and the more I diffuse, the larger these weights get and they grow sub-linearly with time. Okay, so now I'm saying, okay, so this is some sort of a Wiener process, Wiener process, whatever. So this is an accumulated, by the way, there's also some papers who argue that the gradient of stochastic gradient don't exactly look like Gaussian noise. That's true, I don't really care because I'm going to accumulate those anyway during the iterations of the gradient of the vapor propagation. So even if they're not exactly Gaussian, there's some is going to be Gaussian, don't worry. I'm going to sum many, many independent variables like this. So eventually this delta W is going to look like delta WK is going to be distributed like a normal distribution with some sigma, let's call it sigma zero, multiplied by time, by some function of time, which is some power of time. So the sigma zero is the covariance of my initial wall. I mean, okay, so this covariance matrix is the covariance matrix of the flat minimum, essentially it's the Haitian matrix of the minimum if you want, which can be estimated directly but it's a different story, how I do it. The only thing I assume about it is that it has this very non-homogeneous, non-isotropic structure. Okay, so most directions are flat. And so this is what I assume. So this delta W of T is going to be, okay, let's put T to the alpha here, where alpha, sorry, I'm going to be distributed like T to the alpha, this sigma zero, the sigma zero is the covariance of the initial, it's in many dimensions. Most of them are irrelevant but it's not the sharp drop to irrelevant. They are less and less relevant. Now if you actually look, so essentially this sigma zero is the covariance of the CCA matrix. So again, this is something you may know, so the covariance matrix of gradients. So this D to E to DW, K, if I average the dot product, so this is I and this is J. So this matrix is essentially proportional to minus the matrix of the second derivative of the error with respect to DWI, DWJ. This is the Haitian matrix and this is the covariance of the gradients. This may, again, averaged, average of always. So this is something that you may have seen. Anybody who saw the Kramer-Rau bound, for example, should remember that. So essentially the second derivative, the minus second derivative, which is Haitian matrix, is exactly proportional to the covariance of the gradients. So saying that the covariance of the gradients and the Haitian matrix are essentially the same matrix, or at least the asymptotically same, it's not such a terrible thing to say. And this Haitian matrix is directly related to the CCA matrix. So if you really want to be careful about it, I actually argue that asymptotically the covariance matrix of the noise and the Haitian matrix commute. So I can actually diagonalize them together. But even if I can't, it's not a very, very different from what I'm going to say. No, so the argument here is, okay, so this say, well, let me clean it a little bit. So if this is true that this behaves like noise, so I can write this as the first term, which is my preserved part that I really want to keep. The projection to the nonlinear, to the CCA, to the real part of the, this is what I call the signal. These are the relevant part. This is the thing that I projected. So if you look actually at the spectrum of WK, of W at all, so if you look at the eigenvalues of W of the weights, it usually has this very sharp drop where the higher, so this is actually W of the minus one. So the protected dimensions are high. I mean, so these are what we call the relevant dimensions. And then there is a drop and eventually there's a long tail which can be exponential or worse than exponential of irrelevant dimensions. So remember, I'm talking about huge space. I mean, all the weights. Now I can actually break it layer by layer and separate the layers, the weights in one layer which from the weights in another layer and so on, deal with them independently. But all of them have this type of structure. Of course, when I'm moving through the layers deep in the network, they don't see all the data because some of it is already filtered in some sense by the previous layer. But this type of a sharp drop in the eigenvalues, so this is the eigenvalues of the weights is really very nicely separated by the relevant and irrelevant part for any, for most problems. I mean, this is a problem dependent property. I mean, so if I do face recognition, it has one type of eigenvalues. I do speech recognition, it has a different type. But all those problems that we know how to solve with deep neural networks have this bounded spectrum. Yes. Yeah, so, yeah, so actually if you can do this, this analysis for each layer separately and you can do it actually and think about all the weights, as many people do. I mean, I'm one of the very few who deal with the layers separately. But it's, I mean, each one of them has a different story. But in most analysis of neural network, you see the hian matrix of all the weights, which is a bit confusing to me, but. So anyway, so what happens here is that I should look at this WK of T multiplied by TK. So that's what happened. But that's a Gaussian noise matrix. Essentially, it's a random matrix, which is completely, which has different variants in different directions. So in some directions, it's a small noise and it's a small variance and in most directions, it's a high variance. And that's why I can treat it as noise. So even the fact that this Gaussian noise multiplied the previous layer, since the previous layer is essentially an independent vector changes from X to X. So I know it's a vector of plus minus one or something like this, which is essentially independent of this noise. So this looks like a Gaussian variable for any, and that's something we can actually prove. You take a Gaussian random matrix multiplied by an independent non-zero vector, you're going to get each time a very good approximation to a Gaussian independent noise. Okay, so that's why I'm saying, okay, this looks very much like a multi-dimensional or high-dimensional Gaussian channel. It's a linear function of my input plus noise. But this noise is growing with time. It's growing with time because the diffusion is expanding this matrix, it's blowing it up. All right, so now in order to analyze this. So again, so this is just the gist of the idea, and so I'm thinking about this as a Gaussian channel. And I'm bounding this, if this is a Gaussian channel, the information in TK and TK plus one is bounded by the capacity of the Gaussian channel. Notice that the nonlinearity doesn't matter, it can only decrease the information. Because this is, again, data processing inequality. It's a Gaussian channel then wrapped with another nonlinearity. I don't care about nonlinearity, everything is, any information which is not going through the Gaussian channel is not going to reappear after the nonlinearity. So the nonlinearity can only compress more, but not add information. So that's why this is a bound. But this Gaussian, and what do I write here? Essentially, the signal here is W is the fixed part of the weights, which I call WCCA times TK, divided by the noise part of the weight, which is delta W times TK. But since I look at the norm, the square norm ratio, the TK disappears. Okay, so this is the story. And now, and now what I did yesterday very quickly, and I want to reiterate. So this Gaussian capacity in multi-dimension, I can actually diagonalize. So the capacity doesn't depend on in which basis I want to look at it. So I can diagonalize it in the eigenvectors of the CCA matrix. That's what I call here. So this sum, this sum is over the eigendirections of the diagonal part of this covariance matrix. And that's why, because I consider these channels as completely independent, I write the eigenvector of the covariance matrix as AII in this direction. And I divide it by something which I call here lambda IT, which is this, how are these, how the noise is going to grow in this direction. So this is the projection of delta WK times TK on the eigenvectors of the CCA matrix. So you have this picture in mind, I mean, so. Essentially I'm saying that, so these are like two ellipses, but actually they're commutant. I mean, essentially they approach each other. So if this one is the eigenvectors of the CCA matrix, the other, I can always project the noise on one of these dimensions. We don't want to assume that they commute, but I actually argue that due to this, they asymptotically commute. So I don't really, so essentially I just factorize this big multivariate capacity into the sum over the channels. And now, so here I was a little sloppy. I have to do it more carefully. This is correct, but here I did have enough space in my slide. So essentially, the eigenvector of the CCA, the first one, the first group are protected. They don't change. So there's no diffusion in these directions. This is what I call the relevant directions which are preserved at this point. But all the rest at some point are going to diffuse because I'm going to have this high diffusion because the gradients are small in this direction. So essentially I separate this, okay, so since this lambda i t is going to grow with time, with some exponent, and I assume here that either all the exponents are the same which means let's say they're all one half or something similar, or I take the worst one, the slowest one. I want to get a bound. So first of all, log of one plus something is bounded by this something, okay, this, you know. So this is a bound, the first inequality. And then the second one is, so it should be an inequality, if you want to be very careful, I take the time dependence with the worst, the slowest exponent out. Also still a bound. So then I say that, okay, so I have a bound on the mutual information between two consecutive layers which is made out of two parts. The protected part, which I call CK, which is essentially what are the eigen directions of the Gaussian process which are going to go through because there's no diffusion there, or the diffusion is not going to hurt. So this is the relevant part. And then another thing, which is the sum, but the sum should be taken only from this point on, only from the tail of, the things that are actually from the relevant, all the eigenvalues above a certain number. Okay, so this is the second part. This is decaying with time. Decaying because it's one over T to the alpha, so it's T to the minus alpha. This is going to go down slowly with time. Yes, questions. Is the eigen direction of the covariance matrix? It's actually not, it's a covariance matrix. The reason I don't call it just PCA as we would do in the standard covariance analysis is because I don't care about the variance of my direction. I care about the relevant variant, which is the thing that's carrying information about why. So again, I guess that most of you don't know this term, CCA. It stands from Canonical Correlation Analysis and it's actually a very classical method in statistics which is very similar to PCA when I want to project my data to the low dimension, but I don't care about the variance of my data, I care about the covariance with another variable. That's something which was invented in the 60s already and it's a standard method in linear data analysis to find the components which are really, which really matter for something. So for example, if the variability of my background, of the background of the image has a lot of variance, because the back end change all the time, but this has no information about who's the I, who am I? I mean it's not, so I don't care about this high variance component. I care about the variability of the things which actually broadcast something about why and these are the CCA components. Now, if you actually look at the Gaussian channel, actually we have a paper a long time ago from 95 or something about the Gaussian information bottleneck which you can find on the web if you want. It says that actually we analyze what the information bottleneck is doing in the Gaussian case where X and Y are related in this way is precisely a canonical correlation analysis and so essentially it's rediscovering this, it's doing exactly what it should do. It finds those projections to load their mentions which really keep less and less information, or keep the information as much as possible and then if you compress, you lose them one by one and I'm going to come back to this in a second. So again, so this is the reason for this bound. This is where, CK is where I expect the network to converge to. So this is the wall beyond which information is not lost but everything else should decay like a parallel with t to the minus alpha. Okay, now I'm making this a very interesting step, I don't know, interesting or not, but so I say also R is decaying slowly with time as a parallel. So how is this going to affect my convergence to a good solution? So if you understand this linear story in one layer, you'll see that the information in one layer is eventually having this constant which is the thing beyond which I don't want to compress and a decaying part which is going to be affected by this diffusion. So the first story, this wall is too small, this board is too small. So essentially I have to assume something about well, what is this R? So this R is essentially the sum of all eigenvalues of the eigenvalues of the CCA matrix divided by the eigenvalues of the noise at zero, I mean just at the beginning which is essentially the covariance matrix of this Haitian matrix. So this is a finite number, oh it's not necessarily finite, it's a finite number, which is just summing over all the eigenvalues, the ratio of these eigenvalues and this is dependent on the problem only. So this is something which requires some explanation. It's not a problem of the network. It's not a quantity of the network, it's a quantity of the problem I'm trying to learn. It's a property of the rule, let's call it rule. So then this is the hard part of the argument. I mean why is it a property of the rule? So essentially think about your function, the function you're trying to learn. Not so yeah. I can always expand it. It's some function y of x. Okay just to simplicity ignore the noise for a second so it's some f of x, some function. Let's say eventually I can think about it as a function in some high dimensional space. So I can expand it in a Hilbert space with any basis I want, any orthogonal basis. For example I can take Fourier transforms, Fourier I mean signs and cosines, I can take polynomials or orthogonal polynomials or I can take a non-spherical harmonic, whatever, your favorite orthogonal set of functions, okay? So once I expand it, this covariance matrix is not more than the covariance of the coefficients of this expansion. Now the network itself with the weights is doing, it's an over-complete representation of my function space. But after this diagonalization when I'm already at the close minima, another possible expansion of my function is in the eigenvectors of this minima. You know in general this is going to be, it's not going to be a linear expansion, it's only locally a linear expansion but this is an expansion of my function another orthogonal basis. Now this is not more than the trace of this coefficients of the covariance matrix of the expansion, that's what it is. So it's independent of the basis. I mean if I rotate it, move from Fourier's expansion to Chebyshev polynomials or to spherical harmonics or whatever, this is not going to change this trace. So this is a property which depends only on how many relevant dimensions and how fast these things decay. So I'm saying that once I do it in the network functions, I don't change anything. It's still a property of my functions, not a property of my network. It's just an expansion in a different basis. If you buy this, I know this sounds a little hand-braving but if you buy this, then this is essentially property of the network, so of the problem, not of the network, then the question is how are these coefficients divided between the layers? That's the interesting question, I mean. So each layer, all the layers look at them together or they look at them independently. Okay, so I know this is a tricky argument. I was actually trying to prepare a slide but my graphical, I mean, on the iPad, it was very hard to get it right. So I'll do it for you now. So assume that this sum is, I just write it as a circle. It's a finite sum because my function is, so this is the sum of the L square. It's a normalized L square function. That's what I assume here. So the sum of the coefficient converge. So this is a finite sum. And now, let's say that I divide this sum between the layers in such a way, like this pie, I don't know. So this is going to be T1, T2, T3, and so on. So let's say that I divide them independently, which means this cake of this pie of coefficients is split between the layers with no overlaps. Okay, so then what I'm saying here is very clear, I mean, so this R is actually split between the layers and it's just one number. And then I can say, okay, so if you want to look at the full compression of the layer, just the sum of the compression in each layer is done independently. So this is what I call independent compression or layer-independent compression. So each one of them is going to take a different independent part of the coefficients and compress them. We're going to ask in a second when this can happen. So this means that this R is independent of K. I mean, this sum, so this is true for each K. So each K has a different one. If I sum all of them, now remember, I want to see how much the layer compresses the representation of X. So I sum simply the information between each of the layers. They're all talking different languages, so the compression is compressed completely differently. And then I get this result, which is written here, that the time that it takes to compress is doing this completely in parallel. So each layer is compressed independently and I get this very nice result. The time that it takes to compress with K layers is the time that it takes to compress with one layer to the same compression or the same level of accuracy about Y, times something which scales like K to the minus alpha because even if the R's are different for each layer, the exponent alpha remains the same. So this was the intuition and then of course we actually did this much more carefully. And then, so the first thing to look at is to look at the number of iterations that it takes to converge to good solution as a function of the number of layers. And if I'm right, the itself should be a parallel. So in the log-log plot it should be a line. And if I'm really right, the exponent should be one over the alpha of the diffusion. Okay, so that's a very direct prediction which anybody can test. We just train the same network with more layers on the same problem and see how the time behaves. You don't need to estimate information, you don't need to do anything. Just measure the time of convergence to a good, in this case it was 0.98 bits, which is very high in this plan. And I just ask how long does it take to get to this point? With one layer, with two layer and so on. I just plotted this in a log-log plot. And then I got a very nice fit with alpha equal 0.55. Which means it's not too far from the 0.5 exponent that you expect from diffusion. Yes. So everything can scale with a factor, I don't care about factors, it's the exponent which I worry about. And the exponent depends only, yeah, I expect it for whatever architecture. The way this pi is divided between the layers depends of course on the architecture. But since all of them decay more or less with the same exponent, I expect this to be true. And indeed we see it very nicely. I mean I saw it in all the problems we looked at. Now I said, okay, this is really working. I mean this is a very surprising or at least direct prediction. I mean just measure the time of convergence to a good solution. There's a function of number of layers in it. Anybody can do that. And here we got an exponent which was slightly larger than what I expected. But looks as if this assumption of independent compression is not too bad. So then, of course I asked my students of me to look at the real problem. So this is MNIST. It's not written here, but it's MNIST. And then, so what we see here, in this part you first just to verify that indeed we have a diffusion phase. So this is how the weights change in each one of the layers. And you see again this very, it's not so sharp but a very clear transition from a slope one to slope lower than one. So there is a diffusion phase. There's no question about it. So then we looked at the convergence time. And what we saw, again, a very nice power law. I mean, again, the orange is the straight power law and the blue is the data. Average over several iterations, but that doesn't really matter for this matter. It looks even very clean even for one run. It's a very, very sharp phenomena. But the exponent was 1.2. 1.2 was too far away from half. For me to be happy. So we thought a little bit what's going on here. This is, by the way, really fresh results. I mean, it's from the last month. It's in a paper that was submitted to iClear. No, no, if you'll get there or not, but if you can find it on the web. So apparently what it means is that my scaling law with K is not accurate. It actually seems like there is a, instead of being delta T of K goes like K to the one minus alpha, delta T of zero of one, I get some different exponents. So K goes to some different number, K tilde. As if there is a different number where K tilde is actually K to the power 0.5 divided by 1.2. This is what it seems. I mean, there is a different power law. It's still a very nice power law. But the only assumption which is really wrong here is that this pi is divided independently. So it actually means that T1 is compressing part of T2. There's an overlap here. And T2 is compressing part of T3. And T3 is compressing part of T4. So there are some overlaps in these pi pieces. And we can actually estimate those overlaps directly by looking at how on the dependence of the weights themselves. So there's a correlation. If the networks compress the same information again and again, sometimes it's done in parallel, but it's the same fraction of my space which is compressed by different layers, then I'm not efficient. I'm not efficient in my compression and I effectively get a lower number of layers. Because they are not independent. They depend on each other. Okay, so that's the most dubious part of my talk that I wanted to convert to you at this point. But if I assume this overlap, then I can understand this wrong power law. Or not wrong. It's just much slower than I expect. By the way, if you think about computational efficiency, not just number of iterations, when you increase the number of layers, the number of operation usually grows linearly. So this is a 1.2 is may not be as effective as you want because it's again, it was half I get a quadratic acceleration, which is very nice. With 1.2, it's a sub linear acceleration. This is not, it may not compensate for the cost of computation. But anyway, I'm talking about the number of iterations here. Now you may find other tricks to accelerate the problem using this. Or you may find a way of reducing the overlap between the layers. Okay, that's another question. Okay, yeah, you're completely, this is exactly the next step that I want to do. Okay, so the real question is what was the difference between our toy problem, which seems to obey this scaling rule very nicely, and MNIST, which obeyed but not so nicely. At least not with the predicted exponent. Because the exponents are the slope of these curves around 0.4 or 0.3 or something. And this looks like not like one over 0.4, which would give me actually a very nice acceleration, but something much weaker. And again, I can explain this weaker exponent if I assume that the, if this assumption that R is divided between the layers without overlaps is wrong. And so I don't cut my pie as you do usually with pie, but actually people get the same piece of pie more than many times. Okay, that's not very nice, but that's exactly what happens here. I mean, they divide nothing independent way they compress, and R is actually growing with scale slowly. That's my understanding of this. Now, so here's my attempt to write and draw pies, but it's not working very well. So let me summarize essentially what we see so far, which is actually quite striking. So first of all, this is again, the experiment with the small toy problem. What you see here, the blue, so it's just a blow up of the top of the information plan from 0.98 to one. So this is just the top of the plot of the information plan that I had before, because before it was from zero and one. So this is just the 2%, the upper 2%. And you see that the blue curve is the optimal information bottleneck bound, which in this case, we can calculate analytically, sometimes analytically, numerically, but we can calculate it directly just from the distribution of the data because we know the distribution of data. We designed the law, the rule. They'll tell you something about this rule in a second. And the crosses are where the layers end up. The spread is due to the statistical spread of the 100 repetitions of the layer. So remember, there's a finite size effect. It's a very small problem. So I don't expect it to really lie below the line. I can have a scatter around it. Asymptotically, I know this is going to be below on the line. But I think it's a very convincingly converging to the line. It's not only that, all the layers converge to the line. And essentially what I see here is that the KL divergence, the distortion of the last layer, I'm trying to say it with this figure, but it's not very clear. So what I measured before was the KL divergence within two consecutive layers, or the mutual information between two consecutive layers is related to each other. One is about Y and one is about X. And you see that they sum very nicely here. So the distortion of the last layer is simply the sum of the distortions. So I have this approximate sum, the distortion of the last layer compared to the original. It's just the sum of the distortions, very, very close to the sum of the distortion of every of the consecutive layers. This is a nice property that in information theory, we sometimes call it a successively-refinable codes. Again, I don't know if there are any. So essentially it tells you that not only that the last layer is optimal, but each of the intermediate representation is on the optimal line. Now, I don't know yet. It's an open question for me at least. Whether it has to be so, or I'm just lucky, because in principle you can have the last layer on the line and the previous not on the line. But I should argue that if I could do this, I could actually push the intermediate layers to the line without damaging my performance. So the only question is, is there anything that is going to prevent me to get to the line by this diffusion process? Is there any dynamical stoppage of something that is going to stop my diffusion? Okay, but the other thing is, otherwise it looks really nice. I mean, all the layers converge to certain very specific points on the information curve. Now I want to understand this point. I want to really to understand why they converge where they converge. And I have, how much time? 45 minutes, that's great. More than I ever had when I got to this point in my talk. Okay, so I want to argue a little bit about symmetries. So now it's about time to tell you what is this rule that I'm learning all the time. Okay, so my intuition was that symmetric problems in some sense are easier to learn. This was the beginning of this research and I'll tell you why. And this is actually a very symmetric problem. And that's exactly where physicists feel very happy because I'm going to use some group theory very much like atomic physics. So when we say symmetry, I actually mean, okay, so we have Q of Y given X. This is my rule. For every X there is a probability of the label being one. It's a stochastic rule, but as I said, this is not important. Now I'm assuming a symmetry group, which can be discrete or continuous. For us physicists it's easier to think about continuous symmetries. We are used to lead groups and things like this. So imagine a group of transformation to where G is some symmetry group. So what I mean by this is that G can act on X. So for every pattern in my space, for example I can translate it, shift it. So this is, if I can do it forever, this is continuous symmetry group. Or I can rotate it. Or I can rotate the pattern in three dimensions. Or whatever I can do, I can shrink it or expand it. I can do a lot of operation which are continuous operations on my data, on my image, on my pattern. So I'm talking about the operation on G on X. So this is going to take me to some other X, which is a rotated image or scaled image or distorted image in some other way. So there are some simple symmetries like translation, rotation, expansion, and so on. And they're very complicated symmetries which are going to distort my image completely. Like changing the illumination or changing the perspective. I don't know, there are many, many ways of distorting images which are unnatural. Okay, so now when will I say that my rule is invariant to this group? Okay, so it will be invariant to this group if when I act with the group on X, I will not change the probability of the label. This is the most natural thing. So essentially the label, if these two things are equal, then I say that the rule is invariant under the group G. The label is invariant. By the way, I mean that people would take it much more seriously like Maxwelling for example in Amsterdam and a few others who take the symmetry group. So they allow actually the label to change but they change in precisely the right way such that they still get invariance. But I don't need it at this point. So the question was, the hypothesis was, what happens if my rule is symmetric? The reason of course we thought about it is because convolution neural networks which are the most popular version of neural networks are very good at symmetries. I mean so essentially, convolution is invariant under the translation. And actually as Maxwelling and others showed, if you have many layers or several layers of convolution, you can learn very much more general symmetry groups. For example, spherical, O3 and rotations and so on. I mean you can learn them using several layers of convolutions which is an interesting fact by itself. Because some of these groups are not brilliant and not commutative. And it's not obvious that you can learn them by itself by convolution. But because if I have many layers of convolutions, I can actually learn very general groups. So it seems that symmetry plays a very important role in the success of convolution neural networks. Now I want to say, what happens to my story if I have symmetries? Now if I have symmetries, you have immediate compression because essentially all the labels along the trajectory of the group have the same, are the same. So all the trajectories of the group collapse to one cluster or to one partition immediately. So once you assume symmetry, there's immediately a compression due to the trajectories of the symmetry. Is this clear? I guess this is clear to everyone. Okay, so under this particular symmetry group, what can we say about the solutions of the information bottleneck problem? No, I assume that my rule is symmetric. It's the rule. I mean that the probability of y given x which is invariant to the symmetry, okay? You know, it says it's something about the rule. I don't really care about the probability of x by the way. This may not be invariant. So different symmetric patterns may not appear in the same probability. So the probability along the trajectory may not be the same. So for example, my images, I have more images face on than profile, okay? That's not a problem or back. I mean most of my images are more or less face on. So, but this is, so this is, I'm not, the probability of my images is not invariant to the symmetry group. But the rule is I want the label this is me to be invariant, okay? So that's why I look only in the conditional probability. Now, so then the natural question, so if you really accumulate everything I told you so far, I'm saying that the neural networks are converging to the information bottleneck solutions. I actually argue that this is true for any learning algorithm, but for stochastic gradient descent we even begin to understand how it happens due to this diffusion process, okay? Because they compress until they get stuck somewhere. So the question is what can I say about the solutions of the information bottleneck problem in the symmetric case? So it turns out it turns out that of course, so let's call X hat and IB information bottleneck representation, which means that, so this is at a better temperature. So it's P X hat or P X hat given X is P X normalized by some function of better and X e to the minus better the KL between P Y given X and P Y given X hat. So I'm denoting it by X hat because I want at this point to separate it from the representation of the network which I call T, okay? Just to clarify the notation. So X hat for me is a representation that is on the information bottleneck solution or which obeys this solution. And remember, we had two equations, we had this one and we had this one P of Y given X hat, which is the, so this is the encoder and this is the decoder and the optimal, the information bottleneck solution is the optimal, what I call the base optimal decoder. So these two equations are satisfied simultaneously. Also every point on the blue curve is made out of these two solutions. This is this and this is this, okay? I mean they are self consistent. And the way you, I mean this is this up to base rule but I can use base rule. So actually no, it's exactly the same. I mean the way I wrote it is exactly the same. So just iterate these two equations and we know we can prove I didn't do it today that the information bottleneck, the desiteration just start with some encoder and iterate it. They converge very quickly to a solution at a given better. This is for all better positive. Of course so smaller than the lower critical better I get no solution, no trivial solution but that's a side issue. So by the way, the rationale of the information bottleneck I didn't have time to discuss it because most of you are probably no statisticians but this is a very natural extension of the notion of minimal sufficient statistic which some of you may know. So it's the smallest possible compression of my sample or my data which gives all the information about the parameters of a distribution. This is just a very natural extension of it. So this is the way we originally thought about it as extending minimal sufficient statistics. So it has actually a lot of interesting properties this particular solution. So one of the things I can easily show, you can show just by looking at these equations that if my rule is symmetric which means it has a symmetry group underneath then the information bottleneck encoders are also symmetric. So symmetry, I have the symmetry along all the line. So think about this, this is some sort of a maximum entropy solution. I mean actually I'm minimizing information is very much like maximizing the entropy. That's why it has this Gibbs like form. I mean it has this exponential in the distortion which is very much like a Gibbs distribution for the physicists here. So and you know that maximum entropy solutions, maximum entropy always has the symmetry of the distribution. So otherwise I could mix two things which are not symmetric and get higher entropy. So a maximum entropy distribution is always symmetric, has the maximum symmetry that the distribution underneath has. A minimal information which is the bottleneck solution has the maximum symmetry of P of X, Y. So all the X hats, all the solutions of the bottleneck problem are symmetric. Okay, so that's nice. What does it mean? It's mean that if there is a symmetry breaking or symmetry reduction along the line, it has to be from one symmetric solution to another symmetric solution. Okay, so if you think a little bit about this. If there is a very simple underneath symmetry and in my rule there is, there's the O3 symmetry. I'm actually take all my 12 points are taken on a sphere, on a three dimensional sphere. And the rule was essentially the ratio of the quadrupole to the dipole moments. So this is a very nice rule. I mean, the threshold, whether it's one or zero, depend on only on the ratio of the quadrupole and dipole moments of these points. So this is of course a physics intuition because I know that the ratio of quadrupole to dipole moment is invariant rotation. So there is a built-in symmetry to the problem. It's the O3 symmetry. Okay, so, and I knew it all the time that something nice is going to come up with this symmetry. Now we can actually cash on this, on this promise. So first of all, I know that all the solutions have concrete symmetry, but at the same time I know that they lose information. So how can you have a symmetric solution with less information about the problem? Okay, so here I'm going to use a little bit of knowledge of Goof theory. So essentially, you know that the, if I expand it, let's say in some functions, in this case the natural thing to do is to expand in spherical mnemonics because it's a pattern on a sphere and it has symmetries. So expand it in spherical mnemonics. So what are the objects which are symmetry, symmetric invariant, precisely what we call the subspaces of this expansion, which belong to irreducible representations of the group. Okay, so you all know that. Anybody who land electromagnetism should remember this. Essentially there is the dipole expansion and then the quadrupole expansion and so on. So let's go from, you know, there we have the YLMs and the YLMs, I don't know how to write them, of theta and phi. And those for each L I have two L plus one functions. And so M goes from minus L to L, integers. And these functions span what we call an irreducible representation of the group. So if I rotate my pattern, the magnitude of the expansion in each of these irreducible rotations stays invariant. They rotate within their representation, but not outside of it. Which means that there's this block diagonal form of the expansion. Is this, does this ring a bell to some of you? I hope so. Okay, so that's what I did. I mean we actually used a rule which has this explicit rotation invariance. And my hope was that the layers will converge to rotation invariant representations. Or how can it be? In this case they must converge, they can lose symmetry, they can lose information and still be invariant only by losing a complete irreducible representation. Either they have it or they don't have it. Okay, so this is an interesting observation. I mean there's a very nice correspondence between symmetric solution to the problem and group theoretic properties like the irreducible representations of symmetry group. This is of course if the rule is exactly invariant. If it's an only approximately invariant, this is an approximate statement. Which means, okay, I may lose more information than I want. Okay, so let me show you why. Okay, so this is another figure which is essentially where the layers converge to as a functional number of training data along in this information plan. And what you see that they converge to very concrete lines in a very conservative when you increase the data, this layer is moving along a very specific line here. And these lines, I'm sorry, these lines, these lines represent some sort of changes in the representations which are very concrete and very discrete. So now I want to go back to the questions asked by people here, I don't know if they are now here. In my first talk, I mean how, you asked it, when is it, how is it that the data is separated into only two groups when you move from one layer to another? So let's look at this picture, which I know some of you may have seen. So this is a slightly, slightly take some explanation. So what you see here is what we call a Disney projection, so it's a nonlinear embedding of the layer, which in this case is 10 or 12 dimensional, it's not very large, but I project it to two dimensions in a way which preserves the local distances. So it preserves the local topology of the space. And what you see here are those 4096 points embedded in two dimension, and I color them by black if the network is labeling them one and by red if the network is labeling them minus one or zero and yellow in between, okay? So that's the representation, the topology of the first hidden layer of the space and what I show you now are snapshots of the iteration. So you start training the network and you see what happens. I mean they move around, but what is really important, they get darker, I mean red or black. So there is a more and more separation of the colors, the network is actually making better and better prediction, but they look very ugly in terms of separation. I mean at least in two dimensions there is no simple way of separating the black from the red, okay? So even if I, you see that they tend to cluster, I mean you see more red here and more black here and so on, but they don't really separate. So the topology at this representation of the first layer is still very, very mixed, okay? Very bad. So this is not something I could separate linearly. Okay, so now let's, okay, and then eventually you know after something, 10,000 iteration, it gets this mixed pepper and salt, as we call it, in my plan and it seems bad, okay? Not very well. Let's look at what happens to the same figure when I do it at the fourth hidden layer, not at the first hidden layer. So it's exactly the same training, all the 4,096 points, and you see right from the beginning that it's almost separated, but you see that red is going to the right and black is going to the left very clearly, but you also see the phenomena of dimension reduction. So remember what I told you about the diffusion? Each layer is going to keep only few dimensions of the problem and eventually the whole thing is going to collapse into lines because it's a one-dimensional problem. Essentially there's only one really important, actually two dimension, but I look only at the ratio. So it's a one-dimensional problem, but this dimension, this manifold is curving all over the space. You see these two dragons dance for a long time, but eventually red is separated from black very clearly and you can really put a line somewhere here which will predict the label very accurately. And again, as you see at the end, it's going to lines. All the dimensionality has collapsed. This is precisely the phenomena of this compression which reduces the dimensionality of the problem. Okay, so I really like this figure. We have a lot of things like this. So it's actually, no, it's something which anybody could do. I mean, just look at the hidden layers and project them to the dimension and see what happens. Nobody did it for some reason, I don't know why. But you really see that the topology of the problem changes from one layer to the next. When I mean topology, the neighborhoods, I mean, what's close to me now is not going to be close to me in the other layers. So it's not only compression, it's a complete change of the topology of the structure. Yes, please. Eventually, you predict the data from the last layer. So in the last layer, all the one label, one label, two are separated and you put a line there somewhere and you separate them. That's the trick, otherwise it wouldn't work. So, but the interesting question is, how does it happen? And what is causing these topological transitions? Now, topological transitions, I mean, cannot happen continuously. I mean, continuous functions, by definition, preserve the neighborhoods. So nearby points go to nearby points. Here, I actually get a very dramatic discontinuities in the representation. Some things which are close in the first layer are very far away and vice versa. And not only that, I have this collapse of dimensions, which is important because this is the compression essentially. So what can cause topological transitions? So of course, any physicists will tell you that these are phase transitions. Oh, I'm sorry, this is badly. Oh, God, let me fix this. Yeah, this is somewhat better. Okay, so what is the indication of a phase transition in this problem? So phase transitions are going to be collected to the symmetries at the end of my story, but now I'm going in a different path. Okay, so if you actually look at the information about Y, about X, not versus information about Y as I did so far, but I look at it versus beta, versus the temperature, which is one over the slope of this curve, the information curve. And then you just plot it and you see these very interesting points, which coincide on the two graphs and that's why you don't see them when you plot the one important on the other. And what are these points? These are discontinuities of the derivative or what we call cosps. So discontinuities of the derivative. So remember that my functional has the form of a free energy, such like entropy versus energy, but it's actually compression versus accuracy or versus prediction. This is the same analogy. So the second derivative, the discontinuities of the derivative is actually a second order phase transition. It's precisely what you expect. Second, the derivative of the free energy, it's actually the second derivative of the free energy. So these cosps, these changes in the slope of the curve really correspond to second order phase transitions in my representation. Now what I argue that we actually understand these phase transitions very well. So physically, when you talk physical physics, this is a strong indication of a phase transition. How do you find phase transitions? So you look and do essentially stability analysis or something like this. You look at what point your free energy has more than one solution. The minimization of the free energy has more than one solution. How do you solve in statistical physics problems in general? You minimize the free energy and then you look at the nature of the solution. If there is more than one solution, it's an indication of a phase transition. I mean, it can be a first order phase transition when you actually jump from one minima to the other minima and then you have all this story about metastability and whatever. But it's a second order phase transition you just get at the same point two solutions. Okay, I actually want to identify the phase transitions of this problem in a very explicit and analytic way. So what we do, this is work done together with Noga Zoslavsky and mostly and a little bit of another guy, Shlomi Agumon, but I started to do it a long time ago. It's good to have students who like it. So essentially what you do is you take these bottleneck equations. So remember, the first one is the encoder equation. This is the encoder equation if you're given beta. This is just the log of this equation. So the log looks like the log of some function which is independent of the representation at all with fx hat plus beta the decal which does depend on representation. So you can take the logarithmic derivative with respect to the representation. So what I mean by this in the case in the context of neural networks, imagine that you change slightly the value of the intermediate layers, perturbing it slightly and see what happens to the solution. In the case of this is a more in a more abstract sense I perturb with respect to my representation. And this is just like taking in the derivatives. I mean the physicists do it without thinking too much whether they can or what does this mean? That's what I did. I mean I just take the derivative and you see that the logarithmic derivative of the encoder is some sort of a linear combination with respect to the true rule of the logarithmic derivative of the decoder. That's very nice. So that's essentially the bottleneck self-consistency but what is really nice that now you can take the decoder, this red equation and do the same trick take the logarithmic derivative of this and what you see that this is some sort of another linear combination which depends on the encoder but of the logarithmic derivative of the encoder. So I have these two equations. The logarithmic derivative of one of the encoder is proportional is a linear function of the logarithmic derivative of the decoder and vice versa. So what would you do? Combine them together and get a linear and eigenvalue equation. That's what anybody would do. I hope so. So essentially what we did is indeed combine these two matrices together. So the matrices are written here. One of them has the dimensionality of the input. This is very bad because the input is high dimension but the other one has the dimensionality of the output but actually there's a sum over all possible inputs here but this of course I can approximate with a sample. Whenever you see sum over the inputs, remember that you can sample it and actually get very good accurate results. So you can replace this with a finite sample thing but I'm not going to do it here but eventually just combining this equation you get these two eigenvalue problems. So essentially the unit matrix minus beta, this matrix times the logarithmic derivative of the encoder is zero and the same truth for this logarithm there was the decode. Sorry, this is the decode, this is the encoder. So what will characterize a phase transition? So a phase transition means that the same beta, the same temperature, I have two different phases coexisting. Just like in a water and ice at zero temperature or things like this. So essentially I'm looking for places in beta which will have two solutions coexisting. So this means that I should, I'm looking for a degenerate solution. So if there are another solution at this beta, there should be a non-trivial eigenvalue, eigenvector. So that's exactly what we did. And what we see, let me, let me go to the previous line. So essentially what it means that when you walk, when you go along the information curve, this IY versus IX, I mean just for short notation, you get these critical points which are precisely the eigenvalues of those eigenvector, eigenvalue, eigenvectors of, at this critical beta, there is another solution to the problem. And what it means in terms of the representation that there's a pitchfork bifurcation, actually it can be even some other crazy things, but mostly pitchfork bifurcations, which means that at this critical beta, this one is the first one, at this critical beta, I get a split of the, of this partition. One of those partitions is going to blow to split into two, at least one. In the another critical beta, I get another split and so on. So if I take this analysis, which is not too hard to do for small problems, for large problems it's a big problem, but nevermind, but for small problems, where I can actually go over through all the X's as in my problem, I can explicitly calculate the full bifurcation diagram and all the critical betas of my representation. Okay, so now this is interesting. Why is this interesting for the neural network problem? Because if the neural networks, the layers indeed converge to the information curve. And as I was trying to tell you before, if they have a lot of information about the output, then they must be on the information curve. There's no other way. So the encoder and the decoder must satisfy those equations. The encoder, the decoder of the neural network, not of my problem. That's a big statement. But if this is true, then what these first transitions do is they change the topology because the first transitions actually, I'm actually moving backward. I'm reducing the information. So I'm essentially merging those splits. So I'm merging cells which used to be separate before the transition, after the transition, they're one split. Okay, so that's nice because this can explain exactly my topology change. If I merge two things which are far away into one, I change the topology dramatically. Okay, so topology changes in the same, in the way that I demonstrated to you can happen through a cascade of such phase transitions. Okay, so that's one fact. The other fact is that if there's symmetry, then using group theory again, and I don't have all the time for that, you can show that the transitions coincide. I'm actually going to lose the same splits that's going to happen in many, many different places. Okay, so this is something which I don't have slides for. I wish I could. So if you remember your atomic physics, those of you who actually studied quantum mechanics at this level to learn about the levels of energy of, as I said, hydrogen atom. So then you know that the generative there is full symmetry. But if I break the symmetry, let's say I put magnetic field, I split the levels. Okay, the Zeeman effect or whatever. Now, what happens there? You break the symmetry of the problem by adding a field in one dimension. And then different solutions which were degenerate before the symmetry, the L equal one and the L equal one was mine, one of the dipole are going to split. But if I have the symmetry, all those different L's are going to appear to be better at the same symmetry. Now exactly the same line of argument applies to my symmetric rules. So if there's symmetry in my problem, I expect to see the generate splits, all those splits that are due to the dipole moment or the quadruple moment or whatever, are going to occur together at the same better. So what I expect to see is a highly degenerate bifurcation diagram if I look at the bifurcation tree of the bottleneck problem. So all the splits are going to occur exactly at the places where I lose one of the irreducible presentations of the group. So I know this sounds like a high energy physics, but it's exactly, it's a very simple group theoretical argument. So essentially, if this is true, the symmetry should characterize different topologies of the representation of the layers. Okay, so now I, so this is by the way, the symmetry, the bifurcation diagram that we calculated from a 5% of the data only using the sampling that I mentioned before. For this particular problem, the symmetry problem that I'm showing you all along. And you see that although it's very approximate, but you do see that the splits going to coin are more or less coinciding. You have the first split at zero. And then you have another split at 1.5, 1, 1 bit. And then there is another coin, and you see this very many splits, okay, together around 1.5 bits and so on. This is a noisy picture because I'm using very small part of the data and my algorithm didn't know anything about the symmetry. So I argue I can prove that that if I use all the data I'll see a perfectly symmetric tree which means that all the splits will occur exactly when I lose the symmetry, I lose one of the irreducible representations to the group. Okay, so now I come to my most speculative argument of this lectures, but I'm going to say it out loud. I mean this is still under, but I really like this argument. So what happens here? So first of all, I argue the different layers of the group of the network are going to be pushed into explicit symmetry and they change the symmetry, which means they change the topology, or they change the number of irreducible representations in which I expand my rule along these very discrete steps. Okay, so this was a prediction that different layers, let's say this one blue line and the other blue line are going to lie in different phases of the problem. But it's not only that. If I'm right, then symmetric problems are going to have this very nice separation of my compression because I'm going to compress only some very specific irreducible representations in every layer. So for symmetric rules, I expect these overlap to be small. For non-symmetric, the whole tree is a mess. I mean I'm going to have phase positions all over the place, maybe approximately symmetric, but I'm going to get this split of my vocations and I'm bound to have overlaps. So first of all, this gives us an immediate insight why I see this slope close to one half for the symmetric case and not so much for the non-symmetric case. But you also see that it's going to be easier to learn symmetric problems because each layer is going to compress separately different representations of the symmetry. Now, there's another very interesting aspect to these phase transitions. So if I'm right, my time, if I'm right, then those phase transitions produce what we call a free energy barrier, a change of entropy. You know, any phase transitions is either an entropy barrier or an energy barrier or both. I mean either I mix things and lose entropy or I jump over energy. So in the free energy, it looks like a barrier. Now, barriers like this, what do they do for diffusion processes? So anybody, this is a little bit advanced statistical physics, but there's a phenomena which we call critical slowing down, which means that when I get close to the phase transition, the surface is rough in some sense. I have this barrier. I have this barrier diffusion going to slow down. So maybe I see two things together here. First, the diffusion is going to get stuck near the phase transitions and each net layer is in a different phase of the symmetry. So actually, this is precisely what we see in the symmetric problem, precisely. Of course, for non-symmetric problems, I don't know the symmetry, I don't know how to even treat it, but it gives you something very interesting insight about what each layer represents in this symmetric case. And it's more than that. If you actually look at this eigenvalue problem, so I know that this last part of the talk is a little bit steep, but I want to say it anyway. So if you look at the nature of these logarithmic derivatives of the encoder, so logarithmic derivative is the log of a little bit at x and then minus the log at x plus delta x. Okay? And then I take delta x to zero. So this has the nature of a difference of log probabilities. Okay? Or it's a log likelihood ratio. It's a ratio of two probability distributions. I take the log. So this log likelihood ratio is actually precisely the thing that will tell me if I look at one by vocation here, let's say this one. So if I look at the value of this log likelihood ratio here, it is precisely the thing that will tell me did I come from this branch of the problem or from this branch of the problem? Now remember, each of these branches is a different data. It may come from here, it may come from here. At this point, I merge them. I say, okay, this distinction is not important when I merge them. But if I want to keep this information somewhere, where did they come from here or here, what I should do is what you call a likelihood ratio test and ask, is it coming from here? This is an hypothesis testing of classification problem, binary classification problem, a very simple one. So if I put this particular weights in my layer, there will be a neuron that will tell me if I came from this side or this side. So essentially my argument here, and I know it's very quick, is that the weights of the K layer essentially encode all the phase transitions between the previous layer and this one. Essentially every split of information in between these two things should be answered by the layer if I don't want to lose information. So there must be a neuron, actually a combination of neurons can be a linear combination of things, which encode this particular information. Did I come from this side of there, this branch of this much? Now look at this bifurcation thing. From what it means that each pattern is lying along close to a leaf. So in the very high compression, remember this goes all the way to 12 bits. In my story, this is only up to 4.5, just the very beginning of this tree. Eventually I have a tree which end up with a leaf at every pattern, okay? Now I'm starting to measure them. That's the procedure that I've been describing all along. I'm starting to compress, I'm starting to get these groups and I measure them based on the loss of information. So I measure the small splits first, those that have tiny loss of information and only the lathe one. So first of all you see that if I compress to all the way to the one bits area here, essentially there are only two branches. So all the patterns that are one are going to be a one branch and all the patterns that are zero minus one are going to be another branch. This is the perfect separation. So that's why I have to compress because before that, I still have a lot of information which I don't really need but the whole point here is to compress and start it eventually I get to only two branches. This is the separation of black and red if you want. And all the previous one appears slowly by slowly and are encoded by those weights that tell me at which side of the split I came. So my prediction is that the weights eventually at convergence are some sort of a direct sum of these logarithmic derivatives of the encoder with respect to the previous layer due to changes to the previous layers. And this is some sort of a very vague sum over the splits. And remember that this can be, this can be highly reshuffled and so on. And so okay, so that's almost answering the question what do the layers represent? I mean, they represent this binary questions about the data from one compression to the next compression. And if I have symmetry moving through these splits is going to cost me a lot of time because I have slowing down of the diffusion and that's why the networks are, the layers are where they are when they get stuck somewhere they don't move. Why do they move? Because something stops them and this something can be changes in the topology of the representation which is a phase transition in this line. Okay, so with this I really want to sum up because I'm out of time. And then ask questions. So first of all I showed you that we begin to have some understanding of what happens in deep neural networks and it's made out of these three components which I elaborated here, the rethinking of learning theory, the language of information theory, the information bottleneck bound and then the importance of fluctuations. And just before I summarize I must, I must say something about the opposition. So there are papers out there which attack us, attack me personally almost on many aspects of the theory. I just, I cannot go without saying it. So first of all, there's this paper of socks at all that appear last day, last Eichler which is attacking us mainly on how do we estimate the mutual information? And saying that if you do different binning you get different results. Yes, this is true. As you notice in my talks here I didn't say anything about the binning, it's really not important. But on the other hand, of course you can do all the wrong things and get very lousy estimation of the information if you really insist on doing wrong things. Now on the other hand I also claim that all the compression is due to the nonlinearity, to the saturation of the nonlinearity. I think I showed you that it happens also with other nonlinearity, this is actually in this theorem that I was reviewing today, you see that the compression is due to this Gaussian channel in the diffusion has nothing to do with the nonlinearity can enhance it, but there is compression without a nonlinearity. And it seemed to be very nicely in agreement to what we see about the time of convergence. Now there is a very interesting issue about whether compression is necessary for generalization. This has to do with RevNets and ResNets and all these algorithms which essentially don't compress the network but keep all the information by all sorts of skip connections and essentially this reconstruction term that they add to the error to allow you to reconstruct the previous layer. Actually in the most extreme version which is called iRevNets, which appeared in last NIPs I think, some of those conferences, they actually use Hamiltonian dynamics in some sense and separate the variables in each layer to those that are predict and those that allow me to reconstruct. And they look very much like momenta and coordinates in Hamiltonian systems. And actually I can show you that the dynamics is extremely unstable. I mean you get actually diverging trajectories in the irrelevant dimensions and converging trajectories in the relevant dimensions. So if you add a little noise to the problem, you cannot reverse it. It's a very sensitive to noise and what is robust to noise is precisely what I call the compressor presentation. That's the answer to this. I think the others are at this point redundant. So I have to stop and I thank you very much.