 All right. Hey, good morning. It's really a pleasure to be here again. I am I was here just a several months ago Actually, it was last year already, but so I'm going to Tell you the story and another view maybe a different somewhat different view on deep neural networks from the the one we heard from Mark in the morning which is also Let's say more of a computer science type of talk, but it's really very much based on on the principles of statistical mechanics and Information theory. I just wonder how much how many of you are familiar with the basic concepts of information theory like a mutual information How many of you know the formula for the Gaussian channel capacity? Your legs, okay So I'm just going to use it and if you know it it will save me a lot of time So I want to start with things which are more or less familiar and get to some new results very soon very quickly so as you know, I mean this These machines that we had about in the morning the deep neural networks or layered linear threshold gates really revolutionize not only artificial intelligence, but Technology in the world in general. I think it's not the exaggeration that in some sound in some sense It's a revolution that really is making a the dreams of AI that People like me use only read to think as something which will happen in the far future like you know continuous spatial cognition object recognition and Driving autonomous cars and who knows what into into realities which happens in our lifetime to me. This is a big Big surprise actually and and what is really nice about those type of machines that they're not only a Diversion, I mean they are inspired by biology in some sense Which is good for people who actually study real brains in some sense or connection with neuroscience, but they're also large scale in a very profound sense. I mean the they They what really makes them work is the fact that they're big the data is big the scale of the problem is big and especially the scale of the input and and in some sense it's ideal for for Way of thinking of statistical physicist Which is something which we some of us started already in the 80s and as you know at that time I mean the statistical physics of neural networks was You know a very marginal field that had absolutely no impact on computer science and maybe You know some physicists were excited about it, but it really went nowhere as far as they're influencing the thing The nice thing for people like me. I mean said 30 years or even more later These ideas are coming back and they're coming back as in a very core and a very fundamental sense So we are in this deep learning age or deep learning evolution And of course the big mystery is is why these things which are based on this very simple linear threshold gates really work So beautifully now It's really a big question and one of the biggest technological riddles of our time to really understand What's so special about this particular architectures? So unlike a marker. I'm actually going to Focus on what we really call deep learning. I mean many layers What is really the secret of of the fact that if we train by? What we call error back propagation? Let's say Then classifying an image of a dog we actually and we give it this very simple one bit label Is it a dog or not a dog in the image? eventually this huge large parametric machines and managed to capture Very successfully. I mean more or less at the human level performance Human performance level essentially Do this very well and the point I want to focus on is really What is the role of many layers more than two or three or four or five? What do we actually gain by adding more layers and I want to focus on one particular view Which is something I've been focusing on for many years, but turned out to be important again And this is the question of information Mutual information Shannon's mutual information How does it flow through the network in some sense and and the idea is actually very simple if you look at these And there's neural networks. Let's say the input led denoted by x So this can be the pixels of an image or something like this usually a high-dimensional High entropy variable if you look over all images of even just dogs You know and and usually there is a very simple label which can be just one beat Is it a dog or is it me in the image or is it one of some? finite small Categories of course, this is not the only way neural networks are used today there There's a whole bunch of different type applications like auto encoders and like Many other things where the output is also high-dimensional We're not going to talk about these things now, although we have some beginning of new results on this as well Now what actually happens in the network is that there is once the weights are fixed There is a mark of chain Which I do not hear by age one age two and so on with those hidden layers So this is a mark of chain of representations on summations I mean so the first one Calculates calculated directly from the input the second one directly from the previous one and so on so this is a mark of chain and Eventually at the end of this mark of chain. I can calculate a Approximate approximation to the desired label y which I call y hat So this is really this linearly separable last layer where relatively easily I can generate the approximate good label and What is interesting about this mark of chain is that? Something happens with layer layer to layer to the representation of the of the data Which takes it from a usually in general highly non linearly separable though in the very high dimension Almost everything is linear separable eventually, but but it's hard to separate at least at least with simple linear social functions to the topology of the representation is changing Significantly from layer to layer so things which are closed in the Euclidean topology in the original input become very far in the Euclidean topology in the last layer and last hidden layer and vice versa I mean think which are very far apart are getting together and I'm really interested in the mathematics behind this topology topological transformations of the data and So once we see a mark of chain and we think about variables in high dimension There's a it's a very it's very natural to some of us to think about it in terms of mutual information So how much information and I'm going to be more precise about it is actually preserved in a layer about the input and How much is actually preserved or is there about the output and we can think both about the desired label y Which is a function of the data only that's why it's to the left of my input and also about the actual Output of the layer, which is this y hat And I can think about the mutual information with me and any one of those ages is one single random variable and And both the label and the input And actually so my main claims which is still highly controversial mainly because we haven't published everything that we know But it's it's still not really clear why it show that these two numbers the mutual information about the output and the mission fashion about the input really tell us the gist of the story and In a sense they're like like two other parameters that in some sense capture all the complexity of the problem That's my main claim and of course So the way I think about it is that there is actually a cascade of filters I mean just like you know leaky pipes when I throw the information flow information through the layers at each layer I'm actually losing some information about the input while Enhancing the information about the label. So think about it as a cascade of filters or non-linear filters or just you know leaky tubes and The question is really what governs this floor of information through the layers So just to make sure that we are all on the same page I'm going to use some mathematical well known functions that the KL diversions within two distributions Which are just the average log likelihood of two distributions Of course, I usually assume that Q is absolutely continuous the second to P I have to say it after the last store but But of course, so this is bounded in some sense the KL divergence and they actually we assume that all the distributions Just for analysis simplicity are bounded away from zero and one So we are in some sense in the interior of the simplexes. This makes the analysis simpler. It's not essential for anything Now once I define this a non negative Very important function the cross and the cross entropy of the KL divergence That the not the most natural one when you talk about two variables is the mutual information Which is just the KL divergence between the joint distribution and the product of the marginals or essentially You can think about it as a measure of independence, but it's much more than that essentially this function dominates The last scale behavior of many problems including source coding and channel coding information theory and many other things Actually, I'm going to use it that this mutual information is really counting in some sense the number of representations and What's really important to remember that this is the difference between the entropy or the uncertainty in the variable X and the Uncertainty in the variable X given Y. So it's some source. It's the uncertainty measured in bits or in log binary in a log number of binary questions and The uncertainty removed from X when I know why so it's zero When nothing is removed So there's no connection within X and Y and it's bounded by the entropy of either X or Y now The important thing for just to understand my line of thinking is what's known as data processing inequality or DPI Which is essentially that when you move along a mark of chain Information can only go down You don't gain from it. This is not true for entropy Entropy can increase if I add noise or other things to the layers But this in general true you can't gain information about X when you move from layer to layer and In my language, you can't gain information about why either because I'm talking about the desired label Y And of course an immediate consequence of it that I want to want transformation doesn't reduce information It's actually a big computational issue because it means that I can actually let's say increase my daytime It's for a very hard computational problem and long don't lose any information So mission formation will not tell me anything about the computational hardness of things in general When the fact that two things have been sent me to information doesn't mean that I know how to translate one to the other or The vice versa I may not see this mission information because I have to to break a very harder code You know that to actually find the one-to-one transformation So the fact that mission formation is equal is actually a very weak statement to computationally now So when I apply this this DPI to do the different networks you gain immediate you get immediately this formal Chain of inequalities so the information Among the rays goes down and information about the desired label again Why which is to the left of everything here also goes down and one is bounded by entropy of X And the other one is bounded by the mutual information of X and Y Now I'm going to look at these at this chain of inequalities and I call it the information path of the network and I'm going to measure it Through the training during the evolution of the weights Which is usually done by something like stochastic gradient in the sense some sort of local gradient algorithm Which is noisy, but it's not it's essential to some of the things I'm going to say It's not essential in general and you can do it with either other optimization techniques Which are as long as we talk about local optimization small changes of the weights Most of what I'm going to say is remain correct And and then of course one thing to to to keep in mind is that If you think about this linear threshold function just as how threshold the sine function for example just for simplicity Then essentially each layer is inducing some sort of partition of the input into cells So the first hidden layer is really throwing a lot of Essentially random hyper planes and this is going to partition a very fine partition of the input But once I'm moving along the layers this partition is becoming closer and closer in some sense And and so it's inducing a coarser and coarser partition of the data So it's actually makes perfect sense to think how much information is remained in this partition of x and remember that What induces this partition is really the Euclidean topology of the input Which is not always obvious and so there's some sort of underlying geometry Underneath the input itself, which can be just another it's usually not having this since really the Euclidean topology now so I'm actually going to study this this How they would evolution of this partitions if you want and and the way to think about it when you talk about information theory is To think about an encoder So each layer encodes the input in some different position and then decoder Which is a the function from the layer to that label So each each layer is characterized by an encoder and a decoder and once there is an encoder and decoder Which may or may not be stochastic or deterministic. I don't at this point. It's not important I can talk about the mutual information given the distribution of x of the encoder and the decoder and I argue I mean this is one of my completely informal statements at this point But I actually argue that this is basically correct that in the large scale limit where x is very large Only these two numbers the mutual information of the encoder and the mutual information of the desired label Which I call which I call the optimal decoder Remember, this is not the mission for not necessarily the mutual information with me why a white hat and tea Tees is just like age the hidden layer Was actually dimension management T and Y Which is the best I could do with this particular representation once I give you a partition of the input What is the best prediction of the label that you can induce from this particular petition? So I have you that in some asymptotic sense, which has to be carefully defined And I'm not going to be even as close or careful as the last door This is in some asymptotic sense in the usual asymptotic typicality argument of information theory Ixt is dominating the sample complexity of the decoder and Iyt is dominating what we call the generalization error Which is so the higher why Iyt the my higher information I have about the desired label the better generalization is and the higher the lower Ixt The less samples I need in order to actually calculate this decoder So notice that the encoder is very simple at the beginning and the decoder is very complex And when I move from layer to layer the encoder becomes more and more complex and the decoder becomes more and more simple And eventually the last decoder is just a linear separation Okay, so so there's some sort of trade-off between the complexity of these two functions and So here is the picture I can't avoid showing which is a simulation of this of these two variables the information about the label and the information about the input in a very small toy problem Like physicists like to do I mean 12 bits of inputs one bit of output And what you see here is the info and I measured information Directly with the technique which is controversial to some of you to just beening the number of layers into enough Beans to actually estimate information correctly and then actually take the course of possible beening Without losing information about the output. So I I'm careful about this is actually the minimal information that is required to Flow through the layer in order to actually achieve the performance of the of the of course We can estimate the question by many other techniques Of course when we go to very high problem large problems We can't do it this way So we need to use for example the Gaussian process hypothesis about the layers or other techniques which are now very popular So and I'm doing it, but I'm not going to talk about what is really striking when we first looked at this So this is very specific network, but the picture I get is a very general picture What you see here is a hundred different repetitions of this small neural network with different initial conditions I mean different initial weights and different examples the order of the example trained on on 80% out of 2 to the 12 possible inputs and The striking picture that when we saw it we jumped was that if you actually train those network They seem to follow first of all they concentrate very nicely even for this small problem in the plan Which means all those 100 entirely different layers and weights and network Seem to be essentially in the same place in this information plan not only at the end But throughout all most of the evolution so you see here This is the last hidden layer which starts very low in terms of information about the label Where the first hidden layer which is this blue here is actually very high because it doesn't doesn't lose any information To begin with and it stays there But the evolution is actually quite striking when you actually look at how they move They they get up to more or less this point Where they seem to concentrate along a Line which really obeys the data processing inequality of course the information can only goes down Remember this is the input layer and this is the output layer I'm many of you have seen this movie But this happens after about something like 300 epochs of training which means 300 cycles of the data But from this point on we see a slightly different type of evolution much slower Looks like dominated by diffusion essentially which eventually pushed the last hidden layer Essentially to this very interesting points where it has exactly one bit of input information about the label which is What you hope for but also essentially just one bit of information about the output the input Which means that out of those 12 beats only one survived So this is a perfect what we call in statistic minimal sufficient statistic You don't need to remember anything more than one bit eventually But what is really interesting is that all the other layers also move to the left and And the question is why does this happen? Does this happen in general? What is the meaning of these numbers? Why do they concentrate the many many interesting mathematical questions? And of course the most interesting for me at this point is what governs the dynamics of this point Why do they move along these trajectories and actually if you look at those three just average them because the concentrates on nicely You really see this what I call typical behavior of the information plan trajectories. They start So the last hidden layer start very low and then it moves very quickly to this point C, which is essentially Okay, I gain about half of the information about the label But I also move a little bit to the left Which means I I also remember a lot of irrelevant things about the data and then most of the cycles from 300 to 9 to 10,000 essentially are spent on this trajectory from C to E Where all the layers essentially it doesn't have to be so but this is what happens in this example also move to the left and eventually The layers lie on a very interesting line This optimal line at the end which I argue is the information theoretic bound Which I can actually calculate analytically if I know the joint distribution of X and Y. This is what I call the information bottleneck bound But what is really interesting that most of the training is done for the compression at least of the last hidden representation from moving from C to E and Eventually bringing me to this almost a minimal sufficient statistic So the question is why it has is happened and what so there are many interesting questions I just want to outline because we've been talking about it for some time now Many of you had even me speaking about it So so the interesting part is so people started to look more carefully on this a what I call the information plan Description of neural networks and what you see I this is the same the same pictures Just the color is now the number of epochs and you see that again with enough data This compression is actually helping you in the sense that it simplifies the representation all the layers lie on this line When you reduce the number of inputs the first phase. I mean this coming to this green line, which is essentially When you begin to compress the representation is essentially the same But there's only 5% of the data the compression phase It doesn't help you and you eventually lose information and I argue that both lines I mean the line that you converge to with a lot of information data and the line that you converge to with finite sample Analytically tractable in some cases at least that if you have the joint distribution effect. So for me as a theoretician The interesting question is are there deep learning problems where I can actually analytically solve them in the sense of finding Where are those layers going to lie in this particular representation of the information plan? And that's where I want to take you So there is an issue of the sample complexity, which is much more complicated at this point I mean why do they decrease with the number of sample the way they decrease and how is this related to the classical generalization bounds that we have in learning like Pac-Bounds or like you know Pac-Base bounds or when many other things that some of you may know which we use in learning theory Which are all worst-case in some sense and they're all based and this is definitely not worst case This is some sort of typical behavior and that's why it deserved the language of information theory or statistical physics Allow me just for a second. I want to move to present a mood. Okay, so So maybe I just want to to give you the gist of the the kind of proofs that we're using in order to Connect this with with learning theory So usually at least half of my audience are computer scientists and computer scientists are usually familiar with the things like the Classical generalization bounds which are things of this of this style The generalization error or the square of it is bounded by the log of the cardinality of my hypothesis class Which is the class of functions my network and implement Divided by the sum number of examples plus something which is confidence which is completely negligible in the last-scale limit So this is a very nice bound because it's enough to know that dimensionality for example the household dimension or the fractal dimension of the VC dimension all these kind of things which essentially allow me to cover With size epsilon the hypothesis class. That's why I call age epsilon It's actually an epsilon cover of my hypothesis class And then I get this very nice type of bounds d over m or square root of d over m for the generalization bound Which is really very nice And that's why people really like it the problem with it that it's it's way too pessimistic And in some sense, it's completely useless when we talk about deep neural networks It's given because that any way any reasonable way of estimating the dimensionality of these things is Going to be of the order of the number of independent parameters So even this convolution neural networks or things like this you are still way too high So something else is happening here and actually saw it in this in these simulations that it's not the ways can vary Dramatically, but something else is this mutual information which is really dominating the Direction performance eventually and the question is why is this mission information is doing this very interesting path alone? So so so that the very simple way of thinking about it is is to think about those partitions that are induced on the input as as you know as cells Each layer has its own set a partitioning of the input and if I want to estimate the number of such cells information theory tells me that The cardinality of x is essentially 2 to the age x or age x is the log of the cardinality the typical cardinality and And the cardinality of this each of this partition is essentially 2 to the age x given t This is the same idea the channel used in his channel Channel coding theorems what have a channel and source coding. This is behind very distortion theory This is behind channel code This is the basic idea but here I'm using it in order to estimate the number of functions or the number of labels that I really need to know if I have this course partitioning of the input and that's very easy to do so the cardinality of the partition essentially 2 to the age divided by 2 to the age x divided by 2 to the age x given t assuming not necessarily the terministic partition And and so the canalities just 2 to the I which is age x minus age x given t And now I just plug it. This is where I'm I'm doing something I'm cheating a little bit because I'm using this hypothesis class measure Which is generated by the data as if it's given it's of course not given and that's where the whole the whole thing is It's flawed in some sense, but the intuition is right It's 2 to the I over M which dominates the Decoder complexity how many examples are really not to learn the decoder because that's essentially my the number of parts there That's very nice because it tells you okay this I x t Every beat of compression is like doubling the data. This is actually very very important. So so This is a much more profound type of generalization, but of course I x t Stuck strongly depend on the statistical the data. So this is not a distribution independent boundary Anyway, and of course, it doesn't it's not really a bound because estimating it it x during the evolution is actually very difficult but that's giving me some rationale why Why the the the compression? I mean this phase where all the layers move to the left is actually very yet can be very useful for generalization So this is the way I usually say so transition error is dominated by the difference in mutual information between my absolute information on why and the mutual information that they that a layer that the last of them A certain representation has about why this is easy to show for example using pinskin equality or things like this, but The dimension in some sense that's like something like the VC dimension is actually going down and is more or less behaving like to the Mutual information that the input representation has about input So it's actually very useful for generalization to minimize the mutual information. Of course, how much can you do it? so We know something about this problem For many years, so given given a representation, which here I denote by x hat So this is the completely formal abstract representation any map from x to x hat. I can calculate what is the minimal? the minimal information under some constraint on the mutual information that this representation has on the label y But this is a very simple Constraining minimization or conservative optimization problem. So find over all possible encoders, which obey this Markov chain any one of them Subject to some Lagrange multiply better some constraint on the information on y. What is the best possible? Encoder I can find and this of course that can be solved more or less sex Implicitly analytically by this type of iteration between the optimal encoder and the optimal decoder and It defines this a black line here Which is this absolute bound so in this information plan? There is a line beyond which Nothing can go. I mean if this information theory bound even an alien outer space cannot be above this line For this particular data for this particular joint distribution of x and y So the question is and many other details here One of them is okay. What happens if you have a small sample? You don't have the joint distribution of x and y you have a sample from the joint distribution of x and y This is of course what we do in practice How much information can you hope to achieve about the label the desired label the generalization error out of my sample? Given that you actually learn from a sample and this was answered essentially by with a paper by With a hot show mirror and see once about in 2008 and so it gives you this interesting red line Which is what's completely intuitive and which I actually I can draw it here It's actually it tells you that the empirical mutual information and the true mutual information Can be different by essentially the same number you saw before the square root of 2 to the i over m plus some logarithmic correction Which I don't mention here. So this is very nice The more we compress the representation The better my approximation of course It's obvious because I have a very fine partition I need a lot of labels to actually estimate the To actually label each one of those partitions So once you you need to coarsen their partition in order to be able to make good predictions Otherwise most of those cells will be we have no labels in them And of course you also need those labels to be more or less uniform or homogeneous Respective that that's the hot pot. I mean how do you guarantee that the encoder is good So essentially we're talking about two types of losses One of them is this absolute compression loss if you want to compress the representation There is a loss of mutual information, which you can't avoid due to the structure of PXY The other one is this finite sample loss the difference in the red and the red and the black Which is how much you lose because they have finite samples of course We make we can make this argument much more formal But essentially it tells you that there is an optimal point in this plan Which is the best? Generization error you can achieve with this particular sample size and when you increase the sample this Red line is going to approach the black line, but have absolutely not uniformly I mean at the beginning they're very close to each other at the end They're very far from each other because very fine partitions requires a lot of label Okay, so that's the basic intuition behind this theory. I'm not going to get into it because I want to get to some other things So, okay, so now we more or less understand Formally or more or less that and that compression can be good. It's not noticed by the way that I'm talking about different type of compression People talk about sample compression, which is how many bits you need in order to encode the sample the whole sample in the weights So this is for example the works of Stefano Suato and others This is a different notion and and there's no question that sample compression is good for generalization That's easy to show by the way we were showed by Nathie Srebro and O'Hatchamir and few others good friend of mine a long time ago 2010 following what we showed actually on clustering, but This is a something a very different type of compression. It's compression of one input Into the representation of this input in the layers. I don't look at the weights at all I look only at activity of the units now I Want to stress two things as I see Andrew Zaks here. I First of all, I don't We don't change at all in my analysis the training of the network. I mean that trained by classical Vanilla flavor back propagation Stochastic gradients and I didn't touch it everything we did was some sort of a wrapper. I mean we measure these quantities Given their representation. We didn't change it didn't affect. We don't need the mutual information You know to do anything to the network It's only this you know x-ray that you do to the layers using mutual information By the way, if it's not mission information, let's say I'm completely missing and it's not mission official anything close to mission fashion call it whatever you want These two functions behave very interestingly. They weigh some sort of a data personality and quality and they tell us They give us a nice a nice picture a nice x-ray You don't have to do an x-ray of the network if you don't want to but it's x-ray is completely Independent from the training of the network itself Mutual information is not used in the training and it doesn't affect the training I can be my my units any way I want Once the network are trained or Partially trained This is completely consistent way of measuring things in the network So it has nothing to do with changing the paradigm of learning actually I'm using exactly the classical paradigm of training now Of course, you can do other things like adding noise to the input This will change the paradigm of training or do other things which will force the mission for interest decrease I'm actually talking about the minimum mutual information that is required for the label Of course, you can train network without losing information There are all those resonates and remnants and many many architecture which are by construction Train to not to lose much information although they lose information What I said that in this case you are losing some of the advantages, but SGD is actually forcing you to lose information And that's why I want what I want to show you now so essentially When you train those networks, you can see that this Me in the information plan this beginning of compression Which may or may not happen in on the layer depends on the architecture to in a very strong way very strongly associated with the knee in the training error Which is this a lower curve. So you see that the the network started to move to the left The layer started to move to the left exactly at the point where the knee where the error start to saturate But there's a very big a very fast decay of training error and then a very slow and actually noisy decay of training You're it's bounded away from zero But it's not completely zero because I see gradients there and this gradients still do the work It's very far from the saturation of the gradients or from the collapse of the grad in all these things We're still way way before that and you see that those compression Compression to the left is completely associated with the fact that the gradients are small But noisy Okay, and then of course we verify that in many ways I just want to again show you this picture, which I showed many many times It's it's the race the single to noise ratio of the gradients Which is here you see the the mean gradient and the and the standard deviation of the gradient in every one of the layers And that's my Eiffel Tower network, which is very very specific, but it's actually very nicely demonstrating the effects So that this is in log log scale So the difference is this the log signal to noise ratio and you see that the single to noise ratio is flipping essentially at the same point where the the layer start to compress the single to noise moves from Plus 20 dB to minus 20 dB more or less towards the magnitude difference in the single to noise ratio of the gradient and Of course the collapse of the gradients if you insist happens much later where the unit is actually saturated or not saturated We have some sort of a vanishing gradients This is not the effect at all the effects that much earlier where the signal to noise ratio flips and when the single to noise ratio flips say All the layers such as of course the noise is higher in the lower layers because it's back propagating from from the top and But the and but the violence is also higher and the difference within the variance and the and the mean remains more or less constant Which tells you okay? I'm saturating the mutual information Remember the capacity of the Gaussian channel is log 1 plus the signal to noise ratio the signal to noise ratio and fixed So the information is fixed so the Gaussian approximation actually very valid here And one thing you also see if you look at the weights themselves you see that in again in a log log plot they grow linearly Like a drift if you want up to this point and that they go sub linearly more or less like a square root of t Beyond this point so they are good dominated by diffusion now I really have to hurry up so I Agree that this diffusion is the first of all we see it in many different problems It's really use this convolutions in nips in in m nests in in c far and so on so this is by no way Specific to my toy problem. I don't have the time to to get into what I want to talk about is the effect of this Joint compression of the layer on the computation time the time that he takes to converge That's something why striking that we formalize on recently this year When you add layers to the network let's say you move from one layer one hidden layer two hidden layer three up to six Hidden layers in this case you see that the time to converge actually decreases In terms of number updates number of iterations Decreases with the number of layers Now this is surprising I mean so first of all we saw it We saw it in this picture that here you spend a lot of time in the yellow Which means a takes forever to compare to converge and here everything is blue or purple Which means that after very few 100 iterations I'm already up there and all the layers essentially saturate and you see that the yellow is right next to red Which means that for most of the updates nothing changed in the information plan although the way still fluctuating and moving Now this calls for a theory And indeed we have some sort of a theorem which I call the compression theorem We showed you that if you actually look between in the mission information within two consecutive layers Eventually asymptotically it is bounded by a constant which depends on the number of relevant dimensions like the dimension of the Manifold that mark mentioned or the dimension of any other underlying So the whole assumption here otherwise the whole thing doesn't work is that there's an underlying low dimension to the problem Which can be followed all over the place But if there's no underlying low-dimensional problem neural networks are not going to learn it I completely agree with that set Now what we see here is that there is another term which decays With like a power law with the number of updates and this is essentially how the irrelevant dimension of the problem disappear This is this compression now again without getting into too many formal things I want to give you the gist of the idea of this proof which we only recently managed to really nail down completely What happens to the weights between two consecutive layers is that the first part of the training is actually Projecting the layer to a some some sort of a linear projection to load dimension Which is more or less what we call the canonical correlation analysis projection which means we project to load dimension without losing information about the label But this is a low-dimensional projection because there's assumption of low-dimensional manifold underneath But then there is another term this delta wk which essentially grows like a diffusion So imagine that in this minimum you you have a Haitian matrix. This is highly skewed So there are many irrelevant dimension, which means changing the weights in these dimensions doesn't change the error doesn't affect the label at all So in those dimensions the noise will simply diffuse. It will be essentially a free diffusion and almost free diffusion in the Preserve dimensions of the relevant manifold. I cannot diffuse because this will change the label immediately So what happens in this compression phase that I have diffusion in the irrelevant dimensions And I don't change anything in the relevant low dimension most of the dimensions are irrelevant I mean out of the million possible weights most of the changes will not change the the label at all of the error at all So now I can I can this is of course where I cheated again a little bit if I assume that this delta w looks like a Random Gaussian process. Of course, it's not it's it's it's a quench randomness in general, which means that this is a fixed, you know, Wigner type of distribution And I cannot assume that this is just noise That that's requires a lot more careful analysis to actually argue this But if you assume that this behaves like noise Then the next layer is a nonlinear function of a linear function of the previous layer plus something which looks like noise Where the noise is simply this delta w multiplied by the tk by the previous layer So this means that I can really treat this diffusion process as if it's acting like a Gaussian independent noise with respect to the relevant dimensions So I can bound the mutual this mutual information, but the mutual information of the Gaussian multivariate channel and And this I can estimate very precisely and and eventually everything depends on how many dimensions are preserved And how many dimension are not preserved where I allow this diffusion and this diffusion is it using the signal to noise ratio of the Irrelevant dimensions and eventually eventually gets me to compression. So that's the idea why SGD is actually compressing. I'm not saying that every algorithm is doing it But this actually gives us a very interesting prediction that the time to converge Depends on if all the layers really compress together And they don't do what Andrew says that they all go to the right and stay there But actually help each other compressing each other then the time to compress to to to get to good realization depends on how well I divide the irrelevant information between the layers Let's say that I divide it equally just a simplicity then it gives me this very simple parallel Time to converge with case layers to scale like K to the one Minus one over alpha and negative power where alpha is the diffusion exponent Which is half is in inflect diffusion in general. This is the prediction which I got you know why Basically hand-waving back of the envelope type calculation. And of course we went and checked it numerically So this is what you see in the disturb problem The the number of iteration in the local in a log log plot as a function of number of layers goes down With the exact power law which fits very nicely this assumption of half Exponent and half and diffusion exponent So that's this is very surprising exactly perfectly nice linear log log fit now. We looked at MNIST This you know practical problem we heard about and you see exactly the same type of parallel With a different exponent. So here. It's not so simple and again We know that this exponent is related to the change in slope of the growth of the waste and remember that I'm not doing any Regularization here. So the question is how general this is of course It doesn't go forever if you increase the number of layers more and more You are not going to gain more more time at some point. It's going to bump back and The reason is that this assumption of equal compression between the layers that they really help each other It's not true anymore. They start to overlap and they don't compress independent information So this gives us say, you know, so this power law goes off at some point start to bounce back And if you actually calculate it, you can see that at some point There's really an optimal number of layers beyond which just this computational issue is going to fail And that's essentially this point you see it goes down with this MNIST by the way all the way to about 7 8 hidden layers and then you're starting to go to pay again And the number of the time of iteration is going to be much larger because This overlapping this overlap between the filters is going to be very hard. They're not independent filters anymore Okay, so I actually wanted to tell you a lot more about where do these points converge to in terms of criticality Critical points along the information care. This is my main hot topic at this point So there's a theory behind the bottleneck which says us that computation time The layers are not going to be pushed all the way They're going to get stuck in some places and these places can be predicted just from the joint distribution By analyzing the the bottleneck equations But I'm not going to tell you today. So I'll stop here and Allow you for some questions So this is essentially my main message Only two numbers tell you the whole story the information plan and the advantage of many layers is mostly it's also Computational in terms of it actually boost the number of computation dramatically with the number of layers And there are many other issues and I want us to pay your attention to a special issue of entropy on the bottleneck which we are going to Have a deadline by the end of this year and we already have some very interesting papers Okay, thank you very much