 Thank you, and thank you to the organizers for the opportunity to speak to you here. So what motivates me and what motivates probably many in this room is sort of the mysteries that are arising in trying to understand deep neural networks. And I feel like at some level there's this broad idea behind lots of this work that goes something like this. Deep networks work very well because they can compactly represent complex functions. But they, that's sort of the good news, but the bad news is that they make deep, the learning problem more difficult in some sense. And so for instance, if you were to look at a network doing a handwritten digit recognition task, if you just train this network on this task, you can see there's this period of time at the beginning where the network's doing apparently nothing at all before it eventually bumps down towards lower levels of error. And I, coming from psychology and neuroscience, I'm also interested in this because as a matter of anatomy, the brain has deep structure. So if you were looking at the visual processing hierarchy, inputs come in through the retina, they go through this series of brain areas. So there is some amount of serial propagation of signals. And of course, if you zoomed into any single brain area, you'd also see interlaminar structure. And so if you think about signals flowing through this hierarchy, it has some of this serial signal propagation that's seen in deep networks. All right. And so I think what we really need to make progress on understanding the impact of depth is just a theory that illustrates some of these very basic trade-offs. What role does depth itself play in learning dynamics, as opposed to all the other features of the learning setup? And how does training speed depend on aspects of the setup that we're investigating? So I want to investigate these questions using a very simple surrogate model. And what I'll talk about today is basically three pieces. First, I'll talk about just training dynamics. Then I'll talk about scaling laws, how training speed scales with the number of layers in these simple models that I'll introduce. And third, I'll talk about representing more complex structure when you have richer data sets of what goes on in these models. All right. And all of this I'm going to do using a deep linear neural network. So the idea here is to develop the theory using a very simple model that gives us a much more analytical tractability. And I think particularly for the brain sciences, that's important because we need to sort of have clarity as to what exactly is giving us the phenomena that we're observing. All right. So what is this deep linear network? Well, here's a nonlinear deep network, right? You have inputs, you have these weighted connectivity matrices and you apply some nonlinear function f. And the deep linear network, you just do something quite drastic. You take out all of the neural nonlinearities. OK. So we're back to just linear maps between all of these different layers. And you're probably thinking, weight isn't that just trivial because the input-output map, you could rewrite as just a shallow single weight matrix by just taking the product of all of these weight matrices. So as the level of what the input-output map computes, it only ever computes a linear function of the input. There's no increase in representational power with depth here. But nevertheless, we'll see it radically changes the learning dynamics. And just to sort of give you a flavor of the sort of phenomena that this model can show, in fact, that training dynamics that I showed you earlier was from a deep linear network. So it has these long plateaus, sudden transitions, and it shows behavior like faster convergence from unsupervised pre-trained initial conditions. So this red curve here, we did unsupervised pre-training scheme and then trained this deep linear network. And you can see it converges much faster after that pre-training. So hopefully we can build intuitions for the nonlinear case by understanding exactly what gives rise to these phenomena in the linear case. And just to clarify why this model is interesting, when I say it's a linear network, what I'm talking about is the mapping from the inputs x to the outputs y hat. But then to do learning, you're going to take that mapping and plug it into some error function. And that's where nonlinearity has returned to the problem. So here I'm always going to be using the squared error between some desired set of patterns and what the network produces. And then the learning problem is minimizing that nonlinear function. And now it's readily apparent that this is going to be non-convex once you add one or more hidden layers. So this has at least one simple feature of the more complex nonlinear problem, which is a non-convex optimization problem. So I think, at least in the machine learning field, these different concepts are often coupled together, but it's useful separating them out. So you can have models that are nonlinear or linear. You can have models that are shallow or deep. So for instance, linear regression is linear and shallow. Support vector machines are nonlinear but still fundamentally shallow. You have nonlinear deep networks who knows what they're doing and we're just filling in this extra point here. And what I think is interesting is questions of representational power seem to separate based on nonlinearity or linearity, right? So the linear models cannot solve simple problems like XOR, but all of the nonlinear methods can, even the shallow ones. Whereas the sort of learning complexity appears to separate on the shallow versus deep distinction. You have convex problems in the shallow cases even for the nonlinear problems and you have non-convex for the deep. All right, so what happens when you look into the dynamics of these equations, what we'd like to do is solve the dynamics of the gradient descent equations. So here all I've done is just written down these gradient descent equations, rearranged a few things and they have this form. And what you can see is the complexity here, right? The weights show up in products. So you have these up-to-cubic interactions in the weights and it's highly nonlinear. Everything's coupled together and we like to understand this, how these dynamics evolve. And another thing that happens, this is only true in the linear case, is that the data set enters only through these two matrices, the input correlation matrix and the input output correlation matrix. So this is a key simplification and what we'll be able to do is ask, how does the structure of the data set as encoded by these two matrices influence the ultimate learning dynamics? All right, and these models have been well studied before so work on their critical points goes back many years. Shallow dynamics has been very well studied. There's some work on three-layer dynamics. There's of course a huge amount of related work on nonlinear multi-layer models and variety of settings but what this is really gonna add is an analysis that goes to deeper networks, many layers. All right, so how do we analyze this model? Well the idea is very simple. If you do the right change of variables everything decouples and looks very simple and so I'll sort of walk you through that. Up here I'm showing the input output correlation matrix. This is how the data set is provided to the model and I've just handcrafted one here. It has kind of hierarchical structure and you could decompose that using the SVD. So then you decompose it in the product of three matrices. What we're gonna do is analyze the network in that same coordinate system where now the network will have effective singular values on the diagonal that aren't equal to the true singular values in the world but they sort of will change over time to get to the right values. So the key assumption here and the sort of implicit here I won't talk too much about it is that these off diagonal elements are small when you do this initial transformation which is true for the dynamics starting from small random weights. All right, so basically what this change of variables is doing is it's taking this network that was originally all coupled together and it's separating it out into these 1D chains and I really do mean 1D chains. So think of this as one neuron per layer in a linear network, incredibly simple and they don't interact and that's the key simplification. So now I can just zoom in on one of these chains and ask about what the dynamics of this particular chain is gonna be doing. And it essentially takes the input, projects onto an associated singular vector of the input-output correlations and then passes that through a bunch of these layers where they each have this effective layer weight parameterized by this value B and that gives you the neural network's sort of production on the output singular vector. And when you multiply all these together you come up with the overall effective singular value that's just one singular value in these overall matrices. All right and to sort of show you how this works and in more intuitive level here's the network starts off knowing nothing and then slowly over time these values fill in along the diagonal and the input-output map refines until at the end of training the correlations that the network is producing exactly match those that were in the dataset. So we can zoom in in a little more detail and look at what happens in one of these 1D chains. If I add one additional assumption which is that the effective layer strengths in each layer are approximately equal then I can finally reduce things down to something simple enough that I can just plot exactly what these quantities are. So this is the error as a function of one of these layer strengths in this network and as a function of the dataset here quantified by a singular value and input variance. And this is what these error functions look like for a variety of different depths. So there's a few things to note here. The first thing is there's only global minima, right? I mean there's no real way to get stuck here. There's nothing super complicated going on. You can always find this minimum if you're dropped on the surface. But nevertheless there is this saddle point that emerges at zero. And right next to the saddle point you have this long plateau. And that plateau length increases with the increasing depth of the network. And you can see that the plateau isn't there at all for the shallow network. So if you have no hidden layers that's just linear regression. You get this nice shallow curve. Everything's gonna be fast. But the specific effective depth is the creation of the saddle point near zero and these plateaus. Or the linear chain, yeah. This is all for a linear chain where also each of the layer weights are equal. Okay, and so if I just optimize some networks just starting off at these positions indicated by these little balls here. You can see the shallow network learns very quickly. But these deep networks take a while to expel from these plateaus under the gradient descent dynamics. But eventually they do. All right, and because these models are fairly simple I can also give you analytic learning trajectories as a function of depth. So here I just pulled out three. When you have a shallow network of course you just have exponential approach to the final effective singular value that you should be aiming at in your data set. But when you add depth, even just one hidden layer all the sudden this changes to a much more sigmoidal shape. And this sort of changes only subtly when you go to very deep networks. So I can just plot these for you so that you can see what they look like. And the key change that depth is adding is it introduces these sigmoidal transitions where you have this long period at the start of apparently little learning. All right, and that was all for one of these 1D chains. But you can of course, they evolve independently. So if you have a data set with a larger network they basically you just compute what happens for each of these 1D chains and it reconstructs the full behavior of the network. And so here if you just had a shallow network with no hidden layer you would see exponential approach of each of those effective singular values to their final correct values in the data set. Whereas in the deep network you have sigmoidal trajectories. And of course, since you have explicit equations for these dynamics you can ask any question you like. For instance, you could ask what's the time scale of learning? And this very simple result that just says that each of these effective singular values is learned in time basically proportional to one over the singular value size in this case. And that's just the simple statement that if you have strong statistical structure you learn it faster than if you don't. All right, so that's sort of a simple picture of the training dynamics in these models. Now let's go investigate the scaling laws. So here the question is how does training speed scale with depth? And this is a bit of a subtle question because you need to know what you're doing with the initialization. You also need to know what you're doing with the learning rate. And so the procedure that I'm gonna use here is essentially suppose you get the optimal learning rate. How much slower would a very deep network be as opposed to a shallow network? And the result that we have is that this is strongly sensitive to the initial condition. So a very deep network, the time to learn to a particular criterion in a very deep network compared to a shallow network is on the order of one over this initial weight strength raised to the power of the network. So what that means is if the initial weights start off very small, say this was 0.1, 0.1 to the D, your learning time is exploding exponentially with depth. But if you were to manage to make this be roughly one, then in fact a very deep network would only be a finite time slower than the shallow network. And so this sort of, I think gives some perspective on these different initialization schemes that we have. For instance, this is sort of what people used to do. They started with small random weights and they saw that deep learning was very slow. But then sort of the initial breakthrough was pre-training and fine tuning. Now you're doing lots of unsupervised learning on adjacent layers. And that turns out to scale like just the depth of the network, so linearly in depth. But then this also suggests a faster procedure. You could use an orthogonal initialization that just sets this value to be one at the beginning of learning and then the prediction would be that that would give you depth independent training times. So we've looked into this numerically on the MNIST dataset. So here we train a bunch of different networks. They have 1,000 hidden units per layer and then we just train them with full batch gradient descent. And you can see that as the depth increases, in fact, even very deep networks are only a finite time slower. It looks like it's leveling off. It's not sort of diverging as the depth is increasing. And also we can get a prediction of this scaling of the optimal learning rate as well with depth. So one thing to just point out here because it's come up in a few talks that we've seen. If I were to have just chosen a constant learning rate, the pattern of the learning rate is it decreases as one over the depth. So if you choose a constant learning rate, you can actually see deep networks speed up as you increase the depth of the network. And that's somewhat artificial because what that means is there was a stable learning rate that you could have trained the shallow networks at that's higher than the one you used that would cause them to overtake the deeper network. So it's very important to make sure that you're optimizing out this learning rate quite finely. Okay, so then we can also look at this depth independent training time scheme. So here's comparing what happens if you use these orthogonal initial conditions. And now even as you make the network very deep, you can see that the training time basically is completely flat in the linear case. Obviously, your mileage will vary in the nonlinear case. And in fact, we looked into that a little bit. There's no theory here, it's just simulations, but if you have 10H networks optimized using sort of standard methods, and you start from these orthogonal scaled initial conditions to counteract the nonlinearity, we also have a scaling here, you can get networks that are even very deep networks to train very quickly. So if you look at this red curve here, this is a depth 100 network that's training much faster than even a depth 10 network with the previous standard initial conditions. Okay, so just as an interim summary of these sort of training dynamical phenomena, at least in these deep linear networks, there's something that depth itself contributes to the learning dynamics, and it's not about neural nonlinearities because there are none in this case. And what it does is it creates these sigmoidal learning trajectories and long plateaus in learning, which is reminiscent of a variety of results in nonlinear case, although I think the mechanism is different. So usually in prior work, it's been about two hidden units that are pointed in the nearly the same direction. You have to separate them apart. This is about the scaling symmetry across depth, but it's a similar phenomenon. And I've shown you that the learning speed is ordered by singular value strength, and it depends very strongly on the initial conditions. And there are lots of caveats here, ways that it would be great to generalize these results, but this is where we are at the moment. Okay, now let's turn to the question of representing a structure because classification is an interesting problem, but of course it doesn't exhaust our capacities as humans, and it certainly is a limited framework if you think about sort of the set of tasks that we can actually perform. And that set of tasks is characterized by all kinds of different structures. Some tasks have sort of implicit hierarchy, some tasks have implicit spatial or ring-based structures. So how do we acquire this kind of knowledge of items that's embedded in these broader structures? To investigate this, I'm gonna do a very simple procedure. I'm gonna say, suppose that the world is a structure generative model that comes from some underlying abstract forms. It could be a hierarchy or it could be some other kind of structure. And then from that generative model, I could sample a set of data and then I could feed that data to my deep linear network. And the important thing here is because we know the dynamics of the deep linear network, if you can analytically calculate the singular values and vectors that arise from this generation process, you have an analytic link between the environmental structure and the learning dynamics in the system. And so you can do this program for several different types of structures. And basically we're taking the limit where you have many features and so all that matters is the sort of infinite data limit of these generative models. And this is what emerges. So here in the top row, I have the generative model that I've drawn from. These are Gaussian Markov random fields. And then in the next row, I've shown the resulting covariance structure and the differences between different input examples. So in the clustering case, you can see they sort of have this block diagonal structure. And then you can get the resulting singular vectors which is how the neural network will organize its internal representation and learn about these things. And just as a simple visualization, I've embedded the evolving hidden representations in 2D using MDS scaling. So you basically, all the representations in this clustering case start off at the same point and then they separate out into the three structures that generated this data set. And for something more complex like a tree, it has ultrometric structure in its covariance matrix. That will be diagonalized by the hard wave lids in this case and you see this sort of branching pattern as it learns this tree. So there's a question yesterday about what happens if you, how could you get kind of spatial structure out like you'd see in hippocampus or grid cells, this kind of phenomenon. So suppose you were navigating in a space where you see landmarks but they have locations and the similarity structure is a function of distance. That will be diagonalized by Fourier components and this sort of has repeating structure which is maybe reminiscent of a grid cell like behavior. Okay, and one point which I find quite interesting is that the neural network, it doesn't know about these different types of structures. It's just doing the same thing always, which is gradient descent. And yet it learns to represent these different structural forms without ever having to sort of explicitly enumerate them and reason about them directly. All right, so what are some of the phenomena that you can get out of this link? Well, here's one example. You could ask, how do you learn about a hierarchically structured data set like this? And one simple result is depending on how you set up this generative model, you can show that the model must exhibit progressive differentiation. What I mean by that is it will start by learning the broadest distinctions between plant versus animals first and then it will separate out the next level of distinctions and so on down the hierarchy in these waves of quasi-stage like transitions. And so sort of to illustrate what's going on to drive that phenomenon, on the left here you can see these singular values popping up one after the other, according to the strength in the data set and each one of those elaborates one level of this hierarchical tree. And so you get this sort of progressive differentiation where you split off successively down this tree structure. And this is the embedding of the hidden representations in this network. And so this has been observed before in nonlinear networks. So they did the exact same thing but with nonlinear networks and said, wow, this is interesting, it exhibits this progressive differentiation picture. It's a little harder to see but it does sort of separate out according to hierarchy level and now we have a more clear picture of where this arises from and in fact it doesn't rely on the nonlinearities. There's a range of other phenomena that this model can show. For instance, developmental psychologists are very interested in the phenomenon of illusory correlations. This is the idea that you can sometimes make transient errors. So you might correctly say that a worm does not have bones when you're very young. Then there'll be some age where you can find a child who thinks that a worm does have bones even though they never could have seen that, no worm has bones. So where did they get that idea? And the idea is that if you look at the activation of one specific property over training, because most animals have bones, it might be a reasonable inference when you've learned these higher dimensions of the tree to say, well, a worm probably has a bone but then after you learn the specifics of exactly what a worm is, you eventually get the answer correct and say that a worm does not have bones. And these kinds of U-shaped phenomena, they occur all over development, you can't get in a shallow model. All right, another thing you can do with this model is sort of interrogate when it will discover categories. So here I've generated a data set of binary features. There's a bunch of objects in the world. Those objects have associated features. And of course, if you look at this, it doesn't look like there's very much structure. But if I train a neural network on this and plot it's evolving internal representation, you can see that in fact, these sort of separate out and cluster into three different categories. And if you order the features by that, then you'll see I've also colored it just to aid you. You see there's much denser probability of having a feature in this block as compared to outside this block. And so indeed there was this sort of hidden structure in the environment. As a simple model of this, you could consider this random feature model where you're letting the number of objects and features go off to infinity, but their ratio is a constant. And then you imagine that there's a category comprising a set of objects with a set of features. And if an item is in that category, then you have a high probability of having a feature. And if it's not, then you have a low probability Q. And so this is sort of a simple, gives you a simple spiked model out of this generative structure. And we can just sort of directly plug in these well-known results. As an example of the BBP phase transition to understand how this network will recover this category structure. And essentially there's this quantity you can compute. I'll call it the category coherence. Depends on an SNR-like variable and the relative sizes of the category compared to the world. And that number will determine whether you can correctly recover these categories. And as an example, you can train lots of different models for different numbers of objects in the cluster for different levels of SNR, that's the color, and for different numbers of features. And you get this sort of mixture of learning times for these networks and the overlap with the ideal vectors. But if you just plot it with respect to the category coherence, it all collapses onto this universal curve. There's a threshold below which you cannot recover the category and then increasing performance as you go above that. Okay. And finally, I want to talk briefly about generalization. But this is not generalization in the sense that you usually see it in machine learning. It's not about what happens when you draw a finite sample of data. It's a different notion that has been studied also in machine learning, but commonly in psychology, which is suppose you have some knowledge already embedded in your network and you're trying to learn one new thing. So maybe you know about dogs and horses and rabbits and you know about the properties that they have. And now I tell you a dog has property X. How are you going to assign that property to the other items that are in the data set? How likely is it that you think a horse has property X? And the way that this network will do this problem if you essentially add this new property and just train the network with gradient descent on these weights, then it will embed it into its current representational space nearby the item, right? So property X will be placed nearby the dog item in the internal similarity space of this network. And you will generalize then to these other items in proportion to their distance to this novel property that you've embedded. And so just with this logic, this sort of yields predictions for how these inductive generalizations should change over time, right? So if you came to someone early in their learning about some hierarchical data set and you embed a property, then it's going to embed at a high level of the hierarchy and they will generalize it to all items in the hierarchy. And this pattern will become more and more specific over the course of development. And that's what you see here. So here I assign a property to item one and I test how it's been generalized to these other items that have a hierarchical structure. You can see initially it thinks that every item has that hierarchical structure. Then it sort of separates out left branch from right branch and so on to the finer levels of the tree until eventually it's correctly mirroring the hierarchical structure that's in the data set. This is a very simple window into how these similarity structures in the input space can start to yield structured generalizations that conform to these underlying forms that are in the data. All right, so I haven't talked very much about the psychological connections, but I just want to mention some work that sort of suggests that this phenomenon might be occurring in various contexts. So first of all, I showed you the results showing that the same thing happens in nonlinear networks. So it does seem to survive that change because of course all the results I've been showing you is for linear case. But you'd also ask what happens with real world datasets. And this is a very interesting paper where they just looked at the training dynamics on ImageNet. ImageNet has an implicit hierarchy behind it. And indeed the network does pick out the broadest distinctions in that hierarchy first before going down the tree to the finer distinctions. And at the level of psychology, this pattern of progressive differentiation is in fact, the reason why we looked into it is because it's exactly been observed in infant and child semantic development by developmental psychologists. And there's also work on underlying neural representations that is consistent with the sort of similarity structures that emerge in these networks. All right, so just to summarize, I think these deep linear networks, they're clearly only a limiting case that's suitable for a subset of phenomena and deep learning. And in particular, I think they are useful for studying the impact of depth on learning dynamics, but not of course on increased representational capacity. And the main result that I've shown here in the training speed is that the training speed depends critically on the initialization. So if you start off with small random weights, you're gonna be looking at a very long training time, but if you're careful about the initialization, it can be quite fast. Another phenomenon which I find quite fascinating is the complex phenomena that came from depth alone, even with completely static data sets. So with that hierarchical structure example, the data set is fixed, you draw it, that's all you ever see. And yet you see this unfolding progression where it elaborates out different layers of the tree structure in these quasi-stage-like transitions. And the final point I'll make is that these deep linear networks can represent and to some extent generalize according to these diverse structures, even though you never had to explicitly enumerate the candidate's set of structures. Which I think is a very powerful idea and may go some way to explaining the success of deep networks in diverse contexts. So if you're curious to read more, most of this work is written up in this paper. The seeds of this work were in this earlier paper, this is sort of in a machine learning context. And I didn't talk about it, but there's a generalization dynamics picture, which is in this paper, which explicitly looks at the case of limited data. All right, and I think I'll wrap up there. I just want to acknowledge J. McClellan and Surya Ganguly, the two people who did most of this work that I presented today with me. And also if you happen to know any postdocs or students, I'm hiring. So thank you very much.