 Our speaker this week is Yasha Soldiksten. Yasha is a senior staff research scientist at Google Brain, where he leads a team spanning machine learning, physics, and neuroscience. His career path is a long and interesting one. He started off by working on the science instrumentation of NASA's Twin Mars rovers, Spirit and Opportunity. Then it earned him a PhD in biophysics. And finally, his path led him to work at Khan Academy, an online education site. The topic of his talk today, however, comes from pure machine learning. He's going to tell us about what happens as one increases the number of parameters in a neural network without bounds. And he will mention some surprising consequences along the way. So let me turn you over to Yasha. Please start once you're ready. Well, thank you. Thank you very much for the generous introduction. And thank you also for inviting me. It's a pleasure to be here. I don't know if you can see the blackness out the window behind me, but it's a lot earlier for me than it is probably for many of you. So apologies if my brain is still starting a little bit. So first, to motivate this talk, it turns out that as neural networks become wider and more over-parameterized, they both perform better. And they become more analytically tractable. And in fact, in the limit of infinite width, you can often abstract the way the parameters of the neural network entirely and make surprisingly strong closed-form statements about the behavior network. And while this talk won't strictly be about connections between physics and machine learning, the analysis techniques and the way in which you are able to understand neural networks in this limit are basically ripped directly from this physics. So as you may or may not recognize, there are indefinitely some fairly fundamental connections. There are a couple places in the middle of the talk where it gets dense. And it's particularly difficult to judge audience engagement or understanding her remotely. So if you have questions during the talk, please feel encouraged to just ask it as I go along. You don't need to save it for the end. I think it'll be more helpful for everyone involved if there's some kind of feedback. Cool. So I'm going to talk about a large amount of work done by a large number of people. And just to make sure that I properly give you credit, I'd like to start out by just calling out the subset of them listed here, each of whom is a primary contributor from more than one of the results that I'll be presenting. So just going from left to right to top to bottom, we have Yasmin, Norman, Jaimun, Surya, Jeffrey, Sam, myself, LeCiao, Greg, and Jerry, or George. The overall structure of the talk is going to be like, so I'm going to try to motivate why overprime trivialized neural networks are an interesting object to study. I'm going to talk about the distribution over functions computed by randomly initialized neural networks. I'm going to show how you can use and understanding this distribution over functions to predict when networks will be trainable. And then I'm going to show that perhaps surprisingly, these results don't just hold at neural network initialization, but also describe the distribution over functions that result from either Bayesian parameter estimation or S&T training of the network. So to dive in, why are overprime trivialized networks an interesting object to study? Basically, they are because they do better. There is a history in the name of building larger and larger models and getting higher and higher accuracy. Even within a given model class, though, test accuracy typically increases with increasing model width. Here, we show experiments supporting that. Improvement with width for both fully connected and convolutional architectures applied to Cpartem. So every point on the fully connected plots on the left corresponds to the same architecture and the same depth, but different settings of training and initialization hypercremators and different widths. In both of the plots, only models which achieve 100% training accuracy work. So here, the generalization gap corresponds directly to test accuracy. You can see that if we take the best hypercremators at each width, then wider models strictly outperform narrow models on test accuracy, out to a width of 20,000, which is as wide as we can go given the computing memory constraints we have in the experiment. Similarly, in the right plot, each line corresponds to the same convolutional architecture in terms of number of layers and nonlinearly used. And for each point in the plot, we're performing an optimization over training initialization hypercremators on a validation set. And so you can see that in the plot on the right, once again, the components with pooling accuracy strictly increases with increasing width. Once again, this goes all the way. Can I just ask, width is the number of layers or the number of nodes in a layer? Awesome. Yeah, so width is the number of nodes in a layer. OK. Or for a convolutional neural network width is the number of channels in a layer. OK. And so this observation for both pooling connected in CNNs and other architectures that we do better with strictly better with increasing width maybe raises the question, which is, what happens in the limit of infinite width? So to start to answer that question, I'm going to let's examine the properties of client neural networks at initialization. As we're going to talk about in the second half part, let's talk about many parts of this also apply to networks trained by gradient descent or a Bayesian brand destination. This next section is going to reach peak mathematical complexity. And so this will be a particularly good time to ask questions if you have them. I'll call out when you reach the high water mark. So first, let's set up a system. So for simplicity, we're going to consider just the fully connected feed forward architecture. This architecture has inputs y0, pre-activations, Zl, and activation yl at layer L. The pre-activations are an alpine transformation of the preceding activations with weights Wl and bias at BL. And both the weights and biases are initialized by draws from the Gaussian. You can see that in the little set of equations on the right-hand side with variance sigma squared W over width for the weights and variance sigma squared V for the bias. The output of the neural network or the neural network's logits are Z capital L. And just for simplicity, we're going to assume that the output of the neural network is one dimensional. So it's just out with a single scalar. And I'm going to leave this little cheat sheet about the network architecture in the upper right for a while. So before we dive in mathematically, let me illustrate visually what the core result is going to be using a simple cartoon. So this plot compares the network's output, the logit Z capital L, for two different inputs, X and X star. So if we draw a sample of the parameter's data from there prior, and then we compute the output of the neural network for those parameters for these two different inputs, that will correspond to a point on this plot. If we reinitialize the network parameters over and over again and plot the outputs of the neural network for these two inputs for each of those reinitializations, then we will get a distribution over network output or, equivalently, a distribution over functions, which is induced by the distribution over parameters. The core result that we're going to get is that as the hidden layers of the neural network become infinitely wide, this distribution over functions converges to a Gaussian process. And so the distribution over neural network predictions for any set of data points becomes jointly Gaussian as the intermediate layers become infinitely wide with a particular kernel or covariance matrix. But it turns out that we can compute. So just as a very brief refresher, Gaussian process is a stochastic process over some continuous domain of random variables, such that any finite collection of those variables is described by a multivariate normal distribution. You are probably most used to seeing Gaussian processes plotted as draws of functions over a 1D domain. So for instance, this plot in the lower right shows several draws from the Gaussian process corresponding to a particular neural network architecture. From this perspective, the plot I showed on the previous slide would correspond to plotting the y values against each other for two different x values. So these are just two different ways visualizing the same thing. I should pause for a moment and ask if this pictorial, this cartoon representation makes sense. I'm sorry, but what does capital L represent? So the superscript L is the layer in the deep network. So and capital L is the top layer in the network. I'd just like to say I was a bit confused by the continuity of the curves on the previous plot. So this one here is discrete, but the other one showed continuous variation against either x or y variables. So I'm not sure if I understand discrete or continuous from this. Yeah, so the continuous is like the limit of the discrete. I do have an infinite number of samples. I think this is the right way to think about it. And so here we have some distribution over data, some distribution over the model parameters. And we initialize all the parameters in the neural network to those values of theta. And then we take two inputs to the neural network, x and x star. And we push those inputs through the network. And we look at the output of the neural network, the z capital L here, where L is like the double total number of layers in the network. And we plot them against each other. And if you draw theta from its prior over and over and over again, then each single draw of theta is a sample from the distribution. But if you look at all possible values of theta, then you get the continuous distribution, which is what's indicated here by the red curves. And maybe closely connected to that. And this may have been what you're asking. So here we're plotting the ZL for one value of x, again ZL for another value of x. I think often when people look at Gaussian processes or Gaussian random fields, they don't visualize it as being a, they don't visualize the field value or the function output value for a small number of discrete outputs. They instead imagine that the Gaussian process gives you a distribution of the functions. And so in that case, what you see in the lower right would be draws from the distribution of the functions given by the Gaussian process. So each line in the lower right would correspond to a specific choice of the neural network parameters. So for a specific choice of neural network parameters, you have a fixed input output for the neural network. So every value of x gets mapped to a specific value of ZL. And if you initialize the neural network over and over and over again, then every time you re-initialize the neural network, then neural network is computing a different random function. And that corresponds to a different line on this lower right plot. And the observation is just that if you are, if your distribution of functions is given by a Gaussian process, then if you take any two potential inputs and you just do a scatter plot of the function output input 1 versus the function output input 2, then that will be jointly a Gaussian. OK, thanks. That really helps. And what is the special value of x that gives the minimum width for ZL? Could you ask that again? What is this? What's special about the value of x that gives the minimum width for ZL? So Z capital L is the output of the neural network. And the thing that's special about, so there's nothing special about the x that corresponds to the network width, it's rather more the other way around, which is that as the neural network becomes wider, as it has more and more and more units, then the distribution of functions computed by the neural network becomes a Gaussian process. So a Gaussian process is like the analog of a Gaussian but for functions instead of a discrete number of variables. And so the connection to width is just that as the width of the neural network becomes wider, the distribution of the functions it computes collapses. Yeah, I wasn't talking about the width of the neural network. I was talking about the width of the distribution in ZL. So that depends, that's a really good question, which maybe segues nicely to the next part of the talk. The width of the distribution of ZL depends on the kernel or the covariance matrix of the Gaussian process. And we're about to derive that. OK. Cool. OK, so these next few slides are going to be a little bit mathematically dense. Hopefully they provide insight and interest as well. So now that I've shown you with a cartoon what the result is going to be, let's work through a sketch for why it's true. The derivation I'm going to give here is closest to that in Roman and LeCiao's ICLR 2019 paper, which is the bolded reference in the lower right. So I would recommend looking there if you want more details. So the first thing we're going to do is we're going to note that the pre-activation ZL. So these are the inputs to neurons at layer L in the neural network are a weighted sum of Gaussian random variables corresponding to the weights and the biases of the network, where the coefficients for each of those Gaussian random variables are the preceding activations Lyell. So what this means is that the ZL are jointly Gaussian or are a Gaussian process conditioned on the preceding activations Lyell. And I want to emphasize here that the ZL are a Gaussian process because the weights and biases are Gaussian, because the weights and biases are draws from a Gaussian. We're not assuming activations in the previous layer of Gaussian and we haven't yet taken anything to be infinitely wide. This is true even for finite width neural networks. The covariance or kernel of this Gaussian process describing ZL depends on the covariance or the Grammatrix KL of the preceding activation Lyell. And here the weight initialization scale, sigma squared w just changes the overall magnitude of the covariance matrix. If you take your y's and multiply them by larger weights, then you're going to get the same covariance matrix just scaled up. While the bias on the other hand is added identically to all inputs, the bias is shared for all inputs x. And so the bias makes data points more similar and makes the covariance matrix more like a constant. In practice, if we were to apply this analysis to a neural network, we would compute this covariance matrix or this Grammatrix KL for all pairs of data points. And so k, k super L would be a number of data points by number of data points, second moment matrix. Cool. So what this says is that the pre-activations for layer L are a Gaussian process conditioned on the preceding activations. So what this says is if you take y's and you apply a random affine, you apply an affine transformation to them with random weights and random biases, then after the affine transformation, you have the Gaussian distribution induced by the Gaussian distribution of the weights and biases. But if we just look at this for a moment, we can notice that ZL only depends on yL through its second moment matrix KL. Because of this, we can say that ZL is a Gaussian process conditioned on the second moment matrix KL rather than conditioned on yL. And just for reference, you can see that I've just added the equation for the Grammatrix or second moment matrix on the right. Cool. OK. So this particular step is where we're going to reach peak mathematical complexity in the whole talk. And then it will be like they're relaxing and hopefully engaging on path from there on. So K0, but that also means that the slide is a particularly great spot as I finish it to ask questions. So K0 is the data data second moment matrix or Grammatrix at the input layer. KL is the Grammatrix of yL. yL, as you might remember from the definition of the neural network, is the activations after applying the nominary. So yL is equal to phi of ZL minus 1. And so what we can do is we can take this definition for the Grammatrix and we can substitute in phi of ZL minus 1 into the equation for KL. And however, from step 2 above, we also know that ZL minus 1. So we also know that the distribution over pre-activations in the layer below are themselves samples from a Gaussian process given the Grammatrix KL minus 1 from the layer below. And so what this means is that KL here is a sum over units of units i of phi Zi and L minus 1 for inputs x and inputs x prime divided by the number of units. So KL is an average over samples from a Gaussian process. And as NL goes to infinity, as a number of samples in the average goes to infinity, we can replace this average over samples with an integral over the distribution generating samples where, again, this distribution generating samples is just a multivariate Gaussian process. So this integral is an integral over a multivariate Gaussian distribution. So in the limit, we replace the entries in the second moment matrix for each pair of inputs x and x prime with an integral over a 2D Gaussian of the product of phi of Z and phi of Z prime. This integral is deterministic. So KL given KL minus 1 is deterministic. And so for shorthand, we will define a functional F, which corresponds to computing this 2D integral for all pairs of inputs and which transforms KL minus 1 into KL. I would like to emphasize that computing this transformation only involves computing a 2D integral. There are a bunch of situations where you can solve this integral analytically. So for instance, if the nonlinearity in your network is an error function or a rectified linear unit, you can solve it analytically. But even when you can't solve it analytically, you can efficiently, numerically compute a 2D integral. OK, so this was the mathematical high watermark. This is an excellent time to pause for questions. If you have questions, someone else will too. So please, please ask questions. Just in a sense, and very generally, of why inputs to a neural network aren't represented by a single one-dimensional X, why is it X prime? What's that representing? Ah, OK, so X and X prime are two different inputs in the neural network. So what we're looking at here is, yeah, yeah, yeah. So what we're really interested in is we want to know how the neural network for output for input X prime will relate to the neural network output for input X. And it turns out that the relationship between the neural network output for input X prime and the neural network output for input X is just a, they're just going to be jointly Gaussian in each other. And so you can fully characterize their relationship if we can figure out the covariance of that Gaussian. OK, you're just trying to create an error signal so you can drive the network into. Yeah, so actually, so this is not for training. So so far, what we're trying to do is we're trying to understand the behavior of the neural network at initialization. So we just want to know, so even before you train the neural network, you just draw the parameters from some way. You just randomly initialize the neural network. And at random initialization, the neural networks compute some function. And later in the talk, we are going to discuss at a high level what happens during training. But right now, we're just asking this function that you get by randomly initializing the neural network, what does it look like? But what are the properties of that function? Could you go back a few slides to where it got two different plots of showing our neural networks? Actually, yeah, it's here. And the next one maybe. The next one. Oh, yeah, so if we know beforehand that our neural networks serves as a GP, then we expect the lower right plot. So my question is, does the plot in the lower right serves as a justification of our neural networks actually works like a GP? So if we have a small neural network, which is maybe not very wide, but if I play the same plot as you did in the lower right, and I saw a similar plot, does it mean that my neural network actually works like a GP? So I should be careful answering this. So the answer is it is suggestive, but it's not enough evidence on its own to guarantee it. And the reason for that is you could, for instance, have a lot of samples that look to your eye, like their joint negation, but they could have hidden structure, which is not apparent to you. This is always a problem with modeling data. You can never be sure that there's not some other explanation of the data that perfectly explains, for instance, what looks like random noise to you, or that explains fine structure that you just don't even recognize. And so it would be very hard to just look at the output of the network and conclude definitively that it is very well described by a simple distribution. But if you were to plot your outputs of your network and you were to get something that looked a lot like a Gaussian, then that would be pretty strong evidence that it's Gaussian-like. OK, thank you. I have one question regarding this definition. According to this definition, doing the forward pass through the network at a point x depends on all of the data points that you have in the data set, like the mean and the covariance metrics. No, I mean. So each point goes to the network totally independently of each other. But the outputs end up correlated because they all use the same weights and biases. So it's not that the data points know about each other. It's that if you take two data points and you add the same bias vector to them, then their covariance is going to be a little bit larger after you add the same bias vector that it was than it was before you added it. OK, thank you. So the similarity between data points is induced by the fact that the data points are going through the same neural network with the same parameters. They don't talk to each other. Cool. So I'm going to do the last bit of this kind of derivation sketch is just putting these pieces together. So from point one, we know that the top layer logits, but the outputs in the neural network are Gaussian given the second moment matrix kL of the preceding activations. In turn, we know that each of the second moment matrices kL are deterministic functions of the second moment matrix from the layer below. And so if you just apply this recursively, what this means is that the top layer logits are a GP conditioned on the second moment matrix of the inputs and thus conditioned on the inputs. And this GP, the Gaussian process, has a particular compositional kernel or covariance matrix, which can be found by applying it to the input kernel L times in a row, which we write as fL, and then adding terms sigma squared w and sigma squared b to account for the variance of the read outputs and biases. So what this says is that the distribution of the functions computed by a neural network at initialization is described by a Gaussian process with a particular compositional kernel that we know how to compute. I think I'm going to move on. So the initialization of a neural network, of course, corresponds to the very beginning of training. And so one of the very first things we can do with our understanding over the distribution over functions induced by a neural network is to predict trainability to predict whether the neural network will be trainable or not. So in order to evaluate trainability at initialization, we can look at what happens to this second moment matrix or grand matrix as the network becomes deep. So as a reminder, KL, you could think of as being the covariance between the inner product for a second moment matrix between different inputs to the neural network at layer L in the network. So you can imagine you put an input into the bottom of the network, and then you pass it through an affine layer and a quick margin of narrity and an affine layer and a quick margin of narrity. And after you do that L times, then you have the preactivation CL at layer L network. And they have a second moment matrix or they have a covariance matrix that comes from KL. So one common behavior of recurrent processes is that they converge to a fixed point and generically that they tend to decay towards this fixed point exponentially quickly. That is, in fact, what happens here. So what we can do is we can write K capital L. We can write the grand matrix at layer big L network in terms of a fixed point K star plus a term which decays exponentially with depth over some depth scale which I'm going to write KC towards that fixed point. And in order to understand trainability, we can examine how this decay towards a fixed point changes as a function of the initialization hyperconductors sigma square w, which is the scale of the weight initialization and sigma square v, which is the scale of the bias initialization. So here we illustrate this for a deep fully connected networks with tanh nonlinearities. And we see that we end up with these two regimes with qualitatively different behavior. So we end up basically with a phase diagram describing the behavior of the neural network and initialization. So when sigma square v is large, then the shared bias vector causes initially dissimilar inputs to converge. And the fixed point corresponds to the activation vectors at deeper and deeper layers of the network becoming identical for all inputs, which means that the entries in the grand matrix K star also become constant. On the right side of the plot, on the other hand, the tendency of large weights to push apart initially similar inputs dominates. And no matter how similar inputs the neural network start out, they are made increasingly dissimilar with increasing depth, as captured by the scale of entity contribution to the grand matrix. And so these two behaviors induce ordering chaotic regimes, where depending on how you initialize the neural network, you either expect all inputs of the neural network to collapse and become identical with depth of the order regime, or you expect initially similar inputs to become pushed apart, and the network to essentially act as a hash function in the chaotic regime. Does it make sense what this diagram is illustrating? So what's your intuitive explanation of this? Yeah, so the intuitive explanation is that if the bias is very large, then whatever the input to a layer is, after you add the bias, like the signal is going to be mostly from the bias vector. And so because you could take two inputs, no matter how different they are, you're adding the same bias vector to them, they're going to look like more similar to each other at each layer in the neural network, as you add the shared bias vector over and over and over again. And so they will collapse towards becoming identical vectors as a neural network, I see. So I mean, presumably having very, very large bias doesn't really necessarily give you a good network. So the fact that they converge might be converging to something you don't want? Yeah, for sure. So you can kind of imagine that what this plot is showing you is two different pathologies that can help. And as we'll get to in a minute, there's like a dashed line in the middle, like at the phase transition, which like exactly balances between these two pathologies. And that turns out to be where you want your neural network to live. So in the ordered regime, the neural network is useless because all information about the inputs is lost, they just become identical. And in the chaotic regime, the neural network is useless because no matter what structure there was in your input data distribution, it's like it's being pushed through a hash function. You get like no, you just get essentially random outputs for each of your inputs. And both of those are very hard to train and also won't work well even to succeed in training them. So so far, we've talked about propagation of signals. We've talked about how the distribution over network logins is a Gaussian process. However, gradients are linear operators and a linear transformation of a Gaussian is still a Gaussian. So subject to some weak constraints, the same is true of a Gaussian process. If a function is described by a Gaussian process, then its derivative is also described by a Gaussian process. This means that we can do a nearly identical analysis of the input output of your network as we did for the four complicated activations. And here the input output Jacobian network is just the derivative of the neural network outputs with respect to the neural network inputs x. So it's like d z capital L dx. When we do this, we find that in the ordered phase, the Jacobian norm goes to 0 and gradients Spanish. And this is maybe what you should expect because in the ordered phase, all the inputs in neural network become identical. So you change the input and the output doesn't change, which corresponds in fact to the gradient going to 0. Whereas in the chaotic phase, because small changes to inputs cause very large changes in outputs, the Jacobian norm goes to infinity and gradients explode. We typically train neural networks by gradient descent on their parameters. And so you might imagine that if the gradients go to 0 or the gradients go to infinity, both of these behaviors make training neural networks quite difficult. There is however a line down the center where something different must happen. So along this critical line, it turns out that the two tendencies perfectly balance. k does not exponentially decay towards x point and the gradient does not exponentially explode or vanish. So it seems plausible that in order to train an extremely deep neural network, this line is likely a really good place to do it. This should be very reminiscent to you of phase diagrams from physics, where you have some order parameters describing the behavior of the system. And here, those order parameters are the scale at which you initialize the weights of the biases. And the system behaves in very different ways, depending on where you are in that phase diagram. And a lot of the most interesting behavior happens right at the boundary between phases. It happens when you're not following to either of the order of the frozen regime or the chaotic or even if it is like high temperature kind of regime. So it seems like something magical might happen along this critical line that divides these two pathological regimes. And in fact, we find that to be true. So here's an experiment we did where we train vanilla convolutional neural networks. So these are CNNs, but without usagual connections or batch norm or any of the other usual techniques or tricks, people use very deep neural networks. And we do this. Basically, the only thing we do is we do this with initialization that's carefully chosen to lie exactly on this critical line. We also choose this initialization to satisfy another property called dynamical isometry, which demands that the eigenvalues of the Jacobian spectrum also be well conditioned, which I'm not going to talk about here. And by making this simple change to initialization, we are suddenly able to train CNNs up the depths of at least 10,000 layers, which is as far as we can push it to compute with near-identical train and test groups as we show the networks. This is orders of magnitude deeper than have been achieved with vanilla architectures before this point. So cool. So we've just taken a little bit of a path. We've done some first principle analysis. And we've understood the distribution over functions that neural networks will have initialization depending on their showers and hypercranvers. Then we've made a prediction about how that distribution over functions will make neural networks either easier or harder to train. And then we've used that prediction to initialize a neural network in a way that we think will be very easy to train. And then we've shown that by doing this, we can train neural networks that are several orders of magnitude deeper than was previously possible. So cool. We've gone from experimental observation that large width helps to a theory for behavior at large width to a practical recipe for training deeper networks. So given that the accuracy is the same for 1,000 depth as 10,000, what's the advantage of 10,000? Yeah, that's an awesome question. So one part of the answer is that at some point there's not going to be much more advantage. So probably by the time you've gone to depth 1,000, going from depth 1,000 to 10,000 doesn't give you a huge advantage. Not shown here, but increasing depth at smaller numbers, like up to like order 100, has showing significant advantages in accuracy. If you had infinite compute and infinite data, then the expressivity, the amount of information, the space of functions that the neural network is capable of learning does get greater as it gets deeper. So in some infinite training data, infinite training compute limit, being able to train deeper would continue to be better. So you start the network off on the critical line. But as the training progresses, what are your trends that tend to push it off the critical line, or is it going to stay on there? Yeah, actually. So that segues into something else that I would like to talk about in a couple of minutes. But the answer is that when the neural network is relatively narrow, or when it has a relatively small number of channels, then over the course of training, it will tend to wander off the critical line. But as the neural network gets wider and wider and larger and larger, it will stay closer and closer to the critical line throughout training. And that was for some specific network, that critical line. And you have to recompute it every time you do this. Yeah, so in fact, and I'm not going to talk about these in detail, but in fact, you can do a similar, like how do we deal with phase diagram style analysis for a whole variety of different different architectures. But it is also like we're still building out the universality classes, like we're still building out what these sets of qualitative behavior you can observe as a function of hypergrammeters is in neural networks. And we're starting to get some work that kind of like unifies the analysis for different networks. You should work, especially maybe that's in recent papers by Greg Dangan. But yeah, you can do a closely analogous type of analysis for you build a phase diagram for the neural network and predict this behavior based upon that phase diagram for many, many architectures or network designs. And here, at least, if you have a lot of different network designs. So what have we done? We've said that overparameterization practically seems to make neural networks do better. We've derived the distribution over functions, which you can expect if you, which is induced by random initialization of neural networks. And we've shown that distribution of functions correspond to acoustic process. We've shown that by looking at the distribution of functions you get from random initialization. You can predict whether or not a neural network will be trainable. And you can find the hypergrammeters and the architectures that will be most trainable. And I promised earlier that we were going to talk about networks after training. So let's briefly talk about how this relates to train networks. There are maybe two approaches. One might imagine using to train a neural network. The first of these is Bayesian parameter estimation, where we sample the parameters of the network from their posterior, given the prior parameters and the observed training samples. And this is, if you're Bayesian at least, this is perhaps the ideal way to train a neural network, but is also often impractical. The second and more common way that we train neural networks is we do just steepest, we do gradient descent on the network parameters. I will not go into mathematical depth here, but it turns out that closely related analysis to what we discussed in the first part of the talk can be applied to both of these approaches. So let's just look at this for the cartoon that we use in the first part of the talk. So let's imagine now that x is the training point and x star is a test point. And we know that the network output is 1 for x. So that's like our training data point. But this is equivalent to slicing the joint distribution along this z equals 1 line. However, because we have two equivalent ways of writing the joint distribution, we also have two equivalent ways of writing like any conditional or marginal or posterior distribution, which is derived from the journal. So for instance, one way in which we could make fully Bayesian predictions from a neural network is to draw parameters theta from the posterior distribution of the parameters given the training data. So here that's just p of theta given z l of x equals 1. And then after drawing the parameters from their posterior distribution, we could then evaluate z capital L, the output of the neural network, at the test point for each of those parameter values. So here the x-axis is the logic value for the test point or the output of the neural network for the test point. And the y-axis is the probability of observing that logic value. The distribution over test set predictions that you would get from this method of drawing the parameters from the posterior corresponds to the black delta functions here. The other way, though, in which we could evaluate the Bayesian posterior is just to compute and close form the posterior distribution under the corresponding Gaussian process. This here is indicated in red. And in the infinite with limit, because the distribution by drawing the parameters from their prior goes to Gaussian process. Similarly, this conditional distribution of neural network outputs given the training data converges. And you will get exactly the same distribution over test predictions by evaluating the GP as you would by drawing the parameters from their posterior. So just to say this a slightly different way, we're interested in the Bayesian posterior over test points. And we could either compute this by sampling parameters in the posterior and evaluating the corresponding test prediction for those parameters, or we can compute it by directly evaluating the distribution over test set predictions as defined by the GP in closed form. And so what this means is that if we want to evaluate the predictions of a wide Bayesian neural network, we never even have to instantiate the network. We could just evaluate the corresponding neural network of the Gaussian process, like the corresponding NNGP. I think I'm already over time. I may have two minutes depending on how you count the start time. No, you can go ahead. Yeah, OK, OK. So I will, however, go to the end without taking more questions. So the distribution over functions computed by Bayesian parameter estimation app, given your training data, is described as a Gaussian process. And you can just compute it in closed form. You never even have to instantiate neural network. Perhaps even more surprising than that is that closely related techniques also apply to neural networks trained by gradient descent. And it's going to turn out that you can also get the distribution over functions induced by gradient descent training of a wide neural network without ever instantiating the neural network. I'm only going to give a little bit of intuition for what's happening in this result. So the core effect is, I have a topic here too. So the core effect is that in the larger unit, the number of units in intermediate layers goes to a unity. And each of those units changes only by an infinitesimal amount over the entire course of training. And that was infinite number of units in the middle of the network, all of which change only infinitesimally, conspire together to produce an order one change in the neural network output. And similarly with the parameters, the infinite number of parameters in the middle of the network as the width goes to infinity all change infinitesimally and conspire to produce exactly the order one change that you want to see in the output of the neural network. However, because each of the units and receptive fields in the middle of the network change only infinitesimally, they can all be replaced by the first order tailor approximation. So in the limit of infinite width, it turns out that throughout training, a neural network can be replaced by its linearization. That is to say, the output of the neural network will correspond to its output initialization. Plus its parameter gradient at initialization times the change in parameters. And then plus a term that goes to zero with increasing width. I just to make this really clear, I would like to emphasize this does not mean that deep nonlinear neural networks become linear networks as they get wide. Their output is still a super complicated nonlinear function of their inputs. It's just that throughout changing, throughout training, the change in their outputs is linear in the change in their parameters. And I presented this in parameter space. But you should see the NTK paper for the analogous ideas in function space. So we can explore this experimentally, which I'm not going to talk about. We can also do one other cool thing, which I'll mention really briefly, which is we don't just know, we know the full distribution over functions that you get by randomly initializing the neural network. And then we know that if the network is wide enough over the course of gradient descent training, the neural network evolves, looks like a linear model. And we know what happens, how a Gaussian distribution evolves under a linear model. And so it turns out that we can derive not this distribution over functions out of initialization, but the distribution over the output of trained neural networks. And we can perform some experiments with this. I don't have time to go into this, but let me just briefly say that if you reduce the neural network to its simplest case, so you train the neural network with a very small learning rate, and you don't use weight decay, to use L2 regularization, and you, yeah. So if you force the neural network to be trained without a lot of the tricks that improve test accuracy, then what you find is that for many but not all architectures, this infinite with limit, either infinite gradient descent train networks or infinite Bayesian networks outperform the finite with networks. CNN with global average pooling is maybe acceptable to that. But you also find that if you start throwing in all the tricks that people use in practice when they train finite with neural networks, then the finite with neural networks perform better and better and better, and outperform the kernel methods. I think it will be super interesting over the next couple of years to see how far kernel methods can also be pushed as we developed analogous techniques that improve their performance. OK, and so last slide. In summary, I have presented what I believe to be a powerful framework for theoretically analyzing neural networks. It can be applied to either a Bayesian or gradient descent training of a neural network in either function space or parameter space. And I've shown that this perspective enables us to predict how trainability depends on architecture and hyperparameters. I've also shown that it allows us to do seemingly semi-magical things like compute test set predictions that would come from training an infinitely wide neural network, but without ever even instantiating a neural network. It's also provided several insights into network behavior, which I mostly haven't had a chance to discuss here, but things like showing that noise regularization limits trainable depth or that batch norm induces chaos or has taught us a little bit more about the role of equivalence and weight sharing in convolutional neural networks. Cool, so that's the structure. I also would like to call out, there's the GitHub link in the middle, we have released a code library that allows you to build infinite with neural networks using identical code to what you use to build finite with neural networks. And so if you want to perform your own experiments with infinite with neural networks, then you should follow that GitHub link and it will be almost identical to the way in which you would perform experiments with finite with networks. You can build whatever you actually like at infinite with. Cool, thank you very much. I apologize for trying to cram so much information into your brains in this relatively short amount of time. I think it was over ambitious, but hopefully you also got a sense of how much possibility there is in this space. Well, that was great. So thanks very much to you for getting up very early and for educating us. I had a lot of fun. We have a couple more minutes for questions in case people wanted to ask something. There were many questions right during the talk, which is great. I consider two in the chat. So I'm just going to read out the first, which reads, do you think residual neural networks operate along the critical line naturally, given that they operate OK with increasing depth? Yeah, so the answer is sort of, the answer is the phase diagram actually looks different for residual networks. So you can actually adjust the scaling of residual networks so they operate a little bit better in that, as they are kind of typically implemented, the contribution from later side chains becomes relatively smaller, which is not necessarily desirable. But also, I think a lot of modern architectural tweaks kind of do something that you could also do with careful tuning of initialization hyperparameters. So I think things like layer norm and to some degree, like batch norm and resnaps, I think all these things are, it's never an exact one-to-one mapping, but they qualitatively move you into the good region of the phase diagram. Nice. Then there is one more question by Greg, who is asking, how one should think about your cartoon of the Z capital L of x versus the same thing of x star, or of random function samples, both on the very early slide that you shown, in the case of a complex train network, such as one that you use to perform image segmentation? Yeah. So you can do, I mean, I guess the basic answer is that you still want to compare different inputs to neural network to each other. The big difference is that your ZL is no longer a scalar. Your ZL is now like, for instance, maybe a class label for every pixel in the image. And so it's much harder to plot it in 2D. It turns out, actually, that for neural networks that produce multiple outputs, those outputs are jointly Gaussian with each other, as well as Gaussian across input examples. So you end up with a joint Gaussian over with a covariance matrix that's like number of outputs times number of input examples by number of output features times number of input examples. But it's still described by a GP. Can I comment on your answer? Of course. Go, please, yeah. Thanks. I guess I was just trying to understand how to reconcile the simplicity of the GP aspect that you're describing with the obvious complexity that a network like that performs. So what is it really? In the cartoon, you can't really see the complexity and the simplicity at the same time. So I'm sort of trying to understand what it is about the output that is simple when the output itself is obviously a very complicated function of the input, you know? Yeah, yeah. I don't know if I. OK, OK, so I could throw some thoughts out there. But we'll see if you find them satisfying or not. I think maybe one observation is that we think of Gaussians as being very simple, but that a Gaussian distribution with a complicated covariance matrix in an extremely high dimensional space can perhaps capture a surprisingly strong structure. It's maybe answer part one. Answer part two is that we like to imagine that neural networks are doing these incredibly complex things. It may be that they're not. It may be that they're doing something that's a lot closer to like well smooth interpolation between data points, which is more consistent with the GP kind of perspective. Answer number three is that the infinite with limit, at least the one that we took here, doesn't allow feature learning. And it's still very much, I think, an open question. The degree to which nonlinear feature learning matters for different architectures. Like if you'd asked me a few years ago, I would have said nonlinear feature learning is like the core of deep learning. And now I think it kind of depends. I think there's some situations where it matters a lot and some situations where I think it seems not to really make much of a difference. But yeah, I don't know if any of that is satisfying, I think. Thank you very much. That was very helpful, thanks. There's one more question that's shared by a lot, who is asking, everywhere you assume that your initial parameters are Gaussian, and then this is where you get all the Gaussian from in the end. Did you look at any other distributions for the initialization of your parameters at all? And how would they look instead of this infinite with limit? Yeah, yeah, so as long as most of the results hold, as long as you have a central limit theorem style result that you can apply to multiplying the parameters times the inputs, times the inputs each layer, it's a lot less clean. You don't get like exact Gaussians. You have to say you approximate a Gaussian to get wider and stuff, but the qualitatively it continues to hold for different initialization distributions. Is there time for one more question? Can I ask? I'm a bit worried about infinitely wide layers and thinking about them. If I was just doing a conventional polynomial fit to 100 data points and I had 95 free parameters, it would oscillate wildly and not be very generalizable. Now, is there a danger that with your very, very wide networks you will learn the specifics of the training set and not have good generalization? Or are you assuming you get, as you widen the network, you have more and more training data? Awesome. So what's really nice about having way more parameters than data points, which is the regime that we're in here, is that you're kind of free to interpolate between data points however you like. You have two data points and you need to fit these two data points exactly. And you have so many free parameters that you can make the function do anything you wanted between the two data points. And so what the function actually does between the two data points is basically inherited from the prior. It's basically inherited from the random initialization. So if you randomly initialize your function with a chaotic regime so that it looks like this, then you're going to fit these two data points and between the two data points it's going to do this, which is really bad. But if you randomly initialize the function in kind of like a smooth regime, then you're going to adjust that to fit the two data points. But its behavior away from the data points is going to, the way in which interpolate between data points is going to be smooth like its initialization was. And you won't do crazy things. There's one more question which asked about your library actually, I think. And specifically about its memory requirements when you do this NNGP thing. Do you have to store any big matrices when you do that? Or is it reasonably efficient? It depends on the architecture. You need to kind of, by definition, though, you need to store a matrix which is number of data points by number of data points. And so for this reason, you can probably work with CFAR 10. You can probably work with like 50,000 data points. You're probably not going to be able to work with 500,000 data points. And there's a trade-off which is, again, the training cost of typical neural networks is like linear in the data set size. Here, working with the Gaussian process inference is cubic in the number of data points that you're working with. So if you want to do practical things with this, you either need to use egregious approximations or you need to be working with a quite small data set where for small data sets, this may actually work better than training the neural network. But there are compute costs. Yeah, that makes sense. Thanks. OK, something I think also had a question in the end. Did you still have it? Yes, sorry, I just wanted to clarify what I think you said. Is that the more layers you have, the more stable the critical line is or not necessarily? The more layers you have, the more you're described by this kind of asymptotic with depth behavior. So there's spinning. OK, so I overshot. But yeah, so what kind of happens is understanding how the neural network behaves for. So we know that if you go infinitely deep, then the neural network is going to have this fixed covariance matrix. And we know that if you go very deep, but not infinitely deep, then the covariance matrix is going to decay exponentially towards this fixed covariance matrix. If you're quite shallow, then you may not be in the regime around this fixed point that you can linearize. You may not be in this nicely behaved exponential decay towards the variance matrix. So the division of the neural network into ordering chaotic regimes becomes much cleaner than the deeper you go. But even for shallow networks, the critical line seems to be a pretty good place to be. Thank you. Nice, thanks indeed. OK, I don't see any more questions. So let me thank you once again on my behalf and on behalf of the entire audience for this great talk. I think all of us really learned a lot. So thanks a lot, Yasha. Thanks for getting up early. Thanks for contributing to this seminar series. Thank you. It was a pleasure. It was a pleasure to be here. Thanks, Yasha, and thanks, Vili. Bye. See you next week then. Bye-bye.