 Last week on Thursday I showed you a recipe, one particular recipe that I quite like to turn any neural network structurally into a Gaussian process. And it's worked in the following way. You realize that the thing that the deep network actually tries to achieve is minimizing an empirical risk, a regularized empirical risk potentially, which we can interpret as a negative log posterior. So that means there is an interpretation for what's happening in this deep network based on the geometry of this loss function. It defines abnormalization of posterior distribution, assuming that we have an actual regularizer and assuming that the loss function actually is a log likelihood, the empirical risk is a log likelihood, which it actually is for most typically used loss functions like the cross entropy loss and the square loss and so on. So what we could then do is we do what people usually do in deep learning, you train the network to minimize this risk. And then once you found this minimizer that's step two, we locally construct a second order Taylor approximation. So we construct the Hessian of the loss function, that's a matrix. And that gives us some sense for the geometry of the loss around the point estimate. In particular regions with very low curvature, so directions with low curvature are directions in which the loss barely changes if we change the weights. Right? So if you think of let's say in this dimension, right, let's say the loss is very flat in this direction, then this means if I move the weights from here to there and from there to there, the loss doesn't barely changes. So that means there's almost no or no data terms in this log likelihood that actually get assigned a different label if we move the weights around. So the loss is pretty much the same. And therefore, that sort of implies that the model is uncertain in this direction. Right? There's low sensitivity of the loss with respect to this direction. So I can, there are many possible values of the weights that all explain the data equally well. And that's exactly what a broad posterior means. Right? There's lots of hypotheses in the space that all explain the data equally well. So low curvature means high uncertainty. Low eigenvalue in the Hessian means large eigenvalue in the inverse Hessian. And that means high uncertainty. While in the other, there might be another direction, that's maybe okay, let me rotate the coordinate system for you. Right? And I mean there's another direction where the loss is very sharp, where we, where a small change in the weights already produces a sharp change in the loss. So that means that a lot of the explanation of the data hinges on this loss, on this weight having a particular value. So that's low uncertainty, high confidence in the weights. So we construct the Hessian, we did this in the code. And then we treat this quadratic term, quadratic form of the loss as an approximate Gaussian distribution on the weights base. With a center, a mean given by the trained weights, and covariance given by the inverse of the Hessian. And so by the way, last time I sort of confused myself in the code a bit, with these minuses. So I decided to put in some minuses in the code, that was just wrong. It's actually like the original form was correct, and I fixed it now on Elias, so it actually does the right thing. And then we, so that gives us a distribution on the weights, but we would like to predict in output space, right, we would like to know what the uncertainty over the prediction of the neural network actually is. That's what we're used to in regression classification. So we need to somehow push forward the uncertainty on the weights on to an uncertainty on the function value. And we do this, or we decided to do this by another linearization. So we take the network and linearize it, so do a Taylor expansion around the trained weights. So we write the network as a constant term plus a linear term in the weights. And this linear term involves a gradient with respect to the weights of f. So f is a function of inputs x and the weights. X is the bits that come from the data as an input, right? So therefore this thing, and typically f is a multivariate function. So in the example that I used, it's a univariate function because we do binary classification and I've decided to stick with univariate functions for all of the examples just to simplify the notation and not screw up more, you know, slicing of arrays with multivariate outputs. But in general, for a multiclass problem, for example, f would be a multivariate function so the Jacobian would really be a rectangular matrix and that rectangular matrix is itself a function of x. So that gives us a function-valued object which is linear in the weights. And that is actually exactly the kind of structure that we had in parametric regression in lectures, what was it? Seven. When we first did it, actually I think I still have it up here, right? So that's exactly this kind of setting that we had before as well. So we have a function that is some feature of the input x times a linear thing in the weights. And that's exactly the structure we have here now. There's just, there's a constant in front which happens to be the trained network and then there is a term that depends on the distance to the trained weights with some kind of feature function pi, if you like. And that gives, if we marginalize over the weights, so if we marginalize this Gaussian distribution over the weights because it's a linear function of the weights, another Gaussian-valued object but because it depends on the input x, it's a Gaussian process, a function-valued thing or a probability distribution over a space of functions, therefore a Gaussian process, with a mean function given by the trained weight, a trained net and a covariance function, a kernel, given by this object, which I call the Laplace tangent kernel, which is an inner product between these Jacobians and this inverse of the equation. So here's maybe a moment where I could talk about why this is called the Laplace tangent kernel. I actually had a slide on it in the last lecture but now I don't have that slide here, you can look it up on the old lecture. So who has heard of the neural tangent kernel before? Ah, not so many people, that's good. You don't have to have heard of it. So there is a paper relatively widely cited from 2018 by Jacques and others who introduced this notion called a neural tangent kernel. And that kernel happens to have to form J in our notation like this. So this is a, you can think of a covariance between f at bullet and f at circle given by this. So clearly this has a lot of structure similar to this one. In particular, it also is this outer product of the Jacobians at the two inputs. But there's two things that are different. One of them is that this tends to be evaluated at the initial weights where the network is initialized rather than the final weights. And then there's no matrix here in the middle or maybe sort of implicitly there is a one matrix in the middle rather than the inverse of our equation. So this object comes from a theoretical analysis of the training behavior of deep neural networks that has pretty much nothing to do with uncertainty quantification. It is constructed as a theoretical tool, not a practical thing to use in practice, but a theoretical tool for the analysis of the behavior of stochastic gradient descent when applied to a network that is initialized with these weights and then you run stochastic gradient descent. And we are not going to talk about what that analysis is. It's sort of, there is some big argument that essentially sort of reminds people that for infinitely wide neural networks you get back a Gaussian process and you get back actually a Gaussian process even for deep networks if all of the layers are infinitely wide. And therefore one can say something about the behavior of stochastic gradient descent that the problem is sort of in some sense approximately quadratic and there will be some good convergence behavior. But this has nothing to do with uncertainty quantification and now your question might be what does it have to do with tangents? So where have you heard the word tangents before? You also raise your hand. Yeah, so there's a sort of a tangents sort of thing that points sort of tangential to some surface. So I think in the research community people tend to think of Riemannian geometry when they hear tangents. I'm nodding, I'm not really. So there's this whole generalization of geometry to spaces on which one needs to take care how to measure distances by Riemann, which is a generalization of the historic Euclidean geometry. And I think the name tangential kernel is supposed to be evocative of big, cool, non-Euclidean geometry. But if you read the original paper that introduces the neural tangential kernel, you can grab for tangents and you will find that they never mentioned geometry at all. They actually only mentioned the word tangential kernel when they defined the tangential kernel and then they never explained why it's called the tangential kernel. So I actually don't even know what their motivation was to use this word. I think if you really wanted to have a motivation, then you can think of this distribution up here. So remember that that's a Gaussian distribution. So it involves measuring, so when you want to compute the probability of some theta under this probability density function, you're going to compute an object of the form theta minus theta star transpose psi inverse theta minus theta star, right? That's the operative term in this Gaussian distribution. And you could write this as a Euclidean distance between theta and theta star squared weighted by this matrix psi. It's just a rescaling into a different coordinate system, but it's a linear rescaling because psi is just a matrix. So this is a Euclidean way of measuring distance between weights. And now, so you can think of some kind of space, right? Weight space where the weights lie in and we have some training point. This is theta star and now we measure the distance to it in some kind of elliptical curves around it. And because f is a function of theta, a non-linear function of theta, there is a directly associated random variable called f of x, which is the push forward, so the transformation, remember lecture 3, right? So random variables are constructed by taking a probability measure on some original space and then mapping forward through some function. f is such a random variable, but because it's a non-linear function of the weights, this kind of distance measure on this Euclidean space turns into some non-Euclidean distance measure on this weird curve space called f of x and theta. So if you have a probability distribution that is Gaussian in this space, it's going to be non-Gaussian on this space. But we can locally measure the distance if it's a small distance around the trained network, so f of x and theta star, by computing the tangent surfaces to this manifold and then taking steps along the tangents to approximately measure distance. And this actually has an interpretation in Riemannian geometry. It's sort of in some sense well defined. If you compute the distances between points locally, you need these curvature measures and they come actually in the form of these tangent operators. So you can think of this as some approximate Gaussian push forward translation from the weight space into the function space. And we did this in our code, let me actually look, that I showed you last time. So above there was this whole bit that we've now looked at a few times, designing a neural network and training it. And back then I used a pretty big network last Thursday to show you that we can do this with deep networks. And then we realized that we got lots and lots of uncertainty and that's maybe not so surprising because there's 128 training points in this dataset, but thousands of weights in this neural network. So of course they're not all fully constrained. So what I've now done in this example is I've changed the code. I'm not going to run it again. I'm just going to tell you that I've changed the architecture to something much smaller. So it's now a network that maps from this two-dimensional input to a 30-dimensional latent space and then from a 32-dimensional latent space and then back to a single output function. And if you train this, actually, you can run the whole code again. It still actually achieves a test set of accuracy of 100%. So it's actually still a decent model for this tiny little dataset. It produces this kind of prediction. And then I've cleaned up the code a little bit. You can now find it on Ilias, although I have to still upload the one that I just corrected right before the lecture. We decided to compute the Hessian. So this is this line. That's the magic line that actually does the heavy lifting. And then there's a little bit of annoying bookkeeping to transform it into a matrix that has the right size. So before that is the structured data type. And now we have to turn it into an actual matrix that we can operate on. And then you can now look at what this Hessian looks like. It's now a bit smaller. And then in the last time I showed you and you can now see a lot of structure in this Hessian. So red numbers mean comparably high curvature. So these are the directions that are relatively constrained by the data. And low numbers in log space, so white numbers mean very low curvature. So these are directions that are totally flat in weight space. And yeah, you might see that there are some interesting structures in here, like for example, this three by three structure here and so on. It tends to be the case that you see structure like this and it's not always possible to interpret it directly. Why? Well, because each of these is just one randomly initialized weight for a real feature. So we don't know where they lie in input space. So the fact that it's these three doesn't mean anything. It could also be these three or these three or whatever. There's a permutation invariance between those weights. And then we have to invert this matrix to construct our approximation or Gaussian process approximation. And this is not just some accident because we've like defined it the wrong way around. It's actually necessary because it allows us to really think about uncertainty that it relates the weights to each other so that we can think of them as error bars rather than as positions. And we do that or I decided to do it last time in the brute force way by really computing the eigenvalue proposition and inverting potentially the eigenvalues. So the advantage of this is that it's in some sense the correct thing to do. It's the most universal way of thinking about the inverse of a matrix. Of course it's not by far the cheapest way to do it. And it's not going to be something you can do tractively for a large neural network. But for this tiny little thing, we can actually do it. And then look at it. And in a moment I'll show you a slide maybe after the break for how to do this in practice if you have a larger network. And then we actually I've now added a new cell here. One of the things you might want to do and that's actually your homework exercise but not in this sort of naive way but in the more sort of elaborate way would be to say, oh, if you look at this matrix, this big matrix, you might realize that so you maybe see, you don't really see these blue lines. Maybe I have to make them a bit bigger. So you see the structure of the network in here. So this is the input layer and the input biases. And then this is the output layer. This corner here is the output layer with the output biases that are now right behind this blue line that you can't see anymore. Just a single bias, yeah. What can I actually make of the repeating structure within these? So there are several possible sources for this. Whoops. And hang on, I need to close something. I think? I don't know, weird. Don't know who wrote on this at the moment, okay. So I think the reason why we see this structure here is that the input domain is two-dimensional, right? So it's a two-dimensional image that gets labeled. And so the first layer has two inputs onto 32 outputs. And that means the incoming matrix is of size two by 32. And there's some kind of fan-in structure that each of the 32 inputs has two, sorry, each of the 32 latent nodes has two different inputs, times, yeah, has two inputs. And so if there is some structure in the first input dimension, you tend to see this across all of these units. Or if there is a unit that, whose switch on point is far away from the origin, then it's far away both in X1 and X2, and you sort of see this structure repeated. So this is one way that these structures tend to show up. In particular, in deeper networks, in intermediate layers. And actually, that's a motivation for one of the approximations that are commonly used for composite approximations. And so one thing you could do, and you'll do it in your homework exercises this week, is to say, huh, okay, maybe we'll just look at the last layer of this network, this corner down here. Why, well, first of all, because it's much smaller, and even for a deep network, even for a really large one, it might be quite tractable to compute that Hessian and invert it. And if you do it, I mean here, this is not the right way to go towards this last layer Hessian, because I've constructed the entire Hessian, that's the expensive bit. And now I'm just going to extract the last layer, right? That's easy to do, but it's not cheap. So in your homework, you'll actually have to compute that object yourself, and then invert it. Another reason to do that is this, is that it's sort of a bit closer to what we've done in Gaussian process regression so far, right? So you could think of this example that I just had before here with these, with the parametric models. You could think of a situation where someone has just decided which features to use. You just fix them, you've learned the representation, and we claim that we're not going to change it anymore, and then we would just like to know how uncertain we are about the value of those features. Then that relates to just taking the architecture of the neural network, keeping it fixed, and treating the last layer as the set of weights that we're actually uncertain about. So when you do that, you sort of pay a little bit of a price. So I've done this here. Ah, so this is a plot of the inverse of the Hessian. So this is the Hessian, this is the inverse of the Hessian. You see that it's a bit denser, and you also see that we have high uncertainty over the lower layers, so red now means high uncertainty, and low uncertainty over the last layer, actually. But if you cut out the last layer, and we can make a plot, and you can download this afterwards, here, ah, no, we can do a plot of the Hessian, ah, this is out of order, okay. So this here, the eigenvalues of the Hessian again, we looked at this last time already, we realized that there were some large eigenvalues, and also some negative eigenvalues. We've actually checked the literature since then, so my postdoc Frank Schneider pointed out to me that this is apparently a very common feature. So the papers who have studied eigenvalue decompositions, even of non-trivial, large deep networks, tend to point out that there are negative eigenvalues in those Hessians all the time, and that those negative eigenvalues tend to be suppressed compared to the positive ones by a few orders of magnitude, like this. So this is this picture that I drew last time of your optimizing a loss function that sort of has a global elongated structure, and then inside of that structure, there are little funnels, little spikes that you can drop into, and those spikes create these negative eigenvalues. What I've now done here is I made a plot of the last layer Hessian again, so this is this last bit extracted, this is just a cut out of the plot from above of this thing, right, so if you take this bottom right corner, and you plot it, it looks like this, and the inverse of this matrix is given by this matrix, literally just the exact inverse computed with numerical linear algebra, and this is not the same as taking the entire matrix, inverting it, and then taking the sub-matrix out of it that corresponds to that part. Why? Because a block in an inverse is not the same as an inverse of a block in a matrix. But you can do that nevertheless, and just ignore this problem, and just say, you know, there's a lot of structure here anyway, and maybe we can use it, and then you'll see what that does in your homework exercise. And now it's certainly time to take a break. After the break, we'll talk a little bit about how people do this in practice when they have larger networks. Let's meet at 16, 11, 16. So I realized that this presentation is a little bit, you know, me waving around on a piece of code that runs on one particular toy problem, and maybe it helps to just say that obviously, if you're one of these people, these three people who wrote in their feedback or in the evaluation that you're worried about the exam, you don't have to worry about this code that we're talking about today. And maybe take it as an opportunity to think with me and the others about what these objects actually are or whether they are useful for us to use or not. And there's maybe, I realized that I'm talking about multiple things at once, so let me try and focus the presentation on two aspects. Let's first not talk about the computational complexity for a moment and shift that to the end of this lecture, and instead focus on what these samples actually mean. So in this code, I'll go down a bit more. I've now, so maybe just to remind you what we're doing, we're then constructing this Laplace tangent kernel, so this tangent kernel requires these Jacobians. We built those Jacobians in the code. So this is the bit that constructs the Jacobian, and then it like rearranges it into a matrix so that we can do linear algebra with it. That's tedious, but it's really just bookkeeping to rearrange entries in a structured list into a matrix. Now we have this object, and then we can build this kernel. So this code, even though it's difficult to first look at it, there's the bug again. This code actually, we just run it so that we've got meaningful plots. Tap, tap, tap, tap, tap. Here we go. Where was I? Here. This code is, even though it might look confusing at first, should be reminiscent of how we defined kernels for Gaussian processes. It's this bit of Jack's code where there is some functal partial vector rise on top because we're going to hand it to our Gaussian process library, which will instantiate matrices, and then we just write what we mean. We want to construct this object called Jacobian from the left times eigenvalues of the Hessian, times eigenvalues of the Hessian, times eigenvector of transpose, times Jacobian transpose. And then we can hand this object to our Gaussian process library and finally write some clean Jupiter cell, and say, okay, let's construct a Gaussian process, and then we can use this thing as if it were actually one of these Gaussian processes with these analytic kernels that we've used before. We really can, and it's not just, as if it is actually one, yes. So the reaction is, can we build a mixture of analytic kernels and these tangent kernels? Can we decompose the tangent kernels so that we can think of interpretations in terms of analytic kernel? So, mu is the answer to a question that doesn't have an answer. So, remember that analytic kernels have infinite degrees of freedom, right? They have infinitely many eigenfunctions. We can't expect, so this is clearly a parametric object. It has as many degrees of freedoms as there are weights in the network. So they are not going to be the same in any sort of direct way. But this thing has, in a way, this is a way of parameterizing a finite dimensional space of kernels with a lot of parameters. In this case, a few hundred, or in a big network, a few billion parameters. That's a lot more parameters than what we have for our analytic kernels, where we tend to have a length scale and output scale and so on. So in some sense, it's also more flexible, and that's why people like deep learning. Your question might be, can we relate the eigenvectors V times Jacobian somehow to the eigenvectors of these analytic kernels? And actually, I'll do this below to just look at what these objects actually look like in output space. We can stare at this for a moment, and actually, maybe that'll help understand. Yeah, let me go down to this end on the code and then we'll see whether we can make this work or not. So what we can now do is we can produce samples. I've done this now actually here just now again, and there's a lot of plotting, plotting code, and now we get this output. So what I've plotted, I've added some titles here now, is this is the mean function of our Gaussian process, also known as your trained deep neural network. So this is what we might call the logits, the output layer values of F. So these are things that are real valued, as you can see in the color bar for minus 30 to plus 30. And okay, so sorry for the colorblind people, they are red on the top left and green on the bottom right, which matches the colors of the data. So they're just, it looks like this network has learned something meaningful. In the middle, in gray, we have the value of this kernel evaluated at the individual grid points. So that's like a diagonal of a kernel gram matrix. And then taking the square root of that, that's like a standard deviation, like an arrow bar on this plot on the left. And you can see that it has some structure that somehow has these straight lines in it. And these straight lines are inherited from these VLU features of the deep network. It has learned these VLU non-linearities and VLUs are piecewise linear things that just lie on top of each other. So there they are. And now we can use those two together to construct an approximate expected value on the output of the network. So on the sigmoid of F. So if, so we are interested in sigma of F, that softmax or the logistic of F. But F now is a random variable. So in particular, we might be interested in the expected value under the probability distribution on F of this thing. So that is by the laws of probability and integral over sigma of F and integral against P of F, the F. This is a Gaussian for us at each location with a mean and a variance. And so this is not a closed form integral. There is no analytic answer to it, but there are good approximations to it. For example, one possible approximation is the sigmoid of the mean divided by the square root of one plus pi over eight times the variance. So J psi J transpose. And yes, why is there a pi in the eight? We did this at some point as a homework exercise. It has something to do with the Gaussian approximation locally and because there's a Gaussian as a pi floating around somewhere, doesn't really matter. It's just a simple way of approximating this uncertainty. The important thing about this is you can see that in regions where the uncertainty is high, where this term is large, this will be approximately zero and the sigmoid of zero is one half. So that's very uncertain for the label. And in regions where this is approximately zero and this is very large or more generally where this is much larger than this, the square root of this, this will be something large in absolute terms, like plus or minus large and then this thing will be approximately plus minus one. Or like, no, sorry, it's approximately zero or one. So either 100% confident or 100% the other label. So I make a plot of this and I have moved around. That's this thing here. And you can see that it sort of mediates the uncertainty and the mean prediction and we get some more structured thing with sort of low confidence here and high confidence in the corners. But then there's also some structure to the confidence. And we can draw samples. They look like this. And now someone wrote in the feedback last time, these samples look kind of bad. Why do we look so bad? And this is important actually. So let's talk about this for a moment. They look bad, okay. So first of all, they look bad because I'm using a small plotting grid. So they look so like bumpy because I'm plotting on a 50 by 50 grid. Why? Because making these samples involves building a covariance matrix for the grid. Then doing its single evaluating composition and then multiplying with standard Gaussian and the variables. That's why it's sometimes good to dive deep into the stack, right? To understand what you're doing. So the cost of making one of these plots is all of the size of the grid cubed. So the grid here is 50 by 50, that's 2,500. And that's just about fast enough so that I don't have to wait for it when I make these plots and show them to you in the lecture. So that's why it's a bit of a blockiness here. But there's another way in which they look bad. The second way in which they look bad is that they have these straight lines in there. Why do they have these straight lines in there? Well, because these samples emerge from a Gaussian distribution on the weights of V-LU nonlinearities. So these V-LU features, they just look piecewise linear. That's just what they are, right? They are these things that are just straight lines lying around in space. That's what V-LUs are. So if you don't like that structure, you don't like your neural network. And that's maybe useful to understand, right? It's not always about the Bayesian and the uncertainty of it. It's sometimes also the architecture of the network. Yes? Ah, good question. How does this structure persist through the linearization and the Laplace's? So this is why I think Laplace's approximations are really interesting. Because they, yes, they are linearizations and Laplace's approximation. So there's a lot of linear algebra bits. But they are linearizations of function-valued objects. So remember that the, what was my pointer? I just have to wake up first. Ah, now it's a mouse, okay? So we are approximating the weights, the distribution on the weights with a Gaussian centered on the trained weights. And we're approximating the output function of the network as a linear function in the weights centered on the trained network. So the mean prediction already is the trained network. And that keeps all of the structure of the network. And also the linearization involves a Jacobian which also retains all of the structure of the network. So even though it's linearization all the way down, it actually contains the interesting information in a sense. And that's important because we're really, I mean, we're not just doing that because we like want to at all costs die hard, construct some form of uncertainty. And no matter how bad it is, we actually want to learn something about the structure of this deep network. In a way that is just about tractable. And this is just about tractable and it keeps this structure. So maybe another question that we'll come back to on Thursday is why are there still these piecewise linear things? I thought deep networks were so cool because they are hierarchical layers of non-linearity. So shouldn't they be much richer? Well, they are, but if you think of what, so we lose our piecewise linear functions, right? So if you build them up in a hierarchical fashion in a multi-layer perceptron, then the higher layers are piecewise non-linear functions of piecewise non-linear functions. And you might be able to convince yourself that a piecewise non-linear function of a piecewise non, sorry, a piecewise linear function of a piecewise linear function still is a piecewise linear function. It just has more pieces, but it's still piecewise linear. So what we see here are these sort of polyethers of input, of the input domain. So the input domain is sort of split up into poly-idr, idr, where in each of these regions, the function is linear. And then it just switches its linearity from one to the next. And stays continuous, by the way, because that's how we lose our, because they are also continuous. So that's the second way in which these samples are bad, right? The first one was that they look blocky, that's the plotting grid. The second one is that they look piecewise linear. That's the architecture of the network. That's just what it is. So if you use a Vino network, that's what you use. And then there's a final way in which they are odd. And that's in how their structure of the samples relates to the data, right? You would sort of expect that, you know, outside of the data regime, there's a lot of uncertainty flopping around. And inside they are very constrained and they never, you know, go away from the training data. So this is actually, I think, where the probabilistic functionality becomes really interesting. Because what we see here is the geometry of the loss function. We see that some directions are still very vaguely constrained, despite the fact that there is data somehow lying in there. And this has something to do with this very intricate, complicated structure of this deep neural network. That changing one of the lower weights has a really complicated nonlinear effect on the higher weights. Nonlinear in the sense that it shifts those pieces of nonlinearity around in a nonlinear fashion. And I think I sort of, this is why I have this thing up here again in the, from lecture seven. So this is really important to remind you of structure that we saw a long time ago that seemed to be quite natural then that we now have to sort of translate mentally into the deep learning world. So if you briefly move your mind back to lecture seven where we spoke about simple nonlinear regression or general linear regression. So the regression with nonlinear features and linear functions of weights. On this very simple one-dimensional nonlinear data set then we make plots like this. You remember, I showed you lots and lots of different features, right? I said, oh, you could use polynomial features or we could use these VLU features and then things will look like this and various others, right? And back then you looked at this plot probably and rightly said, well, that looks really ugly. I would never use anything like this for this data set, right? And in particular, the uncertainty looks really ugly. There is, in all, the error bars don't fit the data at all. Why is this this way? And this, back then that seemed like something to fix. We could look at this and sort of say, I'm not satisfied with this. I want to have a solution that works much better. Like for example, I might need to use much more features and then things might look a bit better, right? Or I don't know, I choose a different scale for the output so that, you know, the feature is just about fitting. So if you have that picture in mind and now go back to these plots, you realize that this is the exact same problem. It's just that because we've committed ourselves to this complicated language of deep learning, we've sort of accepted that it's so complicated we can't possibly do this thing that we had in lecture seven. In particular, we could still say, well, maybe I don't like these nonlinearities. Maybe I want to use something else, something smooth, something local, something, I don't know, with more structure. And I think people just tend not to do that in deep learning. Why? Because they tend not to look at plots like this. Of course, it's also much harder to do this in multi D, right? So this is the two dimensional problem. And I realized that it's harder to do this with images. For example, you will see this in your exercise this week with MNIST. Even with MNIST, which has a 768 dimensional input space. It's already much harder to do this. And another thing you might want to change, right, is so going back to this picture, so the two things we could change is you could use different features, but you can also change the scale of the output. And we can, that corresponds to some degree, although not perfectly, in this setting to changing the scale of the prior. So the prior, remember, is this regularization term in the empirical risk. And I sort of on last test, they rushed past this. I said, eh, you know, we need to put some regularizer in. And so I chose it somewhere here in passing and set it to five. I just said that we're going to use a loss function that is given by this empirical risk, L of Y, I, F, X, I, theta. And then we put in this regularizer of theta, which we decided to be some constant times the square norm of theta. So that constant so far was during training, it was zero. And now for this plot, I've set it to five. Just because the plot looks nice with five. So you could now say, well, how should I choose this, actually? And so maybe on Thursday, we can talk about how to set it precisely in a more mathematical sense. But remember, realize that doing this amounts to exactly this step that we did in lecture six of moving this theta around. That's exactly what we're doing. This theta is the scale of the prior. So there's a reason why we did these toy examples in lecture six or seven, even though they seemed really constructed, one-dimensional, whatever, it's because they give a clean interpretation for what we're doing in deep learning when nobody understands what happens in this deep network. So on Thursday, I again went too fast with this to make it clear that this precision is important. I actually wanted to do a sweep through the precision from, you know, I don't know, 10 to the minus two to 10 to the, like to 10, maybe roughly. Another nice aspect of this, by the way, adding this thing, adding this thing here, is that we're adding a scalar matrix to the Hessian. And what that does is it raises its eigenvalues up. In particular, remember that we have this Hessian with negative eigenvalues. If we add five to the eigenvalues, we're moving all of these up beyond one, right? Then they're afterwards, they're all larger than five. So none of them is negative anymore. And that's kind of convenient. But we don't, clearly, we don't need to add five. We could add, what's the largest negative eigenvalue? It's less than 10 to the minus two. So we can add anything larger than 0.01 and it'll be all positive. Again, what does this mean intuitively? What this means is that we have this loss function in here, in weight space, that has this global big elongated elliptical shape with the eigenvalues of the Hessian that are positive. And then we're currently here, maybe that's our theta star. So we have these elliptical kind of global structure. And then inside, there's actually some weird internal structure that there is some hole in this loss function that is non-convex, right? So we come in with like this ball shape and then there's like a hole here somewhere that we haven't reached yet. That's why here we have negative eigenvalues because in this direction the eigenvalues are negative. And by adding this bit under diagonal, we're smoothing those out. We're making the curvature abroad like higher and so all of these tiny little local structure just gets moved away or basically ignored. So I wanted to do this week and now we come to a, now we move a bit deeper in our thinking stack. How do we do this efficiently? Well, we could just, you know, change this number in this cell 20 times and run the code 20 times. There will be a little bit slow, everybody would work. But another thing we could do is we could also realize how we actually draw these samples. So we built this big covariance matrix on the grid, then take internally our Gaussian process library, compute the single value decomposition of this big matrix and that's maybe a bit wasteful because we already have an eigenvalue decomposition of the thing inside of our Laplace-Tangent kernel, the Hessian. So instead of having our Gaussian process library naively, blindly without knowing what it's doing, creating all these single value decomposition, we could also say the matrix of which you're trying to build your single value decomposition has the form J of the grid times eigenvectors times eigenvalues plus eta inverse times V transpose times J of X. So why not pre-compute this thing J of X times V because those are the big things. They show up twice here with the transpose and then only change eta in the inverse and just move it out, right? So what we're doing when we're sampling is we're using J of X times V times E plus eta inverse with a square root hopefully times some unit Gaussian vectors. So we can pre-compute this, keep it fixed, pre-compute this, keep it fixed and then only move through eta and just do matrix vector multiplications that's much faster. So I do this here in this code. I pre-compute J at the grid, the Jacobian on the grid. Then I pre-compute the Jacobian times the eigenvalues of the Hessian. I also pre-compute the mean prediction because I have to add it and then I could pre-compute some standard normal random variables, three of them. And then for all the values that I want to plot, all the precision 10 to the minus one, one, two and five, I compute the square root of one over E plus precision times U and multiply it from the right onto J of X times V and add the mean and make a plot. And now there's a lot of plots for slow precision. We get these samples, which look pretty ugly. And there's a sort of what we see here is pretty much the raw structure of the Hessian of the loss directly. So there's a lot of degrees of freedom that are totally unconstrained clearly. And as the precision rises, we get more structured samples that better fit the data in some sense. So we constrain the scale better. And that corresponds to remember lecture seven moving around theta. And now of course, ideally, you would like to do this not by hand, by twiddling around a knob, but we want to think about how to do this mathematically automatically in a way so that we don't have to do it ourselves. And we can talk about that on Thursday. Instead, I want to now spend a little bit of time thinking about, oh yeah, so two more things, of course. One is this question about how does this uncertainty on the weights actually relate to uncertainty in X space? What is the structure of this tangent kernel? And then the other thing is this elephant in the room that obviously computing this thing for real networks is very expensive. And there, I'll just briefly show you a slide to point out that this can be done. So a final thing we're going to do is to literally make a plot of these vectors. So this object, which will give the predictive samples, plus like if you add the mean, right? That is, what is this? So it's random Gaussian numbers between, you know, typically plus minus three. That's sort of what the typical random standard Gaussian random number looks like. It's sort of between plus minus three sigma times a scale for each eigenvector, times one of the eigenvectors multiplied onto the Jacobian. So we can think of this as sort of features which have been made orthogonal in the weight space, which you can then draw independently. And so we can plot those. So in this piece of code, I literally just plot the first what five of the most dominant eigenvectors projected onto Jacobians. So first five, why is there a minus there? Well, because Jack's NumPy sorts the eigenvalues in size from smallest to largest. So we have to go from the back to take the largest ones first. That's this. And JXVE is literally supposed to be this, right? It's J of X times V times E. And I also plot the smallest ones below. So here we go. This is the second plot will be from the other end, the ones that contain least. And so what you see here is in the top row, the dominant ones. So these are the first few, if you like sort of principal components of this kernel. And the bottom row is the smallest ones. And they're all scaled by their eigenvalues. So you can see in the color bar that the top row contributes a lot and the bottom row contributes less in terms of size. And they also seem to have some kind of structure. It's not the kind of typical structure you would expect from an eigenvalue composition on a simple nuclearance phase. So it's not like, you know, eigenfunctions to sine waves and so on, which they would be for these other little kernels that we use so far. But instead they give some kind of structure which in explainable machine learning has sometimes been used to visualize flexible, like degrees of freedom in the network. So when the network takes a decision at some particular input, one way to think about how it takes this decision is to say, it looks at those top five contributions in order, evaluates them, sums them up and then decides how large they are. Oh, actually, I mean, these are just the ones for the uncertainty. If you want to have a prediction, you have to add the mean and then it looks a bit different. We also could be worried about what is this little thing down here that we've sort of smoothed out, this negative curvature. So we could make a plot where, which is similar to the one above, but I'm taking the sum over all the directions that have positive eigenvalues and all the ones that have negative eigenvalues. And then this plot looks like this. So on the left, these are all the degrees of freedom that have positive curvature, sum together, scaled by the eigenvalues. So that's like the entire positive curvature space. And this is the entire negative curvature space. And so you can see that, yeah, so maybe sort of the uncertainty is in both cases, very low among the data. So this plot is right where the data is. That means this model is quite constrained within the data. And it has a lot of degrees of freedom outside of the data. And the left plot is stronger than the right plot. So there's mostly sort of positive curvature. And the sort of this remaining degrees of freedom are some kind of global structure that shifts everything around in sort of major directions. So this is really where like the negative curvature is just really awkward to think about because depending on which interpretation you want to use of the session, it leads to different ways of thinking about it. So if we talk about this from an optimization perspective, so if you think about what we need to change to this network to make it perform better, to make the loss smaller, then the negative eigenvalues are actually really important because those are the directions that the saddle point kind of, in which we can resolve the saddle point behavior. And then this plot is actually important. So what is the scale? It's this kind of global shift in a way that we're not doing yet. If you think of it from an uncertainty perspective, then there are different interpretations for uncertainty which probably are too complicated for me to do in the last three minutes. So actually I'll talk about this on Thursday. So the final thing I want to show you is what I just want to mention is to address a first elephant in the room, which is what do we actually do if you have a real network to work with? So what I've showed you so far is a mathematical construction to get uncertainty approximately from a real network. What if you are out there and you're working on a model that doesn't have 128 or 500 or 2000 or 8000 weights, but which has 125 billion weights? Then principally, of course, you cannot construct the entire Hessian and invert it. That's just not going to happen. So the people out there in the machine learning industry, as it now emerges, who actually want to use such models, and there are people even at the large companies whose names you all know who do this kind of work, they tend to use some kind of further approximation for the Hessian. And you can probably, if you just spend five minutes yourself, you can come up with two or three simple approximations yourself and without wondering about whether they work well or not. And they are probably on this list. And then there are a few more that people have also thought about. And I have to show you this slide because one of them is your homework this week for the theory question, so I want to briefly mention it. At first, very natural. So first of all, notice that we have to go through this object to compute the Hessian, and this object has this sum over N in here. So the complexity of building this Hessian is linear in N in the size of the dataset, and it's quadratic to build the matrix in the size of the weight space. And then once we have this D by D matrix to invert it, would be worst case cubic complexity in the weight space, but it has nothing to do with data anymore because we've already computed the matrix. So to approximate this computation, the first thing of course you could do is not do the full sum, but just do it over a part of the dataset. And of course people do that, that amounts to some kind of batching. So that reduces the complexity in N. The second thing you could do is now look at this matrix in your head, not on code, and only look at parts of it. So for example, you could decide only to use the diagonal of the Hessian that throws away all this beautiful structure around it that we looked at today, but it's quite cheap. And it's a local computation because you only need to compute local second order derivatives. So it's O of D, just as expensive as computing a gradient. And if you think about how you do this in code, you clearly need to change your code a bit to be able to do this without the diff. It turns out it's possible to do that. There are even packages to do that for you. One of them is called backpack for PyTorch and that happens to come from my group. So it is a shameless plan. So another thing you could do is to just look at the last layer. And that's what you're going to do in your coding homework this week. And this approach has some nice aspects to it, like conceptually, it's a little bit like ignoring the fact that you've trained a deep network and treating it like a learned parametric regression model. That's maybe pleasing. It also throws away all these degrees of freedom in the rest of the Hessian. And we saw today that that actually removes a lot of structure. So maybe it's good, maybe it's bad. The practical takeaway is so far it seems to work quite reasonably for many problems. It produces kind of neat plots, if you like. And that works well even on realistically sized problems. A more advanced thing that people do is that a lot of the people in the training community for deep networks in particular are quite excited about so-called hierarchical or chronic or factored approximations for the curvature. K-fact, chronic or factored approximations for the curvature. And they involve complicated expressions that I'm not going to do now because it takes too long. But they basically amount to computing only partial derivatives of particular parts of the matrix with respect to particular parts of the weight space. Such that what emerges is a matrix that is easy to invert in closed form. And those are used in particular for models that run during training time to precondition optimizers so that they behave better, to tune hyperparameters. And then finally, there's a really cool trick which you're going to do in your homework exercise which was already studied by several people in the early 2000s, which is that if you realize, if you remember again that your loss function at the end is typically one out of a very small collection, it's L2 or it's cross entropy or that's pretty much it actually and then maybe a few exotic ones that you can actually compute the Hessian of the whole network by decomposing it into the loss and the network. So there's the loss and F and you first do a partial derivative with respect to F and then with respect to the weights and you discover that there are two terms then in your expression. One is really expensive and one only involves Jacobians of the network. You'll get to do this in your homework. And then you can make an argument that the second part involves the actual gradient of the loss and if you've optimized your network well, then that gradient is probably zero because otherwise you would still be training. And that turns out to be the expensive part of the computation and you can actually drop it because it's approximately zero. And then you end up with a low ranked decomposition. So scalar plus low rank where the rank of this object is the size of your training set. And it's a very interesting structure because then you can apply the matrix inversion lemma and get something fast. And this is called the generalized Gauss-Newton approximation. Sometimes it's also phrased in a Levenberg-Marquard type sense. So if you've heard about these advanced optimization routines in an optimization lecture, it might make sort of sense to you. Okay, I will stop it here and then come back to this on Thursday where I do the rest of the slide deck. So the main point today that I was trying to get across is that Laplace approximations create some, well actually that's not the pit, that's what I'm doing on Thursday. What I wanted to get across, but here's the QR code for the feedback, what I wanted to get across is that this geometric way of looking at neural networks to construct probability measures gives us a way to inspect the geometry of the lost landscape. We have to think about how we do this in code because it's a bit tedious and if you do it wrong, it can be very expensive. That's why annoyingly they had to look at a lot of code. But what we then get is a lot of structured views on to how these networks actually behave, what degrees of freedom they have left after training, how uncertain they are, and most of those features tend to be things we are worried about rather than happy with. So you saw these plots and it's like, these samples don't look convincing, this looks really wrong. And the reason for that is not that probability theory is the wrong way to look at it. It's also not that Laplace approximations are the wrong linearization to look at. It's that the architecture we work with tends to have flaws. And when we look at them through this lens, we get to see them better than if you only look at the trained neural network. So the stuff that was easy to fix for parametric regression in lecture seven now poses a real problem. And this is kind of really where the field is at the moment. And maybe in 10 years time, we'll have fixed all of these and we'll have much better architectures that don't behave in this weird way. But the contemporary neural networks people use, including for the largest models, actually have all these flaws. And now you've seen them. So we'll talk about what to do with them and how to fix some of them on Thursday. Thank you very much.