 Today we're going to continue our dive into the probabilistic interpretation of deep learning. This is sort of part two of the lecture that we did on Monday. So as always, quick feedback. You really liked last Monday's lecture. That was, I think, the best feedback so far. Everyone thought pretty much that the speed was good, which is a new thing. And most of you really liked the quality as well. And even 17 of you actually responded, which is maybe the best sign. And I think we're sort of working on the quality. We're moving a little bit in this direction, sorry, on the difficulty of the exercises. It seems like we're moving in the right direction. Detailed feedback. Quite a few of you liked both positive things. There was actually very little negative stuff to say. Two people asked whether I could upload the lecture recordings. That was also a question on the forum and on YouTube. So I was inundated with requests to upload the videos. I did. I uploaded them yesterday. And now, of course, the next thing that happens is that people on YouTube ask me to upload the slides. It will take a bit of time to do that, but I will even make this public at some point. Uploading stuff on YouTube is just an unbelievably slow process because the user interface of YouTube is absolute garbage. So the code part was very fast. Today, I'm basically just going to talk about code. I only have like five slides that are very easy to go through. Let's see if that works. Someone asked how do I know whether a problem is good for GPs or not? I will probably talk about this more. But maybe the goal of today actually is to get both of them connected to each other. So the goal, your thought process should maybe not be should I use a GP or a deep network, but actually how can I build a solution that combines the best of all of these words? And the final question is, which is actually good that someone asks, is why do you prefer a Jax over PyTorch or TensorFlow? And the answer is, I don't know. I'll tell you sort of, I'll let you into a little secret. A lot of people in my age who teach don't actually write programs anymore. Maybe you've noticed that some of my colleagues don't actually show you code. And that's not in any way necessarily a bad thing because as you get older, it's just hard to keep up and it will happen to you as well. And so I can tell you that over the course of my life as a researcher, I have written code in very, very early on MATLAB. And then during my PhD, I wrote code in Python, in Java, in a weird language called F-sharp that is a dialect of ML or OCaml, which I got to work on because I was at Microsoft. And then when I returned to a post-doc life, someone actually got me annoyingly hooked on MATLAB again for a few years, so they got really stuck with that. It seems really weird, of course, to you from today's perspective, that people in the machine learning community like to work in MATLAB, but you have to keep in mind that during that time, there was pretty much no infrastructure in Python. Like MATLAB was still a new thing. It was not well documented. The interface was absolute garbage. NumPy was a new thing. It wasn't good to use. So it wasn't that exciting actually to write in Python. So people like MATLAB because it had all these nice interfaces that made things really easy. Of course, today is very different. At some point, I went back to Python and got used to that again. And now we've been through this absolute sort of roller coaster of deep learning toolboxes. There used to be things like Cafe. Who still remembers Cafe? Has anyone ever written code in Parfey? There's like two people in the room. Then there was TensorFlow, PyTorch. Now there is Jax. And there are new things, of course. There's all these other sort of things hanging off of that tree, like MxNet and so on. And it's very difficult to be well-versed in all of them. And actually, I think if you ask people who write code commonly for machine learning, if they are honest to you, they probably won't be. But if they are honest, they probably only know one of these frameworks pretty well. And so it's very difficult. It's a very sort of complicated question to ask someone, why do you prefer one over the other? Because they typically don't even know the other one. I've actually tried this with my PhD students as well. So they asked me, why do you do the course in Jax now? Well, the honest answer is because when I was preparing the course, I decided to learn one of these frameworks. And I just got stuck with Jax because it seemed like the newest kid in town. And so I can give you an idea of what I think the differences are. But to be honest, they are probably not exactly precise because I don't know so much about the other side. And I've spoken with various people about this, and they seem to confirm, but it's not really clear. So I think a very high level picture is that PyTorch is a more mature platform at the moment. It's maybe a bit more monolithic. There is one sort of deep learning, one way to do deep learning commonly in PyTorch, which is both good and bad. And Jax is more modular, it's still evolving, and there's a lot of different ways to do one thing. So Jax is a bit more maybe focused on you write your code the way you wanted to and that it's supposed to work. You're of course seen in the exercises that you've done so far that that isn't entirely true. You have to be really careful with manipulating arrays. But it's more sort of on the modular structure of the code. You write individual functions that hopefully don't have side effects and then you can sort of do autodiff very directly through all of this while in PyTorch there tend to be these sort of modules. And a downside of that actually is that in Jax, there's at the moment at least, at the time of here today doing this lecture, probably at least five different modules that claim to be the Jax deep learning module. There's something called stacks and something called flags and something called optax, and there are various other ones. And they all have very different styles of defining a model. So I'm not using any of them, I'm writing sort of plain Jax. And maybe what's most important for me to say is what the goal of this course in particular is, this is not the deep learning class, this is the probabilistic machine learning class, is to think about the structure of these models what we're building. And then it's sometimes useful to write these small pieces of toy code because then we can play with stuff and do computations directly even though we do them in a way that you would probably not implement them if you were to build a really large model. So I have to say that at the beginning of this lecture because today is going to be all about that. So yes, we are currently working with absolute toy problems. Today I'm again going to show you a two-dimensional binary classification data set, two moons, with a really tiny deep neural network. And the reason I do it with these simple models is that, again, this is the probabilistic machine learning class, it's supposed to be a relatively theoretical course where you understand structure in the models. And we're supposed to do life coding here in like a few minutes of train time. So I couldn't possibly do this with a model that would take three days to train. I can't, especially on a laptop. There's no GPU, well it's not really a GPU that I can use at the moment in here and also you want to see answer within a few seconds. So this is not me trying to dump things down, it's trying to make it feasible. And of course in practice if you really work on a large scale problem you will typically have a bunch of GPUs and then you will wait for a few days for them to train but then it's useful to have first done these kind of simple problems that we are discussing here so that you have a feeling for what's actually happening. And then there's the deep learning class where you can learn how to build these bigger models. With that, let's come back to Monday. I have introduced my, again, theoretically motivated abstract formalism for what deep learning is. I've sort of abstracted away all the complexities of different types of deep neural networks and said what we are going to discuss is the setting in which we are trying to learn a function that maps from inputs to some outputs and we need to talk about the outputs in a moment which are real valued and which are parameterized by a bunch of parameters that we call weights and biases for lack of a better name. And then we train these functions so we find solutions that fit the data well by minimizing a regularized empirical risk function. So this involves a data set of inputs and outputs supervised machine learning, X, I, Y, I, N of those. We're going to, for each datum, evaluate an individual loss function which typically involves transforming F again to compute some loss to the training output and potentially also a regularizer that only depends on the weights. On Monday we sort of reminded ourselves and made it very clear that this setting is in all but the most exotic cases equivalent to maximizing a posterior probability density function because the typical choices people make for regularizers amount to some log priors over the weight space for typically a Gaussian prior where the lost R tends to be a quadratic function and this big term, this chunk here in front corresponds to the negative logarithm of a product over individual likelihood terms P of Y given X and theta. The typical choices of loss functions people make for example for classification, cross entropy amounts to a multinomial or binomial probability distribution for the labels given F and the sum here amounts to the assumption which is actually quite questionable that the data are independent of each other when conditioned on the value of the weights. And that's just a common choice people make. And then we spoke about how to do this so now I'm going to flip to code and just to try to remind you a little bit how we did this we ran this piece of code that I'm now going to run again and we saw sort of first of all on the sort of coding side that this amounts to typically writing down is this too small by the way, can you all read this in the back? Should I increase the font size? No? Larger? Better? Better now? Okay, good. So this sort of tends to look like this, you write down some kind of model and then we spoke about this predict function which is a way to predict stuff in the future and we realized that it's a bit maybe awkward that we separate into predict and empirical risk because they are actually sort of very closely related to each other and I need to make sure I actually execute all these cells then we initialize these networks using a bunch of random numbers to create initial weights and actually doing this right is a bit of a for large scale problems, this is a bit of a tricky thing actually but for our simple problem it's not so it's not really a problem, it's just usually going to work unless we totally make mess things up so we choose a particular scale in which to initialize the weights and one question you could actually have if you're following along this code quite well is whether this year amounts whether this shouldn't actually be chosen to be laid somehow to the regularizer on the empirical risk because if the regularizer on the empirical risk is our prior for the weights then maybe we should draw the initialization of the weights from the prior but actually people don't tend to do this it's just not the standard way to initialize neural networks and it's an interesting question why they don't I don't claim to know the answer now we initialize the network, you've seen this before I make a data set just so that we see it again by the way I've changed the, if you notice I've changed the architecture a little bit so there used to be 12864 times 1 in here which still work for today but it makes some of the computations we're going to do in a moment a little bit slow and then we'll have to wait for too long it made it a bit smaller and now we've initialized the net we see that it makes some initial prediction which is sort of structurally looks like it might work but it's of course in no way fitted to the data which it couldn't be because we haven't trained yet we do this now by defining the terms that make up the equation that was on the slide just now the empirical risk and the regularizer that some of those two is the loss and I actually switch off the regularizer in this case which is also a common thing that people do at least for small networks and then we construct this sort of the algorithmic setup we build something that you might call a data loader a data stream which in this case is trivial because it's just 128 data points then we initialize an optimizer in this case SGD stochastic gradient descent you could use pretty much all the other optimizers that are in this library like Adam and Adam W and RMS prop and they all are going to perform pretty much equally well on this problem because it's so small and then we describe how this kind of update step that the optimizer will go through and you learn about these optimizers either in your math for machine learning class or in an optimization class if you haven't yet in this term there will be an optimization course by Professor Hein and he will go into all the depths you could possibly want to go into then we run this thing it learns and it again achieves 100% training accuracy so we can make a plot see that we go down to zero accuracy goes up and the prediction looks like this which is both pleasing and a bit worrying that's where we get to the sort of the end of Monday which in the sense that this is clearly where the data is a meaningful output of this network it predicts the green class where there is green data and the red class where there is red data but beyond the data set when you start thinking about it maybe this is not entirely pleasing because this network is very very confident about these classes outside of the data region and if you've ever heard about robustness and adversarial robustness then you might be worried that if you now have a test point that lies here maybe this network shouldn't be so confident in the class label because who knows what the right answer should be up there this is a problem for all discriminative classifiers or only for this one it's not generally a problem for all discriminative classifiers it's a problem with this model class so we're going to talk more about this next Monday when I'll enlist individual pathologies of deep learning but the short answer now is this is a VLU network if you go back to the architecture that we've defined sorry that I hop around in the code whereas the model would be nice if it were sort of easy to hop around in this here I'm using VLU as activation functions so everyone knows what a VLU function is lots of nodding so it's these functions that are zero and then they become linear so something that is zero and then becomes linear will have to look a little bit like this so we're moving these features around that in one direction have to go off towards plus or minus infinity so they have this non-local structure you can't locally put something without sort of adding some stuff in the far distance and that's what we sort of see here and we'll talk more about this on Monday the way to solve quote-unquote this problem would be to use non-linearities that localize so in the early days of deep learning one common non-linearity was the gaussian one you remember the square exponential kernel and how we constructed it with these little egg or bell-shaped features people used to use those as well in deep learning but the problem, so those have the nice property that they go back to zero so as you go far away from the data you'll just get zero back so the gaussian is one half so it's very uncertain outside of the data the downside of these features is that they are very difficult to use in high-dimensional spaces so here's a two-dimensional space and it's easy to tile a two-dimensional space with a bunch of little blobs but if you want to tile a higher-dimensional space let's say a 768-dimensional space if you catch the meme then you need to put a lot of these individual little ball-shaped gaussian things into this space to cover it exponentially many actually in the size and the dimensionality of the problem and that's why people stopped using these radial basis functions features these gaussian-shaped features and instead used these let's say sort of non-stationary ones like VLU and tanh good, so this is where we were on Monday and now what I want to do with you today is to talk about a framework to turn any such model pretty much any there'll be a tiny little constraint but maybe you'll catch it as we go through into a gaussian process before I get to do that, there's a question yes so the first half of your question is a nice lead into what we're going to do the problem with this setup is that we're looking at a point prediction a single function that this thing predicts now the question is could we maybe hack some uncertainty in by adding a function that measures how far away we are from the data another feature that somehow uses the training data and says how far we are from the training data and annoyingly sort of the answer will have to be kind of and we will do a little bit like this on Monday but it's a very sort of simplistic answer because or a simplistic way to this would be a simplistic way to address this because then you would have to describe what you mean by being far away from the data and that's sort of actually fundamentally the problem with generative modelling if you want to generate more data you have to say in which sense your data points are close to each other and what a new data point being far away from the training data might mean and you can do this in a sort of very ad hoc way and we'll do it on Monday so I'm not going to tell you how now but that will be a very crude in its description of the data set I'm going to use this to do to fix some crude problems like for example this very extreme uncertainty here far or lack of uncertainty far away from the data but it won't fix more structured problems like the ones that lead to adversarial or robustness for example so I said today we're going to turn any deep neural network into a Gaussian process and the way I'm going to present this process to you is thanks to many people over the years most of all this guy on the right Pierre Simon the Marquis de la Place well actually that's not him that's how stable diffusion imagines Frank Miller the guy who votes in city and 300 would like draw a superhero version of La Place it's very energetic I'm not giving the world but they're also so he came up with this La Place approximation that by now of course you know and hopefully love because we've talked about it a lot in deep learning this idea has emerged pretty much early on when deep learning wasn't even a thing yet when it was just called neural networks through works by among others David Mackay Yann Lecun also had a paper around pretty much the same time with pretty much the same idea and then more recently who have worked in this direction and I can just cite a few but I'm absolutely missing a few more so there is M.T. Kahn from Japan who published about this in 2019 in my own group there was Agostino's Cristiardi and also Huna Echenhagen who worked on many of these like ways of making use of La Place approximations and I will tell you a little bit about this unashamedly on Monday there's also people like Alex Immer and Eric Dachsberger who are PhD students in Switzerland and Cambridge who did some of this work in over the recent few years and various others actually including Hippolyte Ritter and David Barba and various others so this is an idea that is sort of re-emerging it was lost for a few years since 1998 probably for about 20 years and now it's back and that's why we're talking about it today but it's not the only way to turn a deep neural network into a probabilistic model there are various other ones and maybe I have time at some point to highlight a few of the other ones very briefly this is the one that I like most because it's the most direct one the most structured one in which we can look at what's going on and which is easiest to use it's not the most precise one but we'll see maybe that it's not so important to be very precise so my main goal is not to get perfect uncertainty but to take this stupid point estimate, this individual one estimate of what the true function might be and enrich it with some uncertainty as functionality with something that tells us more about what's going on in the network and that is of course going to be a Laplace approximation so raise your hand if by now you have a rough idea of what Laplace approximation is if you haven't then you haven't been paying attention ok, so that means we can do it on one slide here is how you turn your deep network into a Gaussian process you start by doing what we did on Monday which isn't actually a code process, it's just a mental process you realize that this regularized empirical risk that you have been training your neural network with is actually a negative log posterior so when we train the deep neural network we are minimizing a negative log posterior which is the same as maximizing a log posterior or maximizing a posterior because the logarithm is a monotonic transformation so that is maybe you don't have to write any code to do this but it's the most important step because once you realize this, everything afterwards can be, it's just playing around with code and we're really going to take this interpretation seriously but that doesn't mean that we're going to believe that this is a perfect posterior it's just, this is what people do when they train deep networks they maximize a posteriori probability whether they believe that to be a probability or not is actually not so important it's really what they do now we go out step 2 and we train the deep network as usual I mean, you do for example what I just did in this toy problem, I just ran the code and now we have this thing, that's the step that we did so far so now we are at step 2 and now we're going to do two things that lead us to a Gaussian process the first step is going to lead us to a Gaussian distribution on the weights and biases of the network, the parameters and the second step will give us a Gaussian process distribution on the output space of the function f, the deep network so there are two things here, the weights and biases of the network, these are basically a collection of numbers you could think of it as a big vector with structure and then there is the prediction of the function f at any input x that's a function of inputs so these two things we need to connect with each other we're going to do that so we'll first do the bit about the weights by doing the Laplace approximation so we can do both things so often that I can just write one line so at the trained point so once we found theta star we can do a Taylor expansion of the loss around theta star so the loss is in particular also a function of the parameters of the network and we're going to do a Taylor expansion so a Taylor expansion means that there is a constant term loss of theta star then there is a linear term which would be theta minus theta star times gradient of the loss and if we are actually at a mode then the gradient is zero so that term we can ignore and I'll sort of wave my hands around here and just say we can right now we might come back to this a little bit later today and think about whether it's actually two or not and then the second order term is one half times the second derivative polynomial so in the case of multivariate functions that's a quadratic form so it's the parameters minus the trained parameters transpose times the Hessian up here it's defined the this minus shouldn't be there I think that's a typo so let's just think of the Hessian so psi is the Hessian the matrix of second derivatives of the loss added the log posterior which is equal to the loss function with respect to the parameters with respect to theta and then we realize that if we approximate the log loss in this way that means the negative log posterior is a quadratic function and therefore the posterior is a Gaussian distribution a Gaussian distribution with a normalization constant that we can compute in closed form but for the moment we ignore it and then there's a minus one half times and to get the minus we have to put a minus in here that's why the minus is here a quadratic form and to interpret this as a Gaussian distribution we need to interpret psi inverse with a minus in front as the covariance of this Gaussian and the inverse is going to be our big bane because we'll need to actually somehow construct it or some matrix decomposition so this is where linear algebra comes in now we have a Gaussian distribution on the weights and let me see if I can actually just do that now yeah I can and then there's a line here which you can ignore for the moment we'll come back to it afterwards so here is the operative part of this it's one line I'm going to construct the Hessian loss with respect to the weights and biases which we call the parameters of the network maybe I've got a big bug in my code we'll just see final let's see what happens at the entire training data just to make sure what happens here loss is a function of parameters and the data inputs and outputs and we're evaluating its Hessian with respect to the first parameter the zeroth one this one rather than with respect to the data and that's it and you just saw it happen actually I'll comment out this line so you see how long this one runs that was the cost of this now this goes so fast because we're going to construct so for a really big deep learning architecture for like a large language model of course we wouldn't even be able to just do this because what are we talking about here a matrix of size number of weights by number of weights it's quadratic in the weight space so if your model has 125 billion parameters then we're not going to be able to construct that matrix not even with the largest computers and memory banks that we have available at the moment we can do that because it allows us to really look into what we would like to do and then we need to think about what you would do with a large model okay and yeah so we'll come back to that object but now we have it so we have this psi which is going to be important for our distribution on the weights and now here's the final part to get from the distribution on the weight space to a prediction on the function space what we do is we take this deep neural network f which is a function of two things of the input x and the parameter of theta and we linearize it again in the weights around the trained weights and this bit is a bit confusing so I'll go slow even though it's just one line so f is a function f and we will write it locally as f of x and theta is roughly f of x and theta star the trained weights plus well how does a what does the Taylor expansion look like keep in mind that f is a multivariate function potentially so if we're doing classification then f or if you do classification over c classes then f will have c outputs one for each class so there isn't really a gradient as much as a Jacobian so a matrix, a rectangular matrix of derivatives of the i-th output of the function with respect to the j-th parameter so it's a rectangular thing if we do c classes and we have d weights then it's a matrix of size d for each x so this thing is itself a function of x of the input that's where every centric programming comes in there's lots and lots of degrees of freedom so if we do a Taylor expansion we get a constant term plus this linear term which is this Jacobian multiplied from the left onto the vector of parameters minus the trained parameters so first some things to note this looks like a very strong simplification we've taken this complicated function f and we've turned it into a linear function in the weights but it's actually more than what we had so far because so far we only had f of x and theta star, what is that? that's the trained neural network so this little blue thing which seems like a constant floating around that is your trained neural network it's the whole thing it's a function of x this f is a function valued thing where you plug in an input and it produces a prediction we've already used that and only that so far, so all of the plots you've seen so far are just this little blue thing and it might seem like an afterthought but that's actually your trained neural network it's just a nice way of encapsulating all these complexity into four characters or three, I don't know five, six and now we do something additional we linearize it so we add this Jacobian which is another function of x and theta star so it's like we add a whole new thing to our description of the problem actually it's not an entirely new thing because it's a derivative of f so it's not like it's a completely separate function but of course a derivative is also not the same as a function, they are related to each other but they are also separate and then you multiply with the weights okay so this is actually what I do in this second line this here is say I would like to construct the Jacobian in reverse mode ignore that for a moment, I'll say something about that on next Thursday the Jacobian of the predict function the predict function is the thing that we talked about on Monday which is a little bit like the empirical risk which is to take the and also again I need to make sure this actually works maybe I've got a bug in my code we'll see with respect to the ah okay so this doesn't really matter because we're not going to use it anyway okay fine, doesn't matter so we're going to use this we're going to redefine that line a little bit later down in the code it's just here to our clarity to have them both next to each other and we take the derivative with respect to again the parameters and that should work good and now we have this object and now we can do a little bit of math again and think about what we actually have here so this is this way of writing f with the blue and the green bit the trained net and its Jacobian that is now actually a linear function in the weights of the parameters there's just times theta in there and we've just constructed a Gaussian distribution on the parameters theta and we know from these laborious 8 or so lectures on Gaussian processes that if you have a Gaussian distribution on a random variable then any linear map of that Gaussian random variable is also a Gaussian random variable and that's exactly what we have here we have something that is a linear function of a Gaussian random variable so that means this when we marginalize out the weights when we integrate out this probability distribution or probability measure or actually density function even against the weights then what we get is a valued object but that object is actually a function so therefore the distribution is a Gaussian process rather than just a Gaussian distribution so here we sort of now our notation also comes in handy we don't have to think so much about this this object f in terms of its input x when we integrate out our belief over the weights is now a Gaussian process and Gaussian processes have two um describing characteristics two parameters two sets of sufficient statistics what are they a mean function and a covariance function and the mean function now is well it's the expected value of this under this distribution so the under this distribution the expected value of theta is theta star so the expected term here is zero and we are just left with this so the mean function of our Gaussian process is going to be zero and that's good because it means we've just found a place where to put our train deep neural network into our construction of a probabilistic belief set right it's in the mean it's the prediction we're going to make is the one that the deep neural network would make and then what's the covariance matrix well so the is on the slide but it would have been a bit full if you have a if some random variable is Gaussian distributed this chalk is too small like this then any affine map of that is also Gaussian distributed with this mean and this covariance standard property of Gaussians and here where the mouse arrow is we have such an object so our dot j is a essentially and that means the covariance function of this Gaussian process the kernel is going to be minus and then an inner product around the inverse session with the Jacobian on the left and right side and I will call this the Laplace tangent kernel as the covariance function and this isn't really like an accepted name actually there is no name for this thing in the literature yet because it's a pretty new sort of idea still floating around so the people who write the paper on this they all have a different name for it like people have you know it depends a bit depending on which set of authors writes the text they use different names for it okay so that's it that's what we're going to do and I will do it in code and I'll just briefly maybe to anticipate your question already quickly sort of review again what we're going to do so we're going to construct this Gaussian process and in the end I really want to have something where I can call our Gaussian process library on and make it a Gaussian process and we'll do that by defining the mean function of this Gaussian process as the train deep neural network and defining the kernel to be this thing the Laplace tangent kernel so we already have this that's easy I can already write down the mean function now we just need to write this thing in code and that will take us a few lines of code to do and what are they well so they're going to do sort of two things the first thing is to construct the Hessian I've essentially already done that in this line I just call this H but of course now we're going to need the inverse of this so we'll need to do some matrix decomposition we need to think about how to do this your sort of maybe your knee-jerk reaction is to use Joleski we'll get back to that and then at test time so when someone later you know wants to do something with this Gaussian process we need to be able to evaluate this Jacobian function additionally so that means if someone gives you the a new input what you now do is you used to do just a blue thing so you just evaluate the train network at this test point and now we're going to do two things we'll do the forward pass through the network to evaluate F and then a backward pass back down to get the Jacobian and use that to get uncertainty and that sounds like a feasible thing to do if it sounds like a feasible thing to do then that's good because that's actually the power of this entire approach which I'm going to pitch to you before we take a break the nice thing about this is that this is really something you can apply to pretty much any deep neural network of course if it's a large network you have to think a little bit about how to do it right in code it's not going to just work out of the box but it's not something that completely breaks with the sort of paradigm of deep learning if you are someone who comes in from a deep learning perspective like you know for example have taken a deep learning class at last term then you get to keep these trained neural networks that you've just worked on in fact you don't even have to retrain them you don't have to change the... not only can you keep the architecture you also can keep the train weights so if you've just fiddled with SGD or Adam for weeks to get it to work or if your company has just invested 120 million dollars to train your large language model you get to keep that thing you don't have to retrain it that's very valuable because there are other approaches to probabilistic deep learning Bayesian deep learning which do not have this property so for example there are approaches using Markov-Germontekalo sampling methods which require that you retrain the network multiple times to construct what's called an ensemble so you have to... so if you think about how expensive it is to train a large language model once just imagine what if you had to do it 50 times so we're not going to do that we're just going to keep the network and you get to keep however you trained it in fact we don't even care we construct the approximation that makes the kernel after the training part in fact if someone is so kind to give you an open sourced solution a trained neural network with the training data because we need to be able to evaluate the Hessian of the loss then we can do this post-talk and we've done this in my group sometimes we've taken pre-trained image net models and just added the Laplace approximation afterwards without retraining it which is kind of nice and the only thing we need to be able to do is to apply autodiff and linear algebra so do these construct additional constraints on our deep learning setup no because you're already doing autodiff anyway if you can't differentiate through your network how are you training it only the most exotic deep learning models are not trained with autodiff some I don't know crazy evolutionary algorithms or whatever but if you're honest those probably don't work anyway so everyone is training with backprop so you already have your Jacobian so you can do that and then the rest is linear algebra and of course you know linear algebra we need to think about it it's not just I've got 50 samples lying around let's think about what we can do with them it's this very rich object that has been studied for hundreds of years but the result is an actual Gaussian process and by the end of this lecture we will have it in our Gaussian process language it has all the functionality that a Gaussian process has you can sample from it you can project it onto other variables you can even train with new data by calling the condition function that we have in our library we can evaluate its log PDF its evidence and so on and so on the downside of course is that we'll have to live with is that we'll need to compute a Hessian decomposition and I'll do that with you just now after the break and you will quickly realize that the way we do it today which is the pedestrian like the direct way is not going to scale to large scale problems so when you have a large network you will really have to think about how to do it right and at this point in 2023 you still have to do that thinking I wouldn't be surprised if in 2027 you don't have to anymore because some smart model takes care of it for you because that's how linear algebra works it gets abstracted away but right now we'll still have to do it for ourselves and then there's this other thing that usually comes up at this point that people then draw a picture and I've had conversations like two weeks ago I was at Oberwollfach and I had a conversation with big old deep learning people about this who have been primed to do the following argument but what does Laplace do? it takes the posterior which is this complicated thing looks like this maybe and then it just finds this mode and then it just does this Gaussian approximation to it that seems really dangerous right because it's just this local thing and it only evaluates the Hessian here what if your model is really mass over here maybe your true model looks like this how do you know this doesn't really add much right now this is true and this is actually a weakness of Laplace but let's keep in mind that nobody knows what this thing is anyway what currently happens in the real world is people compute this and now we're adding something to it and it this is really important so I'm trying to make this point the goal today is just to enrich the language of the current state of the art in deep learning it's not to give the perfect mathematical answer to all problems you could possibly have because these objects these full posteriors if you like they are fundamentally very hard to track because they are non-linear, non-convex optimization problems so there's really no good algorithmic handle on describing them and we'll just need to find some approximations to it there are other algorithmic approaches that try to enrich this language so if you were taking this class I don't know in New York City you might hear about a very different way of constructing these approximations and you would get good arguments for why they might be better because they maybe have stronger mathematical guarantees but they also tend to be harder to use in practice and more expensive so there's of course a trade-off between a simple approximation and a very good one that is very expensive I'm trying to construct the one that you can actually apply to pretty much any neural network and with this let's take a break and then do the actual code in five minutes at 9.0.10 so now I want to reuse the entire remainder of today's lecture to just show you code so I've gone back to the code I've shown you before here's this one line that is actually important and which did have a bug so also the thing you find on Ilias has the same bug I evaluated the Hessian not on the trained network but on some other parameters that were floating around somewhere and actually I recommend that you run this code with the bug because it gives an interesting Hessian to look at that might be interesting to study this it's a good point I have a slide about this also next Monday I think but pretty much okay let me draw a picture but also show you an equation to it so that we have lots and lots of different layers of stuff again so keep in mind that the loss function the empirical risk is a sum over all these terms so this function L depends on well let's say two things the data X and Y pairs and the parameters theta so after on the picture already I think at a past point so in your head you have this tiny you can think of an object for the gradient for the gradient of the loss so you can think of an object that is of size d where d is the number of parameters of the loss times n that's the number of training points and inside of your machine this whole thing gets evaluated and each of these contains the derivative of the loss of training point n f Xn theta with respect to d theta d and then we have d and n and then afterwards your computer without telling you about it sums over this bit and now we have a vector actually of course it doesn't sum over the whole thing and it doesn't build the whole thing because we are working on batches so actually we are just taking individual slices through this thing at some point that's our batch batch of size 3 and then just sum those 3 and then we randomly pick some of those out that's linear algebra and now if you have a Hessian it's sort of the same thing it's just now an array of 3 dimensions so there is sort of a it's some kind of blocky thing of size d by d d by n and we sum out the n that's it so let's build this thing it's this one line here actually I'll comment this out because we really don't need it it's just so that we have it in the same cell I should say I wrote this code hecticly over the last few days I've changed it this morning already so there is bugs in there that's how close we are to the state of the art here comes our main problem we're going to need to do linear algebra on this object h what is the shape of h? well I just told you it's size d by d let me check that this is actually true oops, where did I type it I'm not in the wrong white cell it doesn't have a shape what's that, what's the error oh it's a list what does h look like so here's the nasty part because of the way we described our code let's think about what the parameters actually are so we initialize the parameters as where is it the parameter function we make this list and the list is of length number of layers of the network and for each layer there is a pair weights and biases and then each of these weights and biases, that's an array so here's the bit where I need to think about how long we're going to talk about this because it's both the tricky part but also the boring part so let me use a few more minutes actually I'll wipe this out here briefly this is going to be the one where it kind of your head hurts so we need to translate from this sort of array view that is convenient to write these functions this initialization function and the architecture of the network, which is kind of what people like about deep learning, so that you're able to write these sort of metaphorical neural networks with cells and layers are more sort of easier to describe in these kind of I don't know, PyTree-like structures and translate it into a into a matrix on which you can do linear algebra so theta is actually a list which contains layers so layer goes from one to I don't know L and then in each layer there is a weight and a bias so it's either sort of weight or bias and then once we're in there each of these either it's a weight or a bias is an array so if it's a weight then it's a matrix of size number of units in the layer below times number of units in the layer above so then it'll have indices I J or actually it has indices not good to draw on a wet board let's be a bit more precise because we're actually quite general in the structure so the inputs might be multi-dimensional it has indices shape of layer below times layer above if you know what I'm trying to write it's maybe not a particularly clean this wouldn't work in python but it's kind of what I'm trying to say but that's only true if it's a weight if it's a bias then it's of size number of units in the layer above then it's a vector so the other way it would be shape of layer plus comma somehow so what we would like to have is to unravel this whole thing into a matrix and we're just going to need to do that and when I realized that I had to do that I thought okay I'm gonna do that tomorrow and then I should put it off for a bit and so I sat down with Marvin again to make sure I don't mess it up because it's really difficult actually to do so let me show you how we do it and then I will say of course this is one way to do it but it's not the so here we can okay I'll just leave it out and of course there might be others this one works and if you have a nicer run tell us about it so first of all we construct one of these Python has these enum structures integer enum in this case because it's a discrete thing so that we're able to say we have either a weight or a bias so this thing here is either 0 or 1 and if it's 0 then we call it weight and if it's 1 then we call it bias then we can index it a little bit nicer and now we can look at this thing, look what it looks like so the Hessian at layer i for the bias that's in this indexes into the row of the Hessian that's now something of size let me actually run so let's go back up the definition of the parameter of the architecture here so I've decided that this network should be of size 2 output layer times 64 64, 64, 1 so it's a deep neural network that has 1, 2, 3 sets of weights mapping from the inputs to the first layer to the second layer to the output layer the output layer is one dimensional because it's binary classification and so okay so this means the Hessian here so the Hessian is a matrix that contains second derivatives of one of the parameters that's the first 3 entries the first 2 entries with respect to some other parameter so on this layer the biases have a Hessian so an inverse covariance with respect to those on the other layer let's say the weights that look like this so this is of size number of units in layer i then nothing because it's a bias and then number of units in the weights of the number of elements in the weights of layer j so that's a matrix of size 64 by 64 in this case you can use this code to convince yourself of this data type because you might not need to do it in the exercises anyway next week and it's a bit of a pain to go through it's just you know once you build arrays you just have to do it so now I don't want to spend too much more time on this what we actually do is we build a stupid little for loop which is not good in Python but it's just the way we're going to do it to unravel those parameters and it's one of these for loops that when you see it on a won't tell you much you just have to kind of write it down yourself so I'm providing it to you and I'll briefly tell you what it is and then we'll just have to stop it there and look at more interesting stuff so what we do is we go through each we initialize some empty lists blocks and sizes blocks is the operative part sizes is some bookkeeping and now we go through all the layers in each layer we create some weights, sorry biases and weights it's just a binary thing and then we create something which we collect the row of the Hessian at layer I and parameter bias and weights and then append the sorry that's for the sizes, that's just a key book so that we know how large this thing is we just store how large that part of the matrix is and then we go for the other side of the Hessian so the Hessian is a second derivative one with effect to the other so the first two for loops are for one and the other two for loops are for the other again we go through all of the layers and all of the types biases and weights and append the entries that are in this list of tuples reshaped into the size that they should have so that they fit into a row and imagine that the first time you write this down it doesn't work and then you have to stare at it and find all the stupid bugs and then it works at some point so I can run this now and you see it's very fast actually and in the end we now we have now a list of all these blocks and we just collate them into what Jack's NumPy calls a block matrix and I'm also collecting the cumulative sum over the sizes those are then indices in this big new matrix that tell us where one layer ends and another one starts and now I can make a picture of this object it looks like this that's because I've zoomed in so you can't really see it let me zoom out a little bit so here's our Hessian and I'm plotting so that you can see something I'm plotting the the the log base 10 so that the decimal logarithm of the absolute values in the Hessian so in the Hessian there are positive and negative numbers even if it's positive definite of course so I want to take a logarithm otherwise you can't see anything and then take the absolute value and it looks like this and maybe you can just barely see some blue lines that I've put in here and there those are these boundaries between different parts of the network so up here in this tiny little block remember that our network is of size I'll write it down here again 2 times 64 64 1 so at the beginning we have a weight matrix that mediates between 2 and 64 so it's of size 128 by 128 that's this tiny little block up there then we have biases which are just 64 so they are next to it and they have their own covariance inverse covariance which is this little block and the inverse covariance of the lowest layer weights as this little block then we have a 64 times 64 layer that's a lot of things 64 squared weights that's why they pick up the bulk of the Hessian it's this huge part in here and then we have the output layer which is one dimensional so there's something that's one weight matrix of size 64 times 1 that's down here and then one matrix of one bias set which is also 64 times 1 because it's just a bias so that's down here and that's our Hessian and you can stare at this for quite some time and maybe think about what this structure tells you and actually in fact these plots they look very different when you change the architecture so I encourage you to play with this code change the architecture a bit maybe remove one layer maybe make one a bit smaller or larger and you will get very very different pictures so what you can think of what you see here this looks a little bit like a computer chip and actually this is maybe a good metaphor because what you see here is some kind of occupancy map of your memory in your neural network so one way to think about these objects is that if you invert this matrix then you get a covariance matrix between the weights that's our Psi inside of the Laplace approximation the Psi inverse and then the diagonals of this matrix will give you a marginal error bar on that weight so if the numbers they are small that means the data constraints these elements strongly it's like your network knows something about this bit it's using the memory at that point to store something about the data and if the number there is very small it means the network has not used this weight yet the error bar is very large the data has not constrained the weight yet and the off diagonal terms tell you something about how things relate to each other so remember you had actually a homework exercise at some point about precision and covariance matrices of Gaussians this is a precision matrix of a Gaussian can someone tell me what a white entry means so this is roughly zero what is a zero on the off diagonal of a precision matrix mean yes when conditioned on the other numbers those weights are independent and a zero in the inverse would mean these things are just independent marginally so there really is an interpretation to it and I want to get this across there is a big numbers floating around you can actually look at this and think about it and now what we will need to do is to invert this matrix now, of course so far for Gaussian distributions we've done this with Cholesky but there is a few problems with applying the Cholesky decomposition to this matrix the first one is that no one guarantees us that this matrix is actually positive definite why? it's a non-convex optimization problem so the Hessian might well be non indefinite and then we've trained it with an optimizer that has maybe found a minimum but it's stochastic gradient descent so it's not like we know for sure we are absolutely at the minimum so we just stopped it at some point we didn't run it until convergence we just stopped it because we thought it had converged so there might be some negative curvature in this matrix and so what we should probably do is to do either a singular value decomposition or an eigenvalue decomposition because this is a symmetric, so Hessians are symmetric right and then eigenvalues and singular values are closely related to each other through a square so the most general description is an eigenvalue decomposition which I'm going to do so that we can look at these things so this is this line we're going to run it that's the one bit that takes a bit of time because this is a 5,000 by 5,000 matrix roughly so we need to take an eigenvalue decomposition of that that is not cheap but now we've done it and this has constructed for us by the way, so this is Jax-Namheis-Lin-Alk method for the eigenvalue decomposition of a Hermitian matrix because this matrix is symmetric so it's in particular also Hermitian yes the loss functions will be continuous in the parameters actually they're pretty much continuous I don't know if the loss functions are continuous I know if the loss functions aren't differentiable like the hinge loss but continuous yes now we can make a plot so what I'm going to do is I will plot again the eigenvalues so first I'll take the log base 10 eigenvalues I'll take the absolute value of the eigenvalues and then plot their logarithm and so this function it returns conveniently the eigenvalues sorted by their size so it starts with the smallest and goes to the largest usually we wanted the other way around we want the large eigenvalues first so that's why I reverse the indexing here so if we go from large to small and then I also plot the corresponding eigenvectors scaled with the eigenvalues so the eigenvectors are an orthonormal set of vectors because it's a symmetric matrix so they're orthogonal to each other and they all are constructed to be norm one by this algorithm and I scale them with the eigenvalue so that we can sort of see where the large and the small bits are so I run this and here it is why is this so weird I know what the problem is this is difficult to get right without looking at it never try to do life coding when you can't look at the screen so that you can see an actual plot where's the second brackets ok now we can see the plot better so here's what we see these are the eigenvalues of the session and log base 10 absolute value and here are the eigenvectors what do you see so there's a first observation so keep in mind that this is log base 10 so we start with a large eigenvalue which is roughly one actually and then there's a very rapid decay over the first I don't know maybe change the axis here maybe the first 50 or so entries down to 10 to the minus 3 and then there's another rapid decay down to the 10 to the minus 12 absolutely tiny and then there below machine precision for the majority of the eigenvalues and then what's this bit what's happening here so I'm plotting log base 10 of the absolute value of the eigenvalues there are negative eigenvalues there are negative eigenvalues in this session this thing is just not convex there's just negative curvature in there so are we actually at a minimum with SGD no it's a saddle point that's just what happens during training there is a direction that we could still optimize the modeling and it's just sitting there that's it so these are the ones that come back up negative but maybe the good news is they are actually quite small compared to the large ones so if I would have switched on the grid here maybe I can do that without messing it up you see that basically those important eigenvalues they completely dominate over those negative ones that's also why SGD doesn't doesn't make use of these because its dynamics are dominated by those large ones this is the the narrow vene that defines our loss function that pushes SGD to move along so what we see here maybe is a loss function that looks like this and then inside if you would zoom in if I kind of make a cut out of this there are maybe we're currently here and these are the potential lines and then there's a hole here somewhere that we're not getting to because it's sort of a weird nonparabolic shape in the loss function and these are the small eigenvalues that are negative but SGD does never see those because we're not actually perfectly on the centerline of this thing we're just a little bit off because it's SGD so the gradients are all pointing in these directions SGD is just with its step size just hopping back and forth over the sides of this loss function and it never makes it to this end point here okay and we see the structure and the eigenvalues here of course as well and we can now look at the eigenvectors of it and see how did it look like so I've put in these lines again which corresponds to the layer so the first bit this is the initial layer bias of the initial layer that's the big intermediate layer the 64 by 64 layer and then the output again and what you see here is that the eigenvectors actually have mass the red bit across all layers so there isn't much structure in here that relates to the layers maybe on the middle bit here there's a lot of white so that means in this large inner piece of your memory hasn't really been occupied yet there's just too much degrees of freedom we have 128 training data points and this thing has over 4000 parameters so a lot of these parameters can't possibly be constrained yet and that's what we see here there's just no information about them okay so that's our Hessian and now let's go back to the math slide that says we've now done step 3 we've done the eigenvalue decomposition so we can in particular invert this matrix by just taking the eigenvalues doing a one over of them that's cheap now and we're done and of course we have to be careful with the eigenvalues that are actually zero so we'll need to deal with that in a moment and now we need to linearize to build the two objects that will make our Gaussian process we already have the mean function it's called network that's the deep network and now we need this Laplace tangent kernel so I'll write this piece of code show it to you and here again annoyingly the difficult part will be reshaping everything into the right shape so this Jacobian object which would be nice to write down it's just one line in jacks is j which is a function of the inputs and it just says compute the Jacobian of the network function the network function is the thing that takes the input it's called f in our slides with respect to its first input that's called the parameters and then we turn a function that you can evaluate at the inputs now that object though because it operates on this network function it operates on these parameters which have this shape and so if it again returns an object that is shaped like a function that takes in the inputs and then returns something of this shape and now we need to make this sort of germane with this array so we have to again do the same thing as for the Hessian but now it's in a way only half as complicated because you only need to do it for one entry because the Jacobian is a derivative with respect to the parameters and not a second derivative with respect to a pair of parameters so we just do that in this thing and it's the sort of thing you have to look at afterwards and because we are going to call it a lot we are not compiling it just in time no because we are going to call this function with some input and in a moment I am going to call it with we are just constructing a function and now somehow I have a rendering problem with this thing with this cell that's not good hmm what happens if I press reload maybe not a good idea so I will tell you what is in here so we are going to build this tangent kernel I should have probably put a cell in here to say that so we are going to build this object so we will need to construct a function that takes in two inputs and then for each input it constructs this j and then it multiplies inside with the inverse of side now keep in mind that we just saw that the Hessian has all these pretty much zero eigenvalues and then negative eigenvalues so if I just naively invert these numbers I am going to get lots of infs one over zero divide by zero outputs and maybe one reason why this is is that I have set the loss function to have no regular horizon so there is no prior for these weights and if you have no prior you don't get meaningful uncertainty so we have to put in some prior knowledge and I will do that by saying maybe I could have used during training a regular horizon I just didn't use it because it wasn't good for SGD it kind of made SGD not behave well so now let's say I put in something a quadratic term which is an r of theta that is equal to prior precision the variable that I am going to use times L2 norm theta square when I do that that means I am adding to the Hessian a diagonal matrix actually a scalar matrix scaled by prior precision so our Hessian is actually now psi plus prior precision times one and we already have the eigenvalue decomposition of psi so we know that we can write this as V times E times V transpose so of course we can write the one also as V times one times V transpose so that means the eigenvalues of this combined matrix of this sum are going to be V times E plus P times V transpose precisely P times one so I am just going to add a number to the eigenvalues to shift them up and from this plot I can maybe convince myself that I should probably put something in like at least 10 to the minus 3 because then I am going to drown out those negative eigenvalues so is that good to do or not hmm now this is where you can think about interpretation what I am going to do with this process I am going to somehow smooth out this structure in the loss function it is not going to be visible anymore it gets blurred away in a way by adding this term but I will keep the overall shape roughly depending on how I choose the prior precision I am going to start with a pretty large value 5 that really raises everything up and we are really just left with these dominating and now I define this tangent kernel it just takes in A and B to inputs and the parameters of the network and the Hessian matrix which is a construction of E and V actually we are not even using it we are not even using it so that is good if we could remove this then it constructs from the parameters the Jacobian for input A and input B I have outlined that stupidly my browser does not show you properly it does Jacobian of A multiplied with the eigenvectors V goes from the left onto here and also from the right onto the Jacobian of B and then multiplies inside with 1 over the eigenvalues plus the prior precision so that is the inverse to get an approximation to the inverse of the Hessian and how do we define the Hessian let's hope that that is right maybe play with this a little bit maybe there needs to be a minus here actually let's see once we have that we can define a Gaussian process it's annoying that this line doesn't work maybe I can just add a like this here we go and this is sort of what the magic happens if you like we construct a mean function in a kernel we already had the mean function it's just a network evaluated as the trained parameters and then a kernel which is this thing evaluated at the trained network and just define our Gaussian process and at this point we call the piece of Python code that we've been working with for this entire course it just now constructs this thing and at this point we're sort of endowing the network with all of this functionality with the ability to sample, to instantiate, to train on new data sets to evaluate PDFs and so on and so on and in particular this means we can make a plot we can make a plot of the stupid sometimes you just have to live with your we can predict on some plotting grid and we can plot the uncertainty and we can also ask our Gaussian process to instantiate itself on some grid and draw posterior samples from this deep neural network so I will make sure that this actually happens this takes a bit of time, why because constructing Jacobians actually costs a little bit more than constructing forward passes through the network about five times more so we have to wait a little bit so doing this kind of uncertainty doesn't come for free and now I'm producing a bunch of plots so that we can get to an end here you see the posterior mean of this function here the estimate for the uncertainty so that's the sort of where it's dark the network claims to be confident and where it's white the network seems to be uncertain and here is the sort of predictive estimate so that's the approximate prediction for the expected value of the the sigmoid transform of the network assuming that we are Gaussian and we're uncertain about the weights and you can see that it's a little bit more structured than the mean prediction and down here are three samples from the posterior and you see that they sort of adhere to the structure so they are red where there is red data and green where there is green data but they are sort of different this could be one hypothesis this could be one this could be another one and I just hope I didn't have a minus wrong here let me just run this code once to see what happens if we put in a minus here so that I oops not a minus ah what okay because there are negative eigenvalues so let's hope it's not wrong um an error no ah okay so maybe we're mostly seeing the large eigenvalues at this point ah so what one might like to do now is realize that the expensive computation is actually j times v multiplying the constructing the Jacobian and multiplying with v so if you want to play around with different prior positions p you can just pre-compute that object you can look at the code later if you want and then you can sort of do a sweep through these different values of the precision and just check what the samples are and so you can see that for for large positions if you just lift out all the small eigenvalues you just get the mean prediction back pretty much because therefore we're just confident about everything and for small positions you get all sorts of unstructured estimates question just because I choose in the prior position to be larger than the negative eigenvalues so in this plot you've seen that in absolute terms negative eigenvalues are at 10 to the minus 3 or so and what I'm adding here is always larger than 10 to the minus 3 yeah so there's a lot of things to see in this plot maybe let's spend two minutes to think about some of these these structures so the first thing is maybe you see a lot of structure in this uncertainty plot and what these are is they are sort of reflective of where the eigenvalues actually lie and you can see in here maybe the shape of these VLU features that lie in this space they provide these harsh straight lines that's because there's a feature in there with a relatively large weight that has some uncertainty to it that just kind of get reflected another thing that you see is that we also get kind of structured uncertainty so we're sort of highly certain in this sort of region down here a very uncertain of course along the decision boundary but there's also a large part of uncertainty for example over here a lot of structure in this on those parts and we see this in the samples that here this actually down here maybe as well the network up there flops around a lot so in the bottom right it's relatively confident but in the sort of up here and up here there's a lot of flexibility and now we can think about why this actually is and why do we have to put in this prior position so this is maybe the last point I want to make because it's very important you could get now the impression that maybe some of you will write this in the feedback in a moment I can anticipate that if this is also you know wishy-washy and lots of samples and it all looks very different why should I trust this Laplace approximation this seems like a really sort of made up thing to do but the important thing to remember now here is that there's something on top of this pile of auto-diff already so if this doesn't look the way you expected it to look your first thought should maybe not be the Laplace approximation being wrong but the fact that you're dealing with this deep neural network which is really complicated to work with so for example keep in mind that we here have something like just over 4000 parameters being fitted onto a 128 data points so of course this model is uncertain in all sorts of crazy ways because there is no near near enough information in a data set to train this thing but that's a typical setting in deep learning people over parameterize their networks so if you want to make these plots look better the first thing is not to break away from Laplace approximations but to go back up in this Jupyter notebook and drastically reduce the size of the network and I can tell you there's no time to do it now but you can try it yourself you can actually train this network to 100% accuracy with 10% of the weights maybe even 1% if you do it right even a single layer of neural network is going to work well on this and when you do that you'll find that a lot of this complicated structure we just saw with negative eigenvalues it's just going to go away and then maybe you can take that as something to think about whether you actually have to build your deep neural network in such a big complicated way the other thing you might think is there so seem really complicated remember on Monday I just clicked SGD and it just worked it went down somehow we had to tune the parameters a little bit and now we had to do all this complicated work to construct this and then I even made some mistakes in the code and it's actually alright so this is too expensive to do it's too elaborate so on the coding side hopefully in a few years this will become much more natural to do because the stack adapts and people will write code to make this automatic for you but the other thing is is it actually more expensive well here for the purposes of this lecture I constructed this entire hashing which is this big thing and computing it is cubically expensive in this number 4000 whatever number of parameters but if you look at this plot it's clear that we really only need this bit we really only need this part where the line is large the eigenvalues are large these contain the entire geometric information about the loss function and then everything underneath is just tiny bits of detail that we can look at and we can have nice little arguments about but they don't have any effect on the network actually so if you want to make this code fast you have to think about ways of only computing that first bit that tiny little bit and then afterwards of course predicting is going to be so much faster because we don't need a Jacobian with respect to the entire weights only with respect to this linear projection of like I don't know 50 parts of the weight space actually and you might think about what else this tells you about your network maybe you don't even need the entire weight space you only need a linear projection onto those 50 dimensions that's much easier to store as well this really is a map of your memory so we will talk more about this on next Monday where I'll keep pointing out a few more aspects of these Laplace approximations what I've shown you today is a way to turn any deep neural network into a Gaussian process so now hopefully it becomes clear why we spend 8 or 9 lectures talking about Gaussian process models because now we've just sort of plucked all this functionality on top of the deep neural networks that you've learned about in social media and other lectures this only involves auto-diffin linear algebra both of these are non-trivial they amount to complicated code but they are not like nasty in the fiddly kind of way they are just hard pieces of numerical code that can be designed well and tuned well and understood well of course you have to be careful to make them work for larger networks but that's something we can talk about in another lecture I hope some of you give feedback and I'm looking forward to see you on Monday thank you very much