 OK, again, welcome to lecture number 13, in which you get to see me instead of Augustinus because he was supposed to give this lecture. When we prepared the lecture course, we were still thinking he would give the lecture. But then he got a slightly expedited offer to do a postdoc at the vector institute in Toronto, and he had to leave us literally last week. So he said goodbye last Friday, and now he can't give this lecture. And so you have to imagine him doing this. I'm just sending in for him. And that also goes by way of an advanced apology that I'll occasionally have to think about what I'm actually presenting, because it's ever so slightly a little bit of PowerPoint karaoke. But I was involved in all of the papers we'll talk about, so I should hopefully be able to do this. So what is this lecture about? The previous lectures so far were maybe one way to phrase it is about getting uncertainty into computation, doing numerical computations with uncertainty. But today is a bit special that they will talk about how to do the computations to get uncertainty into your machine learning if it's not already there. And that will actually give us an opportunity to also we use some of the insights that we had in previous lectures. Particularly, I will talk about uncertainties or Bayesianness in deep learning. We already spoke about, in lectures nine and ten before Christmas, the fact that uncertainty usually means solving integrals. And today we'll see a really, really fast way of solving integrals. Because we already learned in lecture nine and ten that the generic way of solving integrals, Monte Carlo sampling, is maybe not the most efficient way to do things. All right, so we'll also reuse some of the insights from the previous two lectures. So two lectures ago Frank Schneider told you about how to train neural networks, and actually he kind of told you that it's not so straightforward, as it might seem. And then last week, Lukas Tatzel built up an intuition for how to use second-order quantities, curvature estimates, maybe first for optimization in deep learning. And today we'll see again that, or we'll see that the second-order quantities, these curvature estimates, they can actually do a lot of double duty for us, not just for optimization, but for re-interesting other functionality as well, in particular, to get uncertainty into deep neural networks. So before I can tell you how to do that, we first have to talk about why we need uncertainty in deep neural networks. And there are actually several good reasons, and they are surprisingly concrete, at least in my opinion. And I want to use one of them as a motivation. So, hands up, who has seen pictures like this here on the left before? Probably everyone who's taken a deep learning class, yes. So you all know that there's this weird behavior of neural networks, that sometimes you can give them really stupid images, and they'll think they're something completely different. You've probably seen a different version of these pictures with adversarial examples. So these are examples that look like actual images, but get the wrong classifier output, right, that predicts the wrong thing. So these here are slightly different. These are what you might call out-of-distribution examples. So these are images that actually, they look a bit like white noise. They're not actually quite white noise, but they're almost like this. They're very far away from the training data within the inputs domain, right, within the X space. They're images, but they really don't look anything like natural images. However, a network trained on these classifier, no, sorry. A network trained on something like the ImageNet training data, that actually you works well on the training data. That has high predictive precision on the training data, gives outputs that are labeled under these images to these particular images with very high confidence. So the classifier gives something like 99.5% confidence that this year is a Robin and this year is a Cheetah and so on. And I think we all agree that that's not a good thing, right? You don't want a classifier that looks at some totally stupid image and says, I know exactly what's in that image. So this is sometimes called asymptotic overconfidence, because these images are very far away from the training data. And there was actually a really beautiful paper by my colleague, Matthias Hein, in the CDPR 2019, which in machine learning terms is like a few ages ago, right? In which he showed theoretically, typical for Matthias Hein, that this is actually a fundamental problem with classification networks of the standard type, which are classification networks using VLU nonlinearities. And the argument goes as follows, and I'll actually pretty much try to reproduce it in just like a few lines with some intuition. So first of all, here's an image that shows the situation. So you see here a training data set, one, two, three, four classes. It's a classification problem. There are four colors in this plot, if you can see them. They're for four classes. We've trained a classifier, a very simple classifier on this, using VLU nonlinearities as a linked function between the layers of the network. And what you see as a shading in the background in red is the confidence of the classifier in whatever the output class is. So you can see four regions, right? You can imagine that each of these regions, the classifier thinks is one of the classes, and they are not shaded in the color of the class, but they're shaded by confidence. So deep red means that the class is predicted with very high probability, and white means that basically all the classes, the highest class, has a low probability. And as you can see, what the main thing you're supposed to see in this plot is, if you move away from the classes into the corners of the image, you still get very high confidence. And that's bad, right? So Mathias Sainte showed that this is actually a fundamental property, and it works as follows. There's two insights to have. The first one is we're talking about a VLU classifier. So if you're thinking of the penultimate layer, the logit layer of the network, the one that goes into the softmax output, then that is a VLU layer. So that means it's a linear combination of the previous layer's output. So if all of the layers of the network have used VLU nonlinearities, then so everyone, I guess, knows what a VLU is, right? So it's a piecewise linear function that you get to scale and move around. Now, a piecewise linear function of a piecewise linear function of a piecewise linear function is still a piecewise linear function. So the nice thing about VLU networks is that their output is a piecewise linear function. It's just a combination of a hierarchical combination of the previous layers, but that doesn't change the fact that it's still piecewise linear. It just has combinatorially many of these piecewise regions. So this picture is supposed to show this. So each of these gray lines is one of these boundaries between linear regions. So each cell delineated by these gray lines produces a piecewise linear function. So the first lemma up here says if you have such a classifier, such a neural network, such a function, f of input. So this space here is the input space. It's a sketch of the input space. And if you have such a piecewise linear function on this, then if you take any point in this space and scale it away from 0, then there exists a last point alpha, it's this blue point, after which you're going to be within one of these linear cells and never leave it again. Does that make sense? So the main thing about this is that there are finitely many of these piecewise linear features. So therefore, if you move away, eventually you're crossing the last of these gray lines. And then after that, everything is just linear. So if you think about moving far away from the training data, we will eventually be in a region where the classifier has a linear input to the softmax output. So remember, from your deep learning lectures to do classification with multiple classes, you have multiple of these linear outputs going together into a number of classes of these linear outputs and they go into a softmax. And the second theorem then says, or the statement then says, and I'm going to show you a visualization of this in a moment, but I'll first tell you what the statement is, if you are in such a piecewise linear region and you move sufficiently far out, then eventually you're in one of these regions where all the classifier outputs are all linear. And that means, with probability 1, each of these linear functions that move out have a different gain, right? They rise at a different rate. And if that's the case, then if you move sufficiently far away, then there is one of them that will be the highest. And because it has some higher gain than the other ones, it will be arbitrarily far away from the second highest one after some distance, right? And if you put that into the softmax, you'll get an arbitrarily high confidence for that one class. And that's exactly what we see. So I also have a visualization of this that I hacked together yesterday. So here you see three linear output features in red. Each of them you can think of as one class predictor. And above them you see the softmax output over those three. Now clearly, let's maybe look at this one, which is the steepest one. It comes from below, so it has very low probability to the left of 0. And then it becomes the highest one, so it has very high probability to the right. And for the other two, you can think about why they look this way. And so in a situation like this, clearly one of them is the one that has highest gain. So eventually, it'll be arbitrarily far away from all the other ones. If you move arbitrarily far in this direction, and then you get arbitrarily high confidence in this class. So this is a fundamental property of RELU classifiers. It's not something that's sort of an accidental thing. It's not like you can retrain your weights to avoid this. It's just fundamentally a property of these networks. And I'm going to tell you that the way to fix it, fundamentally, is to add uncertainty to the weights of the neural network. So let's see if we can get there. So first of all, how do we get uncertainty on a deep neural network? We have to have a Bayesian interpretation. So uncertainty is always connected to the idea of Bayesianism, so we need some Bayesian interpretation of the neural network. It turns out that that's actually surprisingly straightforward, maybe. Well, maybe surprisingly. Certainly straightforward, maybe surprisingly, depending on how much you've thought about this before. So in the last two lectures, Frank Schneider already introduced several times. And then Lukas, again, this notion of empirical risk minimization, which of course you also know from your statistical machine learning class. So deep neural networks are trained by minimizing some function that looks like this, up to a few normalization constants that I've left out for ease of argument. So the risk we're minimizing is some function that happens to be a sum over individual terms, where each term depends on one of the training data and all of the weights of the network, where the weights are denoted by theta and they go into some function, where the structure of the function defines the neural network. And then maybe in general, you also have some regularizer on the weights, just so for full generality. You don't always have that in deep learning, but you could. So minimizing this function can be thought of as, in particular, since these functions usually are non-negative, they can be thought of as also minimizing some other name for the same function, which you could just call negative log likelihood of the data under the predictive model, minus some function that only depends on the weights and not on the data, so we could call that a log prior. So minimizing this is the same as maximizing the exponential of this, or I've left out the minus here, so there should, of course, be a minus here. So if you take the logarithm out, you can think of this as multiplying a prior with a bunch of individual likelihoods. And prior times likelihood up to normalization is a posterior. So normalization doesn't matter because we're just finding the mode, the lowest point in this negative log posterior. And that mode is not shifted by multiplying this thing by a constant because the constant would just be a minus a constant here in the logarithm, and that doesn't change the location of the minimum. So actually, we can think of deep learning already as doing Bayesian inference. It's just that we only compute the mode of the posterior. Well, actually, we never ever find it because of all the problems that Frank told you about, but we're trying to get there. In particular, if you think of one common setting that is actually not so rare is for regression type settings. So you have supervised problems with continuous valued outputs, a common loss that is widely used is the quadratic loss. And so that's down here. And a common regularizer is what's called weight decay or quadratic or L2 weight cost. So that's this thing. Can someone guess what the corresponding priors and likelihoods are? Yes. So if you're doing regression with a quadratic output loss and a quadratic weight cost, you're essentially putting a Gaussian prior on the weights and a Gaussian likelihood on the data. However, that's not a Gaussian model because remember that we're talking about deep neural networks. Here's a picture on the right. So theta enters in a nonlinear fashion into F. So it's not just a Gaussian regressor. OK, so we actually have kind of a Bayesian interpretation for our deep neural network. So the only thing left to do is to make use of this posterior. So to do that, we would consider the full shape of this object. This here is a full posterior, up to normalization. And then maybe make predictions about the output. For example, about for regression or for classification over the class. Then what the typical setting is we're getting a new input x star, which you have to make a prediction. And then we want to know the class level y star. So for that, we have to integrate out our uncertainty over the weights, our marginal. Now, unfortunately, and this is why, of course, most people don't do Bayesian deep learning, this object, this posterior distribution, is intractable to begin with. We're doing deep learning, so there's like a billion of these weights. Theta has like a billion dimensional vector. And this function is highly nonlinear. It's not just a quadratic function. Otherwise, it would be a Gaussian. And this function here is also not just a Gaussian. Because if you do classification, it might be something like the softmax over the class. So that's where most of the early thoughts about Bayesian deep learning ended. And it's actually where in recent years a lot of contemporary Bayesian thoughts about Bayesian deep learning have entered in with a somewhat sort of reflexive answer to say, well, it's an intractable prior likelihood, so let's do Monte Carlo. So I've told you in lecture nine that Monte Carlo approaches, while theoretically very well-founded, are potentially dangerous because they might take a long time to converge. And in fact, that's going to be my argument today as well. So we're not going to do Monte Carlo because that would basically disadvantage us compared to people who are not doing Bayesian inference from the start. I'm assuming you've all done a Bayesian deep learning class. So you kind of know and like the way deep neural networks work, not all of it, but some of them. And you'd like to be able to keep doing that, but maybe adding uncertainty would be nice, but not at the cost of completely making it intractable. So the people who train GPT, whatever, 4 or 5 at the moment, they can't afford to run a Markov chain of Monte Carlo samples on their huge deep network because it would be so expensive that you could never hope to actually do it. So if we want to add uncertainty to fix pathologies like the ones I just showed you, then we have to find a way of doing all of this without significantly adding costs to the process. So what we'll talk about today is the cheapest possible way to do integrals and its automatic differentiation coupled with linear algebra. Let me show you how this works. First of all, there is a surprising result due to Augustinus. So if he were here, he could give you the full story, which is that this problem with overconfidence that I've just showed you in an example can actually be healed by being Bayesian about the weights period, not by computing the correct posterior, but by adding any probability distribution to the weights. So the paper actually has the title being Bayesian even just a little bit fixes overconfidence because it contains this theorem that actually I'm going to just read out one part of the theorem, and then we'll talk about what it actually says. This theorem says any Gaussian approximate measure on even just the last layer's weights of the network already solves the problem, at least partially. So before we talk about how it works, it's valuable to appreciate this. So what we're going to say is it doesn't actually matter whether the probability distribution we're going to put on the weights is the correct one, but it's actually a correct posterior or not. This particular problem that we've talked about at the beginning can actually be healed by adding any kind of probability measure on the weights, even the simplest possible one. How are we going to arrive at such a measure? So actually, I'll go one slide forward because I can tell you this is what it will look like. And you can see already that it has sort of healed this confidence problem. It just becomes less and less confident about the classes as you move away. So how does this work? We'll take our input to the classification layer. I keep swapping back and forth about how to call that. This is like this f theta. It's the input to the logistic output layer to the softmax, f theta. And we'll write it explicitly as a linear function of the last layer's weights, we'll call those w, multiplied with the outputs of the penultimate layer's output. So phi of x is what encapsulates everything that happens in the lower layers of the network. And w is other weights of the last layer. So theta, the parameter set theta, combines both w and phi, if you like. Now we're going to assume that we'll put a Gaussian distribution on those last layer's weights. So we're basically ignoring the fact that there is a deep neural network below. We just take it. We don't think much about it. And we're just going to assign a Gaussian distribution on the last layer's weights. And in fact, it doesn't actually matter what the covariance of that distribution is. We'll just assume it's anything that is not 0. But we assume that the mean of this distribution is given by whatever the trained weights of the network are. So that prediction it would otherwise make, that we have previously thought about. And now we are going to use this distribution to try and solve this problem from the previous slide. So this is now a Gaussian distribution. We've just decided that it's a Gaussian distribution. We are going to make predictions. We're doing classifications. So this here, this likelihood, is a softmax, softmax over f of theta at x star. And so this is still an intractable integral then, right? Gaussian times softmax integrated doesn't have a closed form. Thankfully though, smart people a long time ago already came up with good approximations. In particular, David McKay had basically the seminal paper on Bayesian deep learning, which happened to be in his PhD thesis, 1992, basically vote on how to do this approximately because he had the same kind of mindset. How can we be ever so slightly Bayesian about deep learning? Well, back then, like connectionist models. So he realized that you can write this integral, if this is a Gaussian and this is the softmax, very roughly as this. So you compute the softmax over a quantity that is kind of derived from the sufficient statistics of this Gaussian. So this is like a derived variable which in its numerator has the mean prediction, kind of the output that the network would otherwise have. But then you mediate it. You kind of flatten it a little bit by a term that is one over the square root of one plus pi over eight times what would be sort of the Gaussian output, the Gaussian prediction over the function f of theta, right? So if f of theta is w transpose times phi and you marginalize over w, you would get like a Gaussian with this mean and this covariance, that's kind of this quantity. So if you take this function, this prediction, then it has this property. So for any x under the approximation of this integral, any Gaussian approximate measures, in particular any choice of sigma yields the fact that as we move arbitrarily far away from any point, we are going to get a confidence that is bounded away from one. So the classifier is not going to be arbitrarily confident. And I'll show you this in this little visualization that I've used before. Here I can actually switch this on. So here what you now see is for each of these linear inputs, I've just added a Gaussian uncertainty on the weights that translates into this term phi transpose times sigma times phi, which looks like these three kind of regions of uncertainty. And in blue here, you see this approximation, this David McKay approximation for the classes. And the main thing to see about this is that the blue lines are bounded away from one. They don't approach one anymore. And one intuition for why this happens without doing the proof is, if you look at these regions of uncertainty, you can sort of see that and I mean, you can actually better maybe better see this even in the, in this expression, you can see that, that there is a function up here that is linear in X. Because it's a Vino network, so it's piecewise linear. So if you're far away, then this is linear in X. And we're dividing by something that is quadratic, square root of a quadratic term in X, right? So it's X times X, X squared, and there's a square root over it. So we're dividing something like X over X and it kind of cancels out asymptotically. That's the entire point. And what does it come from? It comes from the fact that we're uncertain about the weights of the network. So we have to take into account that we don't quite know the weights and then suddenly everything is good. So this is really so far only a philosophical argument for why you should be uncertain about the weights in your neural network, not because of some religious reasons about Bayesianism, because it's like the right way to do learning, but because it actually avoids pathologies. It's just wrong to assume you know something if you don't and it creates issues like this. Okay, now the argument has been we should be uncertain and actually it kind of doesn't really matter how uncertain we are. As long as we are ever so slightly uncertain, it's gonna be fine. But that of course is an asymptotic statement, yeah? Oh, yeah, no, yes. So you're saying it has to be a Gaussian and yeah. So maybe more cautiously what this theorem says is if you're doing all of these approximations, if you're linearizing, you're using this particular approximation, you're putting a Gaussian distribution on the last layer's weights, then it doesn't matter what the Gaussian is. Now, it's certainly possible to get generalizations of that kind of theorem that work for a broader class of approximations. But actually I'm quite happy that we just have that because we're going to stay within that framework for the rest of the lecture. And I could maybe also rephrase this statement by saying it might be a good idea to put a Gaussian approximation on the weights and using this particular approximation for this integral, which by the way is actually quite a strong approximation because of properties like this. So we're actually kind of deliberately stepping outside of the kind of religious framework of Bayesian inference to say we just do this. So we use probability distributions on the weights and pretend that they are posteriors. When in fact they are not, right? Because the real posterior is, I mean it's this object that we've spoken about now, right? It's this thing. But then again, how much do you care about the actual shape of this posterior? This is maybe also a good point in the lecture to point this out. I mean, when you write down a deep neural network, when you define an architecture for a deep network, does anyone ever think about the fact that they are generative models and they can be interpreted as posterior, or defining likelihoods and priors? No. Right? You don't write down a conf net with eight layers and whatever weight norm and decide that you're doing that because of some Bayesian modeling assumption. It's just what you do because it's deep learning, right? And you just hope that it learns the right thing. So maybe it's not so important what exactly the shape of this posterior is. What does seem to be important though is to consider that the weights are not actually identified. What that just means is that the loss function is not a point mass. It's not just zero everywhere and then minus infinity at some point. But it's actually a complicated shape and there are several different choices of weights that all explain the data somehow, sometimes better or sometimes worse. So if you are in that framework, then this is of course an asymptotic statement, right? It only says something about what happens if you go arbitrarily far away from the data and we never go arbitrarily far away from the data, not even in this picture. And also for most use cases of deep learning, right? X is actually something on a bounded domain. So if you talk about images like the ones I showed you on the very first slide, then the domain of images is bounded, right? Every pixel goes from zero to 255 and that ends. So asymptotic distributions are useful for this kind of theoretical analysis but maybe they are not like super crucial for applications. So that maybe raises the question of how should we choose this sigma such that it works also in practice? And here my argument is going to be we choose it with automatic differentiation with curvature estimates because they are the fastest and cheapest thing to do. And it works as follows. These are called Laplace approximations. Maybe quick check. Who has heard of Laplace approximations before? Almost everyone. That's very good so I can be quite fast. So the idea is to do a quadratic approximation to the log posterior. Let's do it within the notation of deep learning. So our full posterior is this object up here which is prior times likelihood. So that's the joint, the generative model divided by its normalization constant. We've just realized that we can write this joint as the exponential of the negative loss function that we are optimizing in deep learning. So that's the connect between deep learning and Bayesian inference. And now let's assume we found the mode, we actually used some magic optimizer that found the mode of the loss somehow. We call that theta map. Now around theta map, maximum aposteriori, right, map, we do a second order Taylor approximation that gives a constant plus a linear term. And because we are at the mode, the gradient is zero, so the linear term drops out. And we're left with a quadratic term which contains the Hessian of the loss function. So we need the negative exponential of this. So if you take this, write a minus in front, to take the exponential of it, that's going to give us the exponential of negative this thing times the exponential of a quadratic form. And exponentials of quadratic forms are Gaussians after normalization. So we can do this integral here in closed form if you want to integrate out theta and we'll get the normalization constant of a Gaussian which is two pi raised to D half where D is the dimensionality of the space times the determinant of the square root of the determinant of this Hessian matrix. Notice that that's actually the Hessian even though if you think of it as a covariance we would need the inverse of the Hessian. Okay, so if we can get to the Hessian then we can construct approximations like this. And this is actually totally analytic. So if someone gives you the Hessian in some parsable form and the trained neural network, so the theta map, then you can just say those two define a Gaussian. A Gaussian with a mean at theta map and a covariance that is given by the inverse of the Hessian of this loss function. And that will look like this. So this is the cartoon picture that people always show. Here's the loss function in blue and this is the Gaussian that you get from this which is centered at the mode and it has lock curvature equal to the lock curvature of this blue line. The main reason to show this plot is that this is evidently not a perfect approximation at all. It's totally local. So you can be arbitrarily wrong. There can be another mode somewhere else that this thing will completely miss. And that's also the argument that has been used for a long time against the plus approximations. I mean they are like the oldest approximation ever for Bayesian inference. They were, as the name suggests, actually invented by Pierre Simon, the Marquis de la Place because he couldn't solve the beta integral and he needed, like, so he had to use the Gaussian integral. And because it's so old, you can sort of, it has this kind of sort of bad reputation because of this problem. But actually, I have now, over the past few years, come round to really fall in love with our past approximations again and I'm gonna tell you for the rest of this lecture why. Before we do that, let me briefly, actually, yeah, I'll tell you on that slide. So here's the point where I do PowerPoint karaoke with Augustino slides. So here's how it works. You take your deep neural network that you're using from whatever else you got it from. From your deep learning class, from papers with code, whatever. You take it and if it's already been trained, then you're basically done, almost. If it hasn't been trained yet, then you first train it. And you train it in whichever way you like, using any of the 120 algorithms that Frank Schneider told you about two weeks ago, right? Add them, typically. And you wait for it to converge. I don't know, do something fancy with the learning rate, decay it, cosine decay, whatever. Eventually, you end up with theta map and that was the hard part, getting it trained. So you've already survived the hard part. And now the only thing left to do is to find the Hessian at that point. And you do that with autodiff. I'll show you code in a moment. Actually, I'll put it on Elias as well that you can look at. And of course, it's not entirely straightforward, but you can rely on autodiff to do that for you. Why? Because automatic differentiation is actually kind of a science. It's a closed form process that actually works. As opposed to those 120 optimizers with their weird parameters and Monte Carlo and all of these algorithms that are so shaky, right? Automatic differentiation is just linear algebra. It just works. So that's not the hard part about deep learning. So if we get our Hessian, then let's just assume for a moment that we have that Hessian, then we're done. We have a Gaussian distribution over the weights and everything else is closed form. So the cool thing about this is that, A, it's cheap. The hard part is training the deep network, but you've already done that. So at the end, you're just once evaluating the Hessian, nothing else. And actually, I'll show this to you because I have a little animation of it, I think. I think I fiddled around with some piece of code a little bit. So Augustinos actually gave me this data set. I'll upload this code afterwards. So here's just a bit of a toy data set that has these four Gaussian blobs. And now here is like we're training a deep neural network. For those of you who have used Torch before, you can kind of, you've seen these pieces of codes lots of times. This is a really simple, deep, deep neural network with two hidden layers. And then importantly, where is it? A cross entropy output loss. So we train this thing in the normal way, you know? Here, like it's a tiny data set. So everything is kind of reasonably fast, even though this isn't the GPU machine. Once it's done, we have a trained neural network. Notice how so far there's nothing Bayesian whatsoever. Now there's a test data set on which we can, which you can just construct and you'll get this kind of prediction, right? And now what you do afterwards, after you've trained your deep neural network, you're just gonna construct a Hessian. And actually, I'll show you in a moment how exactly how to do this computation. But first of all, the main thing to understand is, you're doing this after you've trained your deep neural network. So this doesn't add cost to training the network. Secondly, you can do it even for a trained network. So if you can download a trained ImageNet classifier and you have ImageNet available and you know what the architecture of the network is, you can do this uncertainty business afterwards. You don't have to rerun the training algorithm. If the people who train stable diffusion would actually give you access to their training data, you could do this as well. I mean, you don't quite have the training data and the entire architecture, but you know. I already said the nice thing is that it's autodiff, so there is no fancy tuning sampling algorithms. And thirdly, maybe the coolest thing about this is that it actually allows you to keep the point estimate. So normally when Bayesian talk about doing Bayesian inference on deep learning, the argument goes the other way around. They say, well, this map estimate shouldn't be the estimate of the neural network because if you're actually uncertain about the weights, you should maybe think about the average, the mean weights. And that mean will not typically be at the mode, right? It'll be somewhere else. And for a long time, historically, that was the argument for doing Bayesian deep learnings. Like, ah, no, your trained network is actually wrong. You want to have an average of several possible explanations. And that never quite stuck with applied people because, I mean, by now, you know how hard it is to get these networks to work. So once you have a trained network, that kind of works. You don't want to fiddle with it anymore. You want to keep it. Even though it's just a mode of a posterior. So actually, here, this kind of turns into an advantage for this way of doing Bayesian inference. You have a trained network, you get to use it. And if you like the predictions of this network and they achieve whatever your 97% accuracy you want to reach, then you're just happy and you keep them. And you're just building uncertainty around them. So in practice, this has turned out to be a surprisingly powerful insight, actually. So those are the good bits. That's the reason to be happy about Laplace approximations. There are, of course, also some downsides. Can someone guess what they are? The Hessian is expensive. Yes, and that's actually kind of the main problem, right? So in fact, the problems are pretty much the same that Lukas Tatzel had on his slide last week. So I basically copied them down again and just rephrased them a bit. So first of all, our optimizer doesn't necessarily have to be at an actual local minimum. It might just be somewhere else. Secondly, this Hessian doesn't even have to be positive definite because it's deep learning, so the loss function is not necessarily convex. It could just be wherever. And thirdly, it's expensive because it is of size and weight space by weight space. It's quadratic in the size of the weight space. So the answer is going to be the same answer as in the previous lecture. We're just ignoring the fact that we're not actually at a mode. We are just going to construct an approximation to the Hessian that is guaranteed to be positive definite. Spoiler alert, it's the generalized Gauss-Newton matrix that you already heard about last week. And we're going to use various forms of approximations to get to a tractable form of the Hessian. So the generalized Gauss-Newton matrix is, you've already, actually it's on the next slide. Yeah, so on the next slide, we'll look at it again, but there is just, you can think of it as an approximation to the Hessian in a way that I'll make a little bit more precise on the next slide. It is guaranteed to be a positive semi-definite because it's the outer product of a vector with itself with something in the middle. And if you're computing the GGN, so this is maybe the one thing I have to mention on this slide, if you're computing the GGN for the loss and you have a regularization term, a weight cost, then you can just add that of course to your GGN and that's going to be easy. So for example, if you have a quadratic weight cost, it's just a scalar times a unit matrix. And then we're typically going to make some further approximations about the structure of this GGN. For example, we might only consider it diagonal or we might do some block diagonal approximations like the chronicle factor approximation that you've heard about in the last lecture. So here is the GGN again. So Lucas already talked about this last week, so I don't have to spend much more time on it. The main important bit about the GGN is that it contains this object J, which is such a Cobian of this function f of theta, which is this kind of input into the classification layer with respect to the weights of the network. So it's a matrix, right? And in fact, actually it turns out that the reason to use the GGN is not just that it's always positive definite and therefore convenient and also a bit cheaper than the Hessian. It actually has a nice interesting connection to another kind of approximation that we're going to use. So, and that's linearization. It's sometimes called a linearized neural network. So if you think of this function f of theta, not the loss function, that we found the Taylor approximation about before, but this function f of theta, that's the input into the classification layer, and you do a first order Taylor expansion of this function around theta map. Then this is obviously this term, right? So this is where the Jacobian shows up. I hope that's pretty obvious. And if you think of the Hessian of the loss, if this were the actual network, so if our deep neural network was replaced with this function in the weight space, then the GGN would actually be the Hessian of the loss under this approximation. Maybe something where you have to think about for two or three minutes, but yeah. So this is kind of simultaneously an argument for linearization or for the GGN. So either you could say, I like a linear representation of the network and then the GGN is the natural representation of curvature because it's the Hessian associated with this approximation. Or you could say, I want to use the GGN because it's nice and positive definite and then you're kind of compelled to use this linearization because it matches to this curvature estimate. One thing that's always important to note at this point is that this linearization doesn't mean that we are replacing the neural network with a linear function of the inputs. So this function here is still non-linear in X, right? It still contains this object. It's just linear in theta. So this does not ruin the nice properties of your deep neural network. It still behaves as before. If you plug in some X, you just get the exact, like at theta map, you just get the exact prediction of the deep net. You just now have something that's linear in the weight space. If we use this linearization and combine it with the Laplace approximation, we actually end up with a completely tractable model assuming the loss function is tractable. So here this is for regression where it's particularly straightforward. So if you're, right? So here's this equation again that we had on our previous slide. So if you want to predict the output of the network at some test point X star, the trained network, then you'd have to solve this integral that we've now seen several times. If you assume that this is Gaussian from the Laplace approximation, and we assume that this is linear from the linearization, then, sorry, so if this is linear and this P of that function, if that's a Gaussian function, a Gaussian density because we're doing regression. So if the observation likelihood is a Gaussian, then this is an integral of a Gaussian against the Gaussian. There's a D theta missing here and Gaussians against Gaussians are just Gaussians. So we end up with a Gaussian prediction. So the title of the slide says it, if you're combining linearization in the weight space with a Laplace approximation on the weights, you're getting a Gaussian model, a Gaussian process from your deep neural network. A Gaussian process where the mean function is given by the output of the deep neural network and the covariance function is given by, well, this thing. So the Jacobian of the network in weight space on the right-hand side, taken as an inner product with the inverse of the Hessian. This actually also works with a tiny adaptation for classification and that's the bit we've seen so far. So if we do classification, then the output likelihood is the softmax, as I've mentioned several times. And so if we assume that this thing is Gaussian distributed, then we have to compute an expectation of a softmax under a Gaussian that's intractable, but we could use this simple approximation that David Pekai came up with that's a closed form expression. And we end up with this prediction. Everything is closed form. And actually, you can also see this in the code that I sort of skipped over, here it is. So here's an, you can look at this code afterwards. I'll put it on Ilias. This is the actual implementation of this kind of process. So we're coming in with our trained neural network, which now exists, right? And so in this piece of code, Augustinos actually together with Felix Dange came up with a nice implementation using funk torch, which is a functional sub-library of torch or derivative of torch. And we don't have to understand everything that's going on here, but maybe the important bit is that you're defining the loss function and then you're computing the Hessian of the loss function and the Jacobian of the train network with respect to the weights. And then compute the ggn, which is a little bit of Einstein sum notation trickery, but basically it's an inner product between the Jacobian and the Hessian. One or two lines. And then the rest is just plotting. And so that's basically boring, right? So when you make a prediction, you're just adding, constructing the predictive uncertainty from the Laplace approximation and then pushing it into this simple approximation for the softmax, which is this one over one square root of one plus pi over eight times whatever, right? And you get this plot that you've now seen on previous slides already. So again, it's a simple post hoc procedure. And with that, I'm at a summary slide. These are Laplace approximations for deep neural networks. You're approximating the posterior distribution of the weights by finding the mode of the loss function, so training your deep neural network, and then constructing a curvature estimate, in particular the generalized Gauss-Newton matrix, and linearizing the network in its weight space. So computing a Jacobian of the output function of the network with respect to the weights. Those two can be combined together in a product and you get a prediction for the output function f of x star, which is still nonlinear in x, but now linear in weight space. And we can marginalize over this distribution depending on what the loss function is. If the loss function is a quadratic loss function, so the output likelihood is Gaussian, then it's all closed form. If the output is a softmax, so the loss function is the cross entropy, then there is this fun approximation, which is basically as cheap as assuming it's a Gaussian, just a different function. And if it's some general predictive function that isn't cross entropy or quadratic, then you could still do Monte Carlo, but Monte Carlo in linearized weight space. So you just draw a bunch of weights from the Gaussian distribution that comes from this La Taz approximation and for each of the sampling, those is trivial, right? It's just Gaussian random variables. And then just sum up a bunch of predictions rather than running Markov, J Monte Carlo in weight space, which is the expensive bit. So there you have it. Uncertainty for your deep neural network. And now the rest of the lecture will be why? Why would I want to have something like this and what would I do with it? Well, I've already given you the first argument. The first one is, as you remember, avoiding overconfidence. So if we do this, then at least for the case of classification with this particular linearization and approximation, at least we now know that our network will not be overconfident. We'll have this thing that's that suggestive picture on the right keeps saying. So as you move far away from your training data, the network will at least not be arbitrarily overconfident. But there's actually another reason why you might want to do this, which is that the Bayesian formalism, as you may know from your probabilistic machine learning class, provides functionality to do interesting things. One particularly important thing is to adapt parameters or aspects of the model by checking how well the model predicts the data, marginally predicts the data. And for that, we usually use in, or the mechanism in Bayesian inference is called evidence maximization. So we're trying to compute the likelihood for some model under the data and that involves the normalization constant in the theorem of Bayes, right? So the denominator, the P of data. So what is actually P of data? So normally it's intractable, right? It's this integral over P of data given weights times P of weights, D weights. And we can't do that in closed form for a general posterior, but we now have a strongly reduced, simplified form of the posterior and the loss function. We've linearized the network in its weights and we found a quadratic approximation to the loss function with the LaFace approximation. So maybe things become tractable again. And indeed they do. So actually I've kind of had this on a previous slide about the LaFace approximation. So kind of here, right? We know what the normalization constant of this thing is. It's just this, right? And that means that the Z of D, so that is our marginal likelihood, it's just this. So it's the X of minus the loss at the train model times this, well, times a bunch of constants which aren't important because we're gonna optimize this thing times the determinant of the hash. So we can use this as a measure of how good our model fits. One more slide, here we go. These are results from a paper by Alex Immer and his co-authors who is a PhD student here in Tübingen and in Zurich, which were published two years ago during the pandemic. So the idea is, let's just basically, I mean basically the idea can be summarized in one statement. Let's just take this approximation seriously. This discussion approximation to the network is a combination of linearization and quadratic approximation for the loss which turns a deep neural network into a Gaussian process. So let's do with a deep neural network what people do with a Gaussian process. We just compute the evidence for the data and then pick whichever model has the highest evidence. So here's a picture that's actually from their actual paper. It's just reformatted to look a bit like fit better into our slides. Here's a regression data set. So the orange dots are the data and we've trained a very, very simple deep neural network with just a few layers. And then for different choices of the model, so in particular they vary over the number of hidden layers in the network, they compute this quantity which is the logarithm of the quantity we had a few slides ago. So this here is the negative loss function at the trained model. So that's, this will, because this is a quadratic loss, right, it's regression, this will measure, like you can check your intuition, what does this actually, what does this quantity actually measure? This is a quadratic function. This is like y minus f of theta at x. You said variance, it's, I think you're thinking of the right object, I wouldn't call it a variance, it's the average quadratic distance between the predictive, the black line and the data, the orange line. But it's a quadratic form, so it's a bit like a variance, right? Yeah, so you can see that this model down here has, this is the one with three layers, the black line is much more flexible because it's a bigger neural network, so the black line is closer to the data than it is in this model which is a single layer neural network which actually isn't a neural network, it's just a Gaussian regression basically. And so in this model, this loss function will be smaller because we're closer to the data. But notice that there is a second term in here which is the determinant of the Hessian. So determinants, first of all, tend to grow if you make the model larger because there's just more numbers in your matrix so the determinant just gets larger. Sorry, the log determinant in particular gets larger because the log determinant is the sum over the eigenvalues. So if the Hessian is a positive definite matrix, then these are all positive numbers that you keep summing up so the more you have, the larger. But it also of course depends on what exactly the model actually is. So our Hessian here of course will depend on what the shape of the loss function is at that point. So it's not just keeps growing as we add more layers but actually if you get more layers that might be a better fit, the loss function might be narrower and then that might actually go up. So here in this picture you see those two numbers. So this is actually the approximation for the marginal evidence, minus 88 for this model, minus 115 for this model. So if you take this seriously, then what this says is pick this model please. We actually re-ran their code for five different choices. So I forgot the X label here because I just made this plot an hour ago. So here this plot shows as a function of the number of layers that this neural network has, these three quantities. So the red curve is this quadratic fit. The black curve is this, what's sometimes called the Occam factor. You might remember from your probabilistic machine learning class, this sort of penalty term for complexity of the model. And this is the sum of those two. And clearly you can see that sort of around two, three or four layers is probably good. And maybe actually two or four might be the best. You see here ever so slightly an arrow bar. So there's a tiny bit of arrow bar here which kind of gets them close to each other. So these three are basically the same. You get the arrow bar by retraining the network several times, which is annoyingly expensive, but that's how you get your arrow bar. Yeah, and maybe one thing you notice is that this Occam factor is not as straightforward as you might think it is. So for Gaussian processes, we're kind of used to this as you just, as you make the model more and more complex, the Occam factor kind of goes down so everything gets more and more penalized. But here, because we are computing the Hessian of the loss function, there's actually a non-trivial effect about how well this model gets to explain the data and how much degrees of freedom it actually has. And if you add layers underneath each other, they might actually constrain each other in surprising ways. So we can't quite expect this function to just kind of come from up there and keep going down. So how would you know, well, if you have discrete choices like this, you just look at this curve and just go, I'll pick this one or maybe this one. If you have a parameter that you can tune, like for example, prior precision, then you can do this with stochastic gradient descent. And in fact, actually, I think I have a visualization of this. So here is our deep neural network with linearization Laplace approximation for this classification problem. I'm plotting this approximation to the confidence. And now this is actually a function of one parameter, which I haven't talked much about yet, which is the prior precision. So maybe maybe quickly go back. So I kind of dropped it along the way because it seems so unimportant, but actually it has an effect on the output, which is this term here, this regularization of the loss. I've chosen this to be a quadratic with a number in front, which corresponds to prior precision. So it's like one over the variance of this prior Gaussian, right? And if you move this up and down, then you can see that this actually affects the confidence of the model quite a bit. So if I make it very small, the precision then the model becomes very unconfident. And if I make it large, the model becomes very, very confident everywhere. So there's probably a good choice, somewhere in between, somewhere around here. And in fact, that's actually what marginal likelihood maximization would tell you. In fact, there is a way to do this. Agostinos and Alex Immer and some of the other people who are co-authors of these papers wrote together a little library that you can use yourself as well, which simplifies the whole process even more. You just take your trained neural network and just run this thing and it just tells you what the best choice of prior precision actually is. So there's a line here that says, find the right prior precision. And it does exactly what we just talked about. So it computes the evidence of this data under this model, does some gradient descent and finds this choice. So apparently this plot sort of looks best from this Bayesian perspective on how to choose these parameters. So now I have to find the slide again that I'm supposed to be at. Right, so you can use Laplace approximations to select models, to select parameters of a neural network. Not algorithmic parameters, not the learning rate of your optimizer because that happens before you do Laplace. But choices like the number of layers, then that the width of the layer, the prior precision on the prior of the layers. It's not the only way to do this. And of course you could do it with cross validation or with a validation data set and so on. But this is one way of doing it that is kind of motivated in a probabilistic fashion. And maybe I think the honest answer is probably that it works quite well for certain things like choosing prior precision. And it doesn't work also well for other things like for example, choosing the number of layers of your network, if only because it requires reshaping the network. So how does it work? We compute this quantity which clearly we can compute now because it involves the loss that we're already optimizing and this hashing that we're already constructing for Laplace. So you might as well make use of it by picking whichever number M maximizes this quantity. And I've already done the demo. So there is a piece of code that does that for you. You'll get to, I think you'll get to play with this or actually implement yourself these Laplace approximations in the exercise sheet for this week. And now I can waste your last 15 minutes with a bunch of other things you could do with Laplace approximations. And this is largely gonna be a celebration of Agostino's PhD I guess. And also Alex Immer and Eric Dachsberger and MPS Kahn who are all people who've worked on these linearized Laplace approximated patient deep neural networks which have become kind of one school of thought for how to do uncertainty in deep learning. The one that is motivated by low cost rather than by faithfully representing the full posterior. And it's been solidified into a large number of different publications which were published over the last few years. Like you can see that the earliest one on this list is from just over two years ago. And since then the people have been busy writing lots of papers on this and also building this library. And they basically like used, they all boil down to this idea of let's take this linearization on quadratic approximation seriously use them as a black box tool that turns a deep neural network into a Gaussian model and then do whatever you do with Gaussian models. So a first thing we could do is we could return to this example that I had at the very beginning of the lecture. So I made this argument that there's this problem with deep neural networks that don't have uncertainty on the weights that if you go far away from the training data you'll get a classifier that is arbitrarily confident. And I said if you put any Gaussian measure on the last layer weights and linearize and do this approximation then you'll get something that isn't arbitrarily confident. It's gonna have a finite confidence that is bounded away from one. But you could say maybe that's not actually what I want, right? So this thing that I'm gonna end up with as I move far away it won't actually be fully uncertain. It won't be the maximum entropy uncertainty. It won't be one over the number of classes. It'll just be something. So if you zoom out of this training regime and move very, very far away there's still going to be structure in this predictive distribution. It won't go arbitrarily close to one as the non-Basian network would go but it still has this structure. And you can also see this in this little visualization that I showed before. So these blue lines, they go somewhere but they don't go to one over the number of classes which is this constant line here which is at one third actually. So how do we fix this? Well, Agustinos and Mathias Hein and I came up with a fix that is so simple that it's actually kind of embarrassing to be honest but it actually works surprisingly well. Which is to say, well actually, okay maybe it's not so embarrassing because it has an interesting intuition. So let me tell you a story about this. The problem with this, if you think about how to fix this, kind of what would be the thing you have to do to your network to make this work, assuming that we're staying within this particular choice of approximation then you can kind of convince yourself that there is no way to fix this by just adding a bunch more of VLU features. Because no matter how many VLU features we're going to add to this network as long as there are finitely many of them we will always have this property that eventually as you move far away from the data you're in a linear regime and then you have a linear function divided by a linear function and it goes to a constant. So either you'd have to fix something about the weights of these features and make them grow as you move away from the data but that seems really wrong, right? Then you would have to have the weights of the network depend on the input of the network that kind of breaks the idea of a deep neural network. The only other way to fix this is to add an unbounded number of weights. And actually this is kind of intuitive. If you move very far away from the data then what this says is kind of imagine I've given you a finite data set, right? So of course you could only train a finite number of VLU features with that because you have a finite amount of training data. But the world out there is perhaps kind of unbounded so in principle you could get more and more and more training data and if you did that you could probably train more and more and more VLU weights. And if you had a way of producing training data out here far away from this set of four classes then maybe the real world would contain many more classes or actually just many more other things, other kinds of structure. And that would maybe compel you to put ever, ever more weights on there. I actually find this quite intuitive and it also fits neatly with the idea of Bayesian non-parametrics or kernel machines, right? No model should ever have a finite amount of degrees of freedom actually, really. I mean in practice it's okay because we only have ever finite training data but there's like an unbounded complexity left that we just haven't encountered yet. I mean the funny thing is it turns out if this is the only thing you want to fix then you actually can do this. You can keep track of this infinite number of things you need to track for free and this works as follows. So we're going to say that the actual function we care about that's called an F tilde is actually the thing that we get from our finite amount of data, F theta and that finite amount of data of course will be in a finite domain because how else, what else could it be, right? And some other function F tilde that is added, so not F tilde F hat that is added on top. Added why, well because F theta is already a linear function anyway so we might as well add some more Vlu features. And this thing keeps track of all the non-parametric amount of stuff that we just haven't seen yet. And we're going to assume that this F hat is a little bit like the F theta that we are already constructing. That means it consists of a weighted sum of Vlu features. It's just that we don't know what those features are, where they are and what their weights are. So we're going to put a really simple prior over them. We're just going to assume that the number of Vlu features grows linearly as you move away from the training data and their weights are just draws from a Gaussian distribution. If you do that, then there's a construction for such processes that ends up being a Gaussian process. So you've had your probabilistic machine learning class, most of you, you know that there are ways of constructing Gaussian processes by just doing infinite sums over Gaussian distributed weights. And that's exactly what you can do in this case. So if you add an arbitrary number of features, we're adding more and more and more and more of them and make them arbitrarily dense, then asymptotically you actually get a Gaussian process that for this particular choice of features, for Vlu features, happens to have this shape. This is called an integrated Brownian motion Gaussian process. And you've already encountered it in the lectures by Natana Ilanio-Natan who was sitting at the back when they constructed their simulation methods. They are ODE filters. Now it just goes in both directions, away from zero to the left and the right. And if we construct this right, you have to be a little bit careful about the math, then you can imagine a situation where all your training data is in here. And this is the asymptotic regime where we're very far away from the training data. You can imagine that this bit in here has negligible effect on your posterior because you have all the training data in the region where this thing contributes zero anyway, approximately. So that means your posterior it will remain a direct sum over the train deep neural network and your Gaussian process which actually is still at the prior. So the posterior of this model asymptotically or approximately will be the train deep neural network plus the prior, which is just this thing. So if you do this, then first of all, you can show a theorem that actually says asymptotically my uncertainty will then become the maximum entropy thing, one over C. Doing this doesn't cost anything because I just add uncertainty that depends on how far I'm away from the training data. So I also have a picture for this. So we can just add this here. It's like click at the uncertainty and then just find the right choice to make sort of like scale up the uncertainty. And you can see that this blue line, these blue lines now go towards asymptotic uncertainty. We can actually make this sort of get arbitrarily close to it essentially. So how does this work? You take your training data, you train your deep neural network, you have some approximations, you get uncertainty on your weights. And then you also take the training data and you find it centroid. So you compute the mean of the training data and the covariance of the training data which is easy to do, it's linearly expensive. And that tells you where your data is and how broad it is. You use the breadth of the data to define a length scale for this approximate Gaussian process. And that just adds a function onto the uncertainty that just says if you are n widths of the training data away from the training data, then just add n cubed to the uncertainty and you'll get this calibrated uncertainty. It's really just a backstop, right? It just means if you're very far away from the training data, you'll get arbitrarily low confidence. And nothing else. It doesn't change the point prediction except for just reducing it back to one over C. But it also doesn't cost anything, so why not do it? Another thing you could do is, so this is the asymptotic regime, it's very far away from the training data. But now we could think about what do we do inside of the training data as we are close to the training data? What could we do with uncertainty to improve the predictions at that point? This is sometimes called out-of-distribution training and you might have heard about it in deep learning classes. So even if you are close to the training data, that might be the problem that there are some inputs, which you might call adversarial inputs, but the network has high confidence on the incorrect class label. So what you would like to do in these settings is to tell the network that this is a region of your input space where I just don't know what the right label is because I don't have training data for it, but I would like you to be uncertain at that point, right? So to be able to do that, we would need a way of training the uncertainty of the network without changing the point estimate. And that's exactly what the FASA approximations are good for. Why? Well, because they separate the point prediction from the uncertainty into two sufficient statistics of the Gaussian distribution, into a mean and a covariance. So if we have a trained network that has trained the mean, maybe we can fiddle around with the covariance to make it uncertain in the regions where we want it to be uncertain. And the way that this works is by adding units to your deep network that don't do anything to the point prediction, but which add uncertainty just by being there. So these plots on the right are from Augustinus, and I always found them a bit difficult to understand, so I'm going to draw my own picture. If I give you a VLU feature that looks like this, so this is the standard VLU, so this is just the feature, right? So this is phi of X. Now if I multiply this with a weight, now I have an output to the next layer, right? If I decide that this weight is zero, then this function just looks like this, obviously, right? Because zero times zero is zero, okay? So this won't add anything to my point prediction. But if you think about the uncertainty associated with this feature, then the uncertainty is phi transpose times whatever the variance of this W is, times phi of X. So it looks like some kind of trumpet shape, right? Well, I mean, the variance looks like the square of this, but the standard deviation looks like this, the error bar looks like this. So what this is is it's uncertainty that I get to move around because I can fix the parameters of this VLU, right? I can shift this left and right, and in higher dimensional spaces I can rotate it, and move it, and so on. And by changing the numbers in here, I can scale this up and down, but it won't change the point prediction of the network. And of course I don't necessarily just have to use a VLU feature, I could use something else as well. I could put a Gaussian there, it would be localized. I could use a triangle shape thing, something else, whatever features you like, or a ton H features which go to a constant and so on. And so the way that this gets sort of translated into code is this picture here. So you take a particular layer of the network, let's call it L, and you expand the weight space of the layer kind of below, so that, or above, so that you have more units in that layer L, all of which get a weight for which the output above is set to zero, so they don't affect the prediction at all. But because this means that we're introducing new parameters, theta hat, into the network, there will be new rows and columns in the Hessian in the Laplace approximation. And those rows and columns contain numbers, even if you just track the diagonal, that you can tune. But the only thing left to think about is how we could tune them. And that actually requires you to think a little bit about your data, right? You basically need to write down a loss function that says where you want to be uncertain. And one way to do this is what's called out-of-distribution training. And if you've been to a talk by Matthias Hein, and you've probably heard about it, because it's one of the things that he's a world expert on. So one simple way to do this for the example of image classification is, let's say we've trained on pictures of dogs and cows, then you take another data set out there from the net, some benchmarks, I don't know, image net, that you remove ideally all the dogs and the cows and leave everything else in, and then say those data, those are just different. They are pictures of something else. They still look like real world images. You could also generate them in a different way, right? You could use a generative model or whatever or some other data set. Sometimes people even use noise, but noise is a bit dangerous. And you just say those are images that are not dogs and cows. So I'm not going to tell my network what they are. I've only to use like 100,000 other classes. I'm just gonna say this is a third class, which I don't know. So the simplest thing I could want for my network is just to be uncertain in those regions. And you can use that to define a loss function that just says those two classes that I'm trying to learn, they should be uncertain in that region. And then that extra is an extra term in your loss function, right? You have your on distribution training data that you train as before, which affects the weights of the trained part of the point estimation part of the network. And then an extra term in the loss function that's the OOD loss out of distribution loss that just says these are the numbers where I want you to be uncertain. And then this thing is going to shift around the terms in the Hessian, for which you're gonna, of course, compute a gradient, right? Such that you get uncertainty where you want to be uncertain. And that, of course, can be within the training data. You can kind of carve out bits of the input regime where you just want to put a big blob of uncertainty, essentially, right? And I'll leave it at that. So here's a final slide that I'm gonna spend a few minutes on. You can also use it to give some feedback while we're waiting or while you're listening to me. So first of all, the first main takeaway was that uncertainty actually matters in deep learning. It's not just some religious dogma that you want to be uncertain about everything because Bayesianism is nice, but not being uncertain about the parameters of a machine learning model, in particular, a deep neural network is actually bad because it causes pathologies. We looked at one particular one, which is that VLU classification networks have this property that if you move far away from the training data, they predict one of the classes with arbitrarily high confidence. And it turns out that there's a very simple fix for this, which is you linearize the network, you put any Gaussian distribution on the last layer weight and you approximate the output of the softmax with this simple reweighing of the softmax and you heal kind of at least partly this problem of overconfidence. So if you're already doing that, you might as well think about which weight to put, which Gaussian distribution to put on the weights. And I've argued, admittedly, relatively vaguely, that one nice thing to do is to construct a curvature estimate of the loss function at the mode of the loss, a Laplace approximation. Why is this a good idea? Well, because you can do it with autodiff, so that's quite reliable. You can do it post hoc, so you can even do it through a trained neural network that someone else has trained for you as long as you have access to the training data and the loss function and obviously the network. And it won't change the point estimate. It's actually adding something that's completely orthogonal, if you like, to the point estimate, the Hessian. And this is maybe nice because most people who train deep networks don't want someone to come in afterwards and say, oh no, you've done it all wrong. I'm gonna retrain everything and gonna mess up your classifier. No, you want to keep the classifier but just add uncertainty to it. And that's what our class of approximations do for you. They can be particularly well combined with a linearization of the network in the weight space. That linearization allows us to do closed form computations that can maybe be best summarized in the sentence that they turn the deep neural network into a Gaussian process. That's actually the title of the paper by M.T. Askan who introduced this. Linearization turns neural networks into Gaussian processes. And you can be done pretty much to any deep neural network if you're willing to compute the Jacobian and the Hessian of the loss function. For example, you can use this linearized Gaussian process approximation to do maximum evidence or evidence maximization. So to find parameters of the architecture that you might want to use not just for uncertainty quantification but also to fit the network. You can also use this idea of Laplace approximations to add functionality to the network that you might want. For example, asymptotic calibrated confidence by measuring the distance to the training data and just adding a term that keeps growing as you move far away. And there is a admittedly somewhat handcrafted argument for why this is a good thing to do because you can think of it as infinitely many weights that keep getting added as you move away from your data. You can also train uncertainty within the input domain by adding units to your network that are deliberately set to have expected weight zero, so mean weight zero, but curvature around it. That can then be trained, for example, to give higher uncertainty on data that is patently different from the training data. Maybe a meta takeaway from this is also something that was also a threat that ran through the entire lecture course so far, which is that we quite deliberately talk about probabilistic numerical algorithms or probabilistic training and a lecture course next term that I'm gonna teach is called probabilistic machine learning, not Bayesian numerics or Bayesian machine learning because quite often what we're looking for is actually just the fact that there is a probability measure to operate on and the strict mathematically precise question of the full posterior is kind of maybe a level too high. So if people have told you that being Bayesian is expensive, they may be correct, but being probabilistic doesn't have to be expensive. And being probabilistic can solve most of the problems actually, like overconfidence, without having the entire cost of finding the perfect posterior and tracking it, which is maybe not the thing you want to do anyway because you never wrote down a prior you believed and to begin with. Doing full Bayesian inference or actually tracking the exact shape of the posterior only makes sense if you actually believe in the model, if you believe in the prior and the likelihood. And in particular in deep learning, nobody believes that the deep network defines a correct prior, right? It's just a convenient parameterization of a function. So there's not really a need to be super cautious about the posterior, but there is a need to put a probability measure on the weights. And that's what the first approximations allow you to do. If you like this kind of view, then there will be much more of it in the probabilistic machine learning class. I realized that almost none of you are going to be in probabilistic machine learning next term, which is maybe good. So I got to do it today and you got like the fast insight into it in a single lecture. For those of you who still want to take this class, there'll be a significant number of lectures in the probabilistic machine learning class where we very carefully take the step from Gaussian processes towards deep learning and this kind of functionality. Okay, thank you very much.