 Hi, good morning. So Thursday morning, we're clearly now in the second half of term, also feels very summery outside. And I can't start today's lecture with feedback, because I was stupid last Monday and forgot to switch on the forum on Ilias. So only three people in the end actually managed to give some feedback after telling me that I had done so. And they didn't leave so much detailed feedback. And from three samples, I can't really do much. So instead, I'm just going to jump right in and hope that today's feedback actually works. So on Monday, we finally moved away from regression and started talking about classification. So solving problems that look like this, where you have a bunch of training points, input output pairs, where the output label is not a real number, but a, in this case, binary or potentially discrete integer-valued output. And we discussed a little bit that there's various ways of thinking about such problems, both from a generative perspective and from a discriminative perspective. The discriminative one is to assume that there is some underlying function that just maps from inputs to outputs whose evaluation tells us something about the label. And we realized that we can phrase this problem in the language of probabilistic inference as always using a prior over the latent quantity we don't know, the function mapping from input to output, and a likelihood that links the thing we don't know to the thing we'd like to know, the class label. And because we've used, we've sort of grown accustomed to priors over functions in the form of Gaussian processes, we reused that prior, relatively unmotivated. Let's just, that's the thing we've done so far. Let's just keep using it. And decided to describe a likelihood, which is sort of maybe a somewhat hacky way of turning a function over the reals into a probability over, well, the binary simplex, the zero one range by taking that function and shoving it through a logistic or a sigmoid or a binary logistic or whichever other name you want to use for these functions that take the real line and sort of monotonically squish them into a zero one shape. So that's a generative model, if you like, for data, for labels, Y, not for Xs, the Xs just go in as inputs, but for the labels at any point in location. And in principle, it allows us to do Bayesian inference. We can multiply this prior by this likelihood. We hope that we can somehow normalize and that will give us a posterior over the function. But we discovered very quickly that there's a problem when we do this, which is that prior and likelihood are no longer conjugate. They are not conjugate exponential families. So what that means is that the posterior will be of a functional form that is not easy to write down in a nice compact joint class of probability distributions like the Gaussian or something else. So I showed you this picture to sort of also bring this message across. If we start with a Gaussian joint prior over function values, in this case two, different function values, F1 and F2, and then we observe a label at the first location. So observing a label means that there is a likelihood term for F1. So in the label that is a binary probability to see the label or not, in F, this likelihood as a function of F looks like this sigmoid. Then when we multiply this Gaussian by this sigmoid, well, principally we get this kind of distribution of, which I can actually make a picture of because I have a computer, which can make this picture. But it doesn't have a closed form. There is no simple analytic expression that I can write down and say this is what this looks like. Well, beyond the fact that of course I can multiply the prior with the likelihood, I can just write that down. But that doesn't give us much because we don't want to do this with two function values. We want to do it with a million maybe function values. Yes. So the question is here, I'm showing you a joint distribution with respect to F1 and F2, and the posterior is, so what is the posterior actually over? Is it over F1 or over F2? We'll discuss this more today, but in principle, I can just write down a joint distribution over the things that I care about and the things that I get to see. Well, that's Bayesian inference, right? It's always prior over the thing I want to know times likelihood for the thing I want to, or probability for the thing I see given the thing I want to know, so that's a likelihood. And that thing I would like to know, the latent stuff that could contain only this function value F1, but it could also contain function value F2 because I care about it. So it's sort of train and test, if you like, right? F1 is the train point, F2 is the test point. And actually, in a few slides from now, we'll think about how to relate the things like function values where we have observed from function values where we haven't observed. And then we kind of made a plan. We said, okay, so this won't be closed form tractable, so we'll do something that I've talked about so often now that it might seem natural, but actually outside of this lecture hall might not be so natural, which is that we'll find the mode of this distribution with optimization, we'll run some fancy optimizer that finds the mode. And then maybe at some point, we'll get to a point where we can draw this ellipsis around it and use this as a Gaussian approximation. And then I went through some math that I'm gonna show you in a moment again and also showed you this code. So on Monday I did this, I created this dataset, and then we wrote down the, actually let me do sort of both things at the same time because it's early in the lecture, I'm gonna flip back and forth, not good, but I'll stop doing it later. We said, what is this optimization actually entail? Well, we write down the prior and the likelihood, and then we think about what it takes to optimize them. So we, is that the right? Yeah, maybe. No, so this is just a plan, right? That's what we're going to do. So we'll write down the log, the posterior for the things that we care about, the function values over any parameters, and we optimize this log posterior to find its mode. And then the plan is, and we haven't done that yet, but we'll do it today, is to add the mode, construct a Taylor approximation, which involves a quadratic term, no linear term because we're at the mode and a constant, and then if we take, if we go back to probability world, so if we take e to the minus, this expression that gives us a Gaussian and we'll construct plots with that Gaussian. So how do we actually do that? Well, we first think about what this posterior actually looks, yeah, the order of these slides is not perfect. So do I do this this way? Yeah, okay, so maybe actually, well, so first the picture again. So we then ran, after having done some math, some actual piece of code that did this by defining a prior. So actually the prior is defined in just these two lines over this data. Here's a sample from this prior. And then we set up an optimization problem by writing down the log likelihood. And well, the prior, which also allows us to evaluate the log prior density function, then adding the two together, likelihood and prior, putting a minus in front because optimize was minimized and we want to have the maximum. So we put a minus in front to compute the minimum. And then I did this little spiel with you where I showed you how to optimize such functions with SGD, with gradient descent, or actually with just gradient descent, no stochastic in this case. And I did all of this, maybe a little bit silly game at the end of the lecture to point out that this part is a bit tedious, right? So we first set some learning rate for gradient descent that made it unstable. So it didn't converge. It went to like 10 to the 308 or inf. Then I set it smaller, but then it took forever. Then we realized that I can actually, just in time compile some part of that code. Maybe you've tried that for yourself afterwards. And then maybe you've also discovered that this doesn't always work. You have to be really careful to make the code work. And then sometimes you just have to restart the Jupyter Notebook, God knows why. And so, and then at some point it sort of kind of worked, right? And for this simple problem, doing that took us like five minutes, but it was the five minutes at the end of the lecture which nobody likes to spend. So it set up a feeling for how annoying it can be to optimize with gradient descent. But nevertheless, we found a point where the loss is quite low. By the way, this loss goes below zero. Why? So you say the likelihood could be zero, but I mean, when would the likelihood be zero? The log likelihood would be zero, right? But that's when the probability for the data is one. But keep in mind that this is a probability density, right? So you can make probability density is higher than one. If they're over a domain that is compact, then or doesn't even have to be compact, right? Gaussian density functions can have locations where they are higher than one, right? So it's not like this loss has to be bounded below by zero. It's just some function that goes down. So we're now at this point and then I constructed, well, half of a Gaussian process because we don't have uncertainty yet, but I kind of wanted to make this point that we've now constructed the mean function of the Gaussian process by constructing this object, which we'll talk about on the next slides, the ones that I just showed you too early. And then I made a plot at the end of last week's lecture that looks like this, which is to show that. So by the way, what am I plotting here? I'm plotting the data. That's the dots in the background that you can barely see because I've made this sort of too dark. And I'm plotting this function that lives in the latent space. So it's a real function, real-valued function between minus infinity and plus infinity and then take the sigmoid of it, which gives us this function between zero and one. Okay, so so far, that's the picture we ended up at the end of last lecture. Ah, so this is a good question. So this question, doesn't this give uncertainty? This kind of gives uncertainty, you know, because it gives a probability at each point, a confidence in each label. So this is a very common question about classification and it's unfortunately a little bit tricky because this is really a probability, right? So there's a few ways to answer this. If you think about lecture three again where we did beta inference about people wearing glasses in the room, this is a similar situation where in the end we learn a probability and the probability is between zero and one. So you could say that is some kind of uncertainty because if it's not zero or one, then we're not certain whether the next person is going to be wearing glasses. But we actually learned a probability distribution over that probability. We were uncertain about what the probability is. And this is like the steroid version of this, right? With a function everywhere that assigns to every pixel in this image one of these probabilities that we would like to be uncertain about. So at each of these points, we would like to be able to say how much do I actually know about what the value of the label is? Because for example, what this thing says is at this particular point, I'm exactly 50% certain that if I now draw more data that I'll get a plus or a minus one. While at this point up there, I'm exactly whatever that is, I don't know, 33% certain that this is a red point. So in particular, we'd like to be able to say, well, this is where the data is. So I'm probably pretty certain around where the data is. And there is no data over here. So I'm kind of uncertain in a different kind of way. So sometimes people use the word aleatoric or aleatory and epistemic uncertainty for these concepts. But these are very loaded, complicated philosophical terms that often get misconstructed. So maybe we'll just do a few more of these objects. And today at the end, we're actually going to have an uncertainty over these. And then I can sort of maybe give you a better intuition for what this is. But in particular, what we can't do is I can't draw joint samples of this function. There's just one such function. There's this blue and green thing. By the way, I heard that this is apparently the maximally bad color map for colorblind people. And given the number of people in this room, there's probably five of you here who can't see blue and red and green. I'm sorry, yeah, should probably have used a different one. There was one more question, no? No, okay. So how do we actually construct this map? So maybe I meant too fast over this bit where I make this function that takes the labeled training points and turns them into a posterior mean function. So here's how this works. Let's assume for a moment that we have already found our full Laplace approximation. And we only need to assume that so that I can talk about probability density functions. And we could even say, let's someone magically gives you a probability distribution over the value of this latent function F at the training locations. So F underscore X is a vector of length number of data points, training data points. It's a finite object. If someone magically gives you a probability distribution of Gaussian form over these, we could get this from our Laplace scheme, which is to find the mode and then construct the Hessian at that point and invert it. But you could have some other means of constructing a Gaussian distribution. Then we could ask, well, what does this tell us about the other function values at some other locations? And here I'm using lowercase x, which is sort of the old style notation, capital X, lowercase x. So lowercase x is the training, sorry, the test points. And capital X is the training points. Then we can actually ask this joint model for the latent function, this Gaussian process object, this function valued object and say, well, what's that probability for these given that I have a distribution over F of x? And so what we do here in this line is just probability theory. So this is the product rule. We can write a joint as a product of a conditional and a marginal. And now we plug in, here is the bit where the approximation kind of, we just begin to trust the approximation. So actually, there should be a full posterior here. But we sort of pretend that we have a posterior, which we've constructed with Laplace. So that's why there's a Q rather than a P. And maybe there's, yeah, well, there's a Q here so we can write an equal sign. So then this, if this is actually, I mean, this likelihood is correct under our model and this is our approximation that is of Gaussian form, then what this is is a marginal over a joint Gaussian distribution. And for those, we have these linear algebra results that we've had on multiple slides so far. So if you have a variable that is related in a joint Gaussian form to another variable and that variable has a Gaussian form here on its marginal, then we can compute this integral, sort of look up what the conditional looks like. So this left expression is the, that would be the probability distribution for the test points. If we knew what the value of the function at the training points actually is, if you had observed it, that's just this sort of Gaussian standard linear algebra results, conditional distribution. And then multiply with this marginal distribution over the value at the training points. Do this integral by plugging in this Gaussian here and then the slide about Gaussian marginals tells us that this is the posterior. And this looks a lot like a Gaussian process posterior from some observations and actually it kind of is from some kind of imagined observations. It's like, as if we had seen the function value at the training locations with noise given by this approximate covariance matrix. And in fact, do I have, yeah, I have two slides for this. There's sort of a typical thing one could compare. This two is if we had, if you wanted to compute the expected value under some true posterior. So if you didn't have, if you had some magical way of finding the actual posterior, if you knew what it looks like, then you could ask for something else, which is, so we don't know what the full posterior over the test points is, because this is not a Gaussian distribution. But we could ask, given this relationship between the two that we have up here, given this line here, what is the expected value of f of x under this joint relationship against some unknown probability distribution here, then that would be this expression where we, why? Because there's f of x only shows up here once as a, in a linear term. So if you compute an expected value, we can take the expected value inside pretty much and just compute this. So if that's the case, then we can see that this expression looks a lot like this mean where we've just replaced the true expected value, which we don't know, with this mode that we can compute with an optimizer. So this is sort of a motivation for why we do this. It's kind of, okay, we're approximating the expected value of this distribution. By the way, the same thing goes for the variance of the function values at each of the test points. You can do a very similar kind of computation that shows us that these two expressions are almost the same, except for the fact that we're replacing the actual variance of the training point latent functions with some approximated variance, which we compute by Laplace. Yeah. So, and again now true, so true in the sense of the first few slides. So true in the sense of this model. So this red thing is p of f one and f two given the data. That's this object that we can't compute. Why is this? Yeah, so this is, so we have, what we do is we say this red thing, I cannot compute. So I will approximate it with a Gaussian distribution, this black thing. If I do this, then the expected value, so this, that's this thing of, so here is, hang on, we plot this, I plot it here. So this gray thing underneath here and there, that's going to be the Laplace approximation. This is this Gaussian approximation. And what the next few slides show is that this mode here, which is also an expected value of a Gaussian, this would be at the, this is sort of, no, this is a replacement for the expected value of this red distribution, which is the true marginal over this distribution over here. So they are not the same, but we're trying to compute, like analytically the same objects, but we're replacing the true posterior distribution whose expected value we don't know, the line down here with this line, which is the expected value under discussion approximation. So that's the motivation for doing this. And then by the way, all of this so far has happened in the latent space, the space between minus infinity and plus infinity for function values. And now, just a moment. Now we are going to need to make predictions about, this is the last line here, we're going to need to make predictions about the actual probability for labels, because in the end, right, they're going to ask, what's the class label at this point? Is it red or green? Then we will need to compute something like an expected value of the transformed F, right? So there's squished F. And that integral, so an integral of a sigmoid against a Gaussian does not have a closed form expression. I mean, for a one-dimensional distribution, we can probably write a piece of code that just does it. And in fact, people have multiple of them. And on your exercise sheet, you've seen one. There are many others. But it's sort of just something to point out that this is not the same as, so the expected value of sigma of F is not the same as sigma of the expected value of F. Why? Because sigma is a nonlinear function, so integrals and nonlinear functions do not commute. And this is actually relevant because we might wonder whether being uncertain about F actually shifts this value in some way that changes our behavior, like the changes of what we do with those labels. And there's some interesting results that, well, I won't have time to talk about, but the simple answer is, for this particular setting for binary classification, only for binary classification, it turns out that the decision boundary doesn't actually shift if you do this, but the uncertainty over the labels changes. So if there's a square here, things become interesting, and if there's no square here, then in some sense the point where the probability is one half is the same in either case. And why this is, this has something to do with the fact that this is a Gaussian, so it's symmetric around the mean, and this is a function that is anti-symmetric around the origin. And so at 50%, if the mean of the Gaussian is zero, we're integrating a symmetric distribution against an anti-symmetric distribution and things work out. So this was 15 years ago a thing that people argued that uncertainty might not be so important because the decisions you take actually don't change because the decision boundary is the same. That's when people were interested in decision boundaries in margins and classifiers between them. So this is where we were at the end of the last lecture. We've decided to do classification with a combination of optimization and auto-diff. We're finding modes and then let's construct a second order approximation, a curvature approximation and use that to estimate an uncertainty. But we didn't actually do that last bit. So that's what we're going to do today. And that's going to require us to think a bit more about math. And for the next three lectures, we're just doing these Laplace approximations and connecting them cleanly to something that's more and more like deep learning. So today we'll stick with Gaussian processes and then on Monday I'll do an entire lecture also for those of you who haven't had a deep learning class, carefully matching these things to each other. So let me go in the stack. I should have had a picture of our stack that I should maybe show again next week. What we've maybe done so far here today is some kind of relatively high level picture of what we're trying to do. We will somehow find a mode and then compute Hessians. And that's also kind of what our code looked like. So you'd like to remember that I actually computed this gradient descent update by just calling an autodiff library. So this is sort of a level where you say, oh, that's just this function and we're going to compute gradients. And I don't know what the gradient is, but you know what? I just called Jaxx and let it find the gradient for me. And this can be very convenient when you're writing code for the first time. But you also saw that when we do this, we then have to wait quite some time to get things to work. Actually, I just re-ran this code this morning and for some reason it took two minutes to finish, even though it was just 20 seconds the last time that I did it on Monday, I still don't know why. So what I'd now like to do on this slide for a bit is to think about what kind of computation we actually have to do to get this uncertainty and to find the mode and then see if the two are connected to each other. So here is now, I know it's a busy slide, so I will go through it slowly. A thought process on what the computation actually is that we want the computer to do, even inside of an autodiff framework. So we've decided to use this Gaussian process prior on the latent function and a likelihood that looks like this. So it's a product over conditionally independent terms which say that at each location i, x i, the label y i is distributed as a binary random variable, so like a Bernoulli experiment with probability given by sigmoid of f of x i. And then there's this neat notation that if we assume that the labels y are plus or minus one, so if we write them as the two classes as plus and minus one, then because the sigmoid is anti-symmetric, we can sort of write it like this. So all the labels that are positive will get a sigmoid likelihood. Let me show you why this works. So if this is the sigmoid function in z, now if you mirror this axis around this vertical axis, like this, you get something that looks like this and this is literally just one minus this function. So that's the probability of the other label, right? So this is one minus sigma of z, which is equal to sigma of minus z. So therefore, if we encode labels that are of one type or the other with probability the one thing or one minus the other thing, we can also encode labels with plus or minus one as the label for y. And then this is a very simple way of constructing the likelihood for all of the labels. When we use this function, that's this sigmoid shape. And now we said we're going to take the logarithm, then we'll take the gradient, find the root of the gradient, which is the mode of the logarithm of the posterior and take the second derivative to construct uncertainties. So let's do that. We first take the logarithm of this posterior. So the posterior is prior times likelihood minus, prior times likelihood divided by evidence. The logarithm of that is logarithm of likelihood plus log of prior minus log of evidence. The evidence doesn't depend on F, so we can treat it as a constant for optimization purposes. We have to think about what, so this is the log of a Gaussian. We know that it looks like this up to a constant because Gaussians are exponentials of squares. So that's this thing. And the logarithm of the likelihood is the sum of the logarithms of these expressions because the likelihood is a product. So the logarithm of a product is a sum over the logarithm of the entries. And for that we can look at this expression again and see what the logarithm of it is. So that's the log of one, which is zero, minus the log of one plus e to the minus input, label yi times f of xi. So that's the function we're actually computing, the value of up to constants, which we don't need to compute because we are going to optimize and we don't care about the value of the optimum because we don't know it anyway. It's up to normalization, but it's just going to shift the whole function up and down and they want to optimize. So the location of the optimum is not going to change if you shift the whole function up and down. So far we've used autodiff to find the gradient of this. I called jacks.value and grads, which computes this number and the gradient. But let's see if we can compute the gradient ourselves on a piece of wall in front of us. So we do that by taking the gradient from the, let's look at this expression. Maybe here it's particularly easy, we just shove the gradient inside because it's a sum and sums and linear operators commute. So we'll need to think about what the gradient of this is in a moment. And here we get, this is a quadratic expression. So maybe you've done this as homework exercises a few times now. This is the matrix valued version of your class A times F squared with respect to DF is A times F. This is kind of what happens here, just with matrices. So we have to be careful to get the order right. We get this and now we can look at this gradient and you can actually maybe convince yourself that this is the gradient of this object. Why? Well, okay, so it's probably best if you do it with a piece of paper, but sort of if you think about it, you get minus the log of something, so chain rule, so we get one over one plus e to the minus F times the inner derivative. So the inner derivative is of this thing, where we'll get an e to the minus y i, f i, f x i, and then the inner most derivative is minus y. So the minus canceled with the front and you can sort of already see that there will be some kind of expression that involves the sigmoid again because there is a e to the something divided by one plus something. So that's like one minus the sigmoid and that's exactly what we get here. But you may probably have to do it yourself on a piece of paper, but maybe you could just believe me. One interesting note about this is that if you see an expression like this, these expressions tend to show up in classification. Why? Because this is the function that switches from plus minus one labeling to zero one labeling. So if y i is plus or minus one, then if it's plus one, then this function is one, and if it's minus one, this function is zero. So it's just that depending on which notation you use, a zero one encoding or a minus one one encoding, you'll keep seeing these expressions, yes? So the chronicle delta, this is a very good question. I should have said something about this. Thanks. So if we take the derivative of this expression, so this gradient, what does this actually mean? So it's this thing, right? It's the derivative. So gradient is d by d f i, right? So we apply this to this sum, which is a sum over, and I should have called this maybe j, j. And this is a gradient of the i-th entry in the sum with respect to the j-th function value. And of course, because of the sum structure, only the i-th entry, only the j-th entry in the sum actually depends on the j-th function value. That's why we get this chronicle delta. So also for anyone being confused, this is a function that is zero if i and j are not the same and it's one if i and j are the same. So that's the gradient that we could actually implement, and I'll do that in a moment. So far, we've asked Jacks to compute that gradient for us, but now we realize we can maybe write it down ourselves. And this might be a good time to sort of... Have you ever actually had a class on how Autotiff works? As he has said, no, so the deep learning lecture does not contain anything about Autotiff. Okay, maybe I can shove in something at some point about Autotiff. So the short thing to have in the back of your mind when you call an Autotiff library without understanding how it works is A, automatic differentiation, is theoretically speaking in some sense cheap because in the sense that evaluating the gradient, one element of the gradient of a function is at most a constant times more expensive than evaluating the function. I think that constant can even be shown to be five, but it's a bit technical, depending on implementation. So computing derivatives is linearly expensive in the number of times as a function of how expensive it is to compute the function of which you're taking the derivative. Okay, so in some sense that's good because it means we can maybe shove it off to a computer to do it for us. But that doesn't mean that it's arbitrarily fast. And in particular, it doesn't mean that the algorithm that does the automatic differentiation for us, in this case, Jacks, does it perfectly. Why? Because it does this through a local computation. It sort of passes along the compute graph and locally computes partial derivatives and then kind of combines them again in a pass maybe within the same pass or in a second pass through this compute graph depending on how exactly we do it. And that local computation doesn't necessarily always make use of all the possible ways of speeding up the computation. Here is an example of this maybe where we can see that this gradient has this nice structure that it's going to be in some sense, how should I call that? So it's not diagonal because it's a vector, right? But it's sort of the ife element only depends on the ife function value. And I don't actually know whether Jacks in the code that I wrote, manages to realize this. So we can clearly speed up this computation. It's very trivially parallelizable. Actually, let's think about the Hessian, which is even harder and see if you can do something with that. So the second derivative, so this is, I tend to use this notation for the Hessian. So that's just like an auto product of nablas of differential operators to show that this is gonna be a matrix. The second derivative of this log posterior, well, it's another application of this operator to this function above, to this gradient. So now we're going to get the Hessian of the log likelihood minus, and now we just take a derivative with respect to fx again, which is clearly just shows up linearly. So we just get minus the inverse of the matrix. So this is going to be our Hessian. Now we have to think about this object. So we already saw that there is a Dirac delta here. And now we look at this expression and take another derivative with respect to f of xi. And we can find that there is no f of xi in here and you can maybe convince yourself that you'll get something like this. So this is the second derivative of this function. If you can't convince yourself directly, you can do it on a piece of paper. It's really, it's like three lines, but it's not particularly enlightening, it's just algebra. So the most important thing is this that we now have two Dirac, not Dirac, chronicle deltas here. So what this means is that the a beef entry of this Hessian matrix is non-zero if and only if a and b are equal to i for the i function value. So we're taking a sum over i here from one to n. So we're going to have only entries on the diagonal of this Hessian. And each diagonal entry only depends on one of the function values. Ooh. So that means our, this term in the Hessian is going to be a diagonal matrix. This is not right, that's the kernel-gram matrix, it's a dense matrix. But it's nice to know that we have something diagonal minus or plus kernel-gram matrix. And the other thing to observe is that this is a sigmoid, so it's a number between zero and one. And this is one minus a sigmoid, so it's also a number between zero and one. So if you multiply these two, a number between zero and one times the number between zero and one is a number between zero and one. So not only do we have a diagonal matrix, we have a diagonal matrix whose entries are between zero and one. In particular, they are non-negative. They're at most, at worst, they are zero. So that means this matrix that is sort of hidden in here, which I'm going to call capital W or the diagonal matrix containing elements of a vector lowercase w, that is clearly a positive definite matrix. Is that clear? Some people are nodding. So if I have a diagonal matrix with entries, that's the wrong way of drawing a diagonal matrix. Actually, no, that's okay. If I have a diagonal matrix which contains only numbers that are larger than zero, then if I multiply it from the left and the right with an arbitrary vector, I'm just basically scaling each of the elements of v with a positive number and then summing up their squares. So that's clearly a positive number. But the other matrix that we're adding to it is by construction positive definite, symmetric positive definite, because it's a kernel ground matrix. The whole point about these kernel matrices is that they're symmetric positive definite. So the sum of two positive definite matrices is a positive definite matrix because the sum of two positive numbers is a positive number. And that means that this hashing is symmetric positive definite everywhere. So what does that mean for the problem? Your analysis class from the second term is calling. Or from the math for machine learning lecture last year. It's a convex problem. So this optimization problem is convex. It looks like this. And so if we currently were in like 2011, this would have been a big deal because people used to care a lot about convex optimization problems. Why? Because there are super efficient algorithms for solving them. A large part of the studies in operations research, so business economics research stuff, research is about optimizing convex functions because you can be done very, very efficiently or at least it used to be. There are these big books by Steve Boyd and van den Berger and so on about this just turning everything into a convex problem. And this is what we've just done here. We've just discovered that this is a convex problem that we're optimizing. And we've done this with gradient descent. Hmm, that was a question. No? So the question is about, well, this could be all zeros, right? And then move can be even invert this. We couldn't, we cannot be sure that we can always invert this matrix as, but I mean, okay, so actually we can because like unless F is plus or minus infinity, this is just one or minus one times a non-infinite number. So this number will be literally like hard between zero and one. So in the open interval between zero and one, it will not be zero or one. And then on the diagonal we only have non-zero entries, all positive. And then you can always invert this matrix. You can also always compute its eigen, sorry, not its eigenvalue decomposition, but you can always compute its matrix square root. It's just the product of the same matrix twice with a square root on top of all of the entries, which will be convenient in a moment. So why have we done all that before the break? Last Monday we did this very, very tedious thing where I stood here hopping around for a few minutes trying to get this to work fast and tuning parameters and so on, which is kind of my placeholder for how contemporary machine learning often works. And this is kind of a throwback to the olden days where people would spend a lot of time trying to find out what computation they're actually trying to do. And we've just discovered for this particular problem that it's convex and therefore there is a fast algorithm we can use for optimization, which you will do in the second half. But before we do that, can someone guess what the algorithm is going to be? So what can you do if you have a convex problem? What is an algorithm that works well? If you take in your math for machine learning class. Newton's method. Exactly, that's what we're going to do. I'll briefly tell you what Newton's method is and then we can go into the break. Everyone has done scalar Newton's method in high school. It's a root finding algorithm, right? You'll find the zero of a function. In this case, our function is the gradient by taking steps that are like, like just like X is X minus F prime of X to the minus one times F of X, right? You all remember, you did this in like 11th grade, 10th grade, I don't know. So we're going to do the same thing, but with matrices. Here it is. We're going to update our estimate for the latent function by making steps like this, where this is the gradient and this is the inverse of the Hessian matrix. And we'll see how fast that is and you can already think for yourself, isn't that expensive somehow and we'll see how expensive it is after the break. So let's take five minutes and at 9.06, I'll continue. So there is a big question in the break that I can't answer in short, but maybe when I can sort of start to answer by putting it into the context of this lecture, which is what we're now going to do in the rest of the remaining 45 minutes or so is to think about how to implement this optimization process in an efficient way to make it work fast and give us uncertainty in the end. And you might wonder why we have to go through all of this. I realized that this class is often as this very technical mathematical nature or actually maybe numerical nature. And maybe some of that is my own fault because that's kind of the stuff I like to think about. But it's also a question of what you actually think the job of a machine learning person is going to be in the future. And first, this is wide open. Who knows, maybe we're all going to do no code at some point. But there is a sort of a, as to keep in mind, a computer science still is a surprisingly young field actually, if you think about it, right? So the very first beginnings of it were like in the 1950s. They compared that to physics, which is like 400 years old, I don't know. And we're still seeing this kind of evolving landscape of professional roles that involve computers. And machine learning is not so much different from other fields in computer science where we see this kind of stratification of the professional roles from on the one hand, so in, you know, on the internet there's a very clear separation, right? With network engineering, there is a spectrum between on the one hand the network engineers who like, you know, built the backbones of the internet and make them work. You never see, but they sort of run the whole thing. And they probably make a decent amount of money from it and they can kind of hide in the cellars and make everything work. And on the other end there is, you know, web designers who are pretty much artists in a way, right? And they use high level tools and they kind of live on different parts of the stack as well, right? One is in user land using, I don't know, Adobe something. And the other one is just writing some, just sitting in front of a console basically typing arcane commands. And in machine learning, which is a much younger field even than computer science, it's, I don't know, like, so the first conference on international conference on machine learning was in the 90s, I think, 1991 or something like this, maybe the 80s, but, you know, as a field it really sort of emerged in the early 2000s. We also see this stratification now. There are people who currently get called data scientists who use, I don't know, scikit-learn, R, or maybe they take an output and then paste it into Adobe and make a nice picture out of it. And there are these people who run the low level bits who train the large language models. And these are very different roles, right? I'm not saying that one of them is more, I don't know, enticing than the other, but they're quite different. Also in terms of how you work, right? So whether you're very close to the customer, whether you're constantly, you know, making a tiny little thing, make a nice picture and then next day you do something else, or if you're really trying to get one really big thing to work. And if you want to get the big thing to work, you really have to understand how it works. And to this day, being able to do this kind of stuff is a much more valuable skill than calling scikit-learn. Because there are thousands of people out there who can call scikit-learn. And there are millions of people out there who can call GPT, actually there are now billions, billions of people out there who can do no code, right? You can just ask GPT to produce some silly code for you. So if you want to make a dent in the things that actually run the machine learning stack, you need to be able to do this kind of stuff. And I realize that it's sometimes painful, but when you do it by hand, it can also be very pleasing. You can suddenly make things work and you understand what's going on and you feel like, ah, I've got this under control. So let's see if you can do this. I pointed out that there's this algorithm which you've seen, which you've heard about on an abstract level, Newton optimization. I'm guessing if I ask who's used Newton optimization before, all the hands will go up because you've thought about it and you might have heard that it has these wonderful properties that Newton optimization can converge very, very fast in the sense that the next iterate is closer to the original, to the true solution than the previous iterate by a factor that scales like the square of the previous distance. So if this is a small number, much less than one, then this number is way less than one. There's this, this is what's called quadratic convergence. And so now if you think about what that means for the t plus nth derivative, then you can see that this is some kind of doubly exponential type of convergence. So it's a very, very fast kind of process and this has led people to say much like Bill Gates in the 90s, quadratic convergence should be pretty much enough for everyone. You don't really need a faster convergence than this. There's no point in trying to find an algorithm that's faster than this because if you double the number of significant digits in every step, you're very quickly running out of significant digits to represent in your floating point number anyway. So there's a downside to this though, which is that you have to compute the Hessian. But we just found out that this Hessian has this interesting structure. It's there's this kernel gram matrix, which never changes, it's in the prior. And then there is the Hessian of the log likelihood, which is diagonal, only contains positive numbers. Hmm, maybe we can do something with that. Let's see if we can work with this and make the algorithm run reasonably fast. So I go back to the previous, like sort of combine a few slides, the previous slides and the sort of insight from this last slide into one thing to talk about what we're going to do, abstractly in pseudo code. So we now know that we have gradients and Hessians of the log posterior, which are sums of gradients and Hessians of log likelihood and log prior. And we now know exactly what they actually are. They have explicit forms. You've sort of derived them. So now what we can do is we can run Newton optimization to find the point estimate and the curvature estimate. And Newton optimization is going to look like this. This is, you can think of this as a training algorithm because it happens before test time. You get the training data set. Then you construct the kernel gram matrix of the prior and the prior mean function. You load in the data. And then you iterate to do this thing. We compute the Hessian of the log likelihood that we have to do in every iteration because it changes as we change F. But the Hessian of the prior never changes because it's a Gaussian prior. So it has a constant Hessian. It's a quadratic function. So we don't have to update that. Then we compute the gradient of the log posterior. That changes both in the prior and the likelihood. We know that this is the gradient of the likelihood and this is the gradient of the prior. We know that it's a linear function in F. So computing that gradient is just a, what is this going to entail? It's going to multiply with a matrix whose inverse we need to know but we pre-compute that inverse once efficiently before doing the iterations. And computing the sigmoid of one but that's a diagonal operation. So it's going to be O of N. And then we need to compute this matrix. This is going to be the complicated bit. We're going to need to compute the inverse of the sum of this diagonal matrix and this kernel matrix inverse and multiply it from the left onto the gradient. That's the Newton's step. It's inverse Hessian times gradient. And put a minus in front because negative, that this is just what the Hessian looks like, right? So up here you can see that there's minuses which have taken the minus outside. Let me do the update. So that's roughly what we want to do. So we've now seen that A, one problem is that we need to compute these matrix inverses and matrix inverses are potentially unstable and expensive. And then there's another thing which actually someone pointed out during the break, it's like Newton optimization, hang on, hang on, hang on. I felt that Newton optimization can be quite unstable. Maybe you were in the math for machine learning or in an optimization for machine learning class and you felt that Newton optimization can be very unstable. So the intuition for this is that when we do this, we're inverting something. So we're multiplying a number with an inverse of a number and you can think of this Hessian matrix. So let's call it H which is minus W plus K inverse. That has like, it has an eigenvalue to composition, right? So let's think of V times D times V transpose. Then there might be some very small numbers in here and some very large numbers. Large eigenvalues, small eigenvalues. And they're all positive because it's a positive definite matrix, convex problem, but some of them might be 10 to the three, others might be 10 to the minus five. And then if we invert them, the small ones will get very large and the large ones will get very small. Then we multiply them with the gradient which might also contain large and small numbers. And then that might mean that we might get some cases of small number divided by small number, which is dangerous, as you all know, right? Floating collaborations are unstable if you're dividing small, two equally sized but small numbers by each other. So an intuition for what is happening there is that if you have an optimization problem that locally kind of looks like this, which means it's very badly conditioned, it's very, very slim and narrow, then the gradient is pointing in this direction. So you're going down here, what you want to get here. And Newton is supposed to correct for this to get you exactly to there if this is a quadratic function in one step. But if you're sort of, let's say you're here at this point, then it has to correct this large gradient with this tiny gradient in this direction to get here. And that's dividing small numbers by each other and so it might actually shoot off by just a bit and then become unstable and go, and it's broken, right? So to make this work, we usually tend to have to think a little bit about this matrix and how to represent it correctly. And this is often involves actually just rescaling the matrix in just the right way so that it's stable. If this were a unit matrix, which we've sometimes seen in Gaussian process regression, then we had matrices of the form sigma squared times one plus matrix. Then a common thing that people do is to just multiply by sigma to the minus two from the outside. Then we get these two sort of cancel, right? And we're left with one plus sigma inverse times KXX. And then that's nice. Why? Because this is a positive definite matrix and this is then a unit matrix and adding a unit matrix to a positive definite matrix shifts all the eigenvalues up so that they are larger than one. Why is that? So just maybe final notes. I know, I realize that I have a terrible blackboard writing at this point. If you take a matrix, if you write K equal to U, E, U transpose, positive definite matrix, then we can write K plus alpha times one as U, E, U transpose plus alpha times U, U transpose. That's a wonderful way of writing one. Then we can take the U's outside to some kind of expansion and write U, E plus alpha times one, U transpose. And this tells us that the eigenvalues of this sum of these two matrices are, well, the eigenvectors are the same. This is still an eigenvector decomposition. And the eigenvalues are shifted by alpha. So if alpha is larger than zero, it's shifting the eigenvalues up. And if we divide out alpha, we can even make sure that all the eigenvalues are larger than one. And that's nice because then we can be sure that we're not accidentally dividing by a very, very tiny number. Here we have a diagonal matrix, not a scalar matrix, but you can do a similar trick. And now you're gonna ask, oh, do I have to know this for the exam? No, you don't have to know this for the exam. It's the sort of thing where you just, you can just sort of breathe in what you might want to do if at some point you're sitting in a company and building up a piece of code that's supposed to actually work in practice, then you might have to dig in and do this kind of stuff. So the people who call themselves engineers who build deployable products that run at scale, they often have to spend weeks thinking about these kind of problems to make stuff really work. You have to dig in, you have to see the matrix, think about what is this actually happening here? Why is this unstable? Ah, it's because there's a small eigenvalue floating around. Stupid. So it turns out, and you can find this in the Vaspersen and Williams Gaussian process book, chapter 3.4, that there is a neat reparameterization of the problem in terms of this matrix. So remember that we need a matrix that looks like this? It turns out that we can compute this matrix. Why? Because W is a diagonal matrix with positive entries on its diagonal, so its square root is literally just take the square root of the diagonal. And then we can clearly compute this kind of matrix, which also is clearly positive definite by construction because this is a sum of a positive definite matrix and this expression, this expression is a positive definite matrix multiplied from the left and the right with actually an arbitrary matrix. It doesn't even matter that it's positive because it's multiplied from the left and the right, but it's still even positive definite. So the product of three positive definite matrices plus another positive definite matrix is still positive definite. And its eigenvalues are clearly larger than one. So this is going to be stable. And now you just do some linear algebra. So we reuse the matrix inversion lemma in a special form. So you can think of this matrix there as W square root times one times W square root plus K inverse. And then you can use the matrix inversion lemma which is running out of chalk, K inverse plus UCV transpose inverse is K minus KUC inverse plus V transpose KU inverse V transpose K. And at some point you just know it by heart. So you can just write it out, plug it in there and you'll find that you can compute the inverse in this form. And this is nice because this matrix we already have. We don't need to invert it. Here we get something like this. So this matrix, this matrix is going to be here we get something stable because B is this matrix that is stably invertible. It has eigenvalues larger than one. And then we just multiply things from the left that we have. So we can compute the Newton update which we had on a previous slide by plugging in this expression for this WK inverse inverse from the left onto everything here. And actually we can reuse a similar trick to make predictions. So in the predictive bit on a previous slide I said we need to compute if we are at new test points this object which we had seen before. So here is our mean which is going to be easy because it only involves the kernel gram matrix inverse of the prior. And then in the posterior we saw that there is this matrix sigma hat. If you plug it in with our Hessian inverse then we see that we need to do an expression that involves by rearranging something like K plus W inverse inverse. This is the bit where we see that this looks like a posterior arising from pseudo observations with noise W. And so this is very similar. It's just sort of the other inverse, the other way around, right? So we have just before thought about W plus K inverse inverse and now we need K plus W inverse inverse. Which is not exactly the same thing but it's clearly related. And we can again use this B matrix as it turns out to compute a posterior covariance stably. So we need to be able to invert this matrix B but if we do that then everything else kind of becomes relatively easy to do. And this is the point where I can show you code. So I have another piece of code that is also on Ilias which is a verbatim copy of the stuff we've done on Monday up to the point where we start optimizing. So it's as before we load a bunch of stuff, you make a data set, here it is. We define this prior that we've done that before. So we divide down a kernel and a mean function and make this, define this Gaussian process prior with this mean and this covariance, this kernel and then draw a sample from it to make a nice plot to see that this takes a bit of time because we need to build this big grid with 50 by 50, two and a half thousand points and then draw a sample, do the SPD of that to a sample as you've seen this before. And now we write down the log likelihood but now I've changed this log likelihood a bit to compute not just the value of the log likelihood but also its gradient and its session in one go. So this is not autodiff, it's just diff. I'm kind of like by hand replacing autodiff with a piece of code that actually computes the things that we needed. So you can maybe convince yourself afterwards by looking at previous slides like this one up here that this object, there's a sigmoid here and then there's a sigma times one minus sigma, these are things that are being computed here in this piece of code. And then it returns, if I give it an F, it returns the value, the gradient and the diagonal of the Hessian of this function. And this is a scalar, this is a vector and this is also a vector of the same size because it's just a diagonal of the Hessian matrix. And now we do Newton optimization. So this is the bit where it's different. So in the previous, and on Monday, we had this bit which said, let's define a bunch of annoying parameters, then compute an update step using gradient descent and make sure to jit it to just in time compile it so that it runs faster and then do a stupid for loop to optimize. And now instead we're going to also define an update function and that update function does the heavy lifting. That's the bit that does the nasty linear algebra that I just waved my hand about. So what does it do? Every time you hand it a current estimate and these sort of statistics, kernel, grammar, tricks and training data, these are by the way from the point of view of exponential families, these are the sufficient statistics. We feed in the current estimate to the likelihood and have it compute its value, its gradient and its Hessian. Then we compute this matrix which I've maybe named badly, which I call here W, which actually is the square root of the negative Hessian diagonal because we need square roots everywhere. So I decided not even to write down the square root. And then here's the one heavy lifting piece. So this is where we compute the matrix decomposition that is necessary to do the Newton's step of this stable matrix, of this stable matrix B which is one plus W times kernel, grammar, matrix times W. And I've indexed it such that we get, we multiply this diagonal vector once from the left and once from the right. That's like multiplying with two diagonal matrices from the left and the right. Then we compute two lines that involve a Cholesky decomposition and a pre-computed vector B that we use twice here and there. So that's why we pre-compute it so that it's faster and do the actual update by multiplying with KXX. And this is the Newton's step. So this bit, which doesn't look like it, that's actually, it's sorry, hopping around, that's this line where I've just replaced this matrix with this trick, right? So if you place this matrix with this, which gives a different update and I sort of just drag that in and that doesn't look like you would expect it to look, but if you take these three lines together and write them out on a piece of paper, you see that they are the same thing. And then this update thing returns the new estimate and the value of the log likelihood because I want to plot it. So I've pushed those off into a separate function called update because first of all, that's standard sort of coding patterns. The update bit is what in deep learning, you also know as the optimizer update step, but also because I can then pre-compile this function to make it fast because we're going to call it over and over and over again. How often are we going to call it actually? Well, here's the actual Newton loop. So what I'm going to do is I'm going to repeatedly call this update function with an initial estimate. By the way, I've set the initial estimate to, I've even set it to a random function. Actually, it doesn't matter. You can set it both to zeros into a random function, try out both, it doesn't matter. And then, so this gives me the new estimate and the new log likelihood. I call it new because I can then compute an update and how much has the likelihood changed and use that as a convergence statistics. So when that difference between those two, let's call it delta. When that gets small enough, I'm going to stop this loop and just say we're converged. And I'll also store the log likelihood, which is why I don't do this in the update because this is a side effect. So Jacks would complain about this, so I put it in here. And then I run this. So remember how long it took to run gradient descent and how annoying it was to get it to work. Now we do this and we're done. We'll do it again. Done. All right, done. Where's the learning rate? There's no learning rate. It's not an optimization, the learning rate is one. How many steps has it taken? Six. Actually, this jitting up here is pretty much useless because I'm just calling the function six times. I don't even need to optimize it. In fact, I'm probably paying more for the just-in-time compilation than for actually running the code afterwards. And the cost of this cell, actually I haven't, you can time it yourself if you want to, you can put in a time it. The cost of this cell is mostly making the plot. And this is on a hundred data points. Now, of course, if I have a million data points, then the fact that this Cholesky is too big in the number of training points will become relevant. But if you have a small data set, why worry? We just do it. And the rest is now easy. So remember that on Monday, we then took this thing that came out and constructed this posterior mean function. But we didn't have a posterior covariance yet because we didn't have the Hessian yet. But now we have the Hessian because we've just used it in optimization a bunch of times. So we can define our, we can do the pre-computation as before. Now we just have to be a bit careful with this Hessian. I've rewritten the Hessian with this W matrix. I'm just gonna use it here. And then compute the posterior mean function and a posterior covariance function. And this posterior covariance function, you've now probably seen a few times when you do your homework, involves this, because we're going to call our Gaussian process library, involves this Functools Partial Vectorize bit where we vectorize this kernel function such that we can write it in terms of scalars and then it'll take care of the broadcasting out in the right way to build the right matrices, whether they are diagonal or full. And then we have a Gaussian process posterior. This is the Laplace approximation. It's a Gaussian process type object which has two sort of sufficient statistics, a posterior mean function and a posterior covariance function. I can instantiate it, ooh, that was fast, right? Gaussian process classification is really fast. No, of course it's, I mean, that bit is just a functional programming, right? So it's sometimes useful to think about what's expensive how or not. This bit is really just instantiating functions. It's just saying there is this function, don't call it yet. But now we can actually call it. I'm going to, I'll tell you in a moment what I do here. In this cell, I'm instantiating this posterior on the grid. So I'm calling its call function underscore underscore call underscore underscore. And that takes some time to build these big matrices of size two and a half thousand by two and a half thousand. Then this line computes an SVD under the hood of this matrix, uses it to draw three samples and compute the predictive probability. Actually, I'll tell you about that in a moment. And then I make a plot. So this is just annoying plotting thing. And now what do we see in this plot? So on the left, I'm plotting the posterior mean function, which is the analogous thing to what we did on Monday. I take the posterior mean, and it's actually in the latent space. So you can see that the color bar goes from minus four to plus four. It's in this, what in deep learning is called the logit space, the pre-activations of the output layer. So it's the latent function that lives in on the real line between minus infinity and plus infinity. And I'm also plotting the diagonal of the covariance matrix on this plotting grid. So that's the local error bar for the function. That's also a vector of the same size as this vector, of which I take the square root to get a standard deviation and then plot it. And I'm using a different color scale here, different color map, because this is really a different object, right? This is a point estimate and this is uncertainty. This is not minus infinity to plus infinity. It's always a positive thing. It's just how uncertain are you at this point. And you can see that on the order of magnitude of the uncertainty is comparable to the order of magnitude of the latent function, which is good because we're doing classification. So we're going to push this through this sigmoid link function. And if we want that function to be uncertain in any way, the uncertainty should be on the scale such that these Gaussian distributions kind of reach into the dynamic range of this sigmoid. So these two numbers are comparable. That's good. And now, I also draw here on the right three samples from this posterior. This is something I couldn't do on Monday because there was no Gaussian process yet. It was just a posterior mean. And this is also the answer to what is the uncertainty actually mean? This is something we couldn't have done with just a point estimate. And what this says is the posterior thinks under this very constructed model that these are all possible realizations of this unknown function that creates labels. It could be this one, it could be this one, it could be this one. And in each of these plots, you see that on the training data, it's relatively consistent. So we have in the region where there are red points, there tend to be red values of the function. And in the regions where there are green points, there tend to be green function values. But outside of that region, we're actually pretty uncertain. We just don't know what the right label is. Depending on how we draw, sometimes we get green values, sometimes we get red values. And now finally, what is this bit? So one thing you can do is to draw these samples and use them to visualize uncertainty. But this is maybe, it's a bit difficult to look at, right? It's maybe actually interesting to look at because it tells you something about the sort of the joint structure, right? It says like, how smooth is this function underneath? On which scale do things change? But you need to look at a lot of these to get a feeling for what the prediction actually looks like, what its variability is. So if you look at a lot of these, what you do in the back of your head is you actually average over them. And that's what we'd like to do. So we would like to compute, ideally, we would like to compute this kind of the distribution over that transformed latent variable. So there's this function that lives in the wheel space, in minus infinity to plus infinity. And we would like to construct a random variable that is the sigmoid of this thing. So in principle, we can do that, right? We can go back to lecture three and check for how this works to transform random variables with a probability density function. But remember that our Gaussian is actually multivariate. So we can do this locally in a scalar fashion at each point, maybe, and construct this object. This is what I did in this Streamlit app that I showed you on Monday. But doing this for the whole high dimensional space is going to be difficult because it'll involve this massive, interesting Jacobian with the determinant inside. And also this object, this is sometimes called, I think this is sometimes called the logit normal, doesn't have easily, like analytically expressible mean and variance and so on. It's not really a tractable object. So as you also know from your exercise sheet over the years, people have come up with all sorts of approximations to compute this. And if it's a binary probability, so just one latent function, you can do this in very elegant ways. For example, David Bekai in, well, what is that, 30 years ago, already came up with this very simple transformation, which is, even in the paper explicitly just says, it's just a simple thing I needed to do on the side because I wanted to have a quick output. And this was in the days of GNU plots and there was, Python didn't even exist and so on. So things had to be very simple. But it's also useful from an analytic perspective because it really tells you what's going on. You can think of the expected value of this sigmoid transform, which is what you would get if you would do a lot of these samples, not just three, but a hundred thousand and then sum them all up. You can think of that as this expected value as approximately applying the sigmoid function to this expression, which is mean divided by square root of one plus pi over eight times s squared. So where does the pi and the eight come from? Well, he does some very smart constructions on a piece of paper to match gradients of a probit. So that's the CDF of a Gaussian distribution to this. And the Gaussians, you always get pies floating around. So that's where the pi comes from. And he matches it to be equal at zero. So at zero, the Hessian of the softmax is one over four because it's a half times a half. And that's where the four comes from times pi half is an eight, whatever. So you get this kind of expression. The cool thing about this is that this is an element-wise function we can just implement in Python. So it's actually up here, that's this thing. It's really just give me a mean and a standard deviation and I'm going to compute sigmoid of nu over a square root of one plus pi over eight times the standard deviation. And this is what I plot in this plot. Where is it? Here, predictive probability and predictive probability is this thing of the posterior mean, sorry, the full posterior. And it looks like this. So what you see here is it's this function mediated by this function through the softmax. So where we have high variance compared to the mean will be uncertain. And where we have low variance compared to the mean. So in the blue regions, this is pretty much just the sigmoid of this thing. And outside it's just uncertain. So now you get, and you can hear, you just barely see a difference between those two, I know. That's probably because I haven't zoomed out enough in this picture, but this is sort of the answer to the question at the beginning of the lecture, why do I need to be uncertain in the first place? Because outside of the data regime, there's one way to think about it, you might want to have a behavior where the thing just says, well, you're far away from the data and I have no idea what the answer is. It's not just that it's 50%, I just don't know what the answer is. I'm uncertain about the uncertainty. And here in this two-dimensional picture, yeah, admittedly, this may be a bit boring, but if you're doing this, not with two moons, but with some complicated computer vision data set where you can't look at the images, like there's no two-dimensional space to look at, it's a 768-dimensional space, that's a useful property to have. Quick question. Ah. No, I'm dividing by the scalar variance. So that's the, it's literally this function, squared. So that's the diagonal of the covariance matrix, element-wise with the mean. So that's linear computation in times of how many pixels are in this image. So it's a local thing, you have locally have one Gaussian, one scalar bell-shaped thing that gets pushed through this sigmoid function and we're computing the expected value of that pushed one-dimensional object, repeatedly at every pixel. So with that, I'm at the end and this is also the end of the specifically Gaussian process part of this entire lecture course. Let me quickly summarize and then tell you briefly what we're going to do next. So we've taken our first step away from the safe space of Gaussian inference. Up until this Monday, everything was closed form and we sort of confined ourselves to mathematical problems where we can write down the exact answer in terms of linear algebra, Gaussian process regression. And we saw that we can get somewhere with this, we can learn functions that have real-valued outputs. But life is more complicated than real-valued functions. In fact, if you think about real-world data, very quickly you get data types that don't fit into this neatly constructed framework. Classification is the first example of this. Generative models and so on are more extreme ones or structured output. So we sort of realized that if you want to do this, if you want to learn functions that map from input spaces to simply Cs, to discrete probabilities, we need to allow ourselves approximate computation. So we dipped our toes in with this and said, let's keep as much Gaussian process as we can, full Gaussian process prior and just approximate the posterior by finding the mode, then computing the Hessian at the mode and calling that a Gaussian approximation. So where the mean of this Gaussian is the mode and the negative inverse Hessian at the mode of the log posterior is the covariance matrix. That's called the Laplace approximation. And one of the first maybe experiences we had here on this week was that on the one hand, this is often quite tedious to actually do in code. It's tempting to do it more simply with the available automatic tools like automatic differentiation and gradient descent. But when we take the time as we did now, 90 minutes, to think about what things are actually being computed, implement the functions directly rather than calling autodiff and then implementing a supposedly more complicated algorithm like Newton optimization, we can build something that is numerically extremely stable, very fast, thousands of times faster than the stuff that I showed you on Monday, doesn't require any babysitting and just converges. And you can do classification. Well, okay, very simple classification in this binary type problem. So on Monday, we're going to meet here again and I will try and not just dip our toes into the real world but actually jump right in in the deep end and start to think about what this deep learning thing is that people out there apparently do in machine learning a lot and how it relates to all of the stuff we've done so far. And we will find, we'll do this very carefully and we will find that there are deep connections to the stuff that we've done so far. And that's also why it was worth maybe spending so many lectures on this because all of the concepts that we have so far used will have implications on what we actually do in deep learning. That's also the bit where the world still is trying to find out what we're actually needing to do to get deep learning to work. So we are very close to the state of the art in machine learning, starting from next Monday. So I'm hoping that you'll all be back by then and I wish you a nice rest of the week.