 Hello and welcome back to probabilistic machine learning lecture number 13 as we now have gone through 12 lectures Maybe it's time for a bit of a recap We started by noticing very fundamentally that probabilities and probability distributions Allow an extension of propositional logic to reasoning under uncertainty by distributing truth over a space of hypotheses Doing so can be or can pose Computational challenge because instead of keeping track of one single true statement We now have to keep track of an entire space of hypotheses and assign Non-binary truth values to all of them. This can be combinatorially hard and so a large part of what we're doing in this course is Coming up with good computational tricks for keeping this process tractable One such trick is the use of conditional independent structure as encoded by directed graphical models To separate parts of the inference at least conditionally from each other another one is to use the use of random numbers in Monte Carlo methods to Compute what would otherwise be potentially interactable high-dimensional integrals now we've spent quite a bit of time in the course with a third option which is Applicable when all variables in our model are only linearly related to each other throw through linear maps If that's the case and we decide to specifically choose to use only one particular family of probability distributions, which are the Gaussian family then inference Instead of posing Interactable high-dimensional integrals reduces quite drastically in its complexity to just linear algebra and linear algebra Are or is a kind of operation or a kind of domain on which computers are particularly good We saw that this Gaussian framework cannot just be used to infer individual variables but instead even non-parametric objects functions that map from an input domain to a real-valued output domain We saw that We can build such models in particular in a parametric fashion by Deciding to use certain feature functions and we saw that we are quite free about these feature functions and how we choose them To map from the input domain to the output domain and one interesting aspect of this that already arose there Is that the input domain if we do so can essentially be more or less anything it doesn't even have to be a real-valued vector space You can learn such feature functions by another computational quite drastic approximation Which is to not keep track of the probability distribution over the feature functions and instead just to maximize their assigned probabilities under a marginal model by type 2 maximum likelihood or maximum upholster theory and This turned out to be connected to the idea of representation learning or even deep learning if you like We then saw that there's actually an alternative approach instead of this fight deciding to use specific feature functions and Parametrized the space of functions we can address we can also consider using infinitely many features and in the lingo of neural networks this intuitively corresponds to using an infinitely wide neural network This gives rise to non-parametric probabilistic models which use kernels to construct non-parametric probability distributions over functions, which are called Gaussian processes And then we spend a little bit of time getting to know the theory of these concepts We saw that they are associated with quite interesting abstract mathematical notions The hypothesis spaces are quite complex or this entire framework is connected very concretely and precisely to the statistical viewpoint on kernel machines through the notion of reproducing kernel Hilbert spaces and kernels and Least squares regression. We also did an example to see how these algorithms can be used in practice to build expressive structured models that allow a sort of basic scientific inference on unknown quantities even those that are hidden sort of within several different sources in the data and Finally, we noticed that we these models are actually also useful in a setting where that the input Domain is one-dimensional ordered So time-structured and this gives rise to a particular subclass of Gaussian regression algorithms Which are of linear complexity in the number of data points and these are called filters So today I Would ostensibly at least at first Move a little bit away from the Gaussian framework or even though we'll notice that we won't quite be able to escape it yet by Posing perhaps a question that some of you might already have had on your minds Which is that all of the models we've been talking about in the context of Gaussian inference Assume that the output object, let's call it Y that we observe Is a real number or a real valued function or a vector of real numbers And that's perhaps natural because the Gaussian distribution is a probability distribution Whose domain is the real line? So it's the natural object to use When we're trying to infer a quantity that has a real value But of course not every single data set out in the world consists of real numbers so To show you a plot I'm not actually doing a 2d plot so so far We've looked at 1d plots because they are particularly easy to think about but Sometimes there are data sets that are so simple that you can even do a 2d plot and keep them easily interpretable what we have here is a data set that consists of Again a supervised setting so inputs In this case there are two inputs a two real valued inputs x1 and x2 and then outputs Which aren't real valued so they are not numbers like three point two five But instead they are binary. They are either in or out. Yes or no one or zero or plus or minus one Doesn't really matter how you represent them. They're just two different classes of Labels here and therefore of course this kind of problem is called a classification problem This is a particularly easy classification problem. It's one that I've constructed to make it as easy as possible It's very easy to think about how you would solve such a problem Maybe you have to think you can think about it for yourself for a second before I reveal a simple answer The simple answer is you can just draw a straight line through this data set and then decide that everything That's above the line corresponds to the red class and everything below the line corresponds to the blue class Now obviously not every single data set is going to be like this for example It could be that the classes overlap a little bit so If you're in this region, we're not quite certain what the right class should be to assign It's clear that over here the class is probably the blue one and up here is probably the red one or the solid one But in between we need that kind of a soft transition and What is that going to be? Well, it's obviously going to be a probability up here Our prediction for the red class the solid class is probably going to be one down here The probability is probably going to be zero and the probability for the other class is just one minus that probability And then in between here Maybe here the probability should be something like 50% roughly. So what we need is a smooth transition and you can imagine that that's possible to realize with probabilities even though I haven't really said how yet and In many ways, this is still a very simple data set because even though there is no a smooth transition The transition boundary itself remains a straight line Now maybe a real-world data set looks more like this Where the classes even though like first of all, I mean maybe the classes are mixed together But there are still domains regions cells Areas here where one class clearly dominates over the other just from observing this kind of let's call it a training set now Having seen the previous two images, you can imagine that you can probably Describe this kind of problem in this way where there is now a non linear Discrimination boundary between these classes. So some kind of line that you can mentally draw along this sort of separation between white and green in this image and it's not a linear function anymore it's a more complicated thing and we also again have to use probabilities maybe to explain regions like this where data points get quite close to each other or Maybe a region like this where we only have so few training data points that we're not totally sure what the right class will be here And it seems like maybe having I mean, I'm hoping actually that having gone through the previous lectures You aren't particularly surprised by these or faced by these kind of Challenges because really this is very similar in nature to the step We took from learning a linear function to learning a non linear function It's just that the output of this function now is not a real number anymore. Instead. It's a probability between zero and one So it's actually a subset of the real life actually in practice and I have to show you that before we move on. This is a little bit orthogonal to what we're going to talk about in the rest of the lecture, but And actually we'll talk a little bit more about it in the lecture afterwards, but I Just want to like not leave you with the incorrect impression That this is necessarily how real-world datasets look like it's actually more likely if you look at that actual data set that It looks like this. So here are two classes and They are again like now. They're showing green and blue So you see sort of a blue class here and the green class here and one thing you might notice about this Data set by the way Normally that would be a point where I asked you a question But you notice about this data set and then you could have a conversation about it Instead I'm just going to tell you that it seems just looking at this data set that there is something very important here Not necessarily about the input point But about the distribution of the training points So notice that when we talked about regression so far with to learn functions from input to output We never really spoke about the distribution of the inputs under the training procedure And that was straightforwardly possible in regression It actually remains possible in classification because we can of course describe this kind of problem in The same fashion that we describe this kind of problem. We can just assign a function that Maps a green class over here and a green class over here and the blue class here and there And then maybe something around 50% or away from 0 and 1 in this overlapping region here But you might be thinking you're not wrong that It's a bit tricky to make a statement about the predictive class over here Because we what we should be answering out here isn't really class one or two We should be answering. I don't know because This doesn't really seem to belong to any of the two classes Not on account of various in the input domain, but on account of various relative to the training data So today in the lecture today, we are not going to address this issue Instead, we're just going to ignore it for a while and we turn to it in later lectures This is one aspect of the problem that is often described as discriminative versus generative classification Just one way of looking at it We are not going to talk about it today. I just want to point out that this is an aspect of real-world data How do I know that it's an aspect of real-world data? Well, I know because this is actually real-world data at least in The sense that we usually use data in research in machine learning I know that know that because this data is actually a representation of a real-world data set that is available online It's that is becoming increasingly popular for research And I don't even have to show you more than just to say here are two data points here and there Which don't just have two dimensions. These are just two of their principal components instead. They actually have They're actually images of size 28 by 28 and if you look at these pertaining data points, they look like this So they're actually images of clothing Which come from a variant of the MS data set that was released a while ago and which you can find online. So Real-world data does have to structure and we have to think about it at some later point But for today, let's instead focus on the fact that when we still here are looking at a problem where we could describe it as in terms of a function that takes any input point on the domain on X1 and X2 and returns Instead of a real-valued number that says the value of this function is 3.25 at this point instead returns a probability Between zero and one and including zero and one for One of these classes and then the probability for the other class is just one minus this probability so here's non-grey, but sort of break nice breakable slide to summarize what I just did in pictures classification is a Variant another kind of machine learning problem we certainly have to deal with because it arises in practice which is Actually quite similar to the problems. We've discussed so far so far. We've looked at so-called regression problems. These are supervised learning problems. So supervised means that there are pairs of inputs and outputs X and Y Where the inputs come from some arbitrary domain and the outputs are real-valued or real vector-valued wise in classification everything is almost exactly the same but the outputs are not Real-valued instead. They are discrete values. So they come from a bunch of classes Which we might label one through D or we could call them the class red blue and green and so on it doesn't really matter how you name the index space, but it's Isomorphic to the discrete space from one to D And there's a particularly interesting subclass or sub case where D is two where we have binary classification Those were all the examples we've seen so far between just two classes To address regression we constructed learning algorithms that assign or that learn Functions mapping from X to Y and then we put probability distributions over those functions Now we saw that just now that the answer to the classification task is actually also quite similar It's just that we have to learn a specific kind of function not a function that maps to the real line But a function that maps to probabilities So numbers between zero and one which sum to one across the classes now, of course probabilities are just functions just like The regression in the regression problem as well, but they are specific functions because they are constrained to be strictly positive And their elements have to sum to one. So that's a subclass of all functions But there are still functions and maybe to remind you in the probabilistic framework of regression We assign probability distributions to the values of these functions So what we might now want to do given that we have finite data is to assign Probability distributions to these specific functions we're using in classification, which happened to also be Probabilities, so we are again going to put probabilities over probabilities So to make this process a little bit more precise, let's introduce a bit of notation We are today going to be talking specifically For the rest of this lecture because it simplifies things about the case of binary classification That means we are going to consider training sets which consist of arbitrary inputs x and Labels which are not real valued But they consist of the members of two different classes and to have a concrete instantiation I'm going to decide to denote these classes with the class plus and minus one This is actually quite convenient because the sign can be used as a nice algebraic trick to simplify notation later on So what we're looking for is As the answer about which we are going to be uncertain A function that takes the input x and returns a probability function a pi of x A value at x that lies between zero and one and including zero and one And which can be interpreted itself as a conditional probability distribution for the unknown label y Which is plus or minus one given the input x Defined to be pi of x for the first class and one minus pi of x for the other class That's really just a way of describing what we're looking for Now what we'd like to do is to assign a probability distribution over this unknown function pi Now what I just said is that um learning this function pi of um x which returns a probability for y at the location x is almost like regression In that we're learning a function so far in regression We've learned real valued outputs y at x and we assumed that the observation model So the thing we got to see the likelihood is a Gaussian distribution because y is a real number and um That is an evaluation of the function f of x with some Gaussian noise sigma squared Now this is almost what we are now faced with it almost the same situation It's just that the domain is wrong you could say so the function we are learning here Isn't a real valued function anymore that lies between zero and one And instead the observations are binary valued minus one and plus one But actually it's a bit more than just the domain. It's also that the observation model is wrong We're not actually getting to see the function value corrupted by noise We're getting to see A class label that is drawn from the unknown probability So our likelihood is going to have to be something else And in fact, maybe our prior has to be something else as well So let's first talk about the prior in and I thought I'm going to draw a picture in The lecture so far we learned functions where whose output space is the real line And to do so we assigned prior distributions, which were joined complicated covariance structured objects, but at every single individual Location input value x the prior marginal predictive distribution was a Gaussian distribution So Gaussian distributions are probability distributions that are defined over the real line And they assign a probability To this unknown thing, which we call the function value at x which has some kind of bell shape form So they have support over the entire real line. What we now need is instead An object that puts prior probability density across the simplex the domain from 0 to 1 on which we define pi of x to y And of course, it would be wrong to use a Gaussian distribution for that because a Gaussian distribution fundamentally Assigns probability distribution to the entire real line. So if we put a Gaussian here Even one that is quite peaky inside of the right domain Then this is not going to be a probability Distribution a proper one over this domain because we're missing the mass that is outside here So we could what we would like to do maybe is to Okay, maybe we could reinvent the wheel and do something completely new But we could also look at this and say well, this is sort of almost the same The only thing we need to change is the domain of the thing we want to lie in And we could do that maybe by inheriting some nice structure from the gaussians because gaussians are so nice And we're now gotten used to it and we realized that we can do gaussian inference using linear algebra So maybe we can keep some of this around and instead Sort of not only adapt the the input domain or the output domain a little bit So we could do this in various ways We could take gaussian distributions and sort of v scale them by cutting off bits over here Maybe that would be a sort of a first thing you could you could think of and then we have to v scale these distributions Keeping in mind that we've just cut off a little bit of probability mass here so that they still integrate to one But actually there's another idea which is even a bit more elegant and that other idea is to instead use The transformation rule for probability density functions for probability measures and construct a random variable called pi From another random variable that is actually real valued and what I mean by that is Pictorially shown in this kind of animation. So imagine that we have a gaussian process here down here. This is actually a gaussian process it's a function a learning machine a prior probability distribution that assigns Over an input domain x here a real line at every point of the input domain x a real valued function value that lies On the real line along here and here I've chosen A gaussian process using a gaussian kernel because it produces nice smooth lines. That's not really relevant and I've put the prior mean function at plus 1.5 over here just to make it a bit more interesting It's not at zero. It's at a positive value and then standard width of One so the lines here are two standard deviations. So they're at plus minus six. Okay, so Now one thing we could do is we can take this gaussian process and take every single of these green functions that are possible hypotheses for the underlying function and Push them through some non-linear transformation that takes the real line This real line and sort of smushes it so that it fits between zero and one When we do so and I haven't told you exactly how I do that yet But if if I do that then I get these black lines up here So what I've done is I've taken every single individual one of these three samples that are animated here and put them through some kind of squashing function that squashes them onto The domain from zero to one What you can see is that if you get out functions that are how I have to divide structure They map from our input domain to the simplex zero one And um because of the way that I've done this transformation, which is quite a smooth one You can see that these function values So if this one is actually a monotonic transformation So if this value goes up then this value also goes up And you can see that these are now all lines that lie hard between zero and one And because the prior mean of this Gaussian process is not at zero Most of these functions typically aren't at 0.5 at 50 percent But they are a little bit sort of pushed upwards towards towards one Of course, I could also create a prior that puts more mass towards zero by moving the prior mean down here or any sort of other kind of structure by Changing the kernel of this Gaussian process or the mean function To scale and shift and move things around here to create other hypotheses This is great because it now means that if we can do this Then we can basically inherit all the great modeling power that we just constructed in the previous lectures for Gaussian process regression defined through kernels and all the transformations we can do on kernels to get derived Gaussian process models And sort of move them forward Transport them to this new machine learning domain of classification How have I actually created this particular picture? Well, I've chosen a very specific kind of transformation from the real line to the simplex These functions are also called link functions. And here I've chosen a function that So this is a transformation of the real valued function value f onto a probability pi Which is called sigma of f and it's a very popular function That is that is the so-called sigmoid function or also the logistic function One over one plus e to the minus its argument You can see that if f is very large, then here we have one over one plus something very small So roughly one And if f is extremely negative So if it's a very large negative number then here we have a very large number and we get one over A large number so almost zero Here's a picture for this. This is the this black line is the transformation You can see why it's called a sigmoid because it looks looks a little bit like an s shape Mapping from the real valued input domain to the Simplex the domain between zero and one these gray lines are lines of equal distance on the input domain You can see that they get squished on the output domain Here get three different curves each of which are gaussian distributions Over this real valued function and you can see that they get transformed into distributions On the output domain that are not gaussian for this red curve It actually almost looks like a gaussian, but it isn't it's a squashed version of the gaussian It has its mean at zero in the input domain And the center of its mass at 50 percent and actually this happens to also be the mean at 50 percent so we transform things are from zero to one half and from plus infinity to one and from minus infinity to zero And gaussian distributions that don't have a centered mean and which are maybe a little bit wide or get this Very clearly non gaussian shape by being pushed around this way This sigmoid function this particular one the logistic function It is popular for various reasons and statistics there and also machine learning There are various interpretations for it One is that that is sometimes cited, but it's a little bit useless Is that this link function happens to be actually the combative density function of some complicated probability distribution That's not particularly important for the reasons or as a reason for why we use this function What is nice about it is that It well, obviously it's symmetric in this way that's straightforward But one nice thing about it is that it can be written in this way and therefore its derivative So the derivative of the output with respect to the input can be written in terms of The probabilities themselves in this nice kind of way So maybe For those of you who like to see code rather than pictures and paths Let me end this part of the lecture very quickly by showing you a little bit of python code for how I created let's say every single frame of this individual picture And that we do that not just because I mean this transformation is actually relatively straightforward But maybe to drive home the fact again that there are sort of intermediate steps here that We are trying to learn a function That is a probability But the observations we are getting are binary values. They are discrete plus or minus one classes So how would we create? Labels for a training set from this kind of function. That's what we're going to do to define our likelihood function. Well, let's see so Whoop what I have over here is essentially everything So I've blacked out a few things that aren't that important I'm obviously loading a bunch of python libraries somewhere hidden up here And then I'm also defining gaussian process necessary objects or kernels and mean functions and so on So you've seen me do that on various Previous lectures, so I'm not going to do it here again. This is all happening in these among these little dots up here That are grayed out So what I now do additionally is I define this link function Which we're going to use to define the likelihood or which essentially already defines the likelihood by doing so Which is this? sigmoid function the logistic function, which is 1 over 1 plus e to the minus its input Now we can use that to together with a gaussian process prior to construct a prior distribution on Probabilities rather than real valued functions and then use those probabilities to create a data set by drawing from these probabilities So let's do that here. I define So I'm using my gaussian process prior to define function values on the latent space So those are our gaussian process function values Um, so to do that as you know from previous lectures I have to compute a kernel ground matrix and a prior mean function And maybe something to do plots. We won't actually need this to draw our samples and now I'm drawing random numbers here. I'm doing sort of the professional way of drawing gaussian random numbers actually what happens inside of your normal random normal multivariate normal function in SciPy.Random, so I take the Joleski decomposition of the kernel ground matrix make it definitely positive definite take the Joleski decomposition of it and Multiplied a Joleski decomposition with some random numbers and add at the prior mean function. So now we have samples from our gaussian process And now every single sample. So which actually I've shown you over here So here I've used a slightly more fun kernel Which is the sum of a gaussian kernel and the linear kernel So this gives functions that are sums of a linear function and this sort of gaussian process smooth gaussian kernel Prior here are two such samples as dashed lines Every single such line can now be pushed through the link function. That's what happens here This gives us these black curves here. These are the notice that these lie between zero and one And now we can use those these defined probabilities We can use these to draw actual discrete class labels And how do you do that in a generative fashion? Well, you Just take random numbers uniform random numbers that are distributed between zero and one Of the right size And then you just so these are these red dots that you see in the plots here And then for every single red dot we check whether this red dot is above or below the line If it's below Then we draw a function value and then we draw class one and if it's above then we draw class minus one You can Maybe convince yourself that that means we're going to draw class one with probability given by pi of x so the black curve and class minus one With probability one minus pi of x and then I actually plot these classes as black dots up here and down here Okay, so what we've just done here is we've described we've constructed a generative model. So a Joint probability distribution over latent real valued functions Latent probability valued functions, which are directly computed by transporting pushing the real valued functions onto the simplex and Then generating discreet observations by drawing from these probabilities Now what we need to do in a machine learning setting is the inverse problem Where we are given observations that are discrete values and we need to infer What the latent Probability was to draw them and that latent probability is going to be related through this by the way Invertable transformation called the link function Through the value of the latent real valued function from the gaussian process and doing so amounts to Bayesian inference and that's what we're going to do now Now what do we have to do to do Bayesian inference in this setting? Well, we have to do the same thing We have to do to do Bayesian inference in any setting We have to multiply the prior with the likelihood and divide by the evidence to get the posterior Right, so let's do that So our prior distribution on this latent function, which I call f not pi the function that has real values Is a gaussian process. Okay, that's our prior great. So now we have one part of base theorem the likelihood Is given by the Probability so the probability to observe a positive class one is given by the probability Or a defined through that latent function f So we just on the previous slide saw that we can do that by we can construct this probability Or we define that you construct this probability by taking f putting it through our soft Our logistic link function and then defining that that be the probability for the class So it's sigma of f for the first class and one minus sigma of f for the second class So by the way, here is why where our notation for The classes deciding to denote the classes the class labels by plus or minus one really pays off Because this sigmoid function has this nice property that a sigma of x is one minus sigma of minus x So that's exactly the expression here. We need for the negative class So we can simplify this annoying expression with these two cases into this simple line Where we just say the probability to see class y which is either plus or minus one is given by y times the latent quantity f Great, so this is our likelihood function. That's it. We have it All we need to do now is multiply the two by each other and get our posterior to normalization, right? Well, unfortunately This is where things get a little bit hairy because our likelihood function is not of gaussian form Here and therefore the product of prior and likelihood Is not in general going to be a gaussian So things won't be as easy as they were previously here I write this out explicitly again our assumption now is that every single observation Is connected to the latent value f? through this likelihood And because this isn't the gaussian function, we can't use the fact that the product of gaussians is another gaussian And our posterior distribution therefore is going to have an annoying complicated form Before I show you a picture of this, let me just point out at this moment that Um, it's because i'm going to use the term anyway, so I might as well say it now This function is this particular sigmoid is called the logistic function as I already mentioned and therefore This process of what we're doing here computing estimates for this latent function f that is linked to our observation Through the logistic function is called logistic regression So whenever I say logistic regression, I would mean exactly this process There's also a statistical interpretation of this process. That's also called logistic regression and they are of course very closely related to each other So another way to think about what's happening here is that in the log space So if you think of the logarithm of this posterior function here, which of course is interesting to study for all sorts of reasons We've previously discussed Then this prior Is the exponential of a quadratic form. So its logarithm is a quadratic form. Here it is again Over the training inputs, but the likelihood function Is of a form such that its logarithm is not a quadratic function Remember the form of this logistic function is one over one plus e to the minus x So it's something like the logarithm of one which is zero Minus the logarithm of one plus e to the minus x and one plus e to the minus x is not something you can take an easy logarithm of This is also true in pictures. So let's say we want to learn two different function values to make things simple Here's function value number one and here's function value number two Let's say under our gaussian process prior over them. They are actually related to each other They're not independent. So their joint distribution is this gaussian Bivariate two-dimensional gaussian distribution with correlation Their marginal distributions are of course gaussian distributions. Here is the what marginal on f2 and the marginal on f1 Now let's say we observe a training datum at location number one So let's say we observe class plus one class one the positive class Our likelihood says that the probability for this to happen is given by the logistic function of f1 So here is this logistic function in blue. That's our likelihood function So to get our posterior we have to multiply this gaussian distribution with this blue likelihood function And So here I've sort of shown just a likelihood function as a blue shading That's a bit trivial because it's just a smooth thing And if you multiply these two together Then I actually do this here for every single pixel of this plot then you get this complicated object back That's our posterior distribution up to normalization So here I actually normalize it and you see in red the marginal posterior distributions And you can see especially at f1 but also at f2 that these are not gaussian distributions Clearly they are not gaussian as you can see by looking at their joint form Right, this doesn't look like the exponential of a quadratic function So again We've discussed this in previous lectures as well. This is in the nature of bayesian inference We have decided to use a particular structure that we thought maybe Would make things convenient and in fact it does make certain things convenient. Otherwise we wouldn't still be talking about it but We because we have committed ourselves to using the probabilistic framework the philosophy of probabilistic inference the notion of distributing truth according to the rules of set theory We are forced to do inference at least in principle in the form of Bayesian inference And that just means that if we want to know the truth all we can say is that the truth looks like this It's a distribution of truth values over this domain in this way Now this is fine to do in 2d because I can make this plot right I made this plot explicitly by just creating an image with pixels on this two dimensional space And then multiplying priors and likelihoods and you get the actual posterior distribution But of course we won't be able to do this in a higher dimensional space In the kind of space that I've shown you on previous slides If if we had only 10 function values to consider simultaneously I would need to create a 10 dimensional array With a huge exponentially large number of voxels in that 10 dimensional array To fill with these posterior values and that would be computationally entirely intractable So we are back at our Problem of our fundamental problem of probabilistic inference, which is the computational one If we want to make any kind of meaningful statements We have to find a computational trick or an approximation That allows us to capture interesting aspects of this posterior distribution And then talk about it We already found one way to do so in previous lectures Which is that we could try and find just the maximum of this distribution Which is probably going to be somewhere around here And this in fact is what in the statistical framework of machine learning is called logistic regression But maybe because this is a probabilistic machine learning class We don't quite want to give up this much and just say let's construct a point estimate Maybe we want to capture a little bit more of the probabilistic aspect of this posterior distribution It's not just its Maximum, but maybe its location Maybe its shape maybe something about the fact that it isn't just a point that it's actually a cloud And that's what we're going to do now so Instead of keeping track of this entire thing In its non-parametric its non-analytic form Maybe there are certain quantities we can use To capture an approximate description of this probability density function that is the posterior For example, we might be interested in the mean of this distribution Which is somewhere over here, which is not the same as the mode or its variance But notice that computing those is actually kind of tricky because you need to compute to get mean and variance So here they are again. This is a complicated slide Sorry, you can stop them the video and read the text if you want to but here's the stuff you need to look at If you wanted to compute the mean you would have to compute this object Which is an integral against the posterior or the variance You have to compute this object with an integral against the posterior Now we don't really know how to do this integral because this object has this complicated form So maybe if we don't know how to compute the integral, maybe we can do something else And that's actually going to be something we are going to be adding to our toolbox, which is that we could take the First construct the maximum of this distribution find this point where it has its highest value Which gives a point estimate a maximum A posterior estimate and then Describe the geometry of this object Around this maximum at this location of the maximum By evaluating its curvature And using that curvature to construct a gaussian shape This idea is called a la plus approximation And it's it's so important that we will add it to our overall toolbox It's called the la plus approximation because it goes back to Piercimont the marquis de la plus who invented it in 1814 in his a leutic theory of probabilities To actually solve A problem that is the basic form of what we are discussing today It's the problem we discussed already in lecture number three The problem of inferring an individual probability the probability for an individual person to wear Glasses for example by observing positive or negative samples from the Distribution so In lecture number three We actually saw that today we can do this in closed form because it is a univariate problem And there is a corresponding integral that is called the beta integral that we could just compute on a computer Now the plus couldn't do that. So he had to approximate his beta integral, which doesn't have an analytic form With a gaussian approximation to get his analytic form And he did that by finding the mode of the distribution and then taking the logarithm of it and performing a quadratic approximation to it And then calling that quadratic approximation the logarithm of a gaussian because gaussians are exponentials of quadratic forms Maybe that was a little bit fast. So let's do it in two more ways Not just by waving my hands around and pointing at complicated old french pieces of expositions, but instead by showing you first a picture and then The uh corresponding derivation So here's the picture Here's our posterior distribution again. So in um from the previous slides Let me remind you that this elongated non gaussian red shape here shown as equipotential lines is our true posterior Now what we can do is we can find the mode of this Posterior distribution. So the point where it has its highest value That's easier than to compute its moments because it poses an optimization problem and optimization problems are easier to solve Once we have this point We can compute the second derivative at this mode actually also the first derivative But the first derivative is going to be zero because we're at a mode and we can compute the second derivative That's a matrix a Hessian matrix of curvature And that's also easier to compute than moments because it's a derivative It's a second derivative. And as you know from previous lectures computing derivatives is an easy thing It's much easier than integration than it's summing up volume So once we have that curvature, which is represented by these black ellipsoids here We can treat those as essentially the logarithm of a gaussian Because gaussians are e2 a quadratic form a second order term And that will give rise to a posterior distribution which I'm actually plotting here as these green objects. So that's this green curve and this green curve So you can already see before I show you the math from by comparing the red curve Which is the true posterior to the green curve, which is the approximation to it that this is of course not a perfect approximation First, let's look at the function value that we at the location that we actually directly observe So the mode of this gaussian distribution Coincides with the mode of the posterior of course because that's how we've constructed it however The mass is not actually distributed in the same way as in the posterior The posterior has what you might call a heavy tail on the right hand side and a weak tail on the left hand side This is not captured by the gaussian because the gaussian necessarily Only like has to have a symmetric form. That's just what gaussians are. They are symmetric And in fact, you could even think of situations in which this particular Shape is so nasty that this Local approximation is arbitrarily bad So if you want me to draw a picture, I can very briefly Maybe just to drill this point home Say imagine you have a distribution That looks like this Then the Laplace approximation is going to find this point and then drill this kind of gaussian approximation into it That's obviously a very bad approximation because the true mass of this distribution is all over here And this is maybe even an extreme case Maybe the true distribution only looks like this Then we still get an approximation like this potentially But now it might be even clearer that actually everything should be happening over here in a region that this approximation does not capture at all So one downside of Laplace approximations is that they can be arbitrarily wrong What maybe an upside though is that because they are based on a map estimate They are at most as wrong as the point estimate that lies at their center Now um Maybe just to Make this clear as well Let me just point out one more feature of this plot By looking at the predictive point So this you could think of this here as a one dimensional training set and this as a one dimensional test set, right? That's some other point where we want to predict So here in red is the true posterior and in green is the approximate posterior We get from this geometric or Laplace approximation And notice that this green curve So first of all, of course, it's gaussian and it shouldn't be because the true curve isn't gaussian But okay, we have decided to approximate with a gaussian But also its mean and mode Are actually not at the mode or the mean of the true posterior And the reason for that Is this non gaussian shape of the joint So while we have actually constructed the mode of the Posterior here to build our gaussian approximation This is not necessarily true for predictive points later on Because these are not part of the way that the approximation is constructed Okay, so that was the pictorial view now Let me that maybe end this section of the lecture with another slide that you should consider a gray slide That formally defines what we just did So let's say here here is a function p of theta that we want to approximate in our case This p of theta is actually a posterior distribution That arises from some data by multiplying a prior with a complicated likelihood That's the typical setting But what we're going to do has more or less nothing to do with this structure It just requires us to be able to evaluate p Of theta and its derivatives and we assume that the first two derivatives exist So what we do is we first find a local maximum of p of theta Which equivalently is also a local maximum of the logarithm of p of theta because the logarithm is a monotonic transformation So we find that using an optimization algorithm Optimization algorithms usually require access to the gradient of this log of p Which is easy to compute these days with automatic differentiation We optimize to find a point where the gradient is zero now At this point, which is a local arc max Let's take a second order Taylor expansion of log of p around the Around the mode so that Taylor expansion will consist of remind yourself of your Multivariate calculus classes of a zeroes order term a constant a first order term a linear term A second order term that's a quadratic term and so on higher order terms So the zeroes order term the constant is just evaluating the function We are approximating at the location. So that's log p of Sita head our point estimate our maximum apostrophe or estimate in this case the first order term is given by In sort of in this expansion delta by a term that is delta transpose times the gradient Of the logarithm of p at theta hat and that gradient is zero. So therefore, there is no linear term here The second order term is given by this quadratic form so if You and then there's a higher order term, right, which is in in delta delta cubed, right? Okay So this year we're now going to interpret as the logarithm of a Gaussian distribution we're going to Get rid to drop all terms of higher order and say let's take e to this expression Then up to normalization. That's so that's e to a constant. Let's call that a normalization constant plus e to one half times delta transpose times Some matrix that is the second matrix of second derivative called the Hessian matrix at location theta hat Let's call that matrix psi So delta transpose times psi times delta. So this is almost like a Gaussian Let me remind you that Gaussian distributions are have a pdf that is of the form I'll write it down so that it's maybe particularly clear Gaussian of x given mu And sigma is given by Unnormalization constant Which is not important today times the exponential of minus one half times x minus mu transpose times Sigma inverse times x minus mu So notice that this expression Looks a lot like what we have on the slide here We just need to be a little bit careful with the definition of all the quantities So if we call this expression delta Then we need we can identify sigma with minus psi inverse assuming that it can be inverted Because they need a minus in front, right? and the point where Delta is zero Is the point where this expression is maximized It's mu the mean of the Gaussian and that is going to be theta hat So our Gaussian approximation Due to Laplace is given by A distribution that which we might call q so q is the standard notation for approximations and approximate Bayesian inference That is a Gaussian distribution over the parameters theta at theta hat With mean theta hat and with a covariance that is given by minus the inverse of the Hessian of the logarithm at the mode notice by the way That if p is actually Gaussian Then the Laplace approximation is perfect Now in the picture I've just shown you on the previous slide. Here we go again. I've actually used this approximation that I just introduced as a generic approximation in the Maybe more specific setting of this kind of logistic Approximation or logistic regression One aspect of that is that um because the likelihood looks like this We um Can be relatively certain that this that this problem I just drew on the whiteboard behind me of having a posterior distribution that has a little peak and then Just all of its mass somewhere else is not going to be as pronounced as in the picture I drew because the prior is of this Gaussian shape and the likelihood is of this sigmoidal shape So therefore these two together are not going to create a little bump somewhere That then goes down with high curvature and then becomes flat again Instead we're going to inherit the curvature from these two distributions at these initial points and then towards the right They are going to be Maybe as symmetric as in this example, but they will not Decay and then flatten out again A maybe more formal way to phrase this is that we will find the corresponding problem is convex And I'll show you that that's actually the case in a more formal way later on another thing That is explicit in this picture already, which we should maybe discuss a little bit further Is that we're going to build a machine learning algorithm from this that builds at a plus approximation at the training locations So in this one-dimensional domain here and then uses that gaussian approximation. That's this green little hump To do a pro to do predictions at new points where we don't have data at test points That's this green curve here So we already discussed what this does in the picture Let's talk about what we can say about the quality or the effect of this approximation In a little bit more formal way and to do so we We I'll point out something that you can also find In the wonderful textbook by Karl Rasmussen and chris williams on gaussian process algorithms for machine learning 2006 actually the rest of this entire lecture is based on the exposition in that book And I strongly recommend that you have a look at that book if you want to see way more details It's a beautiful book It's freely available online as a pdf and even though it's now 14 years old It still holds up really well. It's one of the most amazing textbooks for machine learning So let me just remind you again what we're going to do. We're going to do a laplace approximation So we'll find the mode of the posterior distribution at the training points at capital X And then we will find the Hessian of this Posterior distribution over the training points fx at At the training points at f at capital X compute the Hessian inverted write a minus sign in front and that's going to be our Training covariance if you like and that will give us something like an approximate gaussian likelihood for the training algorithm now Notice maybe computationally that doing so is going to be about as expensive as gaussian process regression Because we need to build this matrix. So for that we need to compute the derivatives But as you know from previous lectures computing derivatives is cheap It's as expensive as evaluating the function itself And we can do it with automatic differentiation And then we have to invert that matrix and notice that that matrix is exactly the same size That of the of as the gram matrix that we would get if we did regression So inverting it is as expensive as the corresponding step we have to do when we do regression At least in principle. So once we've done that we've essentially constructed a training likelihood for our Essentially observations even though they already squeak glass if classification observations at the training input locations capital X Now we're going to use these to predict test locations So what are we actually predicting there? Well, we are eventually going to predict test classification outputs. So labels binary plus minus one But to do so we first have to predict the latent function F at Training location at test locations little x So to do so we are going to construct an approximate posterior predictive distribution at these test locations f of x And we'll do that. Of course using the rules of probability So we write down this is a marginal distribution right over the test locations f of x We can write that down by writing the joint distributions distribution over the training and the test locations Use the product rule to separate them in the exact computation here. We would have p of f at capital X given y So this object But we're going to replace it with our q our approximation that will make things tractable And now we just have a gaussian because that's what our gaussian process prior says times Another gaussian and things are going to be easy again. So the predictive distribution for the test locations f of x given the Training latent function values f of capital x. That's just our predictive gaussian distribution So that's one more of these applications of linear algebra the conditional distribution of one gaussian random variable Given the other gaussian random variables and you've now seen these expressions often enough They look like this. So this is going to be a gaussian distribution over the test Latent function values Which has a mean that is given by the prior predictive mean plus the covariance between test and training times the prior covariance matrix of the training locations times the distance between the true value of the Training latent variable f capital x minus the predictive mean Now we don't know this quantity, but we now have a gaussian predictive distribution over it Which is going to make things simple And there's a corresponding covariance and what we now have here is just under the approximation The marginal of the product of two gaussians And we know from the lecture on the properties of gaussian distributions that we can just simply compute this and find that we get a gaussian predictive distribution over all of the Test locations, which has a mean that is given by this expression where we've now replaced the actual value of the latent quantity with the mean of our gaussian approximation under the plus approximation And Variance or covariance matrix, which is given by this expression, which is the prior conditional predictive variance plus The linear operator that is applied to our f here from the left and the right hand side times the observation likelihood This is the quantity you also know and expect from the evidence computations in gaussian process regression And this quantity here is now our negative inverse Hessian of the loss function at the mode sigma hat Now how good is this approximation going to be? Well, this predictive approximation Well, it turns out that we can quantify this relatively well and again This is due to section 3.4 in kalwasmoson's and chris williams book We can think about what our prediction would be if we had actually access to the true probability distribution over the training and test locations Given the data so the true posterior If we had that then this we can again use the same kind of trick, right? We can separate the joint distribution over training and test Into a predictive distribution at test given training times the predictive distribution over training And so that's just using the product rule, right? There's no approximation here And now we just think about what this so here the This expectation is now separated the two different Expected values essentially, right? So it's two different integrals. So there's an integral over the training locations The value at the the latent quantum the latent function at the training locations And another integral over f at little x at the test locations Now because little x only shows up in this expression here We can take the integral inside sort of get rid of this expression or move it to the outside of this integral And just write down that that would be our expected value of the predictive Under under the predictive distribution For sorry, that's ah, so this is going to be our expected value of the Latent function at the test locations Given the true value of the latent function at the training locations Now we know what that is because it's just a Gaussian integral, right? The predictive distribution for f at test locations little x given f at training locations It's just a Gaussian conditional distribution. It's this distribution up here exactly this one So under this distribution The expected value of f of x is just this right? That's the expression here. That's the mean So we just plug that in Right and now we're just left with this outer integral over f of x Now because f of x shows up exactly like linearly or like directly in here We can just like take all of the other expressions that don't have anything to do with f at capital x outside of the integral And we're left with an integral over f of x under p of f of x given y. So that's just an expected value Now compare this expression with the expression Up here, which is our predicted mean under the approximation And we see that these two just these two statements are almost identical up to the fact that here we've replaced the actual predictive mean with the mode of our posterior distribution at the training locations So what we've done is we've literally I mean that's sort of what this picture says as well. We've replaced the mean of this red curve With its mode And other than that we've not done anything as far as the mean is concerned And we can make a corresponding statement about the variance which is a little bit more tricky So for that you have to essentially do the same kind of argument I just made for the mean and and keep track of all the square terms of f that show up I'm not going to do this again You can do it for yourself if you want to or you can look it up in kalvaspa's and chris williams book The result is going to be an analogous statement that says the actual predictive variance of any test location Is given by an expression that looks a lot like what we're computing here under our approximation And we've just replaced the true variance with the curvature of the log posterior distribution actually a negative inverse curvature Of this logarithm of the posterior distribution at the training location So in this setting of gaussian process classification or logistic bayesian classification What's the laplace approximation amounts to is to capture the true moments first and second moments at least in terms of moments of the predictive distribution Except for the fact that all quantities relating to the training distribution replace true moments With polynomial approximations of the logarithm of the posterior around the mode That's what the laplace approximation really is Now by the way So far as I already mentioned We've just made predictions about the latent quantity If you want to make predictions about class labels, which is actually the typical case So if you want to predict plus or minus one Then you have to marginalize or you have to write down the predictive probability right pi as a function of f So that's just sigma of y times f And that has this whole thing has a new distribution All right a devive distribution because this is a random variable pi of x that is constructed from f So you might care about properties of that distribution and here again, you might decide to just With compute moments. So for example, one interesting thing would be the first moment the expected class at a particular location And that could be done by doing an integral and it turns out that this integral is actually tricky again because here Is a logistic function and here is a Gaussian distribution and the expected value of a logistic under a Gaussian is not so straightforward But there are good approximations for it and it's a one-dimensional integral So it can be done very efficiently. You could also decide to do a more simplistic approximation and just Do sort of carry on with this idea of just approximating integrals or non-gaussian distributions with Gaussian approximations with polynomial approximations and just compute a maximum a posteriori prediction for the class By taking the posterior distribution over the latent quantity and just computing the sigmoid of a logistic distribution of it We'll do both at the end of this lecture in a moment until then Let me just summarize what we're going to do and then we just do the derivation that a plus approximation is um Only a very rough approximation that is motivated by a geometric argument It can be arbitrarily long wrong because it's a local approximation However, it's very efficient to implement because you only have to compute Gradients to find modes and then Hessian matrices Hessian matrices, of course, are not for free But they are relatively cheap to compute and they give rise to matrices that are of the same size of as the matrices That you need in gaussian process regression. So they are sort of tractable And for logistic regression or Bayesian Gaussian process classification This process tends to be quite good because as we're going to see in a moment The log posterior is actually a concave function. So the optimization works well and the approximation is usually a good one And the algebraic structure of this link function sort of leads this Yeah, this almost gaussian posterior that you've seen above. This is a great slide. So you can take a quick break Now if you actually want to implement this algorithm to do gaussian process regression If you have to jump through a few numerical hoops Doing this in all detail takes a little bit too long to do in the lecture And it's also a little bit tedious and can be a bit boring if you just watch me do it So instead i'm going to show you on a high level how to implement this following in the footsteps of Carl Wasmoson and Chris Williams And I will upload on Elias a detailed Jupiter notebook written by Alexandra Gessner That goes to a concrete example on one particular data set on how to implement this algorithm So if you want to look at how it's really done in python, you can do so yourself instead of watching me slowly go through it So let's do the derivations in a symbolic fashion instead of the code because those are actually kind of interesting Because they highlight some interesting structure So To to start that let's first see what all the quantities are that we need We're going to multiply our gaussian process prior with a non gaussian likelihood that consists of a product of individual logistic terms What the logistic function is given by this sigmoid we're going to do optimization and Actually, of course, you're free to choose the optimizer you want But to use a standard setting that works well for small and medium sized data sets We're going to do Newton optimization Partly just because I want to drive home the fact that you don't always have to do gradient descent because That's actually a stupid algorithm in many ways That's just popular in deep learning for various numerical reasons that have to do with stability under stochastic noise Which we don't care about in general so To implement this optimizer and also the Laplace approximation afterwards We're going to need the objective function, which is the logarithm of the posterior and its gradient And then of course it's Hessian because that's how we get our Laplace approximation And while we get it we might as well use it to do newton style optimization So what are those quantities first of all the logarithm of the posterior Is of course given by the logarithm of base theorem So that's the logarithm of the likelihood plus the logarithm of the prior minus the logarithm of the evidence The evidence is a function that doesn't depend on f. So it's just a constant Now let's just plug in what we know about these quantities the logarithm of the likelihood is the logarithm of this expression So the product will turn into a sum of individual logarithms of these individual functions So these functions are given by this form or there should be a z here. I'm sorry Um, so that's given by the logarithm of this which is the logarithm of one which is zero Minus the logarithm of this inner expression. So it's a logarithm of one plus e to the minus y i f of xi Which is a little bit annoying at this point. Of course, if there weren't a one here We could take the logarithm in but we can't so let's not move it in What's the logarithm of the prior? Well, that's the logarithm of a Gaussian And Gaussians are exponentials of minus quadratic forms So here is the quadratic form that we have to care about And then there's a bunch of normalization constants Also, especially in the prior. Let's just push that into the constant so that we don't have to worry about them Because they don't depend on f of x What's the gradient of this expression? Well, we can just drag the gradient through the sum And now we need to compute gradients of these log logistics Now it turns out that you can do you can stare at this expression for a while to find a fun simplification But actually the interesting bit about this is that if you look at this expression That this here this expression does not contain any of the other function values, right? So the The the likelihood for the individual datum is just dependent on the local datum and the local function value That's the iid assumption in our observations the fact that the likelihood separates into a product and therefore this gradient In will be a function whose element all elements only depend on the individual function values That will be useful when we compute our Hessian because that means the Hessian is going to have a diagonal form so That's the actual important observation here. That's where this chronicle delta ij comes from the term afterwards You can maybe convince yourself that that's true. It's actually easiest to just do that derivation yourself If you don't believe me if you do believe me, you can just read or tell they can just believe me that I can just tell you that the derivative of this expression with respect to f of xi Is just given by that expression And this works No matter whether y is plus or minus one, but it does require y to be either plus or minus one The second derivative the Hessian or actually by the way, so the gradient of course also has this term here from the prior So this is just a gradient of a quadratic form This is something that you might know by heart if you know multivariate calculus. If you don't then this is the analogous multivariate version of the statement that The x square with respect to x is two times x right? So the two comes down the one half goes and we are left with the matrix times the right hand side To get the the Hessian we have to take a double derivative So we just put that move that through the sum here and here we get a very nice structure Which is just the in the kernel ground matrix itself inverted And as I already pointed out above The second derivative of the likelihood is a diagonal matrix Who which on its diagonal has this quantity? As it's entry and notice that sigma of f is a number between 0 and 1 and therefore, of course One minus sigma of s is also a sum a number between 0 and 1 So that means this product is a number between 0 and 1 because the product of two numbers between 0 and 1 is a number between 0 and 1 And therefore in particular it's positive and with the minus in front here. We get um a negative number So that means if we define a matrix capital w Which contains on its diagonal because it's a diagonal matrix the entries of this vector here Then what we have here is a Hessian that is a sum of two matrices Both of which have a minus in front So we can take the minus outside and the Hessian is minus a diagonal matrix with positive numbers on the diagonal plus the inverse of our kernel ground matrix now Let's remember that the kernel ground matrix is symmetric positive definite because it's a kernel matrix And this w is a diagonal matrix that contains positive numbers. So of course, it's a positive definite matrix trivially So therefore what we have here is a sum of two positive definite matrices Which is itself a positive definite matrix as we've now pointed out several times So therefore the Hessian is minus a positive definite matrix So a negative definite matrix and that means that this optimization problem is convex because we're here have A function which we might maximize Which has a negative definite Hessian That means if we want to minimize this problem Then we have to minimize minus the logarithm of this function The minus will go and we have a problem with a positive definition That's good because it means first of all optimization is going to work. Well, there will be a unique minimum And also that gives us hope that the gaussian approximation We're going to get in the end will not suffer from this extreme situation that I outlined as a potential pitfall of Laplace approximations on the whiteboard before Okay, so now let's implement the actual optimizer and as I said, we're going to use a standard optimizer Which is the newton optimizer if you don't know know how newton raphson optimization works Then that's not important I'll just tell you how it works and you just have to have to believe me that that's a good optimizer If you do know how it works, then you know that newton raphson optimization Consists of individual steps in which we locally approximate the loss function with a quadratic and then take steps that are the iterative update Actually, let's call it f directly. That's the quantity we want to optimize is given by That's the the argument of the quantity we want to optimize the previous estimate minus the inverse Hessian of Our utility inverted times the gradient of the utility evaluated at f i at f i Where l is the function we want to minimize in our case. It's our log posterior So here is a generic implementation of newton raphson optimization We evaluate our Objective initially and then we iterate repeatedly Computing a gradient and an inverse Hessian Multiplying the two together using those as an update and repeat doing that until the update becomes so small that we think we are converged If we want to apply this to the code to our concrete problem of Bayesian logistic regression or gaussian process Classification and we have to plug in the concrete quantities that we've derived on a previous slide The gradient as we found is a sum of two terms One of which is the gradient of the likelihood and the other one is the gradient of the prior And the Hessian has a similar structure. It's the sum of it's the inverse of this. Well the inverse Hessian is the sum of The inverse Hessian Is the inverse of the sum of two terms One of which is the inverse kernel gram matrix and the other one is this diagonal positive definite matrix that consists of elements of the second derivative of the likelihood and those can then be just be plugged in and Again, this is a very high level if you want to see how this works in practice how you actually implement this efficiently You can look at the code that I will upload on Elias and you will notice that there is a Trick that is often done Which is to notice that this matrix here can be numerically unstable because k is often badly conditioned So what people usually do is that they multiply out w on out of this matrix by taking a cholesky decomposition of it Which is trivial because w is a diagonal matrix So you can just do that and then you're left with this matrix instead Which is easier to work with because it's a unit matrix which is traditionally Particularly well conditioned plus a correction term So all the eigenvalues of this matrix are larger than one because this is a positive definite matrix And therefore things become stable. But that's just a side note These are the kind of tedious details that people actually have to do then Now if we want to predict That's the what we do after training. So this algorithm allows us to train basically to compute all the quantities we need for our predictive posterior For our predictive posterior at the training points and then to predict at some test points in the future We need to do this computation that we had on we have sort of looked at in this Slide. So for that we need to compute a predictive distribution at some test locations Which has this mean and this covariance So clearly to do so we need to compute these quantities and these depend on the quantities We just computed in our train algorithm. So the train algorithm gives us this Distribution with these two parameters and now we just need to compute the rest to do our predictions This can be done quite efficiently using some cool tricks Which again are maybe too tedious if you don't want to know the exact details But just for those of you who want to see them To predict we want to construct these objects that were on the previous slide And for those there is a really fun observation Which is that we need a quantity that involves this predictive term here multiplied from the left with k little x capital x But we just did optimization. So we just found a point where this gradient here is zero So actually we know that this expression is equal to r So we can just compute the predictive mean by just taking r that we already have computed without any further matrix inversions and just multiplying from the left with the covariance term the kernel evaluations that we need to compute the predictions and Similarly to compute the predictive variance There is a quantity that involves the cholesky decomposition of this matrix b that I had on the previous slide This numerically stable thing down here And we just need to multiply with the inverse of the square root of w to get like to make this correction for b Notice again that that's easy because w is a diagonal matrix to get a quantity of which we can just take an inner product And subtract it from the prior covariance at the test locations to get the predictive covariance Now as I said on the previous slide on this slide here This only gives us q of fx if you want to predict The class labels at the location pi then you need to compute this integral Which should be at the end of this algorithm And you can do that in the two ways that are outlined before either by actually computing an integral to compute an expected class Or just compute the class label at the prior at the posterior mean So that's the sort of implementation you actually have to do. It's a numerical part and again computations are important in Bayesian inference because We know the posterior we're chasing and it can be intractable. So to get a good approximation We really have to think about approximations I will end with this lecture with a little pictorial view on this Again, this is one of these plots of mine that many people might find confusing But those who really want to see visualizations might find useful Maybe a little bit similar to this the top of this Hastings plot that people can then pour over. So what I'm showing you here Is on the top The let's call it the latent function space. So that's the space in which f is a real valued function and down here the simplex the space between zero and one with um Observations training observations So what I did here is I actually show you both the ground truth and the posterior So I created a data set by actually drawing from a prior. That's this blue curve here Then taking that blue curve Evaluating it at a bunch of points that you did at these circles And then transforming that function through the sigmoid to get a transformed probability at four classes at each point Which has these particular values at these particular points that are solid squares here Then at each of these solid squares, I drew random numbers uniform and a number from zero to one And um, so those you can see here and whenever these numbers were below The black line, then I return a positive training example here here and here as a square And whenever they were above we get negative training examples So this is a Another plot of the form that I just showed you at the halfway through the lecture in our jupyter notebook Now what we're going to construct is a posterior distribution over The latent quantity Which is a gaussian process posterior which we can use to predict At the training locations as well and this is going to be these red this red object Here is Here it is in the function latent space and here are the corresponding samples transformed through the sigmoid Transformation now in this plot you don't get see the posterior you actually see the prior so that's the covariance That we get by Initializing actually you don't see the prior you see the initial matrix that we used to initialize this newton optimizer which has a Covariance matrix which is given by the inverse of the sum of these two matrices And I'm plotting that here and I've initialized with an initial prior mean that is the um zero mean function Plot around that the diagonal square root of the diagonal entries of that matrix and draw samples from it as well These are these three dashed lines Now we can compute the gradient of our optimization problem With respect to the points the represent the points these uh These are here here and here and here wherever there is a square That represents the capture that the statistics of our posterior. So this is these are the values in This function here And I plot these gradients as blue lines Actually, I don't show just the gradients. I show the gradients as red lines Sorry, not blue but red lines and then the Actual newton steps. So that's the gradient times the Hessian as the blue line Now, let's do that do one update and you can see that even in the first step because newton optimization is a smart algorithm We already get a much better approximation And now if you keep going for a few newton steps, then this algorithm quickly converges after let's go back One two three So after three steps one two three newton optimization is actually converged already That's typical for newton optimization converges very fast. You don't need many steps And this is what the posterior distribution looks like the reason I show this plot is to highlight a few aspects That are maybe otherwise lost. So there is a true underlying blue line Which we've actually generated the true function from and in this case that blue line is actually from the model So it's uses the same kernel and it's the exact correct perfect generative process But all we get to see Are For training our machine learning algorithm. It's not this blue line, but just these black squares here these training points Which are discrete binary labels Of course, you can imagine that those have very limited information Contained in them because they are produced through this complicated generative process. We take this latent function. We Transform it, then we draw random numbers and we only return positive or negative observations So therefore you can't really expect the posterior to find this true generative probability Well transformed latent function again, right? So what our posterior actually finds and this is sort of the best thing we can hope for at least within this gaussian approximation is this posterior gaussian process posterior distribution Which is relatively uncertain as you can see but Captures the structure of the blue function So it kind of goes down roughly where the function goes down It goes up where the function roughly goes up and this blue curve Actually does look a little bit like these posterior samples that you see dashed up here If you transform these samples, then they look like this stuff down here And I've also transformed the posterior mean function so you can see it here And we can use that to predict class labels at other points And you can see that those predictions will often be very uncertain. This is an important insight That's fundamental. That's not something to do with the approximation that we're using here It's classification requires uncertainty. So if you just take this Solid red line, which is the classic logistic regression estimate Then obviously it's wrong compared to the true underlying black function Because we've only seen a small handful of class labels So to capture that uncertainty we need to be uncertain about this probability We need to assign a probability to this probability And that's represented by these samples that you see from this approximate posterior here in the back With this we're at the end today. We encountered maybe a first machine learning algorithm that goes beyond the gaussian framework A classification algorithm known as logistic regression or in its Bayesian form as gaussian process classification It applies to data sets that involve in a discriminative fashion Predicting the labels of discrete classes at various points in an input domain We saw that we can actually build such an algorithm by salvaging as much as possible from the gaussian process framework We construct it to infer real valued functions because this kind of classification is essentially regression But on a function that is a probability distribution rather than a generic real valued functions And because probabilities are real numbers We can use most of what we use for gaussian process regression But there are two caveats one is that the Function space has to be constrained to contain only Function values that actually amount to probability distributions and we saw that we can do so by Squashing by transforming a gaussian process distributed random function through a link function such as the logistic function And then there's a second problem, which is a computational one That once we have this non gaussian likelihood that arises from this squashing The posterior is not gaussian anymore because it's not a product of a gaussian and a non gaussian function To solve this problem this conundrum. We arrived we arrived at the one particularly efficient although not particularly effective numerical approximation for Bayesian inference called the Laplace approximation which amounts to finding the mode of the posterior distribution and in log space Computing its negative inverse Hessian at the mode to construct a gaussian approximation Such that the mode becomes the mean of the distribution or the approximate mean of our gaussian approximate distribution and the negative inverse Hessian becomes the covariance of our approximate gaussian distribution Implementing these methods requires a little bit of care So if you want to see exactly how it's done have a look at Elias at the examples We also briefly went through it and in doing so observe some interesting properties for example, we arrive at an optimization problem that is convex and Because we already know how to compute the Hessian. We can use that to build a very efficient optimizer to implement Newton-Raphson optimization With that we're at the end of this lecture. I'm hoping to see you again in the next one Thank you very much for your time