 Hello and welcome to probabilistic machine learning lecture number 14 We've already spent quite some time in this lecture course developing a quite powerful probabilistic extensive framework for one very specific type of machine learning problem, which is supervised regression So that's supervised machine learning in which the Quantity we're trying to learn is a function that maps from the input domain to the real line In the last lecture number 13 for the first time we encountered a Variation of this setting called classification in which the observations we make are Not real-valued, but instead they are categorical So we observe individual classes either two classes and the binary classification setting or multiple classes multiple Categories at different input locations Now studying this problem a little bit in the last lecture we discovered that Actually, we can address this supervised classification problem more or less with a minor variation of The powerful framework for regression that we've constructed in previous lectures That's of course nice because it allows us to salvage a lot of the analysis that we did and a lot of the understanding we constructed in the previous lectures In particular, we understood that we can treat classification as What's called a discriminative machine learning problem or formulation in which we believe that we observe let's say these binary labels that are either plus or minus one with a probability that can be described as Arising from the non-linear transformation of a latent real-valued function Which gives us our connection to our regression setting through some non-linearity That is typically chose to be a so-called sigmoid for the binary setting. So a s-shaped function cumulative density function of some probability measure and In particular for the in the previous lecture we chose as the sigmoid the logistic function which you can see here again if we do so then The beautiful trick is that we're still talking about a latent real-valued function and therefore can use a Gaussian process prior With all of the powerful machinery encoded in mean and covariance functions kernels that we've gotten used to over the previous lectures now the only problem that arose from this was that In doing so we're giving up the nice Analytic algebraic property of Gaussian process regression Which is that the likelihood is also Gaussian and the product of a Gaussian and a Gaussian is another Gaussian So in this kind of classification setting, which is called logistic regression or Gaussian process classification We have to come up with a computational approximation to construct an Approximate posterior distribution because the true posterior is not analytically tractable in any but the most trivial cases in the last lecture we spoke at length how about how to construct this kind of approximation and in particular we encountered one Particularly lightweight. Maybe also not particularly powerful But it would be easy to implement kind of approximation that construct such an approximate posterior called that a plus approximation and it consisted of finding the mode of the posterior and then using the geometry of the log posterior and as encoded in the curvature matrix the Hessian to construct an approximate Gaussian distribution by taking Like by taking the mode as the mean of this Gaussian distribution and the negative inverse Hessian as its covariance matrix Now in today's lecture, I want to dwell a little bit more on these kind of models and in particular Sort of study them a bit more to make connections to other frameworks you might have heard about and be interested in and also to think about what kind of Power this in particular is a Laplace approximation actually holds and whether we can extend it to Domains and to model classes that go beyond classification and Beyond Gaussian process regression in the latent space So in particular today, we're going to do two things First you may have heard if no where else then for sure in or look if on looks books parallel statistical machine learning class about a very beautiful type of machine learning algorithm called the support vector machine Which has great? Relevance not just historical relevance But actually continues to have great practical relevance for many applications And you might have wondered having seen my lecture on logistic regression how it is connected to Gaussian process classification will do that first then we will take a step forward and Think about what other kind of data sets we can apply this algorithm to so in particular we'll think about other data that aren't draws from discreet or Even binary classes and then finally I would like to think about the model side and extend beyond the latent Gaussian process to latent deep neural networks and talk a little bit about the notion of Bayesian deep learning But let's start though with the support vector machine now You can get a proper formal introduction to support vector machines in other lecture courses in particular in the parallel lecture course by Professor von Luxburg So I'm not going to do a derivation of the support vector machine Instead I'm going to directly give you the view from the probabilistic side And then if you already have an education or are getting one about support vector machines from the statistical perspective Then you can make the connection yourself So here is how you would arrive at the support vector machine Coming from where we are coming from Gaussian process classification probabilistic logistic regression So let's start with our probabilistic regression model again. We are assuming we have a Gaussian process prior and We are using a likelihood function that uses this link function that is called The logistic function, which is a specific form of sigmoid Now I'm gonna have to remind you of a few things that we did in previous lectures So let's go through these one by one first of all notice again that the derivative of this link function with respect to its input Can be written as the function times one minus that function. That's just an algebraic property of this function Now what we did in last week's lecture was to use a Laplace approximation So to do that we found the mode of the posterior distribution associated with a Gaussian process prior and This likelihood function. So to find that mode we took the logarithm of the posterior. That is a sum of The log prior and the log likelihood plus a constant that doesn't matter for optimization and Then computed a gradient of that log likelihood function and then try to find a point that our gradient is zero So here we go. Here's the gradient of this log posterior It's given by the gradient of the log likelihood and that is actually a sum over gradients of individual log likelihoods Because we made the assumption that the data iid when conditioned on the latent function and here is the gradient of the log prior so The prior is a Gaussian process So the log prior is a quadratic form minus a quadratic form and if you take the gradient of that with respect to fx Then you're just left with a linear form Now if you set this to zero then obviously this term is equal to this term, right? You can just easily rearrange and Therefore we can write this expression which involves this matrix inverse as the gradient of the log likelihood This isn't actually particularly helpful and doing optimization because you have to compute this stuff anyway Right so after and you have to compute both terms to get to them to them to the minimum But it's convenient after you finish with optimization because it means you can encode this vector directly as This vector which we might call r, which is the gradient of the log likelihood now Why is this helpful what it's going to be interesting because this quantity actually shows up in our predictive process so once we've computed our Estimated posterior approximate posterior on the training points. We can use it to predict classes at other input points at test points by using the mechanical properties of Gaussian process regression to construct a Marginal approximate posterior over the latent function at all of the training points and that involves computing in particular also a mean function which is Given by this object Which we talked about in the last lecture. So that's just the posterior mean of the latent Gaussian process if the true function is given by this constructed approximate mean of the Pository distribution at the training points now notice that this expression involves this bit that we compute During optimization and which we now noted at the minimum is given by r So we can think about this expression this this predictive distribution predictive latent quantity in terms of Colonel multiplied with this gradient of the log likelihood and that's going to be interesting For an analytic perspective to think about the structure of this predictive So I'm showing you a one-dimensional picture of this here So here is a data set that I've specifically constructed to show something showcase something on the left hand side You get observation training data points which are all from the negative class Which I've plotted at just the bottom end of the plot here They're not obviously not at minus four. They're just observations of the negative class Is that is solid points and then on the right hand side of zero? Let's say that we've only observed positive classes So what you can think of is that's basically a decision boundary be here at zero and on the right There's only class one and on the left. There's only class minus one so Let's think about what the situation is after we have found our Laplace approximation. So after we found our Mode of the posterior it happens to be that under a Gaussian kernel With a meaningful length scale this red line here is actually what the posterior is about the Well, not the posterior that which is the mode of the posterior So of course, we're only computing the value of that mode at the training point So that's given by these circles here these black circles correspond to these values of f hat and those induce this quantity this Approximate posterior mean which is this red curve here Now what you might think about is how that red curve can be represented in terms of the data in terms of these Layton quantities that are these black circles and that relationship is encoded exactly in this expression So here M is just zero so we're left with this expression So let's think about this expression a little bit because we just saw from this argument that it's related to the gradient of the log likelihood so the likelihood is this function that Maps from zero to one or that has a range from from zero to one I've actually plotted the likelihood into this into this figure here we go So in blue you see the likelihood for this function for Given that we have positive class. So that's the likelihood for this class here Now we don't need the gradient of this likelihood We need the gradient of the logarithm of that likelihood So that's a function that comes from below and goes up to one on the logarithm at one is Zero right so you can imagine that as this function gets closer to zero It has to become in the lock space flatter and flatter and flatter as it gets ever closer to two to one And therefore ever closer to zero in lock space now that means That the gradient of this function of the logarithm of the likelihood becomes flatter and flatter as we get Into these regions where f has a high value and those are exactly the regions over here because our kernel smooths this prediction and therefore creates this Interpoland between the latent quantity latent quantities that kind of implicitly Assumes that the latent function values within the class are high up here in regions where f has a large value so in those regions all of these points will have a Associated log likelihood with small gradient over here and that means that these terms do contribute as a relatively Small entry in this vector r and you can actually see this in the rest of this plot because what I'm showing you as these lots of Wiggly lines here are actually these individual kernels between x and the training data points multiplied with the values of these gradients and You can see these these are each kernels of course because this is a weighted sum of kernels And there's only two entries here that really stand out and those are the one and two training data points that are close To the decision boundary Why because the latent function has to have a relatively small absolute value in this region and in that region The gradient of the log likelihood is actually high so what you can imagine is that sort of that's an intuitive observation if you don't know anything about support vector machines is That if you wanted to compute this latent function the entire latent function actually this red curve Then you would do quite well to just approximate it with these two points and to almost ignore all the other ones because they Contribute a relatively small contribution to this point to this value Now of course, that's not quite true because these functions do have a non trivial gradient over here so what would we have to do To make this approximation hard or to actually make it exact Well, we have to we have to make sure that the gradient of the log likelihood in all of in within the class So in regions where this function value is large and in particular is larger than one, let's say is actually zero so That is the idea behind the support vector machine Which is an algorithm that gives rise to a point estimate like this red curve here that Algebraically depends exclusively on the training points that are closest to the decision boundary and these are called support points The corresponding values in the weighted sum are called support vectors so to get to that we need a Loss function a log loss function that is flat in the region larger than one and this is called the hinge loss So let's talk about this again from the loss perspective I already made this point several times now in previous lectures that you can think of this operation we're doing when we are finding the maximum of the posterior by taking the minimum of the negative log posterior as equivalently solving an empirical risk minimization problem where you can think of the log prior as an As a regularizer to the risk functional and the log likelihood as the empirical risk Now so far for Gaussian process classification We've decided to use the logistic link function So that's our log likelihood and you can think of this as an individual log loss term so what we just observed was this observation that the computation would be so much easier if The this function here at the property that to the right side of one It's just flat. It has no gradient because if that were the case then this big sum In our optimization problem when we compute the gradient of this actually would only have hopefully a very small number or potentially significantly smaller number of Points that are actually non-zero and these are called the support vectors or the support points and then we can of course think about How to much more efficiently optimize this kind of function by keeping track of only these active terms And this would be particularly helpful in settings where we have a very large data set that lies dense in the Sort of space that we're trying to do our inference over so to do so we need a loss function that has this form in the log space and Sorry well in that space of what we think of as log likelihoods and this is actually called the hinge loss So if you go to a statistical machine learning class you will arrive at this algorithm from the other direction So to say you do empirical risk minimization You notice that there is this beautiful idea of the hinge loss which gives rise to this good computational structure Which allows us to do this optimization or solve this optimization problem in a very efficient way And then you just call that the support vector machine and think about how to design it and use it and we've just seen that we are sort of tempted to think of this algorithm as arising as a as a limit case of Logistic regression where we take we adapt our our loss function such that it has this property that it's flat To the right side of the decision boundary Now I'm also plotting here in red the actual log of the logistic link function So you can see that what I just said before kind of makes sense So the gradient over here becomes ever smaller, but it doesn't actually become zero Now the gradient here by the way is large in this side But notice that we don't actually ever need this side to learn because if we are on the like at least at the end of Optimization, this is the region like to the left of zero where we have training values in the other class And that means we have to flip the sign of this loss function then we are in this region again so This is going to be another case of an interesting connection between statistical machine learning and probabilistic machine learning The support vector machine is quite fundamentally a statistical machine learning algorithm what we can do from the probabilistic perspective is to think about it in this sense of a Empirical risk minimization problem that might be associated with the log posterior if you do that And if you look really closely, then you will discover that actually this connection Unfortunately doesn't quite work and we'll do that now. Why does it not work? Well to give you the the answer right away It's because this particular loss function isn't actually a log put a log likelihood function and what I mean by that is Maybe shown best by this picture. So what you see here in this picture is another way to plot the these these functions the both the logistic link function and the hinge loss and here What I'm now doing is I'm taking I'm plotting the exponential of these functions So if you're starting to think about an empirical risk and then wonder whether the empirical risk Corresponds to a log posterior then you have to take the exponential Like you sort of have to take the step in the other direction instead of starting from a Posterior and taking its logarithm to arrive at an optimization problem You take an optimization problem and then take the exponential of it to see Whether you get a posterior and what you didn't notice is in red here This is the exponential of the logarithm of the logistic function So it's just a logistic function, of course and here in dashed red you get the probability for the other class Which is sigma of minus F now as you know The sigmoid has this property that sigma of minus F is just one minus sigma of F. So therefore the sum of this dashed Line and the solid red line is this that dotted red line up here Which is just the unit function and that's good because it means that no matter what the latent quantity is no matter what F is The probabilities for these two hypotheses, which are which we want to interpret as p of y given F and p of Minus y given F sum to 1 and that makes this wonderful because it means that this is actually a probability distribution Over the classes. So remember again that likelihoods are not Probability distributions over their second argument over the latent quantity But they have to be probability distributions by definition about the observed quantity about the data so for this red line, this is true at arbitrary input locations no matter what the latent quantity is the probabilities for the two classes we consider always sum to 1 This is not true for the hinge loss if you take the hinge loss and take its exponential Then you get this black line here. You can see that it goes all the way up to 1 and then it becomes flat and if you take the hinge loss of The negative Function then you and take the exponential of that there's a minus here That's wrong. You take that minus out, but that plus is correct then you get this dotted black line here and If you sum those two you get this actually this dashed line if you sum these two then you get this dotted black line And you can see that this doesn't sum to 1 now of course we could scale this this loss function Because remember that we're doing optimization here So if we scale everything by a constant then this doesn't change anything about this optimization problem So in particular we could scale it such that there is that it is always less or equal than one So that's not the problem, right? So if we scale things down then that corresponds to just rescaling the prior if you like then We get a function that is always less or equal than one But it only sums to exactly one at exactly two points at plus and minus one And it doesn't sum to one in the other regions So this means that well first of all formally It just means that we can't think of the hinge loss as a likelihood function at least not in this form now various people have thought about how to heal this issue and There are ways of introducing interpretations for how You could still end up with a probabilistic model for this So one one that might already spring to your mind is that maybe there's just a third class that we hadn't haven't considered yet Which has to which we can use to fill up the remaining uncertainty and that third class is maybe only implicitly defined That's true. And this is actually one way to do this. It's Unfortunately Maybe also a little bit unintuitive because if you think about where this model is going to be uncertain About or where this model is going to assign mass to this third class Then it's going to be right between the decision boundary Which is maybe a bit odd that if you have two classes approaching each other and then in between you just sort of Have a third class that emerges notice that it's not an uncertainty about class one or two it's a certainty about a third class and it also and that's maybe a bigger problem shows up inside of the class so if you have a very large and flat class and You might otherwise be or you would like to be quite certain that inside of that region that class is the right one Then and you have to live with the fact that there's a perfect a machine if you want to interpret it in this way has to sort of associate certain non trivial probability for the third class at the center of this Class region. So that's actually all maybe a little bit ad hoc and a bit weird and maybe it's a more natural A more actual answer is just to think of the SVM as a model that is fundamentally not probabilistic It's sort of tempting to want to come up with a probabilistic interpretation for it because it has this nice Computational property that it creates these support vectors which lead to a sparse optimization problem But we just can't and maybe we just shouldn't if you want to use the support vector machine You're very much invited to do so. It's a great powerful algorithm But just don't think of it as a probabilistic model. That's our first gray slide So we observed in our let's call it our logistic probabilistic regression classification algorithm that we There's a structure in there. That means data points that are far from this from the decision boundary contribute relatively small terms to the predictive latent function That makes it tempting to use a likelihood or a log loss that has the hinge loss structure But then we discover that this particular Lost function is not associated with a likelihood when we take x exponential and therefore We should maybe just not think of support vector machines as probabilistic models. They are fundamentally Statistical learning machines and are best analyzed from the statistical perspective That's of course fine And maybe that's an argument for the statistical machine learning framework and it is It also means that this is a kind of learning machine to which it is Fundamentally hard or maybe in some sense impossible to assign a meaningful Notion of uncertainty. It is really just a point estimate and you should treat it as such With this thought, I'll leave you for a second so that you can take a quick break Right. So this was our brief discussion of the support vector machine if you want to know more about it Take a look at the parallel class by Ricky von Luxburg What I want to do in the next roughly third of the lecture is To think about what we do if we have data that doesn't come from a Binary classification problem. It may have already seemed a little bit Arbitrary to you that we've decided to look at binary classification Obviously, there are many binary classification settings in the world by classifying things into good and bad left and right There obviously this happens a lot right is this or is it not of this class? But of course, there are many other machine learning problems in which the output data is just not real value But also not binary valued one First thing you might naturally think much early think about are situations in which you have multiple classes I've already mentioned this in passing. I actually said that that's also a classification problem But I haven't really told you how to deal with this If you actually encounter it. So the answer is actually Comparably simple. It involves sort of a two-step the first thing is that we're going to Actually allow for multiple latent functions So instead of saying that there is one latent function Which goes up and down and when it has a high value, then it's quite likely to see class one and when it has a low value It's quite likely to see class minus one Instead we're going to say now that there are let's say capital C classes from one to see That for each of these classes there is a latent function And that latent function, of course, is defined at all the training points So in total if you have n training points and C classes, we now have n times C training observations Where our training observation likelihood Somehow should encode that if at a particular location we see a certain class that all the other classes are Like the latent functions representing all the other classes should have a small value and The latent function representing the class that we observe at this point should have a comparably high value It's possible to encode that in a likelihood Using various link functions and one that's particularly popular is the so-called softmax link function Which is a generalization of the logistic function and you see it here the probability to observe class C at location I is defined to be equal to the exponential of The latent function number I divided by the sum over all such exponentials This is called softmax because the exponential is well has the shape of the exponential and so this function has the property that it sort of Takes whichever function has the largest value and attenuates it Exponentially more than all the other functions and thereby sort of picks out the maximum if you'd like so Once we have that link function and I won't even show you the corresponding slides you can actually do the corresponding derivation for our Laplace approximation and Tweet everything as before the only two major changes you have to make are actually a simple change Which is that whenever you need a gradient of a log likelihood you just then compute the gradient of this object which is relatively simple to compute and Perhaps to you the more confusing thing, but what's actually the Mathematically easier part is that you now have to keep track of a latent Gaussian process prior and the associated likelihood and therefore a latent Gaussian process posterior approximated through the Laplace approximation Which keeps track of multiple latent functions? people who haven't spent much time with Gaussian processes Yet often struggle with the idea of a so-called multi-output Gaussian process So one that keeps track of multiple Gaussian process distributed latent functions at the same time But it's actually really easy if you find it difficult then the easiest way to start is to just think of See completely separate Gaussian process priors. So in the logarithm in the log space This is just a sum over the individual Gaussian process priors That's going to be fine and once you've gotten used to it You will realize that you can actually also keep track of covariance terms between the latent function values And that's not so complicated to do So this framework provides a natural extension of binary classification Logistic regression to the multivariate case in the sense that we observe multiple classes fine, but again not every dataset is Of the type that we either get to see a real valued output or a discrete number of classes Or maybe at this point though you can already guess what the Tricks going to be to extend to some other data sets It's just going to be a different link function so we will keep hold of our Gaussian process prior because we know and love it and know how to design it and construct other link functions to Produce output spaces that are more amenable to our concrete data set and the kind of transformations we can consider is in general more or less unbounded as long as it's a continuous transformation of course We have to be a little bit careful that if we then use a Laplace approximation We actually still get a meaningful posterior So to give you a few examples Here are here's one Gaussian process a latent Gaussian process this blue thing here Which I'm transforming in three different ways This is if we if we take its logistic transform before we see that we get this kind of set of samples So actually every single frame here is a direct translation of these samples upwards I'm really just pushing it through here and you can see that these samples now between zero lie between zero and one So there are good representatives for probability distributions and therefore for binary classification tasks In which we get to observe data that comes from class one or class two with probability Well with this probability for class one and with one minus that probability for the other class Now let's say you make an observation of a quantity that isn't a binary class But maybe it's some positive strictly positive number for example some rate some scale some Count data in particular we're gonna look at in a moment then you might be more interested in a model like this So here I'm taking the exponential of this latent function and the exponential of course make sure that these samples are always positive and also that they get this very kind of nonlinear behavior that if This latent function goes towards negative values And they are very close to zero in the range from zero to one and in the region where these this function goes up They sort of very Aggressively extend upwards. Maybe this isn't quite aggressive enough for you yet You could go even crazier and do a transformation that even involves Additionally a polynomial transform and you get an even more crazy model Why would you choose these kind of models? Well, maybe because you fundamentally believe that that's how the physical process or the Wheel process you're trying to model actually looks like No matter how you choose your transformation one thing you could try assuming the transformation is continuous and Maybe also monotonic is that you afterwards try to do approximate inference using it a plus approximation Doing so gives rise to at least a point estimate in the latent space that is known as a generalised linear model generalised because it generalises the idea of Real valued regression which we do with linear models to Observations that are not of the linear type so they are not real valued The word generalised linear model is a little bit dangerous because it is easily confused with other forms of generalisation of linear models so in the when we first encountered regression we first talked about Learning a linear function or that might have seemed like a linear model And then we realized that you can learn nonlinear functions in this linear fashion that might be called a general linear model So the generalised linear model is something quite different in which we don't get to see just a nonlinear function with Gaussian likelihood We actually get to see a nonlinear function with non Gaussian likelihood To do inference in such models because the likelihood is not Gaussian We have to use some form of approximation for the posterior because it's not going to be Gaussian And the obvious one that suggests itself given what we've done so far is of course that a plus approximation To remind you once again how it works here is yet another view a pictorial view to approximate This non Gaussian distribution the black curve here Which which happens to be the product of a Gaussian prior and a non Gaussian likelihood But that doesn't actually matter because we are only going to use the shape of this black curve We take the Logarithm of this curve that looks like this then take minus the value of this curve To find the minimum of this curve This gives us this black dot here and then at that point We're computing the curvature the second derivative of that current of that curve to get a quadratic polynomial approximation a Taylor expansion and then we revert this entire process So we take minus the value of this of this curve again and then take the exponential of it and we forget It's a Gaussian distribution because that's what Gaussians are the exponential of Minus a quadratic and That gives us this dashed line and now because this is Gaussian We can now use all the nice algebraic properties of Gaussian distributions and we found this approximation by using a lightweight process that involves only Minimizing the function which you can do with gradients and then computing the second derivative of its logarithm Which is also cheap to do so that gives us an approximation Which we can then use to do inference in relatively generally structured models so Maybe let me give you one concrete example of how to do that just for to keep the motivation a little bit and not just show you that one slide and then rush on and We are going to use it to actually make a little bit of an adapt adaptation to our Laplace approximation That makes it even more a more aggressive and more lightweight. So here is a data set that I just downloaded this is A data set that these days many people look at almost every day It's the number of coronavirus infections reported to the Robert Koch Institute. That's a Germany's national health institution Per day over the course of the pandemic. So I might not actually put the the month here, but only the day since the start of the outbreak so over here we have By the February 2020 and this is late May 2020 you can see the new case is coming up Now imagine that you now want to make a prediction for how this function behaves into the future Maybe you'd want to do so because you want to you observe that this function has a clear weekly periodicity this has something to do with the reporting Process and you want to get rid of this periodicity to predict the actual current rate of infections Or maybe you just want to predict where this thing is going to be a week from now Which at this point is already an interesting question to answer because people seem to be living Days ahead to the future rather than months or years So if you wanted to do this with Gaussian process regression Then if you now just put a Gaussian process prior over this model you hopefully if you actually try to do so Have an internal alarm bell going off in your head that Reminds you that this this data is really not well modeled by a Gaussian process Why well, so first of all it has Well, first of all, it's just strictly positive, right? So a Gaussian process model fundamentally has to put probability mass onto negative Regions or regions of negative real numbers and we know for a fact that there are never negative numbers of new cases so We need a model that only predicts positive numbers and There's also another problem Which is that even if we ignore this for a little bit and just say well, you know This region isn't that interesting anyway, and this is kind of where I want to predict Then this model also has a very extreme kind of dynamic because in this phase of the outbreak The number of new cases rose Well, how did it rise? Well, probably actually exponentially because that's the process of an epidemic Infection right that it initially rises exponentially so You will remember if you go back to the example I did with modeling my body weight that that's also actually a data set in which the observations are fundamentally Lower bounded so in this data set I actually subtracted the initial value so that the values you observe are actually negative sometimes But it's clear that there's a lower physical bound to them because a human being cannot weigh less than zero kilograms and Maybe actually can't I someone of my height can't lay way less than a certain minimal weight before it's just not a feasible process anymore so back then I ignored this problem largely because The These two problems that are present in this data set were not present First of all the dynamic range of the of the data set was relatively mild it was a smooth process up and down on a more or less linear or kind of Yeah Symmetric kind of fashion so the rises were about as fast as the decays and The dynamic range of the process was realistic. It was sort of somewhere between minus 10 and plus 10 and the Lower bound was quite far away Sadly, maybe from the actual Observed numbers now in this data set we can't make this assumption And in fact if you build a gaussian process model for this Then one thing it might predict over here is a decay that very quickly Over here now that we are in this phase of the pandemic makes negative predictions, which is really a bad thing to have So one very simple thing you can do and this is a standard process If you do build these generalized linear models is to just take a non-linear transformation of this data set And I want to leave it to you for a second to just think about what transformation you might use Most of you will have chosen. This is a good idea to take the logarithm So here I've done that I've actually shown you the logarithm of the of the data set But I'm actually showing you a little bit more than that already Namely arrow bars and I want to talk about these arrow bars because these are actually important Now let's imagine you didn't have these arrow bars What I've done is I've taken the logarithm of this of this data set and now of course We could do a gaussian process regression on this data set now Maybe you actually want to build a model here that has a distinct switch around this point of time Because what happened here was a massive public policy intervention that totally changed the causal structure of this pandemic Which is the lockdown so From here to there this data set is probably an exponential rise So you might want to learn a linear function in lock space. That's easy to do. We know how to do that But one thing you might be worried about if you forget about the arrow bars Is that this data set over here is quite flat? So if you try to learn a linear straight line through this then all of these Observations that are near zero. These are individual cases have to be Included in this linear trend and this will create a bias to make a flatter line Basically, that doesn't actually fit the data that well One well, why is that the case? Well, it's just because fundamentally there is a lower limit here So we shouldn't learn a linear function because that linear function would then have to extend down here And this corresponds to fractional cases, which are not physical So what we want is our for our model to somehow capture the fact that in this region Actually, these individual observations are not that particularly informative and that's of course what these Arrow bars are going to do. So where do these arrow bars come from? What I've done here is I've used a lot plus approximation Actually, a particularly aggressive one that is even more aggressive than what we've done in previous models It's a Laplace approximation on the likelihood rather than on the posterior This is a minor variation, but I think we can do it now straight ahead to Give you an example of how to Flexibly play with your tool set or the tool box to create fast approximate algorithms so What I'm going to assume is that the likelihood which is our link function Is that we get to observe these count data y That is assumed to be created by some underlying stochastic process That is gaussian gaussian distributed and the data actually the exponential of that underlying process So for example, that underlying process might actually be a linear function of time Right, it's sort of a straight line And let's say that we make observations with gaussian noise. This isn't quite true So I've already made a first Assumption here and if you want to you can think about how to criticize this assumption and what other kind of observation model You might want to use here. You're very much invited to think about that so What I want to do is so on previous slides in previous parts of the of this lecture on the one before We would now have multiplied this with the gaussian process prior and then try to build a joint gaussian process posterior approximation to the posterior What we're going to do instead now Is a variant of that which is even more aggressive, which is that we're going to directly approximate the likelihood with the gaussian Doing so of course makes the approximation even wilder if you like But also it has the advantage that we can actually compute the approximate term directly Luckily for every single datum because this likelihood factorizes into individual terms And build sort of a if you want a black box that just takes this count data and directly translates it into a gaussian likelihood Which we can directly feed to our gaussian process regressor and that might work well for a dataset like this Which grows quickly to ever larger Numbers of data So what we do is i've actually already written down the answer, but it's but i'm going to construct it for you by drawing it on the white board is to construct Laplace approximation, so let me Remind you that we have decided to use a likelihood Let's do it for every single observation that is given by a gaussian individual one At location fi with let's say a constant variant sigma squared so What is a laplace approximation? Let me remind you it's an approximation To second order in the lock space. So we take the logarithm of this p of y given fi And this is up to constants Which don't matter because we are going to find the mode of this expression minus one half Times yi minus fi squared divided by sigma squared Oh, and this is of course one our assumption was that this is e to the minus fi and here we need No e to the fi and here we need e to the fi. I'm sorry Okay, so let's find the mode of this expression. Well, okay We could compute the gradient and maybe we should do that But you can actually read off where the mode of this is right? Of course e to the fi has to be equal to yi But we're going to need second derivatives anyway. So we might as well compute the gradient So the gradient derivative of lock pi of yi given fi divided by fi Is so with respect to fi is the two comes down the minus stays inside But there's a minus coming in from here. So we're just left with yi minus e to the fi divided by sigma squared and if we want this To be zero then of course fi is going to be the logarithm the natural logarithm of yi Okay, so in our gaussian likelihood For y in the end. We're going to write Our this gaussian approximation is approximately a gaussian function of fi around log Of yi and now we just need to know what the variance is going to be So to get that we construct the second derivative And that is well it's Minus e to the fi divided by sigma squared So if we plug in what fi is at the mode, so if fi is equal to log of yi Then this is just minus yi divided by sigma squared Now if you look up again, what the Laplace approximation is It tells you that we need to take the inverse of this and remove the minus So multiply with minus one and invert it. So the error is going to be sigma squared divided by yi The variance not the error So now let's go back to our plot. This is exactly what we have here and this is a very intuitive kind of result What this means is that as we get close to values that are around Zero well or at small values right so counts of one The error bars get very large And this is good because if you now want to learn a linear model through this Then this model will essentially ignore these observations And this probabilistic formalism this probabilistic approximation will allow you to actually learn a linear function That is informed much more by these later values which have higher measurement position and these earlier ones So this is one of these on one advantage of a probabilistic formulation That even though we're making these quite crude approximations And maybe you want to be careful with your posterior uncertainty to this even just using them in the likelihood already provides certain kind of Beneficial aspects That allow these computations to be more robust Now maybe over here you want to use a different kind of gaussian process regression model One that is maybe also like a linear trend down here And then maybe a periodic trend on top to get to model the the reporting behavior of the german authorities And of course you could do that too and again as the data set returns further down to smaller and smaller values And let's hope it continues on that on this path Then the errors over here will become more important again Which is good because then you can start again to to focus in your regression model on these earlier parts of the measurement process with that I would like to conclude this second third of the lecture which is a part on generalized linear models If you have a data set that isn't real valued in in the observations and isn't also a classification data set So you don't observe binary Levels then you get to decide which link function to use and link functions might seem like a small tool set to select from and of course there are a few obvious choices like the softmax for discrete classification and the exponential function for Count data like the one I just showed But really which one is the right one is up to you Because you know where your data comes from And if you don't then maybe go back to whoever collected the data and let them tell you a bit more about where the data came from If you don't know what the right transformation is Of course, let me remind you you can assign variables hyper parameters to Your model to your transformation to your link function and learn these using type 2 maximum likelihood In the final part of this lecture I'd like to circle back a little bit to The earlier parts of our discussion of gaussian models. So before We arrived at gaussian process models models which arguably track infinitely many features at one time We discussed parametric regression and that idea might seem a little bit quaint at this point But remember that it's connected to the extremely powerful framework of deep learning Which still very much is at the forefront of machine learning So back then when we spoke about parametric regression I showed you a picture like this and said if we have inputs and outputs Where the outputs are real valued then what we're doing parametric regression essentially is to define features and then Learn the weights the linear weights of these features by putting a prior gaussian prior over them And then because the likelihood is also gaussian The posterior is gaussian and everything here is tractable Now if you don't know what your features are and you'd like to learn them Then you can represent the space of features in some kind of parameterized way For example, also in a deep way and then learn those features by maximizing the type 2 maximum likelihood or sorry by maximizing the type 2 likelihoods or the marginal likelihood of the Observation y under the model that maps from x to y where all the w's are integrated out Now it's a bit Silly maybe to do that if you have a very deep network because there's that because then there's a lot of parameters here And the marginalization of those weights up here doesn't make all that much of a difference Maybe but let's not get too much ahead of ourselves and think about the deep aspect later So having just done generalized linear models You might now wonder well does this view that we here constructed this connection to deep learning still hold And does that mean that we can actually think about learning general linear general Neural networks or models that map from x to general y in particular also classification networks, for example That we can learn these in a Bayesian fashion. Well, and the answer is yes So we can maybe start slowly Without getting too much ahead of ourselves by Going back to where we were in regression and see if the framework we derived back then So type 2 maximum likelihood to do hierarchical Bayesian inference Whether that still applies and to do that we have to basically take the derivations We made for Gaussian process classification And make sure that the two changes from the gp setting to the classification setting work One of them is that we actually well one is one change to the gp setting Which is that we have parameterized features. Let's see whether that works and also So far in the classification setting I haven't actually spoken about how to marginalize the weights so how to compute evidences Actually, let's do that first. So let's see whether we can compute the evidence for the observation y So that's the marginal over This joint distribution over the labels and all the latent w's where we marginalize out the weights To do so, I'll show you a derivation that comes directly from the great book by Carl Vasperson and Chris Williams And it works like this. So what we're interested in is this marginal distribution over y given x So that's the the integral the marginal Over the joint distribution over y and the latent function f and we integrate out life Now in the regression case that was easy because this joint distribution is equal to prior times likelihood And The prior and the likelihood were both Gaussian So therefore this integral was just the integral over a Gaussian term And we just got a direct analytic answer now the likelihood is not gaussian anymore But we have our handy Laplace approximation which we can use to Approximately and in closed form solve this integral How is that going to work? Well, it's actually just a little bit of a of a finger exercise. So the The joint distribution over f and y is the product of prior and likelihood And so it's the exponential of the logarithm of prior and likelihood. That's a trivial transformation And now what we've decided to do is to approximate this distribution, which is the product of prior and likelihood with a log quadratic Approximation so with our Laplace approximation So this term the logarithm of p of y given f times p of f Likelihood times prior is what we approximated with a quadratic term. So we approximated it with a term that is a constant That's the value of The of prior and likelihood at the mode at f hat and then a quadratic term in f There's no linear term in the tail of expansion because we're at the mode. So the gradient is zero So That's our log approximation. So actually there should be a logarithm here. Sorry That's our log of q of y and f given x now Let's see what if we just pretend that that's the the true distribution happens to our integral here So we still want to compute this integral which is going to be approximated by some marginal q over y that's the Now let's plug in what we have in the line above. So here's a constant here in front We take the exponential of a constant minus a quadratic term So the exponential of a constant doesn't depend on f only on f hat. So we can move it outside of the integral It's here and then there's the integral here Over x to negative quadratic form. So that's a gaussian integral because this is a gaussian expression, right? It's e to the minus one half times a square But there's no normalization constant here. So this is not just one because it's not a gaussian probability density function It's just a gaussian function. So e to the minus one half square We know though what that is it's The square root of two pi to the power of the dimensionality of this problem So there are n data points here. So if we get an n and then the determinant of the covariance, so that's k inverse plus w inverse where w is this matrix of second derivatives of the log likelihood that i've introduced in the previous lecture So that's just going to be a constant But it's a constant that we have to keep track of because it involves k and w So if we are going to tune the model which in particular also involves k and w Then we have to take that number with us because it's going to affect when the overall terms when we tune stuff This thing here in front is e to the Logger rhythm of the likelihood times the prior, but what's the prior? Well, that's just a gaussian, right? That's our gaussian process prior. So we can take that outside, right? So the This here is the logarithm of this two is the sum of these two exponential of a sum is a product of exponentials So we have the exponential of the log likelihood Times the prior this is just a gaussian process evaluated at this constant at the mode of the Of the posterior distribution And so this term here in front is easy. That's just x of log. So it's just The well, it's just a likelihood, right? So its logarithm is going to be the log likelihood This term here. Well, actually, I mean, actually, this is the entire Our entire approximate margin. So we could stop at this point and we have what we need We have q of y given x now Of course, we're going to what we're going to do is we're going to use this term in the exact same way That we used it in gaussian process regression or in parametric regression We want to optimize the terms the feature functions that enter into all of these expressions As a function of their own parameters. So we're going to compute Gradients of this object and in particular, it's actually more convenient for numerical reasons to compute gradients of its logarithm So let's take its logarithm for that. We just take the logarithm of this entire expression So we're just going to get the log likelihood this exponential goes and we're left with the logarithm of a gaussian So that's a quadratic term minus a log normalizer So this is the quadratic term. It's just written down and here is the log normalizer already subsumed into the term afterwards Because we're also going to get the logarithm of this expression and also plus the logarithm of n half times the logarithm of two pi, but that's a constant. So we might as well leave it out And we're just left with minus one half times the logarithm of the determinant of this covariance matrix This comes from the normalization constant of this gaussian plus the logarithm of this expression and we can take those together because the Product of the Terminance is equal to the determinant of the product and we get this expression here So that's the function we need to optimize and it looks a lot like what you might expect Right, it's a log likelihood, which we know what it is. It's assuming we're using a logistic function Minus a quadratic term That's our log prior and then there is another new form of Occam factor. So that's the classification equivalent of The term we had in the gaussian case. So Yeah, and it looks like this which is obviously quite comparable to the corresponding expression that we had in the gaussian form where we In the regression case where we just had here the log determinant of the kernel gram matrix plus the noise Covariance matrix. Here the noise covariance matrix is replaced by the second derivative of the log likelihood. So That means we can update the algorithm for gp Classification or logistic regression training that I introduced in the previous lecture with an additional term in red down here That allows us to also compute the evidence at Training time and then of course we can use that evidence in an outer loop to train the model This is an algorithm as more or less directly taken again from kawas was in chris williams book the Main thing to observe here is that to compute these quantities this evidence We can actually reuse quantities that are already computed inside of this loop There is an inner product here. So that's a term of linear cost and the number of training data points Then we need to log likelihood. So that's also something we've already computed And because we know what these values are also they are cheap to compute because they're local likelihood So this is o of n as well And then we need this Occam factor Which actually can be efficiently computed if you already have a cholesky decomposition of this matrix of this matrix b Which is one class this weird expression with square roots of this diagonal matrix w Then the log determinant of this The left determinant of a matrix for which you have a cholesky decomposition is easy to compute because it Can be computed by summing over the Diagonal entries in the cholesky matrix actually that Gives you the log square root determinant, which is exactly what we need So doing so gives us the first part or sort of answer to our two questions So I said we need to be able to compute evidences because then we can marginalize out and Train models again I should probably say that in practice when people train deep neural networks. They don't do this Whether they're basing or not because if you have a deep neural network Then this final layer is a small part of the entire network. So you might feel like This Like this process at least during training is not that important because you're only marginalizing out Relatively small and maybe arbitrary part of your network just because you can and the entire layers Underneath are approximated with a point estimate. That's true And we'll have to deal with it at in a few moments actually So the second thing of course we need to be able to do and that's going to be very easy to verify Is that we're still able to train or to use this classification framework This gaussian approximate logistic regression classification framework If the model is not a gaussian process, but has features and we want to optimize those features maybe So you can probably guess that that's of course true, but let's just confirm So if we use this we've been all constructed this gaussian process classification framework And we assume that the function we're trying to optimize actually has a parametric form So that f can be written as wherever an f shows up here It can be written as a phi transpose times some weights. Let's call them v Then of course everything carries through so what I have here is the derivation For values of the log posterior and their gradients and their second derivatives that we've seen in the previous lecture Where f just had general values, um, which we learned at all of the training points f of x So if f instead has this parametric form Then we can basically replace all the values of f in here with phi transpose times v And wherever we take derivatives with respect to v the weights We basically just have to do a chain rule Application and we get additional terms outside that give us the derivative of f with respect to v Which is just phi essentially Now notice that of course so first of all you notice that this still gives us a convex Or convex minimization or therefore concave maximization problem Why so back then when we did a classification last lecture for gaussian processes Here we heard a kernel gram matrix, which was positive definite. The only thing that has changed now is That we get a matrix w which is multiplied from the left and the right with features phi And an inner product in here of the diagonal weights Actually, it should be a v with phi from the left and the right hand side So, um, this matrix is clearly still positive definite if the w's or the v's are positive And this matrix capital w is still positive definite because it's a positive definite diagonal matrix Multiplied from the left and the right with the feature functions So any matrix multiplied from the left and the right with the same vector or the same matrix transpose Is still positive definite as you can easily confirm for yourself in a one line proof So in the end, we're going to apply we would like to apply this framework Not just to learn v but also to learn phi And of course then we would need to take the derivative of this expression with with respect to phi rather than w Which is also fine. We'll just get v's outside on the left and the right The only problem we might then get is that those v's are not necessarily positive Definite anymore. Well, they are not necessarily positive. So that might screw our optimization problem later down the line, especially once we compute Recursive derivatives of lower layers with respect to higher layers And that's of course one of the fundamental challenges of deep learning that the optimization problems are not necessarily convex And therefore optimization can be a little bit more complicated That's not something that arises from the probabilistic treatment Though it just arises from the fact that we're using features for which we cannot guarantee that We're going to still get a convex optimization problems Okay, so with this even though You might not have really grasped it yet. We actually have all of the machinery to do To connect at least Deep learning we're in the classification settings or actually in more general Supervised machine learning problems not just regression To the probabilistic framework and the key ingredient for it was that a plus approximation Wherever there is a loss function that we that doesn't have quadratic form We approximate it in a quadratic fashion and treat that as a Gaussian approximation Of course doing so Introduces an error an approximate error But you might wonder whether that's a bad thing because The alternative is to just use deep neural networks as they are And just that means using a point estimate an empirical risk minimizer rather than an uncertainty estimate And from a Laplace approximation at least you get a notion of uncertainty Over the weights of the neural network at so far only the the penultimate layer But in a moment we're going to do the entire network Why might you want to have such a Bayesian approximation to a neural network and Is it actually enough to use a Laplace approximation to do so? Well, of course No approximation is ever perfectly enough But what I'd like to show you is a quick simple argument for why probabilistic Uncertainty on deep neural networks can be a very beneficial thing even if it is very approximate to do so Let's look at a concrete setting that is very close or maybe it's sort of a prototypical form of Current deep learning for classification. So let's say we have what you might call a feedforward neural network For classification. So there are some input x and we assume that the labels y are Distributed by drawing from the sigmoid. So the logistic or the softmax function Over some parameterized function f of w which is given by some deep neural network So that's one not particularly elegant way of writing a deep neural network It's a cascade or a recursion of linear maps with weight w's applied to non-linear link functions phi Now Um, just to make that point again. I've mentioned it several times by now. You can um, they what people now usually do to Optimize or train such a network is to optimize the weights. So that amounts to tied to maximum Apostoyori that means minimizing an empirical risk function, which can be thought of as the Logorithm of the logistic likelihood of this function f where Yeah, exactly like this. So that's a function of f and therefore also of w Minus maybe if you want and if you think it's necessary and empirical regularizer Sorry, I've just a regularizer. So that's a log prior for example, you could think of this as weight cost So if you're using a quadratic weight cost on your weights, then that corresponds to a gaussian prior on the weights So, um, here we have that now. Let's call this entire function a loss Let's call it j of w and then what deep learning amounts to is With all sorts of bells and whistles and tricks and automatic differentiation and mini bletching and stochastic optimization Minimizing this empirical risk function Now doing so Even though it's totally standard can have certain pathologies and I'd like to point out one that comes from Relatively recent work here in tubing in by my esteemed colleague Matthias Hein. So imagine that we want to apply this This kind of framework to a classification problem like this with binary classification And actually it also works for multiclass classification And the one decision we make is that these feature functions phi the non linearities in the network Are velu functions. So rectified linear units. We've already seen them several times in previous lectures So functions that are piecewise linear. They're first zero up until some threshold and then they turn into a linear function that starts at zero Velus are relatively popular in deep learning and they're also quite relatively easy to analyze. So one interesting thing that uh, Matthias Hein recently pointed out Is that if you're using such a network For classification Then that network becomes arbitrarily confident far away from the data What I mean by that is that if you're trying to learn a classifier with this velu deep network Then this network will learn a decision boundary like this black line And then around it become more and more confident about the class as you move away from the data Even if you're moving almost exactly along the decision boundary very far to the north here So why is that but the full theorem is here and you can of course read the full derivation in the paper But to just give you an intuition the reason for this is that Velu classifiers are piecewise linear functions Now there's two observations to make about this first of all Even though you're building a hierarchical deep network out of them This still gives you a piecewise linear function because a piecewise linear function of a piecewise linear function is still a piecewise linear function And the other observation that is not so straightforward to make but um is actually correct is if you if you have finitely many Features in your deep neural network no matter how many and how deep the network is as long as you are tracking finitely many weights Of course everyone does then there is a region outside of the data Where all of the features all of these rectified linear units are either on Or off and you're on one side of their decision of their own like the kink of their switch And they will not change anymore. So in that domain as you're moving further and further away from the data You're just looking at a linear function One that doesn't have any further change points anymore And therefore you now have as an input to the logistic link function a linear function and as we're moving away that linear function either growth or it falls and Unless you're totally lucky and it's just zero, but of course it isn't because you've trained this network and it's never going to be zero So as you move away that function therefore becomes arbitrarily large because there's a linear function And if the input to the sigmoid link function is arbitrarily large either positive or negative then the output is zero zero one So this network predicts one of the classes the red or the green one with Perfect confidence. So asymptotically with probability one That's of course bad. It's not a property. You want your deep neural network to have This of course is a simple picture here But imagine this as a computer vision task where you've trained on some very narrow manifold of labeled images If you're now moving away, then you're often still in the domain of natural images But just a far just far away in this high dimensional space from the training images And then this network can be very confident about the label of an image that it totally badly incorrectly classifies So, how do we fix this? Well, we fix it as it turns out by Assigning probabilistic uncertainty to this network And it turns out and I'd like to show you this briefly that this can this this particular flaw can be healed Even with a very approximate very lightweight very simple Bayesian approximation probabilistic approximation and here is how this approximation works We're going to put a la plus approximation Over the posterior over the weights of the neural network and use that to marginalize out over the predictions Let's see how that works. So We would like to make a prediction So the prediction if you knew what f is would be just to apply the sigmoid to it But in reality, we don't really know what the right weights of our neural network are So we would like to actually compute properly A marginal distribution over the predictive class So that's integrating out the uncertainty over the weights and to do so We would like to use the true posterior over the weights given y that posterior At least if we trust this deep learning model amounts to the exponential of this Or actually minus this expression down here up to normalization Right because this is the product of a negative log at the negative No, the product of a negative prior and likelihood so We can't do this in practice, of course because this expression is is difficult to operate with because it includes this deep Structure, so the problem here is not really that there is this Sigmoid here all that much the problem is because we've already solved that with all the plus approximation Right, the problem is that this f here is a deep function a recursive application Of non-linearities what we can deal with that just as we could with the other non-linearities using a plus approximation So we replace this posterior over the weights with a Gaussian that Is a plus approximation So it consists of a Gaussian with a mean that is given by the mode of this function Notice that we already have that mode because we've trained our deep neural network So we know where the minimum is and a curvature that is Or actually a covariance matrix that is given by the negative inverse curvature of the The loss function j of w from two slides ago and we could call that psi the matrix of covariances You may wonder about how expensive this is we'll talk about that in a moment now One problem is though that this integral up here is over F or actually it's integral over w but w shows up in f in here So f is a not a linear function of w anymore It is in the pen in the ultimate layer in the final layer of the neural network the output layer from the Deep non-linearities the velus to the classification But below that it's not a linear function of w because these weights enter into the non-linearities into the velu non-linearities So to deal with that we just approximate our loss function Actually our predictive function by a linear function. So we say that f of x is given by F of x evaluated at w star So that's exactly the prediction we would do in the classic deep learning setting We just evaluate f at the trained location plus a linear term that is given by the Jacobian of the predictive function with respect to the weights times the distance between the weights and the The map estimate w star So this is a Taylor approximation to first order and computing this approximation is easy because it involves A Jacobian of the predictive function with respect to the weights Which is something that we already need to compute anyway to make predictions. It's our standard part of our standard backdrop pass So with these two approximations. So with a linear approximations to f approximation to f and a log quadratic Therefore gaussian approximation To the distribution over the weights We can actually now do this integral in closed form because we now have a gaussian posterior over w And therefore a gaussian implied distribution over f Because f is a linear function of w now in this approximation, which is a gaussian with over f at with a mean f of w star And a covariance that is given by the approximate gaussian covariance over the weights And then applied to that to it from the left and the right the Jacobian of the predictive function That's just a gaussian distribution over the prediction the latent prediction That enters the sigmoid to predict our class y Now the final thing to do is to do this integral Which you can now almost do because we now have a gaussian distribution over f and we're only integrating against this sigmoid So that's a problem. We already encountered in gaussian process classification. And there are simple approximations to it in particular there is A classic one that comes from david mckay like a lot of the results we were talking about here today I'll mention that in a moment again, which is that the approximate this forget about this a and s That's a typo. So p of y given 1 given x this thing up here is Approximately the sigmoid evaluated at the predictive mean of this Gaussian divided by the square root of 1 plus pi over 8 times the variance So this looks a little complicated, but it's just the result of a simple linearized integral So is this problem that matthias hein pointed out now still valid? Well, the problem used to be that our m of x so notice that m of x is equal to f of w star That that's a linear function and as we are now moving away from the training points x becomes very large and because this is a linear function in x This function becomes very large and sigma of that large function goes to either 1 or 0 depending on whether it's large In a positive or negative sense Now in under the gaussian approximation We correct this mean prediction by this term that involves the variance The square root of the variance actually now we notice that this variance involves this Jacobian with respect to x so this is a linear function and therefore of course there is an x in here So you can think of this roughly as an expression that looks like x times psi times x transpose So it's quadratic in x and we take the square root of it So very roughly speaking there's a linear function here that gets divided by another linear function So the square root of a square function and therefore you can imagine that this is not necessarily something anymore That deviate that that goes to a large value as x becomes large in absolute terms That's a bit of a hand-wavy argument It turns out that you can make this argument formal and this is the result of a paper that's Currently actually still a pre-print Written by Augustinus Cristiadi who is a phd student here in Tübingen Working in my group and also with Matthias Hein who showed that this is actually the case So here's a more formal statement that says if you are in this setting that was We just discussed on the previous few slides and you're moving far away from the training data Then under this velu architecture of the network and the Laplace approximation The predictive uncertainty So the predictive probability for a particular class label Goes to a number as you move far away from the data that is bounded away from zero and one So the network does not become overconfident anymore under this extremely simple Very strongly approximate gaussian approximation Results like this provide arguments for why you might want to consider constructing an at least an Approximate posterior to your deep neural network Now this is an asymptotic statement and it's just one specific statement About a specific class of deep neural networks velu networks in in this particular case even a binary classification setting But I mean if instead of me trying to give you more arguments for why you want to be vision Maybe let me address what's probably more on your mind having gone through this Which are two more Maybe practically minded questions. The first one is but okay, isn't that going to be expensive? I've heard bayesian methods are very expensive And b, okay This was all very super complicated and I don't feel like I'm able to actually do this myself This seemed like a very quick tour into deep into Bayesian deep learning and now I don't feel empowered to do this at all So let me tell you two quick answers at the very end of this lecture The first one is the how expensive this approximation really is depends on how precise you want to make the approximation So of course if you build the full hashing of an entire deep neural network That's going to be an expensive operation because it's going to be a matrix of the of size square In the number of weights of the network And then you have to invert that whole thing to get an uncertainty and of course that's going to be very expensive to do But nobody said that you have to use the entire hashing There's lots of different approximations which make this process much more lightweight And since we are already making strong linear quadratic approximations We might as well make even those approximations a little bit weaker For example, you could decide to use a low rank approximation of the hashing Those are possible to construct from matrix vector multiplications with the hashing Which have a cost that is roughly the same as a backprop pass You could also decide to use only a block diagonal approximation of the hashing So for example one block for each layer so that each layer has its own hashing Or even more extreme for the output layer one block for each class prediction in a multiclass setting You could also just decide to use only the very last layer So then we are back in the essentially in the gaussian classification gaussian parametric regression classification sort of domain And actually the plot we showed I showed you here does use this last layer approximation And you can see that it's not particularly bad actually typically what happens if you add the More structure of the hashing from lower layers is that this uncertainty just gets more structure Gets a little bit more fine grained, but this has a relatively minor effect And the most extreme thing to do is to just use the diagonal of the hashing So that amounts to independent uncertainty for every single unit of the of the network Doing so of course is very cheap because it's essentially as expensive as computing the as a single gradient At least on paper now. How difficult is it to implement all of these things? Well actually used to be up until quite recently still quite non trivial to get these right and require quite a lot of Deep thoughts to get right, but things have changed and I mean, they are still changing very rapidly There are lots of new software tools out that help with this kind of process On using automated differentiation, and I want to highlight one of course shamefully Indecent by plugging a piece of software from my own group. That's called backpack for pie torch This is the logo. This is the website. It's backpack for pie torch Which was built by Felix Daniel and Frederick Künstner in my group. It was just published a few weeks ago This recording is in 2020 at the eichler conference the international conference of learning representations and It can be used it offers all sorts of syntactic sugar and and hooks into pie torch to compute second order variables like these curvature quantities these diagonal or other factorizations of the hashing And also additional quantities like variances of gradients, which we're not going to use here To give you an idea of how this works. I've actually asked agustinos christiadi who wrote this paper I just cited to create a little bit of a code example Which I will upload on ilia so that you can look at it and it gives you an idea of how easy this code is It's so easy that I can do this at the very end of this lecture just in a few seconds I can show you this is the this is the thing you're going to find on ilias So it's a demonstration of this process for multiclass classification. They I've copied out or Collapsed out a bit of code that produces this data set here. These are this is a four class classification problem with four classes Now in this piece of code Which I'm not going to go through much is a piece of setup for torch that creates a neural network that has two Vlu layers So if that's arguably a deep neural network But of course you can make it deeper if you want to with a batch norm layers in between And now here comes the important bit So we first make a prediction to see that this problem still exists So as we're moving far away from the data this blueness here shows confidence of the model in the individual classes So you can guess that this model up here predicts the the magenta class Here the yellow class the green and the red class with high confidence as we move away it becomes overly confident And then you can read this if you like later on What? Agostino's now does here is it is is he uses the functionality from this backpack library Which provides extensions in particular also for the computation of factorized versions of the Hessian and this happens actually in here. So we add the cross entropy loss function. That's the predictive likelihood For multiclass classification and then add an extension. That's the one line that does actually all the all the magic From backpack that computes a chronicle factored approximate curvature k fact is the corresponding technical term for these It's also relatively recent addition to Compute exactly these chronicle factorization. So that happens in this line and then we can invert this These these quantities to get our Laplace approximation in these lines And then use them to predict and that's users. Let me just go back to the slide Uses this idea of computing a quantity like this. So this inner product of the Jacobian with the curvature This happens in actual code in this line here. You can sort of see that that's what's probably happening And we can use that now to predict through the softmax through the softmax is a little bit more complicated to predict There is no at least no generally accepted easy approximation for this classification prediction Actually, there is but there's no time to to talk about it now. Otherwise, I would have to plug in another paper from my group and Instead we just do sampling. So we just draw a bunch of samples from this Gaussian distribution and push them through the softmax And that gives a prediction like this for these classes It's a flipped image unknowingly But you can now see that this predictive distribution produces meaningful Uncertainty as you move away from the data and as you move away far away It always becomes quite uncertain about the correct class label, which is exactly what we want So with this, I hope for the I hope to have convinced you that Being Bayesian about deep learning doesn't have to be a complicated process It doesn't have to involve writing your own code and it doesn't have to involve Additional highly complex computational steps It does require a little bit of additional computation But you can make that additional computation very cheap if you are okay with cheap approximations Maybe you've read somewhere else that Bayesian deep learning doesn't work that it's a complicated process That it's very expensive that it's not nowhere near done And maybe that's true There is a lot of research going on in the community to build more powerful, more reliable, more robust And efficient Bayesian approximation schemes for deep learning that are more powerful than this They also tend to be more expensive and also much much harder to implement But that's a question for researchers If you just want to train your deep neural network and assign meaningful error bars to its predictions Then simple tools like this can be enough to get to this point Now you might wonder why you don't read about these kind of results in papers Well, I mean you can read our papers, but why you don't read about them in other papers Well, maybe one reason for that is that the Laplace approximation is not a new idea at all It was maybe introduced for this particular deep learning setting at a time when people weren't talking yet about deep learning in 1992 By David McKay. Here he is again the wonderful David McKay In a paper that actually has all of the ingredients that we just discussed today It's called the evidence framework applied to classification networks You can maybe just by skimming this abstract OVD guess that this paper essentially Introduces all of the quantities and all of the notions and algorithms I just discussed and maybe because this idea is now 30 years old That's actually the reason why people have stopped writing writing papers about this That's maybe a good thing because no one needs an infinite number of papers about an old idea But that doesn't mean that the old idea doesn't work anymore Just because the world has it has advanced as you just seen in these examples It actually still works quite well And if you are have facing a practical deep learning problem I would actually recommend that you try as far as possible to use these simple ways of quantifying uncertainty in your Deep learning architecture with that. We're at the end of today's lecture We've tied up quite a few loose strings before we can move on to a totally new topic in the next lecture which Made connections to various different other domains from our gaussian process classification framework that I introduced in the previous lecture We saw that support vector machines, which are an important class of Supervised learning machines are in some sense a corner case of the framework of the logistic regression framework which has very beneficial computational aspects, but unfortunately is Fundamentally not a probabilistic model because it involves a loss function a risk function that is not a log likelihood That's an interesting case to study And it highlights the the points where the connection between the statistical and the probabilistic framework Are not so close to each other and then require different kind of kind of analytic techniques And in turn this also means that uncertainty is difficult to construct over svm models We then moved on to generalize linear models to point out that if you have data that is fundamentally not real valued But has other structure and beyond even binary or multiclass classification Then you can use other likelihoods other non linearities to build what's called generalized linear models and it's possible to save to Transport over much of the functionality of the gaussian process regression framework using the Laplace approximation for approximate inference In fact, I could have called the entire lecture the Laplace approximation Because I kept using it in the final part as well in which I pointed out that a Laplace approximation can also be used to build approximate potentially strongly approximate posterior distributions for deep neural networks Even though they are approximate I showed you some simple asymptotic properties quite recent results That maybe Support this idea or motivate this idea of using such gaussian approximations in deep learning in particular Because such approximations are now with new software tools also relatively easy to implement I hope you enjoyed today's lecture and i'm looking forward to see you again in the next one