 Welcome back to probabilistic machine learning lecture number eight. Here we are in the course. We saw that probabilistic reasoning is the extension of propositional logic to reasoning under uncertainty. Doing so can be expensive computationally because it requires keeping track of a potentially exponentially large or combinatorially large search space. We can nevertheless sometimes complete this process by separating parts of the reasoning process using conditional independence. We can also extend this notion to continuous variables and the resulting computations can then be addressed by various generic computational tools. One of them being sampling and Monte Carlo methods. In the last two lectures we saw another way to deal with the complexity of high dimensional continuous valued inference. And that is to constrain ourselves to only use linear relationships between variables and only to use Gaussian distributions on all latent and observable quantities. This works because Gaussian distributions then map the process of inference, the abstract notion of solving or computing posteriors by integrating out latent variables onto linear algebra. And linear algebra is something that computers are good at, it's not for free, but computers are efficient at it, so that's a way to get tractable inference. In the last lecture we saw that we can use this framework not just to learn individual variables, but to learn entire functions. Here's a one-dimensional picture again that I used on previous lectures. So here is a hypothesis space over functions that is created by assuming that the true function, which is something here in green, can be written as a weighted sum, so as a linear map, of a bunch of features. And here I've chosen a particular set of features, these are sigmoidal features, so they are functions that start at zero and then they smoothly go to one. There's a bunch of them. If we now make observations and assume that the observations are evaluations of that function at particular points, corrupted by Gaussian noise, then we are exactly in the situation where all the quantities we care about are linearly related because the function is a linear map or the observations are a linear map of the underlying weights and all the probability distributions involved are Gaussians. The prior is Gaussian over the weights, therefore the prior over the function is Gaussian and the observations are linear maps of the weights corrupted by Gaussian noise and therefore the posterior distribution is a Gaussian distribution which is described by two parameters. This is what it looks like and that's a bunch of somewhat complicated linear algebra down here or expressions, but the point is that it is linear algebra and that these computations can be performed efficiently on a computer even though they look a little bit tedious. And we saw in the last lecture that we can implement these kind of computations efficiently using low level operations that are available without even needing access to complicated machine learning toolboxes. That doesn't mean we shouldn't be using toolboxes, it's just interesting to know because it gives us direct access to the underlying computations. We ended the lecture last time though on a little bit of a downer which is that if you look at this posterior distribution then it is not particularly satisfying. So one issue for example is that it doesn't fit the data quite well around here, it doesn't adapt to this shape well and also the extrapolation abilities are maybe a bit questionable, the uncertainty here is not perfectly calibrated and we saw that all of this basically is caused by this choice of feature functions. So it's a little bit annoying that we have to use these feature functions, what can we do to fix this? Well one way to do this, a very pedestrian way is to just stare at these feature functions and say hmm maybe if we don't put them so regularly, maybe if we don't distribute them at constant distances from each other we can make this work a bit better. So let's look at the data and notice that this requires us looking at the data which is problematic for two reasons. One because in practice you can't always do it, this is a one-dimensional picture and many applied problems aren't one-dimensional, they're much, much more high-dimensional. And the second problem is a more philosophical one that it's not particularly probation to deconstruct your prior after having looked at the data. So that kind of breaks our nice philosophical conceptual separation between prior assumptions and the likelihood of the data. But let's just do it anyway, I mean we're practically minded people so maybe we just ignore the philosophical issues arising from this. We could just say maybe I put my features more like this. So what I've done here is actually I've still used a regular distribution of the features, they're just not uniformly distributed anymore from minus 8 to 8, but instead they're more densely spaced in the middle and spaced more apart on the outside. The way I actually did this is I took Gaussian cumulative density function and spaced the locations of these features at regular intervals on the CDF. So that gives a kind of a centered, like a more density here in the middle. That's totally arbitrary, it still keeps a little bit of structure but it's just something you could do because it does give rise to this kind of posterior. So using the same number of features and the same kind of features and just choosing where they are differently, we get in many ways a nicer distribution. So one nice thing about this is that here in the middle we now get a better capture of the structure of the data better, but there are also other nice properties like for example that the function returns to 0 here and that it linearly extrapolates here or constantly extrapolates to the right. Of course, depending on what you know about this data set, this might be a good or a bad thing, but it's not, you know, for many reasons this might be argued to be a nice thing. It's a very natural thing to extend in a continuous fashion over here for example. I mean what else are you going to do? Now, since we've already broken with our philosophical framework and we've sort of fiddled around with the model having looked at the data, maybe we can keep doing that and say, well now actually maybe not if we found that, we could do this even better. So notice that here I'm using about 16 features I think. That seems a little bit overkill because now we have all these features in here in regions that maybe don't matter so much actually. So how about I change these features a little bit more? So first of all, one thing I can do about these sigmoid features is that I can also change their gains. I can mix them a bit more steep, more like step functions or a little bit wider and then I can also really think about where to put them and maybe I end up realizing that I only need maybe something like four features or so. That's a reduction by a factor of four basically. Just one over here where data starts and it's relatively continuous here than anyway. Then there's this step in the middle that I just need to model with one particular feature and then maybe just a bit more flexibility for the model down here to capture the other structure in the data. If we do this, then this is what the posterior looks like. It's maybe not perfect but it still actually constantly extrapolates just a little bit further up here. We don't see it anymore. It still goes back to zero. It does have this nice little kink here in the middle and you know, that's just four degrees of freedom. I have a data set here that contains I think 18 data points and I've reduced it to four degrees of freedom very efficiently. And I'm still doing Bayesian inference. I'm just sort of thrown away a little bit the issue that I should really not look at the data when designing my prior. But other than that, I still get posterior distributions that have some width. That width is actually informed by the data. We could argue about whether it's particularly well chosen or not. Why don't we do this more often? This is maybe something we should try. It's just a little bit annoying that we have to do it by hand and it's tricky to do this in a practical setting if you have data that is more than just one-dimensional. So what's just happened here? So what we've just done is we've, let me just repeat this process that I just sort of did in this, you know, hand-wavy pedestrian way. So we are doing Bayesian inference on some unknown weights of our function given the data and to do so we use a model. That model is given by the feature functions phi. So what we've done in the previous lecture is we've contributed a posterior distribution over those weights from the data given that we've chosen a particular model. This is how to do this is described uniquely, correctly by Bayes theorem. Multiply the prior with the likelihood and divide by the evidence. Now the annoying thing is that we have to choose these feature sets phi. Not only is it annoying that we have to choose them because that means that we might be, you know, restricted by the choice we make. It's also problematic that even if we think about trying to make, trying to choose phi, there is an infinitely large space of features as I pointed out in the previous lecture. There's very, very little constraints on what kind of features you could consider to build these kind of models. So the first thing we just did, and I didn't actually explicitly say so, but I did it anyway, is I decided to use a particular family of features. So I said I'm going to use these sigmoid features. There was a totally arbitrary decision. I just said why not use those sigmoid features? And so these sigmoid features, by the way, they look like this. So this is a logistic function. It's the feature function number i of the data point x of the input location x. Given some two parameters theta one, theta two is one over one plus the exponential of minus x minus theta one over theta two. And with new usual way you see these functions, this is just e to the minus x. And then that's this typical sigmoid function. If x is very, very small, so if it's a large negative number, then there's a large positive number in here. e to the large positive number is a large number. So the whole thing is one over a large number is zero. So this function starts at zero. If x is a very large positive number, then here we have e to the minus very large number, that's zero. And we have one over one, so the function goes to one. So what I've now done here is I've introduced two parameters that shift x and scale it, theta one and theta two. And by doing so, I can move these functions around from the left and the right. And I can make them steeper or more flat, flatter, by increasing or decreasing theta two. So by moving around by changing the theta one, scaling by changing theta two. That was a specific decision that we're going to do that. I also played around actually with a number of these features, i from one to f. And now what I did actually is I hand tuned those choices, theta one and theta two, such that I somehow liked the model. So maybe we can do this a little bit more formally and in the process of that one thing you might notice is that, hang on, this is just two more numbers or actually f times two more numbers, theta one, theta two for each feature i, that we don't know. So if there's something we don't know in our model, then there is a correct way to treat it. It's just to make it part of the inference. So really we should just add theta to the set of unknown parameters or variables or whatever you want to call them. And extend our hypothesis space from putting a probability measure over w to also put one over theta. And then shouldn't we be able to just compute the posterior distribution over w and theta? Well, actually maybe we can. This process is called hierarchical Bayesian inference. It's hierarchical because there are these two different layers in our model. We have the weights w that previously we considered as our core unknown object. And now there's this other set of variables called theta, which actually are also part of the stuff that we don't know. So what we're doing when we're doing Bayesian inference over the weights or the function, which are essentially the same thing, because they're connected directly by a linear map. Then we're computing a posterior, at least we did so far in the previous lecture, over the unknown function f or over the unknown weights w, given the data x and y and the model parameters theta by taking the prior, which depends on the model parameters theta, multiplying it with the likelihood. Why does the prior depend on theta? Because f is phi times w and phi depends on theta. And then normalized by the evidence. And now what you might notice is that this object here in the denominator, the evidence term, that's a marginal over f. So we're integrating out the unknown function f, given theta, and we compute prior over f given theta times likelihood, which is a probability over y given f and theta df. So what we're left with is a marginal distribution for y given x and theta, or x is just the input data. So we actually know x, so that's fine, that's not a problem. But what we have down here is essentially another likelihood term. It's just not a likelihood for f, sorry, it's not a likelihood for y given f and theta, it's a likelihood for y given just theta. So this is exactly of the same form as above here, it's just that we've gotten rid of f. So we could continue to use this process and now say, well, I just need a posterior over theta now, right? So what I should do is I compute this object, which is a posterior over theta given y, which is given by a prior over theta. Okay, we don't know yet what that is, but we can set it to whatever we like essentially and think about whether we like that prior assumption or not. Multiply it with this likelihood for y given theta, which we have up here essentially. So I've dropped x now because it doesn't really matter, we know what x is anyway. And you know, two Bayesian inferences, we have to divide by the evidence, which is a normalization constant, so that's the integral over, we have to integrate out theta. Interesting, so this is a process we can actually often do and that we typically do in any kind of data modeling. We start by, we have to deal with different types of quantities to care about. There is the data, which is stuff we get to see. So the data is in many ways actually the easiest bit, at least conceptually. I mean, practice is often hard to get and to maintain and to work with, but conceptually it's the easiest thing because it's a bunch of numbers that's just stored on your hard life, so you know everything about the data. Then there are the quantities we actually care about, the ones we want to make statements over. These are often called variables. Then there are quantities in the model that might affect the data just as much as the variables do, but in the end we won't care about them. We want to get rid of them because they are not part of the question we're trying to answer. You might call these parameters of the model and these are things you want to integrate out to get rid of. You could also call them variables, sometimes the distinction is a bit vague. Then there are these things like theta, which we really want to infer as well ideally, and sometimes we can even do so, but also even if you are able to integrate out f here and do full probabilistic inference on theta, then the next question you might have is, so maybe let's go back, so the next question you might have is, let's say I managed to do full Bayesian inference on those values of theta 1 and theta 2. The next question someone might of course ask is, why did you choose sig model features? Why did you choose these particular logistic functions as your features? As we know from the last lecture, you could do inference on cosine functions or step functions or anything else. Why did you use these particularly? That's another parameter of our model, which family of features we are using. Eventually, you can imagine that there is some kind of reductio ad absurdum way. At some point it's impossible to keep arguing to more and more deeper detail because it just becomes completely intractable. These kind of final layers of parameters where at some point we say, stop, okay that's the last layer we're going to deal with and we're just going to set these parameters in some way and we're not going to compute posteriors over them anymore. These are often called hyperparameters. And of course what exactly is a hyperparameter and a parameter and a variable is debatable in any particular practical setting and it's also not particularly important. It's just a way to think about what you're going to do computationally with these different quantities. Are you going to construct a posterior over it and then use that posterior to answer your question? Are you just going to get rid of these parameters somehow with a more or less elegant integration method or are you just going to set it to some value? So how would we do this in our Gaussian regression model? Well, let's talk about that in a moment and let's consider this gray slide if you like. If there are more parameters in your model that you currently don't know how to set then at least the philosophically right way to treat them is to just consider them as part of your model and include them in your prior distribution and ideally try to compute a posterior over them. If we want to apply this framework to our Gaussian regression setup that we've been using so far then we need to check what this evidence term actually is in our Gaussian model and conveniently it turns out that this is a term we have essentially already computed. I've just more or less ignored it in previous derivations because it's just a normalization constant of a Gaussian and while we knew from the lecture what is that, 6 that normalization constant also takes the form of a Gaussian distribution. So the product of two Gaussian distributions like up to symmetry in exchanging the parameters here on W is another Gaussian distribution over W times another expression, a constant which has the form of a Gaussian PDF. We've usually interpreted this here as prior times likelihood gives posterior if we normalize by this term. So that's our evidence actually. What is that? So it's the expression you would get if you evaluate a Gaussian PDF at this mean and this covariance. So notice that there's no W in here of course because it's the normalization constant of the posterior over W but there are Theta's in here and this exactly is our likelihood for Theta. Interesting. So what does this look like? Well it's another Gaussian distribution and that might lead you to immediately think, that means we can just continue with this game. We know how to do Gaussian inference. We just put a Gaussian prior and then we multiply this Gaussian prior with this Gaussian likelihood that we just got here. Hang on. So the parameter Theta here doesn't actually enter in a linear fashion. And remember that it's not just enough that everything is Gaussian. It's also that the relationship between variables have to be linear. Here the relationship between the observation Y and the parameter Theta that we care about is non-linear. It depends, well because the features depend on Theta in whichever way we've defined our features. So in the example I've used here the features depend on Theta in this very complicated way. So I'm not even sure what that's supposed to be. It's a rational exponential kind of relationship to the features. So therefore we are not going to be able to just put a Gaussian prior here, multiply with it and just get a Gaussian posterior. And this is a typical problem. I mean if it were so easy we would have just included Theta into our model in the first place and just got another Gaussian posterior. It's almost sort of by definition that the parameters Theta that we can't do Gaussian inference over other ones that we end up calling the parameters because those are the ones that make the model complicated. So what can we do then? Well over the course of this lecture we'll find various different ways of dealing with the situation because it is essentially just another instantiation of the fundamental problem of probabilistic inference that it's just computationally hard. It just requires us to deal with integrals over complicated hypothesis spaces and complicated densities defined on them. So we will add more tools to our toolbox to deal with the situation over the course of the lecture. But maybe we should start by adding the most straightforward, most simplistic tool that arguably is almost not probabilistic anymore, which is that we could say, well, okay, let's just not do Bayesian inference on this Theta. Let's just use this expression, which is a likelihood. It's a probabilistic quantity. It's a probability distribution for Y given Theta. All fine and nice. So that's how far we got with our probabilistic reasoning. And now let's just choose whichever Theta is the best value for this likelihood. That means whichever Theta maximizes this likelihood. We call that computing the maximum likelihood estimate. And because it's not quite the same as maximum likelihood in the classic sense, because classically, maximum likelihood would just mean that we choose whichever function f maximizes this likelihood. But we've already integrated out f on computer-done evidence and now we're maximizing this likelihood. This margin of likelihood maximization is sometimes called type two maximum likelihood or maximizing the marginal likelihood, the model evidence. We're going to add this to our toolbox. So the conceptual, the modeling idea is that we build hierarchical models that take parameters and then put weights on functions derived from these parameters and then the parameters can be treated in a different way than the variables. And ideally, the variables are the thing you care about and the parameters are the bit that you can get away with just setting somehow. And then set them by doing what we will call maximum likelihood. And you might also do a minor variation on it which we will call maximum posteriori, which I will mention, but it's going to be so simple that I'll just already write it in here because it's a really trivial variation and we'll just sort of do it on the side later on. So how do we do this hierarchical inference in our Gaussian model explicitly in particular? Well, we've decided to find this what we might call a point estimate theta hat which maximizes this marginal, this type two likelihood. Here it is again. This is the other way of writing this marginal likelihood. It's literally a marginal. So it's the distribution you get by writing down the joint over the data, the latent function f and the parameters theta we care about and then getting rid of the unknown function by integrating over it. Because this is a Gaussian model as we saw on the previous slides, this actually has the form, this expression of a Gaussian probability density function. It's just not a Gaussian distribution, well it's a Gaussian distribution over y, but the mean of it is not a linear map of theta, it's just a nonlinear map of theta. And theta actually shows up not just in the mean but also in the covariance because remember, well, that's just how this works. So if we want to maximize this, we're going to go through a simple motion that is going to happen several times over this course and this is maybe the first time we do it, so we'll do it once and then we're going to do this several times again because it's actually a really important process that makes a connection to statistical machine learning as well. It's worth pointing that out several times. Here we're going to do it, we're just going to counter it for the first time and then in later lectures we'll talk more about what this connection actually means. If we decide to construct a point estimate that is the maximum of a likelihood function even though it's a marginal likelihood then, which is given by the maximum of this marginal probability distribution, then because we are maximizing this function we might as well maximize the logarithm of that function. So why is that allowed? Well, first of all, the logarithm is a monotonic transformation. That means it doesn't shift the location of the maximum, it just changes the value of the maximum value but we don't care about what that value is, we just care about where it is reached. We're looking for the theta that maximizes this expression. We can take logarithms because, well, I mean, I could say that probability distributions tend to be small or probability densities tend to be small numbers less than one so it might be numerically a good idea to take the logarithm but actually there's a more complicated underlying issue which is that and we'll talk about that later many probability distributions actually are the exponential of something. The Gaussian, for example, is. So actually let me write down the Gaussian distribution again because we are going to need this. So by definition, just to remind you, this probability density function is given by a normalization constant which is 1 over 2 pi to the number of dimensions which I call n here divided by 2 times the determinant of sigma to the one-half times the exponential of minus one-half x minus mu transpose sigma inverse x minus mu. So notice that there's already an exponential in here so it might be a good idea to take the logarithm of this expression because then we'll just get rid of this exponential and it will be easier to think about. So it's fine to do that because the logarithm is a monotonic transformation. Now it doesn't actually matter whether we maximize a function or we minimize minus that function. That's exactly the same thing, right? You can just wall down the hill or move up the sort of flipped hill flipped around the origin. So you might as well take the minus here as well. That's good because there's a minus in here and that minus then goes away. So if we do that, then we're left with an expression that's much easier to think about. So if we maximize this marginal likelihood then what we're trying to do is we're trying to minimize a function that is given by this quadratic expression which is the bit that is in the exponential. So what is this? Well, it's a square distance between the data and the prediction that is made by the prior under the model if we choose theta scaled by the variance that is created by the model if we choose theta in a particular way. That's the square expression. And then there is another term here which we'll have to talk about which arises from this log determinant. So the log determinant also, well, sorry, the covariance matrix also involves theta. I mean, that's just what it has. So that's also something we have to take into account if we want to maximize this expression for theta. And then there is this constant which is n half times log of 2 pi which isn't that important because it's just a constant shift and if we are maximizing this expression or minimizing minus the logarithm of that expression then it doesn't shift the location of the minimum. So what's happened here is that we now have an expression that you can think of as an empirical risk in the language of statistical learning theory. So here is the data that we care about, that we want to model, and here are the predictions of our model. And what we're trying to do is, we're trying to minimize the square distance, log gaussians or quadratic functions, the square distance between the data and the prediction that our model makes. And this is very much an empirical risk but it's maybe an unusual one because we have this additional factor here. And this factor arises directly from the probabilistic treatment. If we had just said, well let's just find an expression that somehow gives us a model that fits the data well, then maybe an intuitive thing to do would have been to just choose the square error. Maybe just take this particular expression, maybe even not with this complicated structure in sigma, maybe we just would have said, I don't know, just call that a unit matrix and just maximize y minus phi theta times a bunch of weights. And actually we'll talk about this later on a little bit in this lecture in a few minutes. What that actually means. But Chris is connected to another field that we need to talk about. But we haven't done that. Instead we've constructed a full probabilistic model that takes into account uncertainty over f. It assigns a probability measure to f that actually has a density, a Gaussian density, and we can reason about it. We could make these little plots and draw animations and samples and think about what that model actually means. And having done so, we realize that if we want to find, if we want to be at least a little bit true to this probabilistic motivation, even though we are constructing a point estimate now over theta, then there is this term here in this empirical risk that we can't really argue away. It's just part of our reasoning process. And that means maybe we have to think about what this term actually means. This term often shows up in probabilistic hierarchical models of this form. So if you construct a model over an unknown function or some probabilistic model that has variables that you want to infer and then parameters and you do parameter inference in this maximum likelihood type way by marginalizing out the probability distribution over the unknown function or the unknown variables, then these terms often show up and they have a name. They're called Occam factor. They're named after a pan-European but British born monk. It's maybe the oldest person I ever get to cite in this lecture, William of Occam. He was born in a small hamlet in Surrey in the UK in Occam. That place actually still exists. It's really tiny. It's just a bunch of houses next to each other with a little church. He became a monk then. That church actually still stands. I think the picture in the background you see on this slide actually comes from this particular church. It's a stained glass window made by Lawrence Lee and of course nobody really knows what he looks like so this is just an invented drawing. It's too old to have any kind of semblance of meaningful pictures of what he might have looked like. He traveled across Europe for all sorts of complicated historical reasons. This is so long ago that you can imagine that it was a complicated time. He died eventually actually in Munich in what was back then, Bavaria, already. There's still a street in Munich named after him. He's often cited as the source for a philosophical idea that a simple explanation should be preferred over a complicated explanation. This is such an abstract notion that of course it has been studied in the philosophy of science and epistemology for a very long time and comes under all sorts of complicated names, sometimes also associated with him like Occam's razor and it's actually a very complicated notion and it's not particularly well captured by just this one expression that I've now written down here as well. He actually also never seems to have written in his whole life this quote that is usually attributed to him which is that entities should not be multiplied without necessity. He did write something kind of similar. He wrote this sentence that I'm just quoting directly from one of his texts which might be interpreted as plurality so multiple possible explanations for an observation should never be positive, should never be created without necessity. We don't have to have complicated multivariate expressions for things that can also be explained in simple ways. By the way, another rule that Occam had for his reasoning process was that above all else you should never criticize the scripture. So maybe he's not the best poster boy for rational reasoning but nevertheless it's a nice reference point to have and it's clearly a very old idea. So we see a form of this in our model here showing up in this expression which is often interpreted as enforcing the model to be simple, to prefer simple explanations. However, it's a little bit complicated what simple actually means because this mathematical expression actually describes a function and if you want to understand what exactly the function does and how it behaves you sometimes have to be a bit careful. So here is a picture where I'm trying to give you an intuition for how complicated this expression can be. What I've done here is I've created such a Gaussian model. Here I've now used again a different kind of feature set. This is actually by design. I keep using different features because I want you to understand that there is no uniquely correct choice of feature families. You can use these sigmoids, you can use these little bell shaped functions, you can use rectified linear units, whatever you like. There are many, many different features. There are numerical advantages or disadvantages but they are a modeling choice so you have to think about what they mean for your model. Let's say you've decided to use these little Gaussian blobs, these bell shaped curves. Then if you put a Gaussian prior over these weights and assume that this Gaussian prior is independent then it produces this prior hypothesis based over function values. So let's say the red lines are individual function values and this animation arises in the same way that we've talked about before. We actually have a finite data set given by these circles and they move up and down. These are hypotheses over what these observations at these points might be. I have three different hypotheses which I animate so each of these are, every single frame of this video is equally likely and there are many, many possible explanations for this data set. Now what I've done is, this model evidently has a bunch of parameters. Let's focus on one of them. One of our parameters, even if we keep all these features fixed at their locations we can make them wider or more narrow. If you make them wider, then this model becomes smoother and the functions that you get to see are going to be much smoother functions and if you make these features smaller or narrower then we'll get more and more spiky functions. So what I'm plotting over here is the value of this occup factor of this complexity penalty term as a function of the width of these features and you can see that it actually has a non-trivial shape. It's not just something you might have guessed immediately. So what does this actually mean? What I'm currently doing is, we're currently at this value here where I have this vertical bar, that's this model you see here. That's close to the maximum penalty so that's one of the most complicated models you can produce with this set of features by varying these length scales you like. If we reduce the length scales a lot then the penalty actually goes down and why is that? In a sense, if you look at this picture you might think that this model now actually has become simpler because it's... Sorry, it has actually become more complex because this picture looks much more busy. There's more complicated stuff, spiky stuff going up and down but if you look at the data set then you see that many of these data points are now actually assigned very low flexibility. They're actually almost bounded to zero and if you make these features ever thinner then we do end up with the zero function which has the absolute minus infinity penalty term so that's a much, much preferred model because it can only explain one hypothesis so that means our hypothesis space is very, very compact. It can only explain one thing, the zero function and therefore it should maybe be preferred if it can explain data that is actually zero everywhere. Of course, if the data isn't zero then there will be another penalty from the data modeling term from over here and then we will not consider that hypothesis anyway. So what happens in the other direction? If we move towards larger length scales then we get ever more... actually here's the most extreme case we get ever more continuous, ever more smooth functions and asymptotically if the length scale goes to infinity we actually get a constant function that this animation would just move up and down so it's just a single constant function with an unknown height that is in this hypothesis space and this again is of course a very simple hypothesis. So if the data allows us to describe... to be described in this simple way then we should really prefer that, right? Because of course that's an easier way it's a model with less degrees of freedom even though it has still the same number of parameters effectively the hypothesis space spanned by these parameters is much, much simpler and therefore it should be preferred. So when we fit this function, here it is again then these two terms, the square and the outcome factor trade off complexity against each other so that... well sorry, trade off complexity against fitting performance so if you are able to describe a dataset with small errors or if the predictions are close to what you actually get to see while using a simple model then that's better than having an even lower predictive error but needing a very complicated model to do so. Now it's tempting to think of this outcome factor as a form of prior, as a form of regularization but actually, I mean maybe it is because we have marginalised out a hypothesis space over functions but if you think of it as an empirical risk then it's really just a different kind of empirical risk and it's not the same as having a prior on theta. If we wanted to put a prior on theta then that wouldn't show up in this likelihood, right? It would be an extra term and if we would add it here to the site then we would have to multiply this Gaussian with some other function here which is whatever the prior might be that we put over theta to get a maximum apostrophe estimate and if you take the logarithm of that then we would have to maximise the sum of the log expression of this and the log prior. If we take the minus of that we would have to minimise the negative log likelihood minus the log prior and that would be a new term here that you can basically choose in whichever way you want because we are free to choose our prior over theta and this would then be called maximum apostrophe estimation. So the outcome factor is not a prior over the it's the effect of a prior which is the prior over the function but not a prior over the parameters theta and therefore it's really part of the likelihood of the empirical risk not of the regularisation methods essentially. So we can use this framework to do actual inference on our features and I've been telling you so much about it now that it's about time that we actually do that as well so let me show you how this would work. Here are our bunch of features again we have on the right hand side our model that we've seen before let's say we've initialised with five features these features are just at five different locations I've put them regularly everywhere they have a standard smoothness which is one and you see that the initial model which you see in red is not so great it's maybe quite far away from the data and now what we can do is we can just maximise the likelihood for this data under the model which we do by shifting around the features and marginalising out the latent function the red cloud of the red distribution over functions Doing so involves these two terms the square error and the Occam factor the complexity factor and one thing you might want to know is how much of an effect this complexity penalty of Occam actually has and one advanced warning at least in this part of a simple example actually we'll see that the Occam factor has a surprisingly small effect and we will come back to that when we think about what we want to do with these kind of models so this is actually a little animation let me use this let's say we want to maximise this expression we want to minimise this empirical risk so we want to maximise the marginal likelihood then what we do is we use some kind of optimization method we'll talk in a second about how that actually works but let's just say there is a black box optimization method now that does that for us it takes several steps every step it tries to adapt the features so that the marginal log likelihood becomes larger or the negative log likelihood drops you can see it dropping here and over time actually I can keep it running over time we'll get a fitted model that has shifted these five parameters around such that they fit particularly well to the data okay that's maybe a pleasing picture but it's also a bit of an annoying picture because this plot maybe doesn't look so nice on the right hand side those are takeaways that we should have at this point so what we've done here is we've fitted a bunch of features in a somewhat probabilistic fashion by marginalising out the latent quantity called the function the red cloud of hypotheses but only computing a point estimate a fitted estimate for the parameters of the features for these five black functions doing so has resulted in a better model than we had initially it also has maybe some pathologies like for example that it goes to zero on the left hand side or that it's sort of very focused on certain parts of the data it doesn't know that there might be more data somewhere else you might call this overfitting and it's not surprising because that's exactly what we told this algorithm to do we just wanted to minimise this error and that of course can mean that if because we're not including information like for example that we might know that there is more data coming on the left hand side towards to the left of minus five that the model has never tried to leave any room for explanations in this region this is a problem with this kind of fitting even though we are keeping track of uncertainty about the function itself under the data so here is another gray slide parameters we saw in the very first third or so of the lecture parameters that affect the model should actually ideally in a perfect world be part of the inference process and for that we have actually a likelihood in our model already implicitly that's given by the evidence in Bayes theorem if we do inference on the unknown variables which in this case is a function this likelihood is sometimes also called a marginal likelihood or a type two likelihood now typically this kind of inference is not tractable because if it were tractable it would just make theta part of our model set and then reason about it jointly with the other parts of the function so if it's tractable then we will have to use some approximate way of dealing with this likelihood and the most approximate of them all the most radical approach maybe also the most dangerous one is to just maximize this likelihood this expression to do so we can also minimize the logarithm of that expression which gives us a loss function that you can identify with an empirical risk I'm not saying that every empirical risk model is a log likelihood but every log likelihood gives an empirical risk that you can minimize and we will talk more about this connection to statistical learning theory in later lectures this doing so is maybe a non-Basian thing to do because we are constructing a point estimate to maintain some semblance of basian inference because we are maximizing a marginal likelihood rather than just a direct likelihood and we can see this in practice it has an effect in the sense that it contributes this Occam factor, this complexity penalty that we would not normally include in an empirical risk if we just had invented it from scratch now in our example we also saw that this Occam factor so make of that what you want if you prefer the philosophical interpretation then you are totally right to use this Occam factor and it's very pleasing in its structure you also should be aware that it's not always clear that the Occam factor actually helps you regularize your model or that it matters all that much if I go back to this slide and let this run again you can see that the Occam factor that curve at the bottom is actually almost flat and it's very small compared to the square loss so it has a very big effect and it doesn't prevent the overfitting we see on the right for that we would need a prior on theta which is a different thing and that would be a more probabilistic treatment which we can still do with this kind of optimization framework because we can just maximize the product of prior and doing so gives rise to a regularization term in this empirical way now in the second part of this lecture I would like to talk a little bit about how we would implement this optimization process that we just described in detail and I don't want to do that because I want to give a lecture on optimization that can be left to some other lecture course you want to take but because I want to I want to give you an intuition for how closely related this process even though we've divided in this probabilistic way is to other kinds of machine learning that you already know and have used before quite probably and that is of course captured in this kind of intuition so many of you will have already noticed when we did this derivation that what we are constructing here this hierarchical Bayesian model in which we are taking an input and then constructing features from that input by using a bunch of parameters that define the features and then mapping these features through a weight to construct an output that this sort of at least if you write it as a graph like this is reminiscent of the process called deep learning I mean here we just have two layers you can think of this as a new network if you want to use this nomenclature here's the input which is a vector with a variant input here's the output which is maybe a real number maybe it's a multivariate output as well and what we've assumed is that the output is linearly related to the weights through the features but non-linear related to the input because the input enters the features and the features are non-linear functions of the input that also means that this entire resulting function from x to y is a non-linear function on the parameters theta so of course we could make this deeper and then we would have a deep neural network we could add more features layers of features and their parameters and clearly we've constructed here something that is quite close to a neural network so in particular it's close to a neural network for regression purposes where the output loss from here on this final layer is a quadratic function because we're thinking of maximizing minimizing the negative log Gaussian likelihood so the log Gaussian likelihood is a quadratic function now we haven't actually done that because we have marginalised out this final layer w at the very end of this lecture we'll see what happens if we don't do that because then we get really close to deep learning but of course there is a very clear example that a deep connection to deep learning here and in particular there is an algorithmic connection to it which we like what we're doing here is we're minimizing an empirical risk essentially it's an empirical risk we constructed by getting rid of the final layer by marginalizing out over it because we can do it so we might as well do it because we want to be Bayesian or probabilistic but what we're doing with respect to theta is really just empirical risk minimization of a non-trivial empirical risk which we've derived in a probabilistic fashion so what I want to do is to show that the notions that the algorithmic notions that make deep learning powerful also apply to this framework because people have I think especially in the younger generations have gotten this impression that there are different parts of machine learning they're called deep learning and probabilistic and statistical learning and they're somehow separate from each other and they don't overlap at all I want to show that these notions are actually quite close to each other and that they often overlap in what they're leading to there are also people already in the community who have started to talk about something that you might not call deep learning but instead differentiable programming Jan Le Koon for example does that now and I want to show you that this what we're doing here can very much be a differentiable program and so if you want to think in terms of differentiability and automatic differentiation then that applies here as well if you don't know about automatic differentiation I think some of you might not know then you can use the next few minutes to learn about this fundamental beautiful idea of automatic differentiation so if you want to optimize this loss function which we just constructed on previous slides which is the negative log marginal likelihood of our Gaussian model and we want to optimize it with respect to theta then we need to compute this function L of theta and we need to find its minimum so let's first think about what we need to do to compute the value of this function L at a particular point in theta so if you choose a particular theta what is the loss function? well to do so you could think of a computational graph that looks like this so I've given names to the intermediate quantities and this is one particular way you could use to define this function in code of course you're free to choose what your intermediate steps are what your functions are let's say I've decided to implement this code in the following way I first take theta and then I evaluate the features so that's actually what we did in the previous lecture in Python we define a function that computes the features and that of course depends on theta so that gives us phi of theta then I'm going to compute the quantities I'm going to need to compute L so first of all I need the inner product between phi and sigma and phi let's call that k there's a reason why we call it k you'll find out in the next lecture okay here it is and then I need to add lambda lambda is something I'm not going to optimize it's just something we have in memory somewhere it's just a let's say a variable that's available and then add this together this gives me a matrix G that matrix I have to invert actually let's call the inverse of that matrix so this whole this plus this inverted let's call that G I need to compute that okay so I also am going to need the log determinant of this but I could also do this by just computing k adding lambda and then computing the log determinant directly let's call that a function C that takes as its input k and nos lambda and then we also need to compute these vectors these residuals let's call them delta because they are residual between the data and the prediction under the mean for that we need phi obviously and also y and mu but y and mu don't depend on theta so let's just say that they are around they are stored somewhere that gives us delta and now we just fit everything together so we can compute the quadratic error term which is one part of our optimization problem and that's let's call that E because it's an error and that's the inner product between delta and G and delta or delta transpose G and delta okay so for that we need delta and G and we need this Occam's Occam's complexity factor so let's call that C which is just the log determinant of k plus lambda and we can sum these two together and that gives us L so when we compute this we're essentially following through this directed acyclic graph notice that this is actually a diverted acyclic graph it's not a graphical model that's why I'm using these square rotated square type nodes rather than circles to not confuse you and make you think about probabilistic graphical models but it's really just a very similar notion it's a graph that is directed and acyclic and what we're doing here is we're essentially passing messages if you like this metaphor between variables in this graph to compute individual values okay so that's how you compute your loss function now if you want to optimize this loss function you want to know where it's minimum is so if you're starting at some point theta you need to know in which direction to step to decrease that function and an important quantity to be able to do that is the gradient so the derivative of this function L with respect to theta now it turns out and again many of you will have seen this before if you haven't then follow along and we'll actually look at what this looks like in code in the flip classroom if you know about automatic differentiation already then maybe this is a particular interpretation that you haven't seen yet or you can also just skip forward if you're bored by this process that's a beauty of video recording you can just skip forward if you think you know something so it turns out that computing a gradient of such a function that is implemented in this kind of form is surprise can be done surprisingly efficiently at least if you think about it for the first time it might be surprising and it can be done so it can be done in a more less mechanic way that only requires you to implement local operations when writing your code rather than writing a separate function for the gradient so about 15 years or so ago what people used to do to compute gradients is they would implement this function L and then they would sit down with a piece of paper and write down what the gradient is so they would go through and compute gradients of all the individual terms using the chain rule we'll do that in a moment and then sit down again and write a second function called gradient of L that also computes the gradient and then those two together would be used for the optimizer so the optimizer actually optimization algorithms typically ask for separately the function value and the gradient so you feed both of these functions to them and then they run because it requires additional mental load and work on a piece of paper and also implementation work often these gradients are then faulty because they have bugs and these bugs are then difficult to track down because there's this complicated function and all of this changed over the past few years through the development of automatic differentiation software packages and actually you might argue even that the deep learning revolution this extreme renaissance of deep learning techniques is to a large degree due to this software structure because it allows a writing reusable code and efficient code that can also be parallelized efficiently so how does this beautiful process work of automatic differentiation so we'll do this by just looking at what we have to compute here and then think about what we're doing we're not going to do abstract theory of automatic differentiation because this is not the course for it so how does this work let's say we want to compute the gradient of this loss function L with respect to the quantity we want to optimize theta so there's two different ways of doing this and they are called actually there's multiple different ways of doing it but there's two particularly prominent ways of doing it which are called forward and backward mode of automatic differentiation we'll do forward mode first actually it turns out that this particular setting is not good for forward mode it actually prefers backward mode for machine learning but it's easier to think about forward mode first so just follow along what do we need to do to compute this gradient well we could look at this graph and then we could do we could mentally go backward through the graph to think about what quantities we need to compute the gradient and expand them using the chain rule and then we'll find that and I'll show you in a second once we've done this mental backward pass through the graph we actually have all the quantities we can then implement in an algorithm that as it moves forward through the graph and computes these individual quantities from theta to L can actually just drag along necessary quantities to compute all of the gradients this is called forward mode because of this final step so how does this work ok let's look at L L is a function of E and C so we can write the L d theta using the chain rule as the L d E times the E d theta plus the L d C d C d theta so at the point in time when we are computing L we know what E and C is so we know what the L d E is and the L d C because we can compute the gradient of L with respect to E and actually I've written it down somewhere already on the next slide I'll only flip back and forth once and then you can do that later so what's the L d E well look at this expression up here it's just one half right because L is one half times E plus one half times C and the L d C is also just one half so if we we don't even need to know what E and C is it's just a constant it's just one half ok fine but things get a little bit more complicated but notice that we have this the L d C which we know now and the L d E we just need to know the E d theta and the C d theta which are downward objects here so we can write the L d theta as some object let's call it M9 dot a message number nine and it's the dot message that's a message for forward mode differentiation and M8 dot times the C d theta and the E d theta ok now we will cursively repeat this process what's the what's the E d theta well E is a function of delta and G and C is a function of K so we can expand we can write the E d theta as the E d delta times the delta d theta plus the E d G the G d theta plus what about C d theta well C only depends on K so there's a d C d K times d K d theta so at the point in time when we are computing these quantities we know what delta is so therefore we can compute the E d delta because what is the E d delta well the E d delta is let's look at this expression this is E and this is delta here each of these terms so the E d delta this is a quadratic function ok so now we need to do some multivariate calculus so you just need to know how to compute these gradients this is actually rule books cheat sheets that you can use online if you've taken an undergraduate multivariate calculus class then you should know how to compute these quantities and if you've taken the math for ML class here in Tubingen where Matthias Hein of course you know how to compute these quantities so the E d delta is because it's a quadratic function it's just 2 times G times delta and this is actually a vector so for that we need to know G and delta but they are available at the point in time when we're computing E because E depends on delta and G as well so of course we have them available good so now you've probably gotten the hang of how this works now we just keep doing just expanding all these remaining black terms whenever they show up we just expand them going down through the graph and eventually we're left with an expression that only contains all these quantities we can compute and then a final expression that's d phi d theta times d theta d theta is just 1 and d phi d theta is well it's whatever your derivative of your feature is with respect to its parameters so that depends on how you chose your features so for our sigmoidal features that has a particular value and for Gaussian features it has a different value so now this is just a structure just a piece of code essentially that you can automatically not automatically but which we can write and implement E or G or delta all the other variables in the graph so when you're implementing E you need to implement that E is a function of delta and G and it's delta times G times delta or delta transpose times G times delta and then you can also implement what the derivative of E is with respect to delta because it is here we go 2 times G times delta and you can call that forward okay and you can also implement the EDG because it's just delta times delta and you have delta available so all of this can be written in your program code for E now at the time point in time maybe actually evaluate L we can now do a forward pass so to compute L we compute all the m's right the m's are just objects that make inputs into the subsequent functions but we can also compute the messages with a dot on top we just use the corresponding code in the implementation and then just you know do compute this essentially right so this is a pass forward just like we're computing L to get the gradient now notice that sort of intuitively it's not I haven't really shown this rigorously but intuitively it becomes clear that doing so is not going to be more complex in general than to compute L itself because it's the same pass through the graph now some of the individual terms in here may of course be a little bit more complicated but it turns out that there's actually a bound on how much more complex they can be now there is a problem though which is that some of the quantities that show up in here are potentially very large vectors so for example we need here to compute down here d phi d theta so phi is actually I have this somewhere here so this object d phi d theta is actually a complicated multivariate tensor so it's not just that even a matrix it's a three-dimensional object because we have the feature functions right they map from the inputs to the outputs so there are f features and there might be several inputs so there are two indices to this feature function a and b and then they depend on multiple parameters so that this object also needs to keep track of all of that so now we have indices a b and l three different indices and this object we have to keep track of is a three-dimensional array which of course can be computationally burdensome so it turns out that there is another well and maybe just to finish that thought eventually we're just going to compute the L d theta so L is a scalar object and theta is just of whatever size right so we might be lucky that there might be some cool trick to abbreviate this computation such that we actually have don't have to keep track of these complicated multidimensional arrays right now it turns out that there is such a trick and it's called backward mode automatic differentiation and it's a good choice to make if the output is a low-dimensional function like this scalar function here so how does backward mode mode how does backward mode work so backward mode is ever so slightly more complicated to think about but it actually is a variant of what we just did so instead of starting at the top and then expanding the immediate terms we instead expand the other way round so the L d theta is let's look at the bottom of the graph dL d phi times d phi d theta so we've had this term before so this is the d phi d theta we just spoke about but now we're going to introduce new variables in the other way round we'll just call this entire expression we'll call that M1 bar and these expressions they are going to end up at this point in the graph but it's a little bit more tricky to think about them because they actually defined as the whole thing as the product of these two expressions it's not just an individual term here gets a name it's the whole thing and these are called adjoint adjoints but now we're actually going to continuously recursively do the same thing as before it's just that we pass through the graph the other way round so now mentally we pass from the bottom to the top so that's kind of a forward thought process which will then give us an algorithm that actually works in the backward way so what is the L d phi so phi feeds into delta and k so let's expand and write the L d phi as the L d delta times the delta d phi plus the L d k d k d phi and the bit behind here we leave so we already know what the whole thing is we've given a name to that we've called it M1 bar and now we'll give a name to the individual summons in this chain rule expansion okay and that's called an M2 and M3 and this is a good idea because these will now be localized in the graph at a particular point right because we know that there's going to be a delta involved here at nothing else because that's exactly how we've expanded the term now let's keep going so what's up with delta so delta feeds into E so we're going to look at E next so where is that so here we're going to get a d L d E and then a d E d delta d delta d phi that's going to be our M6 times the thing we already have which we've locally computed which is d L d phi now notice that the L d phi is the thing that we can compute locally just as d k d phi or d phi d theta in previous examples and the other bits are the bits that you have to keep going to compute now if we expand that all the way to the top then eventually we're just going to need a d L d E times the remaining bits so where is that it's over here no it's it's the pop up well eventually we are going to need a d L d E times d E d theta and a d L d C times d C d theta and for these we actually know what the corresponding objects are so as we expand for our upwards for example here we get that M7 as we move back further up here we get a d L d L d C times d C d k and this d L d C we'll call M8 bar and we know what that is it's just one half right because the L d C is just one half okay and now say we now have computed L so we've done our forward pass to compute L just a function itself and we now want to use want to compute the gradient of L with respect to theta then we can use these quantities with the bar that we just derived also to compute the gradient so how is that going to work well so we could start at the top that's why it's a backward pass and look at see that there are two incoming messages M8 and M9 and we just need to show where they show up in this tableau over here so M8 and M9 which are one half actually show up in ah M8 only shows up in M7 okay so for that we will need M8 times d C d k okay and so d C d k is something we know at this point which we can have implemented in our code for oops sorry for C and then we need to check so this where this M8 is part of that M8 is part of an M7 okay and for M7 we need to now check where M7 shows up again and you notice that there's a process I'm doing here which is that I'm looking at the tableau and checking back which variables show up at which point as we go back down the graph and for this we need an additional data structure which we didn't need for forward pass which is called a vangard list so there's an additional kind of cost to store this kind of object but that cost is low because it amounts to just storing that graph and a lookup table in this graph for which variables fit into other, feed into other variables what do we get in exchange for that well an advantage is that because we've kept expanding the terms at the front of the chain rule which are dL d some other quantity these are always in some sense small quantities because they are always an object of the form dL d something and because L is a scalar we don't get these necessarily really big arrays so therefore this kind of process which maps from a low dimensional object onto, well this kind of process we just derived this backward mode differentiation can be computationally more efficient for settings in which the output variable is low dimensional and intermediate quantities can be more high dimensional in this case in machine learning because we tend to predict simple objects like class labels or response variables as functions of complicated inputs data like images or speech recordings and so on in machine learning historically this algorithm was known as back propagation or actually still known as back propagation but it's a specific case of this automated differentiation framework the various accounts goes back to the Finnish scientist who did this I think actually in his master thesis but don't quote me on this in 1970 alright so why did we do all of this well first of all for those of you who haven't seen automated differentiation yet now you know and the more you know but secondly it's important to understand that this is a process which is often associated with deep learning if we're not doing deep learning here we're doing hierarchical Bayesian inference so the fact that gradients can be computed automatically is not a unique identifier of deep learning and in fact if you like deep learning because you can do autodiff then maybe you don't like deep learning maybe you really just like autodiff and maybe you want to do probabilistic inference as well or hierarchical Bayesian inference and use these kind of toolboxes and frameworks more and more and more and more powerful ones to build fast reproducible, reliable machine learning algorithms now of course an account end without making the connection to the more core idea of deep learning you are used to seeing pictures like this where you don't just have a single layer of features and an input but instead a hierarchy of features a deep neural network that maps from an input through several layers of non-linear transformations and linear transformations so at each layer we take whatever comes in we multiply with a bunch of linear features linear weights to compute features and then take a non-linear function of that linear transformation and then keep doing that so now we can compute a linear transformation of this non-linear function and feed it into a new non-linear function and so on you can also just maximize an empirical risk function whatever it is to make these weights from w0 to w3 such that some form of error, some empirical risk between y and f of x is minimized and typically that empirical risk for regression problems so for problems where the output is real valued is something like a quadratic error so if you do that for just this slide maybe to what we've done so far then you're used to being able to do something so here's our data set again to being able to do something like what I'm going to show you in a moment which is to take this function which depends on f of x which depends on w0, w1, w2, w3 and some fixed choices of these non-linearities like the features that we've used here today and then just minimize this empirical risk function in an algorithm that uses this gradient we just computed can compute in this backprop kind of way so in particular to compute gradients and then maybe just follow the gradient using gradient descent I actually have an animation for this here so we can do that if we do this then this algorithm if you're lucky I mean there's no no video guarantee it will because this is not necessarily a convex optimization problem but if you're lucky it might find an explanation like this here I've used three different layers each layer I've used bell shaped features Gaussian features again because I don't want you to get the impression that there's only one family of features anyone could ever use and if you do that so you see actually if you have the slides you can use the animation as well to run it again and again you see that the algorithm has here are the features so in the very first layer you still see the Gaussian shape but then of course in subsequent layers because they are Gaussians of Gaussians the features become more complicated you see that what we are learning is this kind of function here I'm actually only using these six features so two on each layer that's enough for this very simple kind of model and we're learning this kind of function okay so this is maybe deep learning I mean I think it's a one-dimensional example with just six features and three layers and maybe that's not deep and wide enough for you but it is essentially deep learning so maybe the question you've had while we're waiting for what we've been going through this lecture is what is the connection between what we've been doing here and deep learning precisely and I want to end this lecture on pointing that out so so far I've made the argument that we should do this that we should integrate out all the quantities we know how to integrate out so integrate out W here up here because we know how to do it but notice that the main reason I did so was that because I wanted to be Bayesian and probabilistic and I just knew how to do this so maybe I'm unlucky and the likelihood function here just isn't Gaussian then I in general won't know how to integrate out W and then I might just do maximum likelihood inference on the whole thing I might also be another reason to do that which is that once we have a complicated hierarchical model like this one then maybe it seems a bit silly to say oh I'm going to treat this final output layer W3 in a special way I'm going to integrate out these W3s because they are so important but really there are just two numbers that map from here to by and all the other ones W0, W1, W2 I've left out and treated somehow differently and used point estimates for them just because I didn't know how to so if that's your thought process then you might decide to just treat the final layer like all the previous ones as well and then you actually end up with deep learning so let's see how this works if we're doing Bayesian inference on the weights of this neural network let's now treat W and theta as basically the same thing then we're trying to compute the posterior this posterior is proportional up to normalization to a prior times a likelihood now so here one assumption we might make is that the individual observations are independent of each other conditioned on the model so that means that our observation noise has come somehow local and independent and we can maybe make the assumption if you're doing a regression task so if y is a real number that the noise is actually Gaussian maybe that's an intuitive thing to do then the that's our prior times likelihood the likelihood now becomes factorizes into individual terms of conditionally independent random variables y i given the model parameters then now in general we would like to compute this posterior so we would need to normalize by the normalization constant and find some kind of expression for what this posterior is that doesn't just require us to evaluate all of this because otherwise you would have to keep around the data set all the time now that might be hard in general because of this complicated hierarchical structure of the features what we might decide to do as we have in earlier parts of this lecture is to not actually compute the entire posterior but only find a best guess that is inspired by this probabilistic model which is to maximize this expression this posterior distribution if you want to maximize that expression we might as well minimize the negative logarithm of that expression so just to do that thought process once again we could take the logarithm because logarithms are monotonic transformations and they don't shift the location of the maximum we could take the minus and the minimum of the minus because it's the exact same thing the location of the minimum of minus some expression is equal to the location of the maximum of that expression so now let's look at what these expressions actually are because we take the log this product turns into a sum we get initially negative logarithm of the prior whatever that might be because these are independent gaussians we get a sum because the product turns into a sum of minus logarithms of gaussians and the logarithm of a gaussian in particular a scalar gaussian for an individual label y is just a square because gaussians are e to the minus square so what we have here now is an empirical risk minimization problem where we're trying to minimize a function that happens to be the square loss between the data and the prediction that the model would make if we pass forward from the input to the output through W and the square loss shows up because we're using a gaussian assumption and this is something we will talk about again in subsequent lectures in your head if you are coming from a statistical learning background and you see a square somewhere that is being minimized you should think of a gaussian assumption in the corresponding likelihood or prior function what about the prior so this negative log prior is often instead called r and called a regularizer now again fair warning not every empirical risk is a log likelihood and not every regularizer is a log prior but every log prior and every log likelihood is an empirical risk if you want to call them that because they're just functions that you can minimize or you can minimize their negative logarithm so let's say just for sake of argument because it might make a fun connection that we've decided to use a prior that happens to be gaussian as well for W and theta what would that amount to well it would amount to setting this negative log loss to a quadratic function and if we take a totally centered so zero mean unit covariance matrix gaussian then we're just summing up the squares of oh there's a square missing here the squares of W and theta and what we have here is an empirical risk minimization problem with an empirical risk that is quadratic and a regularizer that's also quadratic so we're solving in essentially a well we shouldn't call this a least squares problem because you might call that a general least squares problem because there is a relationship on theta here that isn't linear so we can't solve this in closed form but we're still minimizing a quadratic function and that exactly is what you do if you're training a deep neural network with individual non-linearities given by the features phi on a regression task which is why you use the quadratic loss and you use these regularizers which are known as weight costs as quadratic weight costs so that's if you like a concrete connection between what we've been doing in our gaussian inference framework and deep learning by the way of course what people then often usually do is that they sub-sample the data set to get many batches it's not like you couldn't do this here as well right of course it's the exact sort of same setup okay so that we're at the end what we've done today is to see that when there are parameters in our model that we don't know how to set then ideally the philosophically clean way to deal with them would be to do Bayesian inference over them which requires us to assign a prior to them that's fine that's not a hard problem we can come up with priors but the more complicated problem is that we need to compute a posterior by multiplying this prior with the likelihood of normalizing and that multiplication usually leads to intractable problems because if they were tractable we just make these parameters into variables of our model when we do so then at least in the regression case we've actually created an instance of something that is quite comparable or maybe even identical to a neural network and when we train this neural network we are essentially fitting the hyper parameters or parameters of our probabilistic model by maximizing the well in the standard deep learning setting by just maximizing the likelihood of our probabilistic model directly or if we are trying to be a little bit smarter we can sometimes integrate out the final layer to get a maximum marginal or type 2 maximum likelihood estimate for our model this yields a connection to the between the Bayesian framework we've been talking about and deep learning which is obviously in the next lecture we're going to address a second idea which in some sense is orthogonal to what we've done today instead of keeping a fixed number of features and tuning them we're going to think about how we can create models how we can create expressive models by instead increasing the number of features towards an infinite limit in such a way that we actually end up with a tractable model again until then thank you very much for your attention