 Yeah, time flies when you're having fun I guess. So we are at an end of our lecture course. So before I start, has everyone that needs to do an exam got an email from me about the timings and the modality of the exam? If not, do get in touch with either myself or Matteo or the course secretary. I've sent an email and set up a spreadsheet to sign up for particular times. So just let me know if there are any problems. What I haven't done yet so the exam will be on the 19th of April, so just over a month after the end of the lectures. And what I haven't done so far, but I will try to do as soon as possible is to write a few exercises for you guys. And but in the meantime, you can look at the references and they, particularly the book by David Mackay, contains many exercises and solutions to the exercises. So it's a very good book to work from. Anyway, the exam will only be, you know, five minutes oral because I guess you're doing one of these for each of the mini courses. So put them together, they will make like a proper exam. And I would just ask one or two short questions, nothing to be concerned to concern about. So excuse me, Professor. Yes, Carlo. Regarding the short questions, if I may ask, if I'm not pushing my luck, are luck too much? Will it be mostly a derivation of theory seen in the course or will it be exercises? Yeah, no, it would be mostly on the theory. Okay. So yeah, I mean, you know, I don't know, I could ask things, I don't know, about probably CSA or linear regression, what does it mean to over fitting, you know, these kind of things. So all things that we've seen in the course and all pretty much theoretical material. I mean, many of these, many exercises in machine learning and in basic inference, it's quite hard to give easy exercises that can be done with pen and paper, you know, so the alternative would have been to do some sort of assignment, but you know, I think that might have been a bit too much. Anyway, back to where we were yesterday, brief recap. So are there any questions about yesterday's material? If so, do ask. But the bottom line is that we saw recap yesterday, we saw once again, the major flow of the major flow of basis function regression, which is the zero variance, we introduced as an abstract concept a class of stochastic processes, which we call Gaussian processes, which are stochastic processes. And a stochastic process is a collection of random variables indexed by a certain index set. I guess most people are familiar with thinking of stochastic processes as temporal processes, but they don't have to be just a collection of random variables indexed by a suitable index set. And in this case, it tends to be R to the end of the index set. But they have the property that every finite dimensional marginal is Gaussian. And it's not just any Gaussian, but it's a Gaussian, which is defined by the evaluation of a mean, with parameters defined by the evaluation of a mean function over the set of points that define the finite dimensional marginal, and the evaluation of a covariance function over the set of pairs of points in this set of, in this finite set of points, which define the finite dimensional marginal. And we've seen that the really the crucial ingredient, I mean, the mean function, you know, it's just a deterministic mean. So you could subtract it from your process and you would get zero mean process. But the key ingredient is this covariance function. Reference for this and also for today's material is the book by Erasmussen and Williams, chapter two, which contains everything that I would tell you and more. So we define these things, but how do we do computations with Gaussian processes? And in particular, how do we do predictions, at least in the regression session setting? It's the same old trick of doing computations with Gaussian. So we know now we are interested in a regression setup. So the scenario will be that we have pairs of input output points, yi and xi, where yi is a scalar and xi is potentially a vector, i equals 1 to n. And we postulate our regression model that yi is a function of xi plus a noise term. And this model we've seen many, many times. But this time, excuse me. Yes, Matthew. Gaussian processes are supervised learning, right? Absolutely. And that's why we have input and output and input pairs. Okay. Yeah. So whenever you have, I mean, the difference between supervised and unsupervised learning is when your data is defined in terms of pairs, inputs and outputs, that's supervised. If there isn't such a difference, it's unsupervised, basically. Having said that, you know, Gaussian processes for regression, so it's not a Gaussian processes, it's always supervised learning. Just Gaussian processes are a probability distribution of functions. If we are using them in a regression setting as here, then we are using them for doing supervised learning. But there could be, and in fact, there are some situations where you use Gaussian processes in an unsupervised context. For example, the Gaussian process latent variable model is a prominent example, which I will not discuss though. So we now have that the function f is a draw from a Gaussian process. And as I've explained, you don't need to worry too much about the mean function. So I'll put a zero mean function but with a certain covariance function k and epsilon i are iad. So what is the point of doing a regression? Well, the point of doing a regression is generally to predict the value of your output at a new test input. So this would be our training set of inputs and outputs. And then we assume you have a test input x star. And the question is predict y star. And that's your task in a regression setup. And it could be that you have a batch of test inputs. Well, it doesn't really make much of a difference. You will see naturally how to do it for a batch as well. So how would we do that? Well, the characteristic of Gaussian processes is that they have Gaussian finite dimensional modules. So we're going to define a vector f, which is made up of the function values on the training data, the function value on the tested data. Yeah, for these ones, we have corresponding observations. And for these ones, we want to predict the observation. Now, what is going to be the joint distribution of these n plus one random variables? They're very easy. These are all going to be distributed according to a Gaussian, that would mean zero, and covariance k, where k is a matrix, which we will write as this, k, k star vector. So this is a column vector. This is a transpose. And then this would be k star star, where entry of k ij is the function k on xi xj. So the covariance function evaluated the pairs of training inputs, while k star i, so the ith entry of this vector is the covariance function evaluated on xi and x star. So on all the training points, this is an n by n matrix. This is a vector in r to the n. We have n indices. And the ith k star star is the final piece of the puzzle, which is the covariance matrix evaluated at x star both times. This is the joint prior distribution over the function values and the training points and the function values at the test point. This is the thing we're interested in predicting, having observed all of these data. So how are we going to do it? Well, you see, if we condition on the function values on the training points, then we get a distribution for f of x star, which is the thing that we're interested in. So for each value of the function vector evaluated, function value evaluated on the training points, we get a different distribution of the output on the test point. Another question would be, okay, what specific value of f of x1, f of xn should I plug in? Well, one might simply think, okay, let's plug in the actual observations, but that wouldn't be too bad a choice to be honest. It would be a sensible thing to do. But of course, that would completely ignore the fact that the observations are noise. What we need to do instead, if you follow the philosophy of this course, is to do Bayesian model averaging. So in the end, what we're interested in is a prediction. So we want to compute the distribution of f of x star, given the data yi xi. Okay, how are we going to do that? Well, at the moment we have a joint distribution f of x star and then f of x1, f of xn. This is a joint distribution, but we can rewrite it as a conditional distribution times, you see, this is not, I've not finished writing yet. So we have, we want f of x star conditioned on the data. So the training inputs and output pairs, what we can rewrite it as is the f of x star conditioned on the functions on the training points and we don't observe the functions, we observe y, which is a noise corrupted version of the function. And then, however, we can use the data to learn something about the unobserved f of x1, f of xn. So we can construct a posterior. And of course, here, there is no trace of these variables here. So what we have to do is to integrate them out. So this is Bayesian model averaging in practice. Yeah. If we want a prediction of the output of the new test input condition on what we have seen, and we've not seen the functional values, we've not seen the f's at Xi, we've seen the y's, which are the f's plus noise. And what we need to do is to compute a posterior over those f's and then use that posterior to average out the latent variables f's. And in this way, we obtain distribution over the output given the data. What we need to do is, again, the usual Bayesian calculations. So we need to compute this posterior. Yeah. This posterior is obtained as a product of a likelihood and a prior, suitably normalized. So this is going to be proportional to the probability of the observations given f of Xi times the probability of, and this would be a product of Xi, because my observations have got iid noise times the probability of x1 x1. Okay. This is still the same calculation that we've done a few times now, but every time in slightly different contexts. So what we have, we have iid observations, which means that the likelihood factorizes. So the probability of y1 conditional x1 is independent of the probability of y2 conditional x2 and so on. And that's why we get this product here. And then we have a prior instead, which is correlated. So the prior over the function values is correlated and is given by the Gaussian process prior. Yeah. So these terms are all of the form exponential of minus a half, one over two sigma squared, yi minus f of Xi squared. Yeah. And this term here, instead, is exponential, is exponential, is one over square root of two pi determinant of the K matrix, exponential minus a half of f vector transpose K to the minus one. Okay. This follows from the fact that we have a zero mean of the process with covariance function K. And now this matrix K has as entry ij is K evaluated at xi. So we are given this prior. We have to compute this posterior and the way we do it is exactly as we've done for all the regression problems that we've seen so far. We have a likelihood term, which is factorized and actually all the regression problems that we will see in this course, likelihood term, which is factorized in a product of terms, each of them involving a scalar. And then we have a prior, which unlike the case of linear regression, where the prior was an identity covariance Gaussian on the weights, here is a correlated prior. Okay. So once again, we need to compute the, we need to compute the statistics of this posterior distribution and it's going to be, once again, it's going to be a Gaussian and it will have its own covariance matrix, covariance and mean function. So P of F. Sorry guys, it seemed to be having some issue. Can you still hear me? Yes. Yes. Okay. What can you see? I can see a black screen in front of me. So we have a white screen with some green rectangles. Probably you should stop sharing and then start. Yes, yes. So what were we talking about? So we had shown that the way you do prediction with Gaussian processes is through Bayesian model averaging, which means you need first to compute a posterior distribution over your function values at the training points because you don't observe the function values. You have to say the function values plus noise. And so this function values vector, we call it F, conditioned on the observations, which are pairs of inputs and outputs. And this is a product of a Gaussian likelihood times a multivariate Gaussian prior. So it is a Gaussian itself. And what are its, this is the calculation that we were doing. What are its statistics, its mean vector and its covariance matrix? Well, we have to equate the quadratic term in F in this definition. So we have to do something that will be, so it will be F transpose C minus 1 F. And the quadratic terms in the product of likelihood times prior. And these would have been, because we had n independent terms, we would have had n over sigma squared times, no, sorry, not n, 1 over sigma squared times F sum over i F of xi squared plus, this is the contribution from the prior. Okay. And now this thing here is equal to F to 1 over sigma squared F transpose times the identity matrix times F. So this tells us that implies that C to the minus 1 is equal to 1 over sigma squared, the identity matrix plus K to the minus 1. And if I take, I can take the minus 1 off from this side and add it on the other side. And then I can use the Woodbury formula. So there are a few identities that are particularly useful in this type of analysis. And one of them is the Woodbury formula, which gives you, so this should be equal to K plus sigma squared pi. Okay. She's basically telling me that the posteriori, the variance is added. Then the, what about the mean? Now to compute the posterior mean, I have to equate the linear terms in the equation that I had in the standard Gaussian formulation. The linear terms are F transpose C minus 1n. And in the product of likelihood and prior, now I don't have any linear terms in the prior because it's a zero mean prior, but I have linear terms in the likelihood. And those linear terms are on the form of sum over i, f of xi times yi divided by sigma squared. And these I can view it as F transpose as 1 over sigma squared, F transpose vector times the vector of the observations. And I see there are some issues in the chat and I'll finish the calculation and I will get back to you on that. And so I get the posterior mean is equal to the matrix C times 1 over sigma squared and the vector of the observations. And then using what I derived before, I get C was this object here, I get k plus sigma squared 1 over sigma squared. Okay, so we had some questions in the chat. What is it, is the Woodbury formula also known as Sherman Morrison? Well, Woodbury identity or Sherman Morrison formula or it goes under several. So I'll kind of give you the link Wikipedia page. So in this case, it's a particularly simple situation where you have a is equal to C to k and C is equal to 1 over sigma squared times the identity and U and V are identity matrices. Okay, so let's get back. So what we need to do now, we need to compute the we need to compute the predictive distribution and the predictive distribution P of Y, not Y, F condition on the data is obtained as an integral over the F variables when F is a function values, unobserved function values at the training inputs of the conditional F of X star given F training inputs times the posterior over the training inputs given. Now this is a somewhat unpleasant calculation because it needs, you need to compute these conditional M and computing conditionals as I think, you know, we discussed in one of your lessons. When you have your system defined from equations, it's trivial to compute conditionals. So, you know, if I have my equation Y equals F of X plus Epsilon, then it's trivial to compute probability of Y conditional F. But when you have your system defined in matrix format, then you need to, and what you need to do, you need to take the joint of these things which has a joint covariance, which we have written compute the so-called and computed the conditional, which means fixing the F and computing what is the resulting Gaussian. So I'm not going to do this calculation, but I'll write out the result, which you can find and you can find the derivation in the book of Rasmussen and Williams and it's on page 16. And then we will comment the results together. So this is again a Gaussian. It's a one-dimensional Gaussian, so it's got the variance and it's not a covariance, it's a new variance. Now the mean, so expectation under the posterior, under the predictive of F of X star is obtained as, is obtained as a linear combination of the input data and is given by a very nice formula, well, relatively nice formula. So you have to take the k star vector, you multiply k star transpose times the y vector. So this is the vector, if you remember, it was in the previous failed zoom meeting, k star is the covariance function evaluated at x i, so index i, x i. So you take the covariance function evaluated at all the input points that gives you, and at the test point, all the training points and the test point that gives you a vector, you multiply it times these matrix, which is the posterior, the inverse of the posterior covariance, and then you multiply, so this is n by one, so transpose is one by n, this is n by n, this is n by one. So this is the mean value. It's interesting to observe two things, so it's a linear combination of the inputs. So the posterior mean, the predictive mean is always a linear of the training outputs, not inputs, sorry, but also it's a linear combination of terms of the form k x star x i. And so if you think of it as, each of these ones is like a basis function, it's a function x i is fixed, so I mean this contains the x i and the x j, so this is just a matrix of fixed numbers. x star is the thing that is a variable, we're trying to predict the function of f at x star, which is our independent variable. So it's as if we are doing a linear combination of basis functions, but the number of basis functions that we're using is the same as the number of training points. And that is completely different from the setup of basis function regression, where you fix the number of basis functions and then you do your regression. Well here, so here the number of basis functions, if you wish to call them like that, is equal to the number of data points. So sometimes people call this some sort of adaptive complexity of the model. So the more data points you have, the richer the class of functions that you can use for your predictions. So the more training points you have, the more basis functions you're going to use, the more complicated your predictive function is going to be. Okay, is this clear? So the kind of, you know, the calculations are a bit, you know, sometimes boring and sometimes not totally trivial as well. But the line of reasoning should be clear. You know, we have this process where the finite dimensional marginals are multivariate gaussians with correlations. So given some training data, we can compute the posterior on the function, so some noisy observations of the function. We compute the posterior of the function. And then the correlations induced by the Gaussian process through the covariance function enable us to obtain predictive distribution over the function value at the new input, which is going to be Gaussian and has a very specific form. Here we're exploring the form. We obtain it through model averaging Gaussian, Bayesian model averaging. And the form is quite interesting because as you see, the predictive function is a linear combination of the outputs that you observe your training outputs, but it's also a linear combination of a number of basis functions and the number of basis function, which determines the complexity of the functions that you can observe, increases the more training data points you have. Some people even call this kind of automatic outcome razor because it kind of controls the complexity based on the amount of data that you have. Is this clear? Do you want to ask questions on the posterior mean? So if there are no questions, let's move on to the posterior covariance now. So obviously the predictive distribution is a Gaussian, so it's got its own mean and its own variance, and we are looking at predicting at a single point, so it's one-dimensional random variable. So we now want the variance, so we call it C star, which is the same thing as the variance of f of x star. And how do we obtain it? We do Bayesian model averaging, so we do a whole lot of Gaussian computations, and that's a bit tedious, but it's in the book if you want to look at it. Now, the final result is this. So remember what these terms are, so remember that I had defined, I had broken out my, so this is, let's say this is big sigma, let's say where p of x star is no more big sigma. So the joint distribution of f of x star has got this block structure, the joint prior distribution, which contains k, which is the covariance matrix evaluated at pairs of input points, or a pairs of training input points, then k star is the vector obtained by evaluating the covariance function on the training input and the test input, and then k star star is k or x is the prior variance, so let's play on x star and x. So two important, well a number of important observations here, so there's no way here, there is the data, the observed output, so the posterior variance does not depend in any way on the observed values. Now, this might seem a little bit strange, because you know, suppose I had observed function values like this, but if here, then you probably would be reasonably confident that it's somewhere here, the function, and I would have some function, but if instead I had observed, and I'm one point here, massive outlier for whatever reason, and I try to predict in the same point, well instead of saying well it could be anywhere between here and here, no it would give me exactly the same posterior variance. Yeah, exactly the same predictive variance, I would say it's probably around here, but it would have exactly the same predictive variance as here. So this is a little bit weird, and the other thing that is kind of weird, so this term is the prior variance. If I take the Gaussian process and take its one-dimensional marginal, which is the value of the function f at x star, okay, one-dimensional variance, what would it be? Well it would be the covariance function evaluated at that point and at that point, so two inputs which are the same, so that's the prior variance. So observations will always, the predictive variance, the posterior or predictive variance, I'd say posterior or predictive, the posterior predictive variance is always smaller, so observing more data always makes you more confident of the value of the function, so which once again kind of makes sense in the let's say business as usual scenario, but in a scenario like this where you have an outlier, it doesn't really make sense, if I had observed all of these points, let's say, then I'd be relatively confident that I have a function like this, and at this position here, I would predict somewhere here with a certain variance. Now if just before here, I observe another point all the way up here, then my Gaussian process would have to work extremely hard to try to fit that. The posterior mean here would be shifted up, but it's actual uncertainty, would that be shrunk, which is really strange, it really shouldn't happen that way. So you know, there are some very nice things about Gaussian processes, and so just to wrap up this kind of very rapid introduction, and it's not surprised that they are amongst the most widely used techniques in machine learning. So this is the GP pros and cons. So certainly a massive pro is, I forgot to say one thing before we move on. The other thing that is important to notice on this posterior variance, predictive variance, is what happens if I'm trying to predict at a very far point. So as you will see, this term, if you have a stationary Gaussian process, will not, you know, it will be the same. The prior would be the same everywhere, whether you're trying to predict near the data, or whether you're predicting very far from the data. But this reduction term depends on this k star basis functions. And these k star basis functions are of the form k of xi, where i is a training input x star. So if xi, if the training data is very far from the point where I'm trying to predict, say I'm trying to predict right here and all my training data is here, then these terms will be very close to zero. And obviously this does not depend on where I'm testing. So the third thing is that far from the data, from the training data, the variance is constant and greater than zero. So it returns to the prior variance, which is a good thing. So these two things are a bit bad. But this is a very good thing. We have a model that when it's far from the data, it will return the honest answer of saying, well, you know, these data is so far that it doesn't really tell you much about this function. I tell you what the prior is, which is great. It's also what basis function regression was saying, except that basis function regression was set up to be infinitely confident far from the training data. So GP pros and cons, they're flexible and they have this adaptive complexity, which is a great feature. The more data you have, the more complicated functions you're allowed to express. The other big pro is that they're somewhat, you know, honest with the data. So far from training, they return. So these are pros. The big con is what I've described before, which is they are somewhat vulnerable to outliers because the posterior variance always increase. And the final problem, which is something that has exercised machine learning researchers for a lot of time is that as you see here, you need to invert an N by N matrix in order to obtain the posterior variance, or you need to solve a linear system. We don't necessarily need to invert the matrix. We need to solve an N by N linear system to get the posterior mean. And so a major drawback is that the scale has a big O of N cubed where N is the number of training points. And therefore, you can't really do very large data sets. But however, there exists some partial solutions. Okay, so I hope that this has given you some idea of some of the potentially interesting things that can be done with Bayesian inference. We've actually really looked at the linear case, the linear Gaussian case, which is the only set up where you can do calculations explicitly. So you can write out analytically, you know, I could write out that I need to, you know, if I specify what K is, for example, the radial basis function covariance function, then I can write analytically what's the function, the predictive mean and variance everywhere in the input space. I can do the calculations, which is great. On the other hand, of course, they are only a very small subset of what machine learning is about, particularly these days. But they're still quite a useful subset because they're tractable and they give you a good insight about how the methods work. And I hope this will be of use to you in the rest of your career. I appreciate this maybe slightly tangential to some other complex systems things that you'll be seeing, but you know, being able to model some data will presumably come handy at some stage in your lives. So thank you very much for staying with me for all this time. And thank you, Matteo, for inviting me. And I guess I'll see at least some of you at the exam. Yes, so the final questions are okay, so I think we can stop here. Thank you very much, Guido, for this very good, very interesting course. And so see you all tomorrow. Yeah, and do get in touch with me if you have questions or anything. So, you know, my email is easily obtainable. Thank you.