 And welcome back, everyone. So let me start sharing my screen, sharing a whiteboard. So any questions on the topic of yesterday's lecture or previous lectures? So we've seen yesterday we had an introduction to linear regression, which introduced a number of important. So let's have a summary. Summary so far, let's say. It's probably a good time to summarize what we've discussed. So we've introduced the basic set of probability base. And then we spent quite a bit of time on Gaussians. And I hope you are now understanding why we spent all that time on Gaussian distributions and how to manipulate and how to calculate with them and associated calculations. And then so far, we've seen two models. One which is a non-supervised model, which is probably CPCA, which models high-dimensional data as coming from the embedding of lower-dimensional data into the high-dimensional data plus noise. Well, y is rdx is rq. And then we've seen something that looks extremely similar in terms of equations, where now we have a scalar output, which is obtained by multiplying an input vector by a weights vector. So linear combinations of the input provide the output plus noise. Noise here, of course, it's a vector. While here it's a scalar. We said we belong to rd. So the two main differences between these two types of model, one of them, of course, stems from the fact that this is a dimensionally reduction model, and so we always have a high-dimensional observed variable, while here we have a scalar variable. But the other and more important difference is that this is essentially describing, in terms of an equation, the density of y. So our observable variables, y, are described by taking a low-dimensional density, which for tractability is a spherical Gaussian, and embedding it through a fixed matrix into the high-dimensional space. So we're saying something about the density of the points. This is, instead, an input output model. The x are not latent, as in the case of PPCA, the x are observed. And all the reasoning is conditional. So we're always looking at understanding the distribution of y conditional on its pad input and predicting the distribution of y conditional on its pad input. So this PPCA is unsupervised. Linear regression is supervised. And at the mathematical level, unsupervised learning means really density modeling. Supervised learning means function learning. In this case, learning a linear function. Now, two more important things that we've seen on linear regression, which I want to recall from yesterday's discussion are the two concepts. There's some potential questions. Something in the chat, let's see, before we. OK, so Nasterland asks quite a complicated question, which is related to what I was about to say. So I'll just move it aside from my screen. So the first thing, the first important thing that we saw yesterday was this concept of overfitting. OK, and we saw mathematically what was happening if we have more dimensions than observations. So if D is greater than N, what happens is that we have potentially multiple solutions with maximum likelihood and more damaging than the MLE estimator. Because all of these solutions will go exactly through the points, so you can fit less than D points exactly with D parameters. And D is the number of weights. Because they go exactly through the points, the MLE of the residual variance becomes 0. And so the likelihood blows up. So MLE gives us the exact answer. Well, this is not so Nasterland question. Does MLE gives us the exact answer? No, MLE gives us a asymptotically consistent answer. So that means when we have lots of observations, it becomes the correct answer, provided that we don't have more than mismatch. But if we don't have enough observations, it can become a completely nonsensical thing. So for example, if you look at chapter 10, 22 in the Mackay book where he talks about clustering, he points out in the context of clustering exactly the same problem of MLE, which what can happen is that the solution becomes concentrated. So a single point becomes a cluster. And then the variance of the cluster becomes 0. And the likelihood becomes infinite. Now, the Bayesian framework, instead, avoids this because what it performs is so-called Bayesian model averaging. And that is the concept that to predict a new output at a new input, given what we've learned from the training data, what we need to do is to average over the weights. So this would be one linear regression model with a particular set of weights. And I need to minimize the window. Otherwise, I want to see what I'm doing to average this out using the posterior distribution. So what we do in the Bayesian setup, we learn a posterior distribution over the weights, which we can calculate analytically because the model is Gaussian here and linear. And once we have that posterior distribution, in order to predict at a new input point, which means computing this distribution, what we're going to do is take a model and average out the weights. So compared to MLE, what would happen in a Bayesian, which I believe is the question of last time, what would happen in the Bayesian setup if we have fewer data points than observations? Well, what would happen is that we will get an infinite number of W's with exactly the same posterior distribution. So we will average out all the possible planes that are compatible. It would be at the general distribution, of course, because it will have to be constant over a space to measure infinity. So even in that case, we might get some problems, but it would still provide a consistent answer. Now, Bayes also becomes asymptotically consistent because under certain assumptions by the so-called Bernstein-Fohn and Meese theorem, when you have an infinite amount of data, then the Bayesian posterior converges to a Gaussian around the maximum likelihood estimate because the likelihood becomes dominant. Good, we got one question here. All right, thanks. Good. So these are the two important concepts that we introduced yesterday, the concept of overfitting, so why we need more data than observations and parameters and the concept of Bayesian model averaging. Now, today, we're going to introduce another important concept where we will move on to the concept of random functions. So we've seen, in the simple case, this was our linear regression model, and the way we are thinking about it is a set of parameters, w, which parameterize a line, or a hyperplane in the case of a higher dimensional x. However, the converse is that you can also think of these not as a set of parameters, but you can think it as a set of. So what you do in the Bayesian setup, you define a prior distribution over the w, and then you get a distribution over these parameters. So you view it as typically as a distribution over the parameters that parameterize the hyperplane. Alternatively, you could think of it as a random hyperplane. So you have a distribution over hyperplanes, not just parameters, simply because this would be the normal vector to the hyperplane, and you can define, by defining a distribution over normal vectors, you define also a distribution over hyperplanes. So linear regression defines distribution over hyperplanes. So these, you could think of them as random linear. Now, what if we want nonlinear functions of the input? Well, in general, provided that you are a nonlinear function. So if we're interested in, now, in the nonlinear case, where we might have y being equal to a nonlinear function now of the input plus some noise, if f is, let's say, sufficiently regular, then we could potentially take an expansion in basis functions. So we could define f as a linear combination, potentially to infinity, let's say 0 to infinity. So this would be a basis function, a set of basis functions. So for example, if f was, let's say, in L2 of R, or of R to the d, we could take Fourier basis functions, or wave, or whatever you want. Any set of functions provided that f is in a separable Hilbert space, we can find a set of an infinite set of functions that will approximate arbitrary well. Well, almost everywhere would be exactly equal to f. Yeah? Now, the problem is that, of course, dealing with an infinite set of basis functions is not computationally particularly convenient. So the standard idea is to take truncation of this set. So this is called basis function regression. So you define a set phi going from 1 to, let's say, capital Q, and then expand your random functions about these. So approximate your arbitrary random function as a set. As a finite sum, now, wi phi of x. So provided you take Q large enough, then you might be able to get a very good approximation of f. Now, this would be the point of view, or, let's say, a real analysis, so a mathematical analysis. From the point of view of data science, all that you're doing is simply saying, OK, postulating that my model now is that y is going to be obtained as first. So this is, let's say, the real analysis point of view. The data science point of view is, first, I extract some features. So features are just the same thing as basis functions, or functions of your data. And then I add some noise and optimize wi. So I sort of was a question, or there was something in the chat, at least. I'm not sure whether it is a question. What is the difference between nonlinear regression and symbolic regression? To be honest, so this was a question, a private question from Abol Fasel. I think symbolic regression does not refer necessarily to functions of inputs, but trying to explain calculations in a way. But I'm not an expert on symbolic regression at all. So the data science point of view is, you take some features, or your basis functions are the features you extract from your input. You transform your input through features, feature transformations. And then you linearly combine your features and you get your output. How do we learn the features? Well, you have to make an assumption on epsilon. And the standard assumption that makes everything treat tractable is the Gaussian assumption. So MLE for basis functions. So we now have a set of observations, yj. Each of them is a combination of basis functions, indexed by i here, evaluated on an input point xj. And then we add noise epsilon j. And the standard assumption is that epsilon j is a Gaussian. So as you see, we have the same weights for every data point. And they are obtained, they're used to combine the features, which are nonlinear functions of the data point. Given this, then the likelihood function is just the Gaussian likelihood. So this is a function of w as a vector and sigma squared. And if the observations are iid, then the log likelihood will be minus n divided by square root 2 pi sigma squared minus a sum over j, 1 to n. And then we would have 1 half sigma squared. And then we would have yi, yj, sorry, minus sum i equals 1 to q, wi phi i xj. And as you see, there is really no difference between this and what we've seen yesterday in the linear regression case. Because all we've done is taking the inputs and transform them nonlinearly. So the maximum likelihood solution will once again be of the form. Now, if we define a matrix phi, which has got as entry ij phi i xj, your optimal solution for w will be something of the form phi transpose phi. Now I write it as a vector times phi y. Very simple. We just replace, instead of having x, we have phi of x. And of course, we will have the exact same problems with overfitting. But if you're doing maximum likelihood estimation, you're considering that there is a single deterministic function just that the weights are unknown and you're going to find them via optimization. The things become more interesting when you're trying to do Bayesian, Bayes' functional regression. And that's what we're going to see in a sec after answering this question. Yeah, missing inverse of matrices. Absolutely. Thank you, Paolo. Yeah, in the course that I teach normally at CSI, warn the students that any constant can become arbitrarily 1, including minus 1 and pi and so on and so forth, because I forgot about that. But that keeps you attentive. So let's look at the case of Bayesian function regression instead in the Bayesian setup. Now, Bayesian Bayesian function. So now, we still have exactly the same setup, where we have yi is equal to, sorry, yj is equal to a sum plus 0 mean noise. And we can think of these now as a random function. Because in the Bayesian setup, we don't just say there is some values of wi that we don't know, but we'll say wi, the w's come from a probability distribution. And the standard probability distribution that we will employ to make calculations easy is a multivariate, well, a spherical Gaussian with covariance, the identity in q dimensions. Now, remember that w is a q dimensional vector. So we have a random function. And we can ask what its statistics are. So we can see this as a random function of x. You plug in a value of x and it returns a random variable. So what is the expectation of f of x? So notice there are two sources of noise. There is a source of noise in the w's and there's a source of noise in the epsomance. Well, we just write down the expectation is linear. So the expectation of a sum is the sum of the expectations of x. Now this has got 0 expectation. Now reapply the fact that the expectation of a sum is the sum of the expectations. The fees are constant. I mean, once you put an x in, these are a fixed set of functions. And the expectation of the w i is 0 because they are 0. So this random function is a 0 mean random function. But now the thing becomes more interesting when we look at the second moment, which coincides with the variance because the mean is 0. So what is the expectation of f squared? Well, of x. So what does f squared of x means? We need to take a two-point expectation, f of x i, f of x j. So now if I find what I want to know is how do two points covariate? OK. And to do that, let's just plug these in. And let me rewrite this as w transpose phi x i of x j. So this is w transpose phi vector of x i plus epsilon i times w transpose phi of x j plus epsilon j. Now I need to go onto another slide, I guess. It would be a term that is the product of w transpose phi x i, w transpose phi x j. So now I'm multiplying the terms and using the linearity of the post-phi x i times w transpose phi x j. That was the first term. Then I have products of w's and epsilon's, and then I have epsilon squared plus expectation. Let's say w transpose phi x i epsilon j plus expectation w transpose phi x j epsilon i plus expectation epsilon i epsilon j. Now the w's and the epsilon are independent. And they're both 0 mean. So these and these are 0. This term here is the second moment of epsilon, which recall epsilon is generally taken to be an error of time with various sigma squared. So 0 mean and sigma squared. So the second moment is sigma squared. Because they are no, what am I saying? So epsilon i and epsilon j are independent. So this is actually 0 as well. So the expectation of an epsilon i squared would be sigma squared. But the expectation of epsilon i epsilon j is 0 because epsilon i independent of epsilon j. So the only term that survives is this one. And to make it survive better, I'm going to rewrite it. Because obviously w transpose phi is the same as phi transpose w. So phi transpose times w times w transpose times phi of x j. And now recall that the phi evaluated on one input point is just a number or just a vector in this case. So it's a constant that comes out of the expectation. So it becomes phi transpose xi expectation of w transpose phi of x j. And now the expectation of w w transpose is the second moment of w because w is 0 mean, so it's the identity. So the end result is that the two point covariance between the function value at one point and the function value at another point is simply a function of two inputs. And it's called the covariance Bayesian Bayesian functions defines a function f of x as an inner combination of Bayesian functions. Implication is that if the distribution over w is Gaussian, then for every set of input points x1, xd, then the vector f of x1, f of xd, is distributed according to a Gaussian distribution with mean 0 and with variance covariance matrix sigma computed such that sigma ij is equal to phi of xi transpose times phi of xj. OK, so there was a question. There is a question with the saw in the previous slide. You consider only i different from j, no? Because epsilon i epsilon j are independent. But what if i is equal to j? Then there will be a Kronecker delta? Yes, there should be. So you're absolutely right. Because so I'm looking at the covariance between two different points. And this is uniquely determined by the Bayesian functions. If you're looking at the variance of the function, so the variance of the function value at a single point, then you have to add the sigma squared, which is the error variance. Yeah. So in this slide that you've read at the end, should there be a matrix? Yeah, we can put a delta ij as sigma squared delta ij. So that is the next slide I just omitted epsilon and made things easier for myself. OK, so then I see more questions are arriving. Yeah, so there is another question from Paolo. What happened to the sums in the computation of the second moment? Yeah, so in the previous slide still, I guess Paolo. Yeah, so the sums, I got rid of them by defining, because I didn't want to keep writing sum, sum, sum, I defined it as w, which is now a q-dimensional vector times a phi vector, which is the vector of all the Bayesian functions evaluated at the same point. And more questions. That's good. This is what happened to the sums. OK, thanks. But this is a good point to have lots of questions. Yeah, I would be kind of prepared. This is, if you understand these, then Gaussian processes next week will be a walk in the park and we'll be all going for a beer. Well, we can't because everybody's closed, but we will be enjoying the lecture. Because the fundamental concept that I'm trying to explain today is this idea that if you're a Bayesian with Bayesian functions, then you're not just computing distributions of the weights, you're computing distributions of the functions. And these functions have got a very special property. The property is that you have a function, so something that is defined over the whole input space. But whenever you evaluate this function on a set of points, so over a finite set of points, thus computing a vector, which is the vector of the function values at those points, you're going to get a Gaussian vector. And not any Gaussian vector. You're going to get a Gaussian vector, which has a very precise covariance structure, which is determined by the choice of the Bayesian functions. So the choice of the Bayesian functions determine how two function values at distinct points will co-variate. So they will contain things as what is the characteristic length of covariance, so when do things become decorrelated, and stuff like that. Let's have an example of how we could think of that. So let's have, as our basis functions, let's have a one-dimensional input. And as our basis functions now, let's say phi i equal to some amplitude a squared times the exponential of minus x minus, let's say n, while x minus i divided by lambda squared. For i, let's say equal to 1, 2, 5. If I plot these functions, what I'm doing, I'm putting Gaussians over centered, so that the basis functions, this would be phi 1. Phi 1 of x, this would be phi 2, phi 3, phi 4, phi 5. Now, obviously, how wide they are depends on this parameter lambda. So if this lambda is a large number, then these are very wide, bell-shaped curves. If it is a small number, they are very big. So let me do a little bit more, and then I'll come back to the chat. I've seen that something has come up, but I want to say something. So whatever we take, for example, as let's now have our function, now phi f of x, let's say, is going to be w transpose phi m. So we've computed that the covariance will be this. And so if we want to compute the expectation of how the function values are correlated, say, x1, x1, x2, where I pick two numbers, then what I need to do is to take the inner product of phi vector now evaluated as x1, transpose phi vector x2. So suppose now my x1 is here, my x2 is here. So my phi vector there will have, well, let's say phi of x1 will be a certain number phi 1 of x1 here, a certain number phi 2 of x1. And then phi 3 is practically 0. And so we're going to have 0, 0, 0. While phi of x2 on the first one and the second one would be 0. So we have 0, 0. Then we would have something, phi 3 of x2, f phi 4. And then, well, phi 5 is likely to be 0. So this choice means that these two points will be correlated essentially. Because I've taken these Gaussians to be relatively narrow. If I had taken the lambda squared such that these were like this and so on, then this would not be 0 because we would have something here as well from this Gaussian. So these choices, as you can see, basically this is telling me that if I make this random variable, random function f by linear combinations of these particular basis functions, no matter what at a distance of roughly 3, then the function values will be decorrelated. So the choice of the basis functions is what determines the correlation length of the random function. So I promised I would get back to the questions and I see that there are a couple. Damiano, if we choose to let q tend to infinity, then there are no errors in your results. Yeah, so OK. So the choice of what happens if we let q tend to infinity, we will discuss when we talk about Gaussian processes. So Damiano, sorry. Be patient. Next week is what we will see. Second question, Madien. Are these functions the model that MLE predicts on the data? No. So these functions are draws from the distribution. Now, I'm not really doing fitting in this lecture. What I'm trying to do so far is to discuss, introduce really the concept of random function. So if you wanted to compute, so this is under the prior, then if you would want to compute a posterior or w, that's what you would do. And you would average the weights under that posterior. You would get a posterior distribution of the functions. But they have nothing to do with MLE. So MLE, MLE is deterministic. MLE is telling me, oh, look, there are some parameters that I do not know, but I find them. And once I found them, then they're fixed numbers. And then there is no stochasticity in the function. The only stochasticity is the observation noise, if you wish. And so there is no concept in an MLE for basis function regression or covariation between function values at different points. Fabio, do the basis functions need to be normalized? No. They are not distributions. I've just chosen Gaussian-shaped things. So bell-shaped things, because they are very often chosen, because they have local compact and local support. And they are the simplest type of wavelet, where you basically don't have the wavelet. You just have a localized regression. Correct. In principle, they are totally arbitrary. And that is something that, in some cases, you might have a reasonably good idea of how to choose them. So for example, whether you want spatially localized things or not. So if I took Fourier basis, for example, instead of these localized basis, I would get things that are always correlated on a length scale, in principle. But with a correlation that is periodic. So how to choose the functions making up the basis? Well, it's a matter that depends on the problem at hand. So if you have some prior knowledge, then you can employ it. Otherwise, it's a modeling choice. Having said that, we will see next week with Gaussian processes that there is somewhat a way to get around this question, but only partially. Good. I'm sorry. Yes? Question? Yes, please. In which sense are these functions basis? Is there a scalar product with respect to which they are linearly independent? So there are basis functions because we are obtaining the space of functions that we're placing a measure on as linear combinations of these ones. But they're not necessarily often normal. And linearly independent? Well, yes. They are linearly independent, yes, because you can't obtain. I mean, you should take them to be linearly independent. Otherwise, it doesn't make any sense because, you see, you're generating your spaces linear combinations of these basis functions. If one of them can be obtained as the combinations of the others, you might as well get rid of it. Thank you. You're welcome. We've got any more questions? If not, I'll move on to show you one reason. So what we've seen is actually, well, I always find it quite interesting, this idea of random functions is a very flexible way of parametrizing distributions over functions. But it does have a major flaw. And that would be what motivates us to go to Gaussian processes next week, well, at least. So the fatal flaw of basis functions. So now, suppose I have my observations. And typically, you will have observations over a limited range of values of x. So this would be my x, this would be my y, maybe something like that. And I see the data, and I may want to choose a set of basis functions to do nonlinear basis regression, nonlinear basis function regression. And let me see. I thought I could do colors here. Yeah, so maybe I'm going to take a basis function here, then a basis function here, then a basis function here, then a basis function here, then a basis function here, then a basis function here, then a basis function here. So I'll cover the range of my data. And I'd probably have something like, let's say, my weights will be something 0.2. And here would be 0.3, say, then here would be 1. This might be 2. And this might be close to 0. This might be minus 1, minus 2, and then whatever. So this might be a reasonable set of weights that gives me a pretty good interpolation of my points. But now the question is, what happens if I try to compute the prediction out of my sample? After all, anyone would basically say, well, if you're asking me what the x would be here, well, I'll say probably what y would be at this value of x. I'll say, well, it'd be somewhere in between those two values. No big thing. So what would be my prediction for f x star when, let's say, x star is much bigger than any xi? Well, I have my f, which is an inner combination. So my f of x star would be a weight factor times the phi vector of calculated x star. And if I want an average of this random function, well, I need to do an expectation. And this would be 0 for two reasons, because the expectation of w is 0. But also, well, now the w's are fixed in this case. Let's say that we have a posterior over the w's. But the phi's will be 0 because they don't have support there. But what would be the covariance function for the space of fun? So what would be the variance of f of x star? Well, using the formula that we had before, this would be phi transposed vector at x star times phi vector at x star. And that also would be 0 because the phi's don't have support there. So I can live with the fact that the expected value of any random function from this functional space would be 0 when I go very far from the data. But I don't expect its variance to be 0 because when I'm far from the data, I would not expect to be perfectly certain. But unfortunately, this is an inevitable consequence if we take a finite set of basis functions with localized support. And in many cases, we would want to take functions with localized support because we would want the support of the function to broadly be where the data are. So this is clearly a major problem. So it tells us that this model is somewhat fundamentally wrong because if you're interested in predicting far from where the data is, it will not be able, not only won't be able to give us an informative answer, it will tell us 0 just because of the choice we've made, but it will also be completely sure. So to recap, the function space spanned by this set of basis functions can contain with a reasonable measure a function that will fit reasonably well this data. So from Gaussian 0, 1, we can get with reasonable probability. There is a reasonable probability and density to find something that interpolates well this data. But all the functions in this family will converge to 0 outside of the data with infinite certainty. And that is not something that is desirable in a model. What we will do next week in the eighth lecture and maybe also in the ninth lecture is to introduce a different class of random functions, the so-called Gaussian processes, which is a generalization of basis functions with Gaussian weights, but which does not suffer this pathology and that achieves that by essentially letting the number of basis functions go to infinity. So have we got any questions on this material? We've got a question. Yes, please. We have that the expected value of f is always 0, also when the situation is OK. Yeah, under the prior, yeah, that's always 0. But this would have to be, yeah. So the expected value of x is always 0, f is always 0. I really can't see this point, because we always have that f is expected to be 0. Isn't f what we should guess? Well, but you see, if we wanted to do, OK, so this I took a shortcut, maybe I should revisit it next week to make it clearer. So what we would do in the presence of data, we would compute a posterior distribution of a w. And we would average with a posterior distribution, not with a prior. But what I'm saying by using the prior is that a priori is 0 everywhere, but in this region here, it's going to have a high variance. So you can see data that is non-zero. But in this region far away, you can't, because it's got 0 variance. So this class of functions only allows you to predict exactly 0 outside of the data, which is obviously a problem. Because it's fine to refer to your prior, so to 0, when you're far from the data. But it's not fine to be certain of it. OK, thank you. You're welcome. We've got any more questions. But we'll pick it up again, because Matteo's question may be realized that I should have explained this better. I just tried to rush to the end. OK, if no more questions, I'll stop sharing. And I'll pass the ball back to Matteo Marcille in this time.