 Hello, and welcome to the third lecture of our Introduction to Machine Learning course. We continue talking about linear regression today, but we switch the perspective a little bit and talk about probabilistic models and sampling properties. So, so far in this course, whenever I said a model, we were fitting a model, this word model referred to a family, a parametric family of functions that we wanted to fit to some to some given data set. So, for example, we were talking about linear functions. If there's only one variable, it's a family of lie, of straight lines parameterized by two coefficients. We want to find the coefficients that fit the data the best. And today we will talk about probabilistic or generative models, how they're also called, which for linear regression would look like that. So, it's almost the same. There's just one little twist. We add epsilon in the end and the epsilon is the random variable. It's a Gaussian random variable with mean zero in some variance. So, this means that, here's how you should think about that. If the beta, the true underlying beta is given and some x variable is also given, you chose it, for example, then you can sample epsilon values from this Gaussian distribution and this will generate the values of the response variable y. You can do this for a bunch of different x points and this will generate an entire data set together with some noise in it. So, a generative model is something that you can, it's like a box that can generate data sets for you. And then one can think, for example, what happens if you generate data sets and then each time estimate the coefficients by a least squares procedure, like we talked in the first lectures, how close your estimated beta will be to the beta that you put into the model. And this is something that we want to explore today, but we will start with a short, brief recap of probability theory to bring everybody up to speed with the notions of distributions and so on. Okay. So, probability distributions can be discrete or continuous. They can also be mixed, but this we will not be talking about today. Let's start with a discrete probability distribution. I assume that everybody's familiar with this concept. So, discrete random variable x can be described by a probability mass function, which tells you what is the probability to obtain each possible outcome of your random variable. So, let's talk about dice, for example, as a standard example, you throw a dice, it comes out as number one to six. So, this is the possible values of your random variable. If the dice is fair, then all outcomes are equally probable, right? So, you get one over six as a probability. This is a uniform, discrete uniform probability distribution. And if you think about this plot as a function, then this is called a probability mass function because it's just concentrated on several discrete values. Slightly more complicated example, if you throw two dice now, then you can get values. You add the values, you can get a number between two and 12, right? Two is not very probable. 12 is also unlikely. The most likely number is seven because there's many combinations. Six, in fact, combinations that yielded. So, the probability of seven is one over six, and here for two and 12, you get one over 36. These probabilities have to satisfy two requirements. One requirement is that they're all not negative. They can be zero, but they cannot be negative. And another requirement is they should sum to one. Okay. A continuous probability distribution is something that can take values, for example, on a real line, right? It's not a discrete set of values that this variable can take, but the entire real line is allowed. And they are described by something called probability density function. So, here's an example. To give a concrete example, for example, we can talk about distribution of heights of the students of this course. So, if you think about measuring the height of each one of you, they can be values between, I don't know, one meter and three meters, let's say, but it can be any value there, right? So, we have this distribution that describes how likely each of these values are. And what people are sometimes confused about when talking about probability density functions is that the probability of each specific value, for example, probability of obtaining height equal to one meter 80 centimeters is zero. Zero because 1.8 meters means 1.8 and 0, 0, 0, like zero millimeters and zero microns on top of that. And this just has probability zero. What you can meaningfully talk about is the probability that you get 1.8 meters plus minus one millimeter. So, you're looking at some range of possible x values and you're asking what's the probability to get into this range. And this is something that PDF then can answer you. So, you're looking at this range over here and you compute the integral of this function in this little small narrow band or not narrow, whatever, the integral over some interval gives you the probability to get into this interval when you make a random draw. Which means, of course, that this entire function has to integrate to 1 because you will obtain some value with probability 1. And again, it cannot be negative anywhere. Two important things that we'll talk a lot about today are the mean or expected value as it's alternatively called in the variance of random variables. So, let's define those. For a discrete random variable, that's how the expected value is defined. You see it there. So, you just sum all the values weighted by the probabilities. If something has a very low probability, it doesn't influence the average. If something has high probability, it influences it strongly. If there's several values with equally high probabilities, you just average them, right? So, it's the mean. Very intuitive concept. The variance is slightly more complicated. It's the expected value of the squared deviation from the expected value. So, this means, here's your mean. Every time you draw a value, you get something around the mean by definition. You look at the squared deviations, you square. So, you get something that is always positive and you compute the expected value of that. And this is your variance. So, variance measures how far away from the mean the values of this random variable typically are on average, right? So, we can plug the formula for the expected value and we get this expression for a discrete random variable, where you just sum over all possible outcomes and obtain the value of the variance. For a continuous random variable, it's very similar, but you have to replace all the sums by the integrals. So, to compute the expected value, it's the integral of x times p, and x of p was the probability density and you integrated weighted by x, like in the sum above. And that's the answer. And for the variance, so, the formula, the conceptual formula for the variance that is expected value of the squared deviation from the expected value, this stays the same, whatever your, it's the general definition that applies for any random variable. For a continuous, again, we have an integral and so on as written here. Good. Then, we will need something that is called covariance. So, we define variance on the previous slide like this, but if we have two random variables, can be continuous, can be discrete, anything, then we can very similarly define a covariance between them as written here, right? So, this measures how they deviate from their respective means together. If they both tend to be on the right side of the mean and both simultaneously tend to be together on the left side from the mean, for example, then they will have positive covariance. If they vary completely unrelated to each other, then the covariance will be zero. And this is actually the definition of uncorrelated random variable. So, note that if you plug the same variable twice in the covariance, so, covariance of x and x is just a variance. And, yeah, as I already said, if x and y are uncorrelated, then the covariance is zero, and that's basically the definition of that because, as a reminder, the correlation is scaled covariance. So, the scale covariance is by square roots of the respective variances, and one can show, which is not exactly trivial, but one can show that this value, the correlation, will always be between minus one and one. It cannot, the absolute value of correlation cannot be above one if it's defined like that, whereas covariance can take, can take any value. But zero covariance means zero correlation and vice versa. Okay, good. Some useful properties of the mean and the variance. So, let's start with the expected value. If you multiply random variable by something, the expected value gets multiplied by the same value. That trivially follows from the definition. If you add two random variables again, the expected values you can also add. This whole, always holds true. Just follows, it's basically a linear, a linear operation, the expectation. If you multiply two, then it's more complicated. In general, the expected value of a product of two random variables is not equal to, to anything. It's not equal to the product of expectations. However, if your variables are statistically independent, then this holds true. And in fact, this is the definition, or one can take, this is the definition of, of independent random variables. And you can think about that. So, imagine again, throwing two dice, right? Fair dice independently. So, the outcome here and the outcome here are not related to each other. And then you can ask, what is the expected value of the product of the values? And if you write out the sum, then you will see that it actually decomposes into, into two sums, directly, because the probability to obtain like two here and three here is just the product of these probabilities, because that's how independent events work. Now, let's, let's talk about the variance. If you, if you multiply x by sum number eight, and the variance will be multiplied by a squared. This, again, immediately formula, follows from the definition because there's square in the definition. This sum, the variance of the sum is not equal to the sum of the variances. If you write it down, there will be covariance term, in fact, in addition. But if the two variables x and y are uncorrelated, then the covariance cancels out and you get this as a, as a, as a true statement. And then finally, a very useful formula is that you can express variance of a random variable in, in terms of the expected value and the expectation of the square of the random variable. So this is something I'll probably leave as an exercise. It can be simply proven and can be very useful if you, if you compute something. Okay. And the last thing we need will be a multivariate probability distribution. So a random variable x can be, can be, in fact, a random vector. So you generate a vector and each component of the vector is a random variable itself. So this will call a random vector. So here's an example. It's a bit tricky to draw. So here's, I, I tried to, to draw a random variable that has two components. So this is a random vector in two dimensions. And then the probability distribution becomes like a surface that again is strictly positive. Some kind of surface, the, the integral over the entire plane is one. And if you want to know how like, how what's the probability to get in a specific little rectangle on this x1, x2 plane, then you need to compute the 2D integral. So the volume of this, of this column, and this will give you the probability. And this, of course, works then. In any dimensions, you just need to think about high dimensional integrals. So in two dimensions, it's, it can be very convenient to draw it as we did, similarly to how we did the loss functions in the previous lecture, to draw it as a, as a contour map on a x1, x2 plane. All right. So, yeah, one more thing. So if a continuous multivariate random variable is described by, by probability density, then, so how do our notions of expected value and variance work here? Well, for expected value, it works exactly the same as before. You just put the vectors in there. And what you get out is, of course, also a vector, right? So the expected value will be a vector, because everything happens now in two dimensions. For example, in the previous, on the previous slide. So expected value will also be two-dimensional. It's a vector. Now, variance is a bit more tricky, because if you, if you write out the definition of the variance, how we had it before, we had x deviation from the average squared, right? But now deviation from the average is a vector. So what does it mean to square a vector? And in fact, one can write it like that. So this will be a matrix. So if it's a two-dimensional random variable, then this will be a two-by-two matrix that, on the previous slide, where I had this ellipse that was stretched like that, one, so the diagonal values, the diagonal elements of this matrix will be the variances of each component. So in the previous example, there will be, there will be something on the diagonal that just describes the variance in the x1 direction and the variance in x2 direction. But then, in that example like that, the x1 and x2 are positively correlated. So the covariance, something that stands off diagonal, is going to be non-positive, non-zero. And then, of course, the covariance ij element of this matrix is the same as ji, because this thing is symmetric over here. So it's a symmetric matrix. Yeah, so we're never, so we're going to talk a lot in this course about high-dimensional random variables, mostly Gaussian, and then we will be talking a lot about the covariance matrix. So make sure, make sure that you understand this definition and what it entails. Okay, so Gaussian distributions, that's last thing we will need a lot today. A Gaussian or also called normal distribution in one dimension looks like that. It's a bell-shaped curve. Here's the probability density that, so important about this probability density is that this is essentially an exponent of minus x2. You shift this by mu, because mu will be the mean of this. So if you want to shift this thing to the left or to the right, just subtract it from x, right? And then here in the bottom, you have sigma squared, which is the variance, and it scales and stretches or shrinks this entire shape. So one would actually need to prove that if you take this formula, then the expected value of this is mu and the variance is sigma squared. I'm not doing this here. Just believe me that this works out. If the mean is zero and standard deviation is one, this is called standard normal distribution, then the equation simplifies. And you might wonder why there's this one over square root of two pi factor here and also above here. And this is there just so that the entire thing integrates to one. So if you remove that and then you just have exponent of minus x squared, one half of that, and you compute the integral, then this will turn out to be a square root of two pi. So you need this in order to make the whole thing integrate to one. And it's actually not trivial to compute this integral. It even has a name. It's called Euler Poisson integral. It's not very complicated, but you need to apply a particular trick to be able to compute it. So it's a famous result in calculus. Okay. And now a multivariate Gaussian distribution. So Gaussian distribution in several dimensions. Very similar to what we had before. It just needs to generalize the mean becomes a vector. That's why I'm writing bold here. And the variance becomes a covariance matrix. So one needs to write this thing down somehow that you can't just divide by sigma squared because now sigma is a covariance matrix. So it turns out what you can do is to write it like that. So you have the inverse of the covariance matrix up here. And then again, this is the important part. If you compute the integral of that, it turns out that the determinant of this matrix gets into the integral result. That's why we need to normalize by that. But this, so everything in the beginning is just normalization factors. Conceptually, it's not very important. The important thing happens here. Okay. And we can put the zero vector as the mean and identity matrix. So remember identity matrix is once on the diagonal zeros everywhere. And this will be called a standard multivariate normal distribution. It's symmetric in all directions. It has variance one. It's like a ball centered around zero, zero. Has a simple formula for its density function. Okay. I think this is all that we will need at least today to talk about the probabilistic model for linear regression. So let's write it. Let's get back to it. So here I'm writing it again, epsilon. Now you can appreciate what it really means. The epsilon is a random variable from a Gaussian distribution with mean zero. This is important. And some fixed value of the variance sigma square. So here's how I think I suggest you should think about that. First, for the simple case of one variable, you have one variable X over here. There is some true beta. And this true beta defines this line, right? The regression line that relates Y to X. But now if you're generating values from this generative model, then for any X, for example, this is the value plus noise. And the noise comes from a Gaussian distribution. So it will end up, for example, up here as I drew. But it could end up down here or somewhere up here. This probability distribution denotes this variability over here. The same thing happens here if you have an X value over here or here. So if you put in a bunch of X variables, you will get out a bunch of Y values with noise on top. And it's very important here that the noise values of epsilon that you get for different values of X in your data, they are uncorrelated. So you're independent. They're actually independent. Even statistically, you're independently generate the noise or sometimes it's called the error here and here and here and here everywhere. It's first thing that's important. The second thing that is important that they have equal variance, right? For all values of X, this model has the same noise variance. So this has a fancy term in statistics, homosidisticity. Now, one more thing. If you think about a high dimensional situation where you have several predictors, for example, two, then it's exactly the same thing. So now you have a plane, X1, X2 plane, right? We introduced it in the previous lecture. The Y axis goes vertically. You have not a regression line, but now you have a regression plane that is your average value of Y for any X. And on top of that, you have random variation and random variation is still one dimensional in the direction of Y. So you have this plane and random variation in the Y direction everywhere with the same variance. Okay. Now a very important concept in statistics and in machine learning is the likelihood. So let me introduce that. So think about having some fixed value of beta, the true value of beta, and think about having a particular value of X that is also fixed for now. Then the Ys, so I'm just repeating what I said right now, the Ys follow a normal distribution, Gaussian distribution with variance sigma squared around this beta transpose X average prediction. So we can write down the whole probability density, and that describes the density of getting different Ys at these beta and X values. Now we imagine that we have an entire dataset of data. So we have a bunch of different X values, and we can ask, okay, so what's the probability density to observe a particular value of Y0 for my X0 predictor vector and a particular value of Y1 for my next predictor value of X1 and Y2 for X2 and so on. And this is, of course, the probabilities just multiply because the noise is independent, as I said before. So I just write a product of whatever was above here going around all I, all training samples. So this is, at this moment, it's a function of Y, right? This tells you what is the probability density to obtain particular Y values. And now comes the trick. The trick is to now think about this expression as a function of beta and sigma squared and not anymore as a function of Y. So now we imagine the expression stays exactly the same, but we imagine that we have already, that we are given some X values and some Y values, which, of course, always is the case if you have some training data. So there's a given training set XY, XI, YI, and we can still plug that all in this formula and we can treat it now as a function of beta and sigma, which are two parameters of our generative model, right? And this is now called the likelihood. So the likelihood is just probability or probability density to generate the data reinterpreted as a function of parameters. And now this is useful because now we can ask, okay, what would be the values of beta and sigma squared that would make the data set that we have the most probable? So for which values of beta and sigma squared will this actually thing be maximal? So this can be a little bit confusing the first time you hear that, so make sure you understand. So we want to find beta and sigma that maximize the likelihood, so we can say the maximally likely beta and sigma, but this just means that we are asking for what beta and sigma the probability to observe this data was the highest. Okay, so this now becomes, so that's called the principle of maximum likelihood is tells us that it makes sense to look for beta and sigma that maximize the likelihood. It can be a good estimate of the underlying beta and sigma that we don't know, but given a data set, we want to estimate them, right? So there's different ways to estimate them. And if you have a generative probabilistic model, then for any model, you can ask what is the maximum likelihood estimates. So let's try to do it for linear regression. So what we have up here is a big product of exponentials, and that's typically annoying and cumbersome to work with. So what very often is very convenient, especially if you're dealing with Gaussian distributions, but in fact in other cases, two is to take the logarithm of the entire thing. If you want to maximize the likelihood, you can also maximize the log likelihood. Logarithm is a monotonic function. If you maximize the logarithm of something, you will equivalently maximize that something, right? So we will just take the logarithm and notice that if you take logarithm, then of course this coefficient in the beginning just ends up being, so the product turns into a sum. That's great. Within the sum, this product also turns into a sum. So this coefficient just ends up here because the sum over n training samples, it's just you multiply by n because they are all the same. The interesting part is this second part, right? So now you have the sum of the square deviations of y minus beta x, and this ends up over here. And you have these minuses. You have a minus here, and you have a minus over there. Because we want to maximize the likelihoods, you're maximizing something that is negative. Again, this can be a little bit confusing. What people often do is they just remove the minus sign and call it negative log likelihood. And if you do that, then instead of maximizing likelihood, you want to minimize the negative log likelihood. So let me go on the next slide and rewrite this again. So this is now a negative log likelihood. The same thing as before, but without the minuses. And I can simplify this a bit more using the matrix notation from last lecture where, well, we discussed this a lot in the previous lectures. So you remember that this is just the squared norm of the error vectors, where the errors is just the difference between our responses and our predictions, which are given by x times beta. And now notice also that the first term in the sum does not depend on beta at all. So only the second term depends on beta. So if you want to minimize, so if you want to maximize the likelihood, you want to minimize the negative log likelihood, which is written out here, which means that you want to minimize the squared error. So we proved that maximizing the likelihood of this probabilistic model is equivalent to minimizing the squared error. And this gives one possible motivation for choosing or working with squared error in the first place. So remember, we discussed two weeks ago, why is it that we're using the squared error as the loss for linear regression. I said one answer to that is that it's just mathematically turns out to be very simple. Another answer to that is that it follows from the maximum likelihood principle, if you assume the Gaussian noise model, which is convenient again mathematically and also often a reasonable assumption. One can ask here, what about sigma squared? I didn't say anything about sigma squared yet, but that's in fact also a parameter of the generative model. So there is some particular value of sigma squared that will maximize this likelihood. They say leave us an exercise. One can derive pretty easily from here what is the maximum likelihood solution for sigma squared. Okay, so now we have the beta hat, which is the same beta hat as of course in previous lectures, because it turns out the maximum likelihood and least squares is just mathematically equivalent problems. So we're now always thinking about the input data as random, and then the beta hat becomes a random variable itself. It depends on the training data. So one can ask what will be the statistical properties of beta hat? What does it mean? What is its variance or covariance matrix and so on. So to spell it out a bit more specifically what I mean here, I imagine that beta is fixed. It's a given vector. It will not change here. The x, my whole design matrix x is also fixed. It will not change, but the epsilon, the noise, this is random thing coming from a Gaussian distribution. So we can generate different y's that's training set y's. So we can put x and y's in the least squares estimate, obtain the value of beta hat and then ask if we vary the epsilon, how the beta hat will vary as a random variable depending on epsilon. So that's then the topic of the next segment. And the first very important claim is that it turns out that the beta hat is an unbiased estimator of beta. So this means by definition that the expected value of beta hat is equal to beta, which sounds great, right? So you have this procedure to estimate the beta, you have the formula that we discussed in the previous lectures, but I will write down now again. And in fact, the expected value is the true value. So let's prove that because that's very easy. So that's the definition of the, it's not a definition, but that's the formula for beta hat that we derived last time, right? And now we're just asking what's the expected value of that. So let's plug x beta plus epsilon instead of the, instead of y. And then we open the brackets and we have two terms. And now you can already see that the first term over here it has x transpose x minus one times x transpose x. So this is, this by definition cancels out and you have expected value of beta and expected value of beta with beta being fixed non-random value is just beta itself, right? So here we just have beta. So that's great. Let's look at the second term. Here we have something, something, something, some matrix times epsilon. Epsilon is a vector of standard of random Gaussian values with mean zero. So you have some matrix times this vector. It means that you will get a vector where each component of this vector is some linear combination of the original values in the, in the epsilon vector. So you have linear combinations of random variables that have mean zero. And the expected value of that is zero for any linear combination, which is why this entire term on the right is zero, a vector of zeros, which means that the whole expression is just beta. So here we proved it. The beta hat, the ordinary least squares estimator is a non-biased estimator. Note as a remark here that this only works, this proof works out if I can actually say that x transpose x minus one and x transpose x cancels out. Or in fact even the formula above here, even the first step already how I wrote this only makes sense, x transpose x is in invertible matrix, right? So this means that x has to have more rows than columns. You have the sample size has to be larger than the number of predictors and there should not be any singular values that are equal to zero. And this is called a full rank matrix x if this holds. So this is just an assumption behind this entire thing. Okay, so we proved that the beta is an unbiased estimator. Now we can ask, okay, so on average we get back the true beta. But that's an average. Every single time you of course get something different from beta. So how does the covariance matrix of this beta hat vector looks like? And another theorem would be that the covariance matrix of beta hat is just given by the x transpose x inverse times sigma squared. And this is something that can be proven very similar to my previous slide and I'm not going to do it here, but this is left as an exercise for you. I want to spell out a bit more the conceptual interpretation of this. So covariance of beta hat describes the uncertainty around your estimate beta hat. So if you, for example, use some software to, you put in some data and the software gives you the beta hat estimate and then it will give you like confidence intervals or some uncertainties around each of the estimates. So these uncertainties come from this matrix, right? If this covariance is very large, like the individual variances, for example, in this covariance matrix are very large, this means that, okay, your estimate is here, but actually there's a lot of uncertainty around it. If all the values are small, that means the uncertainty is small, you can have correlations in this x transpose x minus one matrix, which means that your estimates of your individual parameters in your linear regression will be correlated. This can happen. And again, you see, as we discussed last time, if x transpose x is a diagonal matrix, if everything, all your predictors are uncorrelated, have the same variance, then this simplifies a lot and the entire covariance matrix will also be diagonal, for example. And if this thing is identity matrix, then this will also be just proportionate to the identity matrix, which again, so this just brings me back to what I mentioned last time, that strong correlations between your predictors will lead to problems, and here we see this more directly, that some of these problems will be large uncertainties in the covariance matrix of beta hat. A very important result in statistics is called Gauss-Markov theorem. Now we can state it. It says that beta hat has the smallest variance among all unbiased linear estimators of beta. And this even has a special term in statistics. It's called a blue estimator, the best linear unbiased estimator. So the best here means that it's a linear unbiased estimator with the smallest variance. So let me just define this a little bit more precise because you might think, okay, the variance, what does the variance mean here? The beta hat is a vector, right? This is a pretty strong result. It means actually that the variance in every direction is smaller. So if you think about like the ellipse on my two-dimensional picture before, the ellipse around the beta hat will be entirely within an ellipse for any other estimator that one can come up with. So writing down any other possible formula, apart from the formula for beta hat, a linear formula, will lead to, if this yields an unbiased estimate, then it will have larger variance. So here's the technical definition of what a larger variance means. It means the variance in any direction is larger. Or another way to say it is that the difference between the covariance matrix of any other estimator and our least squares estimator, beta hat, has all positive singular values. It's called positive semi-definite matrix, non-negative singular values. Sorry. I'm not going to prove this theorem, not even as an exercise, but this is a very important result. It provides justification for another justification for why this particular procedure, beta hat, is a good procedure, right? It's some kind of a statistical guarantee that you are getting the best estimate among all unbiased estimates. What I do want to stress though here is, okay, so the Gauss Markov theorem tells us that beta hat is the best linear unbiased estimator. Does this mean that this is actually just the best estimator? That this is something that we're going to always be using because it's the best? And the answer to this is no, it does not mean that. It's the best, it has the linear variance among unbiased estimators, but if you allow yourself, if you consider some biased estimators, then you might have smaller variance than that. And you might wonder, okay, but the unbiased sounds really good. Why would you want to consider biased estimates? Like if you say that something is biased, this sounds even just from the language itself, it sounds not very good, right? And this brings us to the entire topic of the bias variance tradeoff, underfitting and overfitting. So let me introduce these concepts to you, but in order to do it, the way I prefer to illustrate that is using polynomial regression. So for a small brief short aside, let's talk about polynomial regression. So imagine, let's go back to the example I used in the very beginning. We are predicting height from, was it height from weight? Let's know height from age. We're predicting height from age of students. And we can think, okay, so the linear function that relates height to age is good, but maybe after some age, the height doesn't grow anymore. So maybe we can add some more nonlinear terms. We can add x squared, so x is would be the age. So we can add x squared and x to the third power, to the fourth, fifth power or whatever to our model. Then we have something like that and we want to estimate that. That sounds reasonable. So this is called polynomial regression. And you can ask, or let me ask you, how would you estimate these beta coefficients in a model like that? And you might, like the first, you know, gut reaction, reaction might be that you don't know, because now we talk to a lot, we talk for two and a half lectures about how to estimate coefficients for a linear function. This is obviously a nonlinear function of x. So how would you go about estimating its coefficients? Well, it turns out that I'm just obscuring things here, because this is still a linear regression. So why is it still a linear regression? Because this is a linear function of coefficients. All betas enter this equation just as linear terms, right? And we can treat x squared and x to the third power, which of course are nonlinear function of x, but we can treat them as just new features. So you can, for this particular model written down here, you can think about the design matrix. It will have like the column of ones, then the column of x's, so different ages in your sample. Then you just square all the ages and that's your next column. Then you raise the ages to the third power and that's your next column. And then you just have your design matrix. And then this thing is just still can be written as x beta, right? So it's the linear function of parameters that makes the model linear. We don't really care about the, about whether this is a linear function of predictors or not. This is just not important. This is still linear regression as long as it's a linear function of the parameters. And this applies to everywhere. So if you have, if somebody tells you we have a nonlinear model, like a neural network or anything, our model is nonlinear. This means it's a nonlinear function of the parameters. Otherwise it's a linear model. So this is still a linear regression. Well, great. Okay. So we can, we can just keep applying all our intuitions about linear regression to this polynomial regression situation. And here's the example that I want to use. So we're still, we're still talking about predicting like y from x. There's just one variable x here. And let's imagine that the true relationship is, as I said, it's not exactly linear. So it's something that looks like that, right? It, it levels off. And let's also imagine that we don't have a lot of training data. We just have a few points, as in my sketch over here. So if we just fit the linear function of x also, so just two parameters, it will not fit very well. It will, maybe you don't see this on this picture. It will become clearer when I open the next part of this slide, but it, it's not perfect. Like here it underestimates the data and here it will overestimate the data because the data levels off and my function does not. So this is called underfitting. This happens whenever your model that you're fitting is too, is too simple, right? It's not expressive enough. It cannot, it's not flexible enough. It cannot fit the, all the properties of the data that we have here. And statistical term here would be that this has a high bias because our predictions are biased. Here if you fit that, your prediction here for this value of x is just below the truth value always. So you're biased. Okay. Now the data is the same. Imagine please that the data is, are exactly the same in the subplots, but now I'm adding x squared and I'm fitting this quadratic polynomial of x and it fits much better. Well, if the true data is, is also quadratic then that's, that's as good as you can do here, right? So the squared error in what sense does it do better? For example, you can measure the squared error. You look here and the squared error is small, but here you will get larger error here and also larger error around here. Now comes the crucial part. We can keep increasing the degree of the polynomial that are fitting. And since the amount of data that we have here is, is, is small, there's maybe 10 points, then if you increase the polynomial enough, so if you get to the polynomial of degree 10 also, then something like that will happen. The error, the mean squared error, the, your loss will become smaller and smaller when you go to the right. And here it might be close to zero or in fact it might be exactly zero if you can find the polynomial that goes exactly through each of the points in your data set, right? So you're fitting your training data perfectly at this point. The error, the loss is just exactly zero, but you're fitting well, but this isn't overfitting. And why isn't it overfitting? Because the, the, the model that you are getting out has of course nothing to do with the true underlying generative model of this data. Which means that if you're now asking like a new person comes along and the age is somewhere here, then your model will predict a very large height which of course is, will be completely wrong. So if you test this model on new data, then you will, then your error will be very large. And this happens because this model is now too flexible. It's so flexible that you can bend it, you can, you can choose the coefficients that will go just perfectly, approximate your training data very well, overfit the training data and yield a terrible model. So when I say too flexible, this means too flexible for the data that you have. It doesn't mean that it's in general too flexible. Somehow it's not so much a property of the model. It's a property of the model as related to the given data set. Our training data in this example is very, very small. So if your model is so rich that it can fit the entire training data, you can have a situation like that where you're overfit. And this is also called a situation with high variance. And let me make sure that, so let me make it clear what does high variance mean here. High variance is, as I said before, it's variance with respect to different training data that you might get. So if you now think that you generate different training data sets with the same sample size, for example, then here on the left you generate a different data and you will get this straight line that is very similar to this. Like it will change a little bit, but it will not change much from one training data to the next. Here it will also, it will stay pretty much the same all the time either. But here on the very right, this is not the case. You generate a different data. I need a completely different wiggly polynomial that will fit this training data. So you will have high variance of your beta coefficients when you imagine that you do this on different data sets and your predictions will also have very high variance. Like as I said here, you're generating very high prediction for this x value, but maybe with a different data set you will generate like a very, very low prediction for the same value of x. So that's what high variance here refers to. So you see here that we have something like a spectrum of models. And here you have high bias and here you have high variance. And this in fact is very common situation where you have bias variance tradeoff. So this can occur not only in the polynomial fittings, but actually everywhere and I will have other examples. But very often you can change something about how you do, how you fit your model or how you express your model and this will move you between having large bias or having a large variance. I want to give a bit more mathematically exact definition of the bias variance tradeoff. And so let's consider the mean square error of our predictions. So here y will be the actual response values from our generative model and f hat from x is our estimates. For example, through polynomial regression, but this even doesn't matter on this particular slide. So our loss is mean square error and we are asking, okay, what's the expected value of the squared error that you will get on the test data? This is about test data. So in the future, I give you a new y, what will be the expected, what will be the expected squared error. So the y itself is just f from x plus epsilon and epsilon is the additive noise that's uncorrelated with anything else that we're considering here. So you can open the squared brackets and whenever you have epsilon squared, the expected value of epsilon squared will be the sigma squared, the squared variance of the noise and all terms where epsilon is multiplied by something will cancel out as we discussed before in a different derivation. So I can just take epsilon out of there and say that it's plus sigma squared. And what you left here is the difference between the true f and the f hat squared expected value. And now this can be, it turns out that this can be decomposed into something that we will call bias and the something that we will call the variance. In order to see that it can be decomposed like that, I'm doing a little trick here. I'm adding this and subtracting the same quantity expected value of f hat. And then I split this into two terms. I want a square of this and I want a square of this. And then of course there's a two this times this term as well. But what will happen here is that, okay, so this is the square of the first term. This is the, I will interpret this on the next line. This is the square of the second term. Wait, so expected value got disappeared here. Why did this happen? Well, this is not random variable anymore. f from x, that's just something that is out there, right? It's not random. Expected value of f hat is not random anymore because this is already the expected, all the randomness is averaged out after I applied expected value. So it's just a number or a function in this case, but it's not random. So it had expected value in front, but I just removed it. Expected value of any number is just this number, right? Here, expected value remains. And then there is this term that is two, the first thing times the second thing. But it turns out that this is all equal to zero very conveniently because this will be zero because here you have f hat minus expected value of f hat and you're computing the expected value of that. So what's the expected value of differences, non-squared differences from something to the mean. This is just zero because we can open the brackets and this will be expected value of f. And this is expected value of the expected value of f which is just expected value of f hat and it's cancel out and it's zero, great. So we just left with this. And the first thing here, so now we'll look close at these terms that survived. This is the difference between the true function and the average of our predictions. This is called squared. So this is squared bias. And the second thing is the expected value of the squared deviations of our predictions from the average prediction and this is the variance also by definition of the variance plus there is a sigma squared which sometimes in this context will be called an irreducible noise. Even if you know the true f, if you don't need to fit anything like an oracle gives you the true f, you still will have a mean square error of sigma squared because there's always a noise on the new data that is impossible to predict, right? So this sigma squared thing is irreducible noise but the thing that can be reduced by fitting the model, the squared error there can be decomposed into squared error in the variance. So this actually, so you see this is nothing about linear regression here. This can actually apply to any estimator. Whenever you have any statistical estimator of anything you can say that the mean squared error of this estimator can be decomposed into squared bias and variance so that's very useful and you will often see it in different contexts. This was a technical derivation. Now let me give you a very non-technical intuition here. So what does it mean to have high or low bias or high or low variance? And here's an example that is often used for that is dots. So imagine that we're playing dots and this is where I want to hit, right? And I want to be close to the center and if I have low bias and low variance then I'm just good and I always hit the bullseye. Now what happens if I have still low variance but now I have high bias? That just means that I consistently miss the center. Somehow the noise in my motor system is low. I always hit the same place but there's some systematic error and I end up always on one side of where I need to be. So this will be high bias. This thing is the squared bias. It's the difference between my average, well prediction in this case average hit position and where I want to be over here, the true one, right? So this squared is the squared bias. Okay. A different situation will be high variance but low bias. Now on average I am in the center. I am unbiased but my variance is very large. It ends up all over the place and of course you can have high variance and high bias where the variance is large and it's also all shifted. So of course you want to be here on the down left but this may not be possible. This just is bad obviously but the bias variance trade-off is moving you along this axis. So you're moving from either having large variance and low bias and this is actually what can happen if you use the least squares estimator. It has zero bias. We proved that but the variance can be high in some situations or you can say well maybe I can trade a bit of variance for the bias. Now I have some bias over here but the variance is small and if what you want is to have a small square distance to the center then it can be that the square distance here is smaller than the square distance here or in fact maybe it's not but maybe somewhere here along this spectrum trade-off somewhere here you have the minimum square error. So you're trading some variance for some bias or vice versa and if you want to optimize the square error then there may be some optimal for you there. That is the intuition and now getting back to the polynomial fitting example but now it's not my sketches anymore but I'm just taking the actual fits from Bishop's machine learning book but he does the same thing. So this is I think there's nine points in here and the true line is shown in green and these are fits of polynomials of different degrees zero one three and nine and by the time you're up to nine you're perfectly fitting your data with zero error and here's a helpful table that shows the values of coefficients that you get for this particular data set when you fit these polynomials of different degree and something interesting happens here so look at that if you go if you increase the degree of the polynomial by the time you get to really high degree polynomial where you fit your data very well with very little error or possibly even with zero error at some point you get huge coefficients these coefficients are very very large and this makes I think it makes sense intuitively if you need this function that's very wiggly that exactly passes through the given points well you need a lot of wiggle space to achieve that so you may need to have some very large positive coefficients and very large negative coefficients and the function will somehow cancel it and make it pass through all these points this also makes sense if you think about the formula for the coefficients themselves right so this is the formula for the beta hat which has this x transpose x to the power of minus one and if you think about these x matrix that you get here as you increase the dimensionality you have the same nine points living in a higher and higher and higher and higher dimensional space by the time you get to the nine dimensional space you have nine points in nine dimensional space this means that there's some direction in this nine dimensional space with very little singular value right it's like if you have two points on a plane then actually one singular value will just be zero because there's always a line passing through the through any two points on the plane the same thing happens here in in high dimensions when your dimensionality approaches the your sample size then this cloud of points in this high dimensional space actually has some directions that have very very small variance and eventually it hits zero and that's the point where you can fit the data perfectly but that also means that when you invert this x transpose x matrix you will get you are inverting very small singular values so you're getting some very large coefficients so there's just algebraic explanation for why the coefficients gets very large this will be important later this will be very important in the next lecture also here I just want to say that this is only this this all of that holds for the fixed sample size so here that on the first row the data are fixed and this is just nine points or is it ten it's just a few points now we can ask okay let's let's always fit ninth degree polynomial but now we increase the sample size and if your sample size is huge then you can still use ninth degree polynomial and it fits very well and the prediction is very close to the to the green line so you're not overfitting you're only overfitting that's what I said before what I meant before you're overfitting if your model is too flexible relative to the amount of data you have if you have a lot of data then the same model will stop being too flexible it's now it's now all right it maybe you don't need ninth degree polynomial to fit that you can use third degree polynomial but nothing bad happens if you do use ninth degree to overfit this thing over here you would need maybe I don't know hundredth degree polynomial higher okay we're almost done I want to to finish with introducing this kind of plot that that is that is also very very important for us and will be important in the future so now we want to summarize this thing that we saw on the previous slide this is a plot where on the horizontal axis is the number of predictors so we're starting with zero predictors this means you just fit the the straight horizontal line you only fit the the intercept that's the simplest model you can have in this in this setup and then if as you increase the number of predictors so this just means adding the second power the third power the fourth power's predictors right then then you you go to the right here that's the number of predictors and let's imagine that the sample size is fixed or in fact the whole data set is fixed at some point you will hit your number of predictors will reach the sample size and then at this is the point where your training loss goes to zero you can fit your data perfectly whereas in the beginning when your number of predictors is zero which means the intercept is still in there you have some high loss and the loss monotonically decreases you cannot have lower value of your of your mean squared error if you add another predictor this is not possible if you add a predictor whatever a predictor is it can only decrease your it can only improve the fit it so it means it can only decrease the mean squared error right it's not possible to make the fit worse by adding another predictor it makes the model more flexible so it can fit the data better that's why the training curve here the the loss the mean squared error on the training data just goes down the very important thing that we are after here is that we are thinking about the test loss so we will talk about that next week too but just think about making a prediction for the value that you haven't seen before somewhere you are interpolating on all these plots with polynomials and now if you as we saw before at some point so the test here maybe your model is not flexible enough you're underfitting then it becomes better better better the test also improves here is maybe the perfect situation and then if you increase if you add even more predictors you start overfitting and the overfitting manifests itself in this growing gap between the training loss that goes down and the test loss that goes up and by the time you hit zero here the test loss it may even diverge or be or be very high so that's a trade-off it's underfitting that's overfitting this is high bias here this is high variance here and there's some some optimal position in between because what you care of course is the test performance of your model and not the training performance of your model and we as we as we discussed if we look at the test error only here then it turns out it can be decomposed into squared bias and the variance and that's how so the variance just increases all the time if you increase the number of predictors was the bias decreases and in fact by the time so let's say the underlying model the true model I know is a polynomial of degree five so by the time you reach five you reach the degree five you will have zero bias because now you have all predictors that you need and we proved earlier that the linear regression estimates are unbiased so you can only have bias here if you didn't if you didn't put some predictors that you needed into your model in the first place then of course you will be biased if all the predictions are there then you will have zero bias so the bias actually can hit zero already earlier here but the variance will increase in any case bias goes down variance goes up and if you add them up you get something that has a minimum and that's the test mean squared error so this is I use this number of predictors in the in the polynomial regression as to illustrate this concept I think that's the clearest probably situation that I know where this occurs but the same phenomenon as I said can happen in different situations where you have something else on the x-axis we will I will show you an example next time and also you can think of a situation where you just have some predictors that are not polynomials you just have some data somebody gives you the data nothing is nothing is square or third power of another of another feature but you just have some data and it has a lot of predictors maybe you have a sample size of a thousand and predictors and there's 500 predictors that's just what the data are and you need to construct some model out of that and this may be in the regime where you are overfitting because this is too many predictors for this amount of data but you don't know what to kick it's not that you can kick out some predictors or maybe you don't want to do that maybe they are useful for something so what to do in this case and the answer that you can regularize your model and this is something that we're going to talk about next week thank you