 Yes we have a very slightly shorter lecture today because I will have to leave at a quarter past job due to another meeting that was prescheduled. I hope you all had an enjoyable colloquium it sounded really interesting unfortunately I couldn't fit it in my time but welcome back to probabilistic modeling and Bayesian inference or introduction to probabilistic modeling and Bayesian inference and today we are going to move on to what is really the cornerstone of all supervised learning and its technique called linear and so before we start in earnest does anyone have questions on probabilistic PCA or the earlier computations with Gaussians and so on because you will see that they are quite relevant today's action. So regression is the main supervised learning task of learning a function yeah so we have data data unlike what we were seeing in modeling directly distributions where the data are all the same in some sense here data is made up of pairs which traditionally we call x and y and the task is from multiple instances of x and y with i equals one two say big n learn a function that predicts y right Matteo is there a problem Matteo I think you are unmuted so learn a function and the term regression comes from the the latin for stepping back walking back and in fact it means from trying to trace back the origins of the variability in an output variable that we're interested in trace it back to its sources which come from the input variable and typically we will focus on the situation where y is a scalar while x might be well is in general a vector in the dimensions so linear regression so this is the general regression formulation linear regression is the special case where regression means that f of x is an affine function so it's the weights vector this is called the weights times the input variables scalar product plus a bias term and sometimes this gets rewritten well usually it gets rewritten in this form w hat transpose x hat where now w hat is made up of w a vector and b is a concatenation of these and x hat is x and one okay so in this way basically by augmenting our input space and by convention making the last component always equal to one we can rewrite we can get rid of this bias term now the probabilistic formulation so linear regression would just be of this form yeah would be y uh y scalar equals w transpose x okay the probabilistic formulation would assume a noise model so the output is not just a linear transformation of the input but it's a linear transformation plus noise and if we wish probabilistic linear regression has the formulation y equals w and i'll forget the hats now just assume that you are all familiar with hats that are just a notation plus epsilon y is a scalar so it doesn't need a squiggle underneath where epsilon is a noise term with a noise variance sigma squared so now compare this with probabilistic pca okay if you recall probabilistic pca was saying that y now let's say ppca instead we have that vector y high-dimensional vector y was a total sin matrix times a latent variable x plus epsilon so you immediately see the strong similarity between the two in particular if y in ppca was you know one dimensionally would formally look exactly the same but let's note two important differences yeah okay so the first one is that ppca is a dimensionality reduction technique yeah and that means that typically the dimension in ppca the dimension of y is greater than the dimension of x while in linear regression opposite generally we try to explain a one-dimensional scalar output in terms of a number of input variables but the main main difference in ppca x is a latent variable while in linear regression x is observed okay and so that means that we are looking at a very very different setup so here is let's say ppca versus linear regression okay in both cases if we plot the joint space let's say so in ppca in ppca we'll be looking at say a two-dimensional y which we try to explain with a one-dimensional x that would be ppca yeah so this would be w x w would be a two by one matrix and x would be a one-dimensional thing and here there would be a bias as well so this is the ppca picture and notice ppca can also be recovered as a least squares problem in fact the original pca was a least squares problem and you would find the hyperplane or the or the low-dimensional subspace such that the orthogonal the sum of the orthogonal projections would be minimal instead in linear regression linear regression we have an x variable and a y variable and we might still have a cloud of points that we're trying to fit with the line but this time if you see you know the error term is just in the epsilon in the y direction so the the least squares problem if you wish would be to minimize the sum of the vertical distances of the points from the regression line so it's a different setup it's totally different but it's different and so you know last comment about ppca the reason why there are no errors why the errors go only in the vertical direction in linear regression is because the x values are precisely observed there is no noise whatsoever on the x values we are conditioning on them so many times when i talk for example with biologists and i talk a lot with biologists or people that have a little bit of a statistical background but not a very deep one they would try to do linear regression because that's what people know and they would say but i also have uncertainties on the input variables well if you do have uncertainty in the input variables you just can't do it in linear regression what you need to do is to consider the joint variability in y and in the output and the input and do a probabilistic pca okay now back to linear regression and the integrity of linear regression so we'll kind of if i can wake up my pen yeah so rest of today we'll do some calculations are there some questions here before i start the calculations on the differences between linear regression and ppca they look very similar i agree with you i see no i mean if you have any questions just put it in the chat i don't think let's see if i there is nothing in the chat so i guess relatively safe so let's start first with the the maximum likelihood solution so before doing anything basic okay so i have that my outputs yi is going to be a weight vector times the input plus an error term the epsilon i are all drawn from this Gaussian independently and identically distributed so the the likelihood function as a function of w p of y let's let's denote as capital y so let's say capital y is equal to y1 after yn and capital x is x1 so this would be a matrix and this would be a vector so capital y conditioned on capital x and the weights vector and the noise variance well since they are iid then this factorizes as a product and i will have a product of p of i p of yi given xi sigma squared and these are all Gaussian terms so it will be a product to be a normalization constant by sigma squared that's that's fine and then i have an exponential minus a half xi minus sorry yi minus w transpose xi squared and now it's a product of exponentials so it would be the exponential of the product i will get power n sum of all these terms so now we can take we normally proceed by taking the log and the log likelihood is a function of w and sigma squared it will be n minus n log sigma squared to the one half that comes from this term here okay i take the log i get a minus because it's a denominator i get an n and then i get log sigma square root of sigma squared which is a half log of sigma squared to get the minus a sum the one half in terms of in front of it sum for i equals one to capital n of yi minus w transpose x and now plus the constant here so i'm shoving in here all the values two pi's to the power n over two or whatever yeah so to find the maximum likelihood i need to differentiate with respect to w and sigma first i have to differentiate this square i'll get a minus a half the sum that's that's the derivative so i get a two and then i have to differentiate these and i get a minus which cancels with that minus x i so this is the gradient okay and if i solve it by setting it equal to zero to find the maximum likelihood solution and use my matrix my matrix formulation of big y and big x i will get that my time is x and so i get that w to transpose both sides so i will get so if i transpose these to transpose and change the ordering i get x transpose x w equals x transpose x now sorry transpose dm now notice this inverse okay this is the inverse of a matrix and this matrix is made up by taking the product of two matrices that are made up of stacking n vectors so in general the rank of this matrix is going to be the bigger between d and n the smaller between d and n yeah but this is a d by d matrix and so if n is smaller than d it will not be invertible yeah so this is invertible only if the number of points is greater than the number of dimensions so this is the standard thing you you know if you don't have enough observations you will not be able to constrain a hyperplane yeah you need at least as many observations as they mentioned to constrain a hyperplane it's in machine learning is also called overfitting so it's telling us that the optimal solution will go through the data points exactly and that will return at the generate solution yeah so this is the the overfitting case you need to have at least as many observations as the number of data points okay any questions on the calculations for the mean of logistic regression maximum likelihood uh so no I shouldn't say the mean the maximum likelihood solution for the weights of logistic regression I can't see anything flashing up in the chat which is a bit surprising because you used to be incredibly communicative until last week I don't know what happened during the weekend hopefully nothing too dramatic to silence you all anyway so assuming this is all clear then we can compute so this equating this gradient to zero gives us the maximum likelihood estimate for w what about the maximum likelihood estimate for sigma we need to compute the other derivative with respect to sigma squared and obviously here I forgot to divide by sigma squared which doesn't matter here because it's you know it's a constant in this equation so the w solution does not depend on sigma squared so if I differentiate that now I will get an n over 2 times 1 over sigma squared with a minus sign that comes from differentiating the logarithm and I will get a plus well I get a minus let's say sum uh one half sum of y i minus w transpose x i squared derivative of one over sigma squared with respect to sigma squared is minus one over sigma to the fourth I set this to zero I multiply everything by sigma to the fourth and I get that um n sigma squared has to be equal to the sum over i of y i minus w transpose x squared now there is some beauty to this formula which naturally implies that sigma squared is the whole thing divided by n because what does it tell us well you you have to look at the residual so how wrong you were so if you have found your optimal weights w which you can do because the optimal weights do not depend on um or on sigma squared if you have found your optimal weights w then you're not going to go exactly through the data but you will have a little error a residual yeah you take that you square it and you take the average of those and that is your maximum likelihood estimate of the regression variance so it's the average of the square deviation from the regression line okay this is all all make sense anyone want to ask excuse me yes uh in the previous section sorry to ask it late um I couldn't find out what is the relation between the invertibility of the matrix um with the number of points and dimension yeah yeah yeah good good point you see x so this matrix x transpose x which we need to invert is obtained by multiplying these two matrices x and this matrix x is obtained by stacking n input vectors yeah so its rank is going to be the bigger between d and the smaller between d and n yeah because you're taking n vectors and you're constructing a matrix by exterior products like this yeah so if you if you consider what is the rank of this matrix what is the number of non-zero eigenvalues it's going to be the smaller between n and d okay now this is a d sorry you told that this matrix is a d by d dimension x yeah yeah so because why is an n vector yeah and so x is n vector again in the previous section no x is a is a x is an n by d matrix x transpose n by d i see by n matrix and so you get a d by one thing and this is a d by d thing i see and if n is smaller than d the rank of these will be n so the rank of these things will be the bigger between n and the smaller between n and d so if these objects has got rank n which is smaller than d then you can't invert it or you can do absolute inverse yeah but but this lack of invertibility is what signals that it's a degenerate model which means simply that there are an infinite number of ways in which you can actually go through the data exactly so you can make you can maximize this likelihood and and go exactly through the points and what is worse what will happen afterwards if you go exactly through the points all of these will be zero yeah and you will find a likely a maximum likelihood estimate of sigma squared which is zero which obviously doesn't make any sense because that's a variance but what is signals is that the if you put a zero here or here well here it would be zero divided by zero but if you put a zero here you're going to get an infinite likelihood yeah and so that is the the hallmark of overfitting so this is a mathematical bit of overfitting overfitting is when you can go through the data perfectly now in the neural networks community and you might hear some of these i don't know if you will hear anything on neural networks in this course in this spring college they tend to think overfitting as a lack of generalization capability so you predict very very well but then you predict terribly on points that you've not seen before but in the linear case this is the the kind of neater definition in my opinion that you actually get an infinite likelihood at maximum likelihood so you've got an unbounded function to optimize so I saw a new question Michele does this calculation differs from ppca's well so this calculation differs from ppca's precisely because we condition over x so if you had someone that told you hey I can tell you what the x values are in ppca then this will be the same calculation except that of course the output variables in ppca are higher dimensional in general yeah so these are the two differences in ppca the target variable the ambient variables that the y is if you wish now we call them are high-dimensional and the x are low-dimensional and in ppca the x are unobserved latent variables associated one with each observed high-dimensional point and in linear regression they are observed and so you always just condition so you see a likelihood our objective function is a conditional while the likelihood for finding w matrix in ppca was a marginal okay one more question okay but the higher dimension is just the absence of the invertibility request on x no no no so you can have situations where your y target is not just one dimensional yeah so this will be called multi output regression the the invertibility simply stems on the fact I mean so for example you could take if you had a y that was higher dimensional and epsilon now needs to also be a higher dimensional epsilon was diagonal then the problem would decouple into a bunch of lower dimensional problems linear one dimensional linear regressions and you would have exactly the same problem with the invertibility of x okay any more questions on the maximum likelihood of linear regression yes one more oh okay you welcome ah no there is another one uh in can we say the probability is not the same thing because the likelihood refers to finding the best distribution of the data given a particular value of some feature um so the likelihood uh you see the likelihood is a probability distribution on the data so you see that the likelihood is a conditional distribution of the output given the inputs and the parameters but when you optimize it to find parameters you view it as a function of the parameters and it's not uh probability of the parameters it's a function of the parameters yeah so it's not normalized in w or or anything oh by the way here I should have I should have the sigma squared here I forgot yeah so you see it's not a function it's a function of sigma squared it's not a probability of sigma squared and it's a function of w it's not a probability on w even though because of the very special form it will look very much like a probability on w now to make it a probability on w we need to move to the Bayesian setup which we will do now very briefly so in fact if you want to be Bayesian on linear regression the standard way to do that is to place a prior on the weights so the Bayesian linear regression still has the same conditional distribution p of y given x w and sigma squared is still a Gaussian with mean w transpose x and variance sigma squared but then you typically will place a prior distribution over the weights and the canonical choice is to have another Gaussian Gaussian with zero mean and a spherical variance okay now this is now a setup that we've encountered a few times so I'm going to leave it as an exercise compute the posterior distribution over the weights given the observations and the variance sigma squared and the parameter alpha squared you can also place a prior over sigma squared in the form of an inverse gamma and it will work the solution to this is just another application of the calculations that we've been doing many times with Gaussians now the reason why it is very important though to do this is because it allows us to introduce the concept of Bayesian model average which is a central concept in Bayesian inference so the idea is now okay I have learned my posterior distribution over the weights given a training dataset and some hyper parameters so what should I predict for a y star associated with a new input x star so what how would you do predictions well in the ml world in the maximum likelihood world maximum likelihood estimation world you would take the maximum likelihood estimate of the weights and return y star equals w transposed m ad x star stars so what you would just do is a plug and play yeah you would say okay I found my w once and for all and I would just use that value in the Bayesian world excuse me what about the noise in the maximum likelihood yeah because in prediction you see sure you if you were to get draws then you would add a little bit of noise but if you ask what is my expected value of y then the expectation of the noise is zero yeah so you would just plug the new input into the linear form okay but in the Bayesian world you would consider the noise and not just the noise in the observation but you would work out what is the probability of y star given x till the star and the data the training data and obviously these various things so how would we do that well we would apply Bayes theorem now stop conditioning on the hyperparameters well by the marginalization rule this would be the average of p of y star you see if I know w so if I'm conditioning on w then there is no dependence of the new y on the previous wise it's all summarized through the posterior distribution of w so this is what is called Bayesian model averaging because you see this is the model your linear regression model and it depends on w but w is not just one value w has got its own probability law which is this posterior which is trained on the training data so you don't have only one value you have a whole distribution of possible values and Bayesian model averaging averages out the model the predictions of all the models in this ensemble using the posterior so going back to our little pictures MLE this is a training data would fit a single line and then it would predict for a new x it would predict something here this is a many Bayesian given roughly the same data we say well there are many many possible lines that go through here each with its own probability and then when it comes to x star it will say well there is a whole distribution which is somewhere here it's got to mean that it's likely to be exactly the same and will be exactly the same as the maximum likelihood prediction but it also quantifies its uncertainty so this is Bayesian now I saw there were a couple of questions flashing up which we'll address in right now can you go to the slide before yes of course I can go to the slide before yeah so all I'm saying is that maximum likelihood you content yourself with a single value and that's it you use it for all your predictions yeah professor I have a question yeah please um the probability of w provided the data points does it depend on sigma square yeah it does I have um I said that I would omit it it would depend on sigma square and also on the prior I'm sorry yeah so this one depends on sigma square and this also depends on sigma square yeah I just didn't want to keep conditioning on huge sets okay any more questions in the last uh minus one minutes yes please using Bayesian give us a ring uh expect term of w's right distribution of w then how does it affect the test data yeah it affects the test data because which one of the w's should be does it use for the prediction it uses all of them it averages them out in this way so the prediction is done by computing an integral if you prefer to do a sampling way of a Monte Carlo approximation of this integral which you would have to do in many complicated cases but in this case you can do it analytically then you would be drawing w values according to this posterior and and draw a bunch of lines and then you will get a histogram of your possible test outputs doesn't it make it hard for us to know which one is better you know you you get the average it's uh you see it's almost a philosophical shift there is no such thing as better there isn't a single answer I think I'm actually quite convinced that in most scientific setups there isn't a single answer there are there is always some uncertainty and what Bayes does is to quantify your uncertainty and say look my belief on the outcome is not a single point but it's distributed somewhat gives you confidence or where the answer may lie lie but the Bayesian philosophy moves away from a single best or better answer thank you very much you're welcome yes sorry so would the two regression coincide in the for a large number of samples well so that's an interesting question you see because we are looking at at the linear setup and this is something we will see later yeah with more samples provided that the model is correct yeah then the uncertainty over the w's will reduce you absolutely right so you will get lines that are more similar but still if you move far away from the data those lines will diverge and you will still get a sizable uncertainty and that is to some extent the right thing because if you are far from the data that you've observed in your training you shouldn't give an answer you're very confident of but yeah in in that case so the posterior variance will shrink but you would see that you know this uncertainty here will depend also on how far you are okay thank you you're welcome and with this i'm going to stop for today and i shall see you again tomorrow take care bye Matteo ciao