 This is again, I'm sure something that most of you have heard at least a little bit about, but we are here to come back maybe to the basics in order to again build a strong foundation. What we will do is discuss about first correlation, different measure of correlation. Okay, and then after we'll discuss about linear regression will mostly focus on simple models of linear regression. Okay, but we'll try and use that to put in place a couple of concepts about the goodness of it about the maximum likelihood systems and model testing that if you ever want to go a bit further than that and play around. For example, with general linear model or something like this, you will have the foundation in order to do that in an easier way than if you just arrive at the subject without any further background. So, without further ado, my import classical and then the data that we'll be playing with or that is this little a tubule. So it's about it's a bunch of measurement that were taken on biological students. Okay, and we have a lot of colors there. So in the context of correlation and linear models regression, we want to relate typically to quantitative variable together. So up to now we have played with one quantitative and one categorical variable. So that was the T test and ANOVA. We have played with several categorical viable features test and the chi square test. Now we play with two quantitative variable, for example, sample the weight and height. So here you can see we have a lot of columns. We will not necessarily use all of them, but we still tentatively we could. So we have the gender male or female, the height in centimeters, the weight in kilogram, the shoe size, whether they are right or left-ended, whether they are smoker or non smoker, their air color, there's light brown, blonde and many others. The eye color is number coded, but that would be that would be blue, brown and green. The wrist girth, the right wrist girth, the left wrist girth, number of siblings, birth place, height of their mother, number of siblings of their father, number of siblings of their mother and diet. Okay, along three, four levels that were again coded so we don't know exactly what's behind that, but that's something that we can use for many, many kind of analysis and test. Overall, that's a fairly nice dataset to play with all sorts of tests. All right, so here's the list of columns, here's the data type. Again, if you encounter that asset sometime it's interesting to go and have a look at what's what. And let's start with correlation. Okay, so correlation is a way of measuring relatedness between two numerical viable, all right. So if you have two categorical viable, you would go with a fissure's exact test or a chi-square test. If you have two continuous viable, you go with correlation. And correlation is always on a scale that goes from minus one to one. Minus one is anti-correlation, so that's a relation, but when you have one that increases, the other decreases. And one is the correlation, so when one viable increases, the other also increases. And zero is the absence of correlation. When one increases, there's no particular pattern on the other one on whether it increases or decreases or stays the same. Okay, correlation describes really a tendency, okay, and the closest to minus one or one. The more this tendency becomes important, the less noise there is around this tendency. Right. Okay, so let's jump right in and there is a lot of a few plots to try and understand what happens. So the most well-known is Pearson's linear correlation. And the linear part is between parentheses there because oftentimes people forget about it, but it's actually quite important. The way that it's computed here is given there. We don't have to spend too much time on that, but you see here the most important part. We are looking at two viable, x and y, and x bar is the mean of x and xy is the mean of y. And we are then relating together or a point i, the difference between xi and its mean and the difference between yi and its mean. So the idea is that if when x is large, so if x is above its mean, you also have x above its mean, then you have a large number multiplied by a large positive number there. And if x is small, so lower than its mean, and at the same time, y is small, so also lower than its mean, you will have also here a negative number times the negative number. So you will have also a positive number. And then it's all about kind of the tendency. If you have the tendency of being high, so above your mean on x and also on y at the same time, or small and small at the same time, then you will have a lot of positive number summing up together. So we'll get a high correlation. And if conversely, when one is above the mean, then the other tends to be lower than the mean, and you have one is positive and the other is negative. Or conversely, one is negative, one is negative and the other is positive. So you will tend to have a sum of a lot of negative numbers. So you will get a negative correlation. And if there is no overall trend, you will have sometimes something positive, sometimes something negative. And this will tend to kind of cancel each other out to give a final correlation, which is close to zero. And then this term below is just about normalizing this number by the standard deviation around the mean of both. But the most important part is a part above. So far so good. This makes sense? Yes. Okay, so we are just looking at if x is big is y big also. Then I love this because we have then this metric there, this formula, and we can look at some clouds of point and see what happens, what is their correlation. And that helps us understand what correlation is, what it measures, and what it does not measure. So the first here line, the first row there describes different level of correlation. Okay, so we have here the Pearson correlation of zero. So you can see that there is no particular trend on the relationship between the two vibe, right? When one moves and the other moves, whatever. And then you have here, you will go from here, perfect correlation. That means that each time one moves, the other moves always exactly in the same thing. There is absolutely no noise with the correlation of one, the relationship between one and the other is perfect. And then we have here slower, lower and lower amount of correlation. Doesn't mean that the correlation doesn't exist. It just means that it is small that there is more noise around this trend there. Okay, and the trend can be positive or negative. It doesn't matter. So you hear some time said or written that a correlation of, I don't know, 0.4 is not meaningful or not worth stopping on there. And I would contend that it's not really true. It just means that there is a lot of noise around it, but doesn't that it's still present 0.4 is not nothing. That still means that you explain a significant amount of the violence with this relationship. Not all of the violence, of course, but a significant amount. All right. So then the second line is just to admiring the point that the correlation does not measure the slope of the relationship. Only the amount of noise around the relationship. So all of these are a correlation of one. All of these are a correlation of minus one. Okay, because here we are positive correlation there, negative correlation. We go down when one go up. And in that particular case there, the correlation is 0. Okay, because when one increases, the other doesn't move. And then last, but very definitely not least, it's very important to remember that the Pearson correlation is linear. That means that it cannot detect accurately cases where the correlation is not linear. You have here a bunch of data sets where I think that everyone would say that the relationship between these two viable is not random. Okay, here there is a weird pattern. Here there's a square. Here there is a Losange. Here there is maybe, I don't know, square relationship. Here it's two moon crescent. Here it creates a ring and here four little blockchains, four little clusters. So the relationship between these variables are clearly not random. But the Pearson correlation is not able to detect that. So it gives a correlation of 0 for all of these. All right. So that's again, a big, big, big principle whenever you play with correlation and regression, you have to visualize your data. If you only rely on the pure numbers there, you might miss very important relationship and patterns in your data. Another example, which is quite fun, is this data source dozen where someone has gone and created a dozen of different data sets, which you see here around there, which all have about the same means, the same standard division and the same correlation. But you can see that they, despite having the same metric, they all look very, very different from one another. And in each, there is quite a number of like structure appearing an interesting relationship between the data and you even have your data source appearing there. All right. So that's again, to a mirroring the fact that, you know, you should always, always plot your data because if it looks like a big dinosaur, that's something that you want to report in your paper. Okay. So then what happens when we have a relationship that is not linear? Okay. We need another metric. So for that, it doesn't work perfectly. Okay. But we have an alternative, which is PMNs correlation coefficient. And that's one which is based on rank. So this metric is able to dictate correlation in a nonlinear case, but the relationship has to be monotonic. A monotonic means that you are not in this sort of case. You don't want to have a portion where you go down and then a portion where you go up. There, the PMNs correlation coefficient cannot do anything for you. But if the relationship is not perfectly linear, but you still have always this trend of go down relationship, always going down or always going up relationship, then this can still work. Okay. Okay. So the way that it works is that it will rank your two viable X and Y or the iteration in the variables. And then it will just for each number, it will just compare the rank according to the X variable and the rank according to the Y variable. So then you can understand that if in general, when you have a high rank, so you are among the first values in X and you are also among the first around three. We are also around the first values in Y. Then this difference here is very small. And so you tend to have one minus something which is very small. So you will have a correlation coefficient closer to one. Conversely, if when you are among the first in X, you tend to be among the last in Y, then you have this difference here, which is extremely high. And so here you have this number, which is very high there. It will tend toward minus two. So one minus two equal minus one. So you will get a very low correlation. And if there is no particular pattern, then this whole thing will tend to zero. All right. So then let's look at it in practice. If it wants to compute. Yeah. So there we go. I'm showing here three data sets. Okay. One with a linear correlation. One with a nonlinear but monotonic correlation. And one with a correlation, but neither linear nor monotonic. And to compute the correlation coefficient, I call stats dot piece and R. And then you give your X and Y. And it will return the coefficient of correlation as the first element. And stats dot spam and R, X, Y, and also the first element as the correlation coefficient. So here you can see that both agree like the Pearson and the Spierman agree on a 99 correlation coefficient. You can see here that there is very little noise around trend. And because the question is linear, the Pearson is trustable. But when you have here a nonlinear relationship, despite the relationship between being quite tight, the Pearson correlation coefficient is moderate, let's say. And so because here, if you will, if you try to understand what it tries to do, it tries to fit a line through that and a straight line. And that doesn't work that well. Where has a Spierman coefficient relies on the rank. And so it's able to actually adapt, if you will, and only cares about the ranking of these variables. Doesn't care about the kind of scale of that. And so it gives actually a correlation, which is quite high in the negative. You see it says anti-correlation. And there finally, the relationship between the two variables is quite clear when you look at it visually. But it's non-monotonic. And so the Pearson and Spierman kind of both fail at detecting the correlation that exists in your data, giving fairly low or fairly moderate values for the correlation there. Where as here, it's actually quite tight around the trend. All right, so far so good. So then we can get back to our data set and let's say what we can get the height and the shoe size. Okay. And we want to know the correlation between the two. So we will compute the Pearson and Spierman and plot them, always do both. And this is what we show here. And we can see here the trend. I hope that you also see a fairly positive correlation between the two, which is quite linear. And we can see here that the Pearson and Spierman are actually in fairly good agreement. Okay. Pearson with a correlation with 815 and Spierman with 821. All right. So that's actually a good sign saying that we have a positive linear correlation between these two variables. Okay. So once you have that, you can also ask yourself, okay, this is a single value. This is also a single value. And then can we test around that? Can we do a test of significance of the correlation coefficient? This is actually kind of done already for you. That's why I already only returned here the first element. But I wanted to show a little bit just of value before we start talking about its statistical evaluation. I will not delve into the detail there, but you can actually take this correlation coefficient and apply it to this formula. So you got your correlation coefficient times something that depends on the number of observation divided by 1 minus the correlation coefficient squared. And this actually, according to literature, follows a t distribution under the null hypothesis that the correlation coefficient is null. Okay. So that gives us a way of actually having a p-value for the significance of this correlation coefficient. All right. And that's what is returned also by the Spiemann R and Pearson R functions. In some publication, at some point, you might also see another correlation coefficient, which is Kendall's tau. Okay. I will not go into the detail of it. It's just another correlation coefficient, which works a little bit like the Spiemann's R in the sense that it measures rank correlation and it's not limited to linear, but it's also limited to monotonic case of correlation. And what it does is just test among pairs of data points, whether when you have two points where i is above j in x, it's also i above j in y. This is basically the idea. And if that happens often, you get a high score. If that never happens, you get a very low score. So if we take our simulation from before, I can just show you that it's ugly. Okay. I will... Let me check if I can do that a bit smarter. Let me see. This is because of the zoom of the screen. So, up, up. Will that change anything? Or will that be worth? That's worth. Okay. Sorry. So don't worry about it. Let me just control Z and then unzoom to see if I can make that better. I did not expect the screen to cause a problem there. Will unzooming resolve the problem? I doubt it. Sorry about this. Not really. Backslash in. Yeah. There you go. Okay. So here you see the same. It's kind of the same as before. So remember here we had high positive correlation. Yeah. The SPMNR was able to give us something because the relationship is present, but not linear. And there everything failed because the relationship is non-monotonic. And you see that here it's kind of the same thing with scandal. There's two way of computing and it's exactly the same. And basically here it sees the positive trend even when it's linear. Here when it's non-linear, it has also no problem detecting this anti-correlation, but there when the relationship is non-monotonic, it's also unable to detect the relationship. All right. So that's what I wanted to say here. I think that now you can try and test it for yourself. So use the attribute all data set. And you have imagined that you are interested into the height. And as you remember, I can actually show directly, just have to remember the name of the data frame. I called it DF because, okay. So you take DF and remember that you have a ton of viable there. So we could say, okay, we want to learn what are the main determinant of height. But there is many, many things there. So maybe we cannot just, you know, start willingly with everything. We have to focus ourselves on the most promising candidates, at least among the continuous or the variables. So test here all the possible, all the possible, all the continuous viable to see which one have the high correlation of, you know, with height, which one seem to be the most promising, you know, predictors or determinant for height. And also you can try and look among these, which one could look a bit redundant because they have a very, very high correlation. That would mean that then they would bring the same information. So maybe something that we don't twice. Okay. So as usual, I will stop putting. And there we go. So as usual, do not hesitate if you have any questions. So first off, maybe my first approach would be to single out the columns, which are of a numerical type. Okay. So for that, I just say df.columns. Okay. I have all of my columns there. And remember, I also have these df.d types. df.d types will give me the type of each column. So I have some, which are object, which are the categorical column, which I don't want to include there. And then some, which are numbers. So either int64 or float64. So what I do is I say that I want data types, which are not object. Okay. And then I will get a bunch of truth and false on these different columns. And the one which are true are the one which I'm interested in. So I can actually feed that to df.columns. Okay. With a square bracket operator to get them the list of columns, which are numerical. So I get then this little series here. And then I can just have a four column in column and compute the stat Pearson here for each. Then what's around here? This is just about making a little bit of presentation. So I say that I will have first the name of the column, then the first element that is returned by Pearson R than the second. And the first will he have this right justification, then a value with up to two point of after the coma. And then just the value there, which corresponds to the P value of this, sorry, which corresponds to the P value of the, of the Pearson R using the method that I showed about of doing a small transformation on this coefficient to have it being distributed following a T distribution under the null. So we have here our height with a Pearson correlation of R and P value of Pearson correlation of one and a P value of zero. So that's maybe just a test. Okay. The correlation of the column with itself is one. That's normal. And then we have here maybe our best contenders there where you can see that the P value is quite small and the correlation is high. So the weight shoe size shoe size is actually maybe the best. Then right wrist girth and left wrist girth. So well played to all the people who wrote that in the chat. And then we can see that the number of sibling is slightly and maybe a negative correlation, but that is not significant. Okay. So we began maybe not trust that here height of the mother. There is a small correlation there open 32 is not negligible and it seems to be a bit significant. Okay. Hight of the father and our thing apparently number of sibling of the father, number of sibling of the mother, again, not significant. That also maybe sometimes you know, you have some nonsense variable here. It would be maybe a bit surprising. Although we might have a surprise who knows that the number of sibling of either father or mother has an influence or is related to the height of someone. That would be a bit unexpected at least. And so having here an absence of correlation also gives us a little test for the actual ability of detecting a correlation where we do not really expect one. What we could also do, for example, if we wanted to check the assumption that how does, you know, a random viable factor in on this something that we know is not correlated. We could create some new column containing only noise that we know are not related to the height and see how much of a p-value and how much of a Pearson correlation coefficient they would be able to attain. Okay. If you want to try and convince yourself of the validity of these p-values that would be the way to go. All right. So far so good. Anything making sense? Okay. So I hope that you were able to at least test some of these, maybe not automatically like what I've done there to detect the column of interest, but add the list numerically. To me, we have to also differentiate there the, let's say code literacy and the statistical literacy here. If your code is not very beautiful and not very elegant, what you test is, is test properly with the proper test of assumption and so on so forth. It doesn't matter. The most important is the validity, the statistical validity of your analysis. The beautifulness of the code is only something that comes after because this is less important. Okay. So then I went on and tried also a few more advanced code just to show to you also how we can sometime approach this sort of problem when we actually encounter that asset and we want to start to understand what we will want to use first and so on. So I first start with something not too, too, too complex. I here compute automatically the collection between all viable by calling df.core. So I have my df I call core and that will automatically compute the correlation between all numerical variables. Okay. So as you can see here, the viable are all correlated with themselves and then I have here levels of correlation here height and weight at 77 then 81 and so on. And so we have the whole thing. And then I plot this thing as a heat map using seaborne and then I create filing plot of height with the different categorical, viable, gender, smoker. And I compute my win a what? Sorry. I compute my win a you rank test to check the difference in height between different categories. Okay. The one with two categories with the man with me you and it was with more than two categories using the cross-cal values test. So the non-parametric ANOVA. So here you see your correlation matrix. So you can just at a glimpse, you can see indeed that height correlates fairly well with height, true size and the risk girth. But you can also see that the risk girth are very correlated with one another. So including one might be enough, you know, maybe including both would be a bit redundant. At least that's a question that we could ask. And then for the rest, you can kind of see that the correlation here overs almost all about around zero. Okay. So that's what we can see from that. And then you can kind of see, okay, the height and between male and female. Here, the height between smokers and non-smokers, maybe there is no relationship there between different birthplace. Here we have the problem that some birthplace are only represented only once. So no chance getting a lot of significance there. Some other categories only have only two or four or three points. So maybe that will not, you know, we will not be able to grab too much information from that. And then between different air color and between different eye color and different different diet there. And of course we have the result of our test. So the Chris Caldwell is test for air color, eye color, diet. And they are not significant here and here significant for the diet. Okay. So maybe including also the diet in the mix might be useful. Okay. Now we'll go towards something even more complex. Is that's okay with everyone? Just to show to you, we have here a correlation heat map. And we see that here we were able to compute p-values for them. Okay. But here I compute all the p-values independently. But that means also that I'm doing multiple tests. So I would want to actually do a multiple test correction for that. All right. So for this, I'm going to compute the PSNR independently then keep the p-values, apply a correction on that and then make a heat map visualization of these corrected p-values and correlation. So I will not go into too much in the detail of the code because it's a bit ugly. But the idea is that I compute the correlation there. I have made here a small function to grab from the PSNR function just the p-value and not the actual correlation. So I grab the p-value for all correlations. And then I will feed them to this multiple test function for the stats model. And here there is a lot of ugly test of ugly code because in this table, if you remember here, I think I showed it here. Yeah. In this table of correlation, well basically you have here every information repeated twice. Okay. Because you test the correlation on one end and on, you know, height versus weight and weight versus height. And also the correlation of height versus height. And I said that all of this is kind of redundant. So I don't want to include it in my multiple test procedure correction. I only want to include that. Hence the not so beautiful code because I want to remove them, grab my corrected p-values. And then I want to add them again to have again this kind of format of a matrix. Okay. And once I have that, I flag which one are the, you know, correlations which have a significant corrected p-values. So and then I use these to highlight them in a heat map where I call here the very nice function cluster map from Seaborn, which basically makes a heat map and then clusters the column and rows together so that the one which are close to one another will be more visible. So there's a few options. You give the correlation and you say that you want to cluster the rows, you want to cluster the columns with what cluster invested and the column map and so on and so forth. The rest is quite close to what we have seen before. And I highlight the cells where the correlation is significant. And this is what you get out of these more complex codes where we kind of try and put everything together where we now have here the height, shoe size, weight, right wrist and left wrist girth. We see that all of these are quite significantly correlated. Okay. Even after the correction and then in this all group there, there is nothing which is significantly correlated anymore either positively or negatively. All right. So there you go. I know this is much more, let's say complex than what we have seen, but this is also to try and show you how we can, we have to spend a bit of time to automate that. And that's not mandatory, right? But if you take the time, sometimes it can be helpful and help you put together all of what we have seen up till now. All right. And then the rest is the sort of plot that we have already seen. Okay. How are we doing so far? Everything good? So then there is one important element which we cannot repeat enough, I think. We have talked about correlation. It's very important. Can you share the code again on where you list down all the PSN correlation up some more? Is this that? Yes. Yes, this one. No, yes, for sure. I will post it in the chat. There you go. Thank you. For this, you will need the num column. So this one you acquire with this part there. All right. So something that we never repeat enough is that correlation is not a causation. Okay. It's not because you have a correlation that you have a causation. For once, you don't know if A causes B or B causes A. But it might be that there is absolutely no cause whatsoever or that there is a viable C which mediates and creates the causation and that's why you would see a correlation or also just it's maybe spurious. If you try to look at all the billions of viable that could exist out there, you will just by chance just because of the multiple testing, you will be able to find some which look like they correlate just by chance. For example, you have this very, very, very fun website, which I love, which try and find a lot of nonsensical correlation. For instance, people who drowned after falling out of a fishing boat and married rate in Kentucky. You can see that here. It looks quite correlated, although there are absolutely no reason why. Okay. If you want to laugh a bit, you can just follow this little website. It's quite interesting also and helps put things in perspective and reminds to us that we should not put too much stock and put always a little grain of salt whenever we conduct our analysis. All right. So with that being said, we have now seen how we can use correlation to quantify the amount of linear relationship between two viable. All right. And we have seen that in particular if I go up quite a bit, sorry for the frantic scrolling. We have seen that between our height and shoe size with a Pearson correlation of 0.81 or 82. We have sort of here, we suspect that there is a good linear relationship. The Pearson correlation just tells us how good this relationship is, how tight it is, how much viability there is around this trend. But it does not really describe that trend. So to actually do, to actually describe this trend, we use then linear regression. So in the linear regression, we want to have the more mathematical description of the relationship between our two viable a bit more precisely. We want to say that we have one viable, which we call the response viable, and another viable or several viables, which we call the co-viables. And so the idea that you want to use the co-viables to predict the values of the response viable. And of course, you presume that your model will never really be perfect and that there is some noise on top of that, you know, that, you know, explain also some of the viability of the response viable, which you are not able to describe. So what you want to write is y. So the response viable is equal to some function or that depends on the co-viables. Okay, some combination of the co-viables plus some noise on top of that. And if the model, if the linear regression is good, then the noise will be small. Okay, and if it is less good, if it explains less of the viability of y, then this noise element will be large. In our particular case, we will look at the case where this function there has a particular form, which we call a linear combination. That means that for each co-viable, so for each addition column, which used to predict the height, for example, you will just have a beta coefficient. So one for each column that you include in the model, plus one little coefficient, which we call the intercept. So we will write then y equal beta times the value of your co-viable plus the coefficient. And visually speaking, what you are doing is you have your x, you have your y, and you make a line there whose slope corresponds to this beta coefficient and whose intercept here corresponds to c. So the intercept is the value of y when x equal to zero. So far so good? Of course, so this is just a simple case where you have a single co-viable. When there is more than one co-viable, then we cannot make such a simple plot because we would have to visualize that in three, four, five dimensions, and our brain are not equipped for that. But from a mathematical standpoint, it just means that we have one beta coefficient per column. And it can be kind of written like this. So there are always kind of two ways of writing that, which you might see around in literature. One is just as a sum. So you do a sum across the different columns of beta, of coefficient beta times value for each column p. Or a matrix notation where we have a vector of coefficient beta, and we apply it to the matrix of co-viable, which contains in column the co-viable in rows. The samples and you just multiply these two matrices and that gives you then the estimate for Y. It doesn't matter too much, but that's just to help you understand then some of the notation later on. So once you have that, like it's kind of neat. We say what we would like, but then the big question is how do we get what we would like? So there are several methods to actually try and find the best beta parameter because you can imagine that I could try and draw my line like this, but I could also try and draw my line here, there, and so on and so forth. And you can easily imagine that there are some ways that are better and some that are worse. Some way of putting my line will have a higher noise parameter and some will have a lower noise parameter. Of course, ideally what you want is to have the lowest possible noise parameter. And this is the method which is beyond the behind the least square fit method, okay? We'll see in a moment why this is called least square. But there are other methods, it's worth it to know just about them because they will be useful for statistical testing of your linear model. And in particular, there is a maximum likelihood where you try to maximize the probability of your data under your model. And when I say this sort of stuff, you talk about the probability of the data under the model. You should, in your mind, try and relate that to the idea of a p-value which relates to the probability of an observation under a null hypothesis, okay? The model is an hypothesis and the probability of your data is also related to the p-value. Right. Just to start and put a few things here and there. Okay, what we are going to see today is linear models. And the relationship there is fairly simple. It's just a line. But know that this is, let's say, kind of one step toward a more advanced way or what we call generalized linear models. So generalization of this, where we can allow this to be maybe not a line, but an exponential or a relationship of log or a relationship of a logistic slope. And so on and so forth. So all when you hear about Poisson regression, when you hear about negative binomial regression or logistic regression or multinomial logistic regression, this is what we call generalized linear model. And they go one step further than the linear model. Okay, so having a good understanding of the linear model helps for the rest as well. Okay, so in the realm of this simple slope idea, we can already do quite a bit. And the kind of criterion that we use to try and find the best beta parameter. So the best slope for our model is called the least square method. So the idea is quite, quite, quite simple. Ideally, you want the error of the model to be as small as possible. And the error of the model can be just described between what you observe. So why I here, this is the reality of your observable and what your model would predict. So for a given value of beta, for a given slope of your model, what your model says why should be. Okay, and the difference between the reality that you observed and the prediction of your model is the error of the model. And so you say that then the best slope is the one that minimizes the error of the model. I think that this is fairly sensible and quite consensual, right? Now, if we try and visualize this, we can say that we have here very quickly. We have our x is shoe size, our y is height. And then I can plot here the two together. And I want to draw a line here that minimizes the distance between the red line and each of the point. Okay, and this distance between each of the point and the red line is what I call the error. There are ways to compute these automatically, thankfully. Okay, so you can just use stats.linearregrets, but we'll see that they are better method for that, which gives you more statistical detail as to whether the fit is good or not. So I'm not going to spend too much time on stats.linearregrets. I mostly use it just for visualization purpose there. So if we come back to that, how do we find this best line there? We basically can try and test it to see what this least square is about. So we can actually try and with a real slope and some noise, I say x, some value, okay, and y will be equal to the slope times x plus some noise. Okay, and then my noise here note, it's randomly distributed according to a normal distribution. And then I could say, okay, I have generated the data. I know the real slope. And then I could try different slope. Okay, let's say for instance, I could try slope number one, slope number 2.5 and see if I can, you know, find back the actual slope that I used to simulate the data. So for each of the slope that I will try on, I will see what my model predicts. So my model predicts is x, my co-viable, times my tentative slope. And then the difference between what I predict and the reality gives me the error of the model. So the prediction error and I square it so that this all sums whether I am negatively or positively overestimating or underestimating the values. And then I plot them. Right, so for instance, you can, I think, see that here you have your data point there. And when I say, is my slope one, this is what I do in prediction. And the black bars there is the error of the model for a slope of one. And when you sum the square of all these, you get the squared error, which is 1000 something. Then if I try a slope of 2.5, this is what my model predicts. And now this is, you see the error visualized and you can, I think, visually see that the sum of these would be smaller and indeed the squared error is 188. So then it tends to reason that this is better than that. Okay, so far so good? Yes, yeah. I think like here we are always fairly simple stuff still. And of course, you could then say, okay, let's try many, many, many slopes. So here I try 100 slopes, 101 slopes between 0 and 6. And for each, I repeat the same. And here I keep the squared error each time. And so you can see the squared error that kind of do down and then up as we are with a slope which are much too small and then slopes which are much too big. Okay, and you can have here a little optimal here, which is actually around the best value, the values that we have actually used to simulate. And here it says that my slope estimate, which gives me the smallest error is actually 3.06. You can see here that it's actually, so it should be around here. And you can see also that it's not exactly the slope which I used to simulate. And that's because of course there is a little noise also on top of that, which means that the best point, the point which gives the minimum error is not necessarily the actual true one. We always have to accept a little bit of randomness, a little bit of uncertainty of course around that. And we'll see also later on how we can build confidence interval around these slope parameters. Of course, we could try out a lot of stuff, but then you would guess that if I have dozens of parameters and hundreds of data point, this can become very, very long to compute after a while. So with complex models, and I'm thinking about in particular generalized linear model, that's unfortunately kind of how we have to do it. The methods are a bit smarter than that, but it's very crunchy and computation intensive. But for simpler model, like the simple linear model where we try to do a least square fit, we actually have a non numerical, so analytical solution. So we don't have to test all slope. We can directly get the best possible estimate with this little matrix computation there, which we won't go into the details of, but that does give us the best possible estimate for the model. So the one that minimizes the standard, sorry, squared here, provided that you have more points than coefficient. So you want to have more points than you have columns in your data set. So that if you have too much column, then maybe you want to trim, remove some column before you go and create your model and using methods a little bit like these where we could say, okay, let's maybe remove all of these out of the model, because maybe they are not worth it. Okay. So now we have the idea of the model. We want to find some slopes to fit a line. Okay. And we have also a method. So the least square method, which will give us a bunch of coefficients, but we have to ask ourselves, are these coefficients worth something? And that is when we come into the underlying hypothesis of the linear model. Okay. So these are not so easy, not so easy to test. And furthermore, they are a bit particular in the sense that before we checked the assumption before we did the test, but the linear model is built in such a way that you have to create a linear model and then and only then can you actually check the assumption. Okay. So it's a, let's say, common mistake to create your linear model and then start interpreting it and only then checking the assumption. Okay. So it's try, if you can, to always create your linear model, then refrain yourself from interpreting it and looking at the result. Just check the assumption after and only then will you want to interpret it. It's very hard to do as we will demonstrate. So this hypothesis, what are they? First is correct specification. Have a good incentive for the function you use. That's not so always easy to say, but if possible, you should have a good reason to want to do, to want to fit a line through this cloud of points, right? Rather than, I don't know, a curve, a parabola, a whatever. Okay. You should spot that and think a little bit about what that means to just draw a line through them. Then strict exogenity, the error should be centered around the true value of Y. Okay. So the idea is that you have your errors there, which we see here. Ideally, the errors you should have, you know, as much which are negative than some which are positive, right? The errors should not be biased toward being positive or negative. Moreover, the error should be kind of equally spread around and you should not see a bias where you have overestimating for low values and underestimating for low values. That is what we call spherical errors. Okay. So spherical errors can be kind of cut in two ways. First, there should be, ah, sorry. Almost get assistive. So that's what I said. The spread of the error should be the same all along the curve. And there should be also no auto correlation. So the error should not be correlated with one another along the curve. Typically, this excludes most time series from the linear models. If you want to play with time series where there is auto correlation, you have to use an adapted method. And these exist, but that goes beyond the scope of this course. And last but not least, one property which is a bit weird is no linear dependence. It's that among the co variables among the X's, there should not be a variables which can be reconstructed from a subset of the other variables. All right. So that means that, for instance, the best example that I have is that if among your co variables you have lengths of the lower body and lengths of the upper body, you cannot also have lengths of the entire body in your co variables as well. Because it would be possible to write that the length of the entire body is equal to the sum of the lower and upper body. Why this is important is that because if this is not respected, the formula that gives us automatically, analytically, the best coefficient suddenly doesn't work anymore because it tries to optimize and find one best optimal point. But when there is a linear dependency, then this base of the best point becomes an infinity continuum of possible points because then you can always shift one coefficient of the linearly codependent variable a little bit higher if you also shift the other a little bit lower. So then you can have a bunch of possible answers instead of a single one. We'll see together how to test that. And the solution to this linear dependency problem is to detect which one are codependent and remove one of them. And that usually suffices to get rid of the problem. Okay. So the correct specification is usually for you to test. Okay. Try to have a good idea. I'll plot a little bit your variable and see if it makes sense to create a line, to fit a line through this cloud of point. Then the strict exogeneity and spherical error, we usually look at the distribution of error around our line, around our model, and we decide from there. So if we look at homoscedasticity, for example, here's the visualization of both homoscedasticity and an example of heteroscedasticity. So here you have your errors. So how wrong is your model? And sometimes it's overestimating. Sometimes it's underestimating. And you see that here the spread is about the same all along the line of fitted value. So all along the curve of the model. That's it. Homoscedasticity, that is what you want. Contrary to some time in some model, you will see what we call a heteroscedasticity. And heteroscedasticity can take a lot of form, but I would say in my experience, the most common one is to have this sort of shape where you have a smaller spread for lower values of the fitted values of the model and a large spread here at the higher end of the fitted value. So that you can have this kind of idea where the variance of the error is dependent on the fitted values. But it could also take all the shape where you have a bias where it goes through like when it's here constantly overestimating and then here constantly underestimating. And so of course all of these would be bad. Any deviation from that is not ideal. Now with that being said, when you look at this sort of plot, it's easy when you have a lot of point, when you don't have a lot of point, it's a little bit like the QQ plot. It's not so easy to make clear judgment when you don't have a lot of points. So and last but not least, you have this linear dependency and it's not an interesting to note that having a viable square or cubic does not create a linear dependency. There is no linear dependency between X and X square. There's a square dependency, but that's different. And so that means that you can include square theorem and you can include interaction term between your co-variables without bridging this assumption. And that's also quite nice of a property to have. Okay. Any questions so far? I know there's a lot of theory before we actually get to the actual code of it, but that's because these models are a bit complex. No, no question for now. All right. So now that we have, we are going to build a model, all right. And before we do that, we have this now our assumptions that we will need to check. And we have also measures that let us know if we have a good model. And I'm sure that you know these already. I just want to come back quickly to what they mean. The first is that you can directly use a mean square error. So you have a least square fit that means that you minimize the square error. So then it kind of makes sense that the square error can be used as a quality metric. Okay. And so mean squared error is basically the difference between the reality. So the absurd value and your prediction of your model with square, with sum, and we normalize by the number of points. And there is a little minus two here to again account for the fact that you estimate several parameters. Okay. And then the one that I'm sure you know is the R square. Okay. Coefficient of the termination R square. And this thing is one minus the sum of squared error here. So that's this thing there. And the total sum of squares. So that's here. The variance, if you will, but the unnormalized variance of the total data. This is something that you should or could relate to the ANOVA. Remember in the ANOVA, we were doing a sort of a ratio of the sum of squared between versus the sum of square within. Right. And there was this idea of the sum of squared total being a decomposition between the sum of square between group and within group. So here we have this exact same idea when we say there is the total variance of the data sum of squared total. And there is then the sum of squared error of our model. And if we say then that the sum of squared total can be decomposed between the sum of square of the error of our model and the sum of square. So the variance explained by our model. Then we can have r squared, which is quite literally the fraction of the variance, which is explained by your model. Okay. So that makes r squared really a metric of choice. It's very, very, very interpretable. If I have a r squared of 0.8, that means that's my model explained 80% of the observed variation in the target variable and the response variable. If the r squared is 0.5, my model explained 50% of the variation. Okay. It's as, let's say, as simple as that. There is a very small, let's say, complexification once you have said that is that when you have more than one one co variable. So let's say you say that you have height and you describe it using both weight and shoe size. You have to do a little bit of here adjustment to account for the additional degree of freedom and variabilities in the fact that you are estimating more parameters. And so you have here the little adjustment formula to have the adjusted r squared. But the interpretation stays the same. And then last but not least, the confidence and test statistics that we can build around our parameters around our slope around the coefficients of our columns. So these rely on an additional assumption which we did not talk about up to now. And that is the normality of the air. Okay. So here these are ideally they should be normally distributed. It is not 100% necessary. That means that the linear model can still be done if they're even if they are not normally distributed or nearly normally distributed. But if they are that is better because if they are and only if they are then these are normally distributed. And so you can actually use a T test to check for the, sorry, the significance of your confidence interval and even better you can then compute also a 95 confidence interval around these coefficients. And that's quite, quite useful. That lets you learn much more about the fire ability or the, sorry, about the importance and about the reliability of the relationships that you describe with your model. Right. So that's necessary, not necessary, but quite nice to have. Okay. A lot of information. Now we are going to look at some example. Try to look at what you know a you know what a linear model looks like, what happened when you bridge the assumption and what it looks like on concrete data. Okay. So here I import a bunch of stuff just to make the simulation a bit easier. The most important import being here, the stats models library and in particular its API, which is quite useful to build linear model. We'll see the exact specification of how we create that later on. Just focus here on the fact that we create some data where it looks like y is equal to 1 plus 3 times x. And then I will play with several level of noise. And I will try and show you what the model looks like. So the first of here I here display x versus y. So the relationship between my model when there is here 0 noise. So perfect relationship correlation of 1. And what is also typical to show in terms of diagnostic plot is showing the true y, so the reality what you observed versus with your model predicts here. Okay. So here, okay. No noise at all. So it's perfectly tight line. I will just add a little bit of noise, noise to 1. And so you can see that the relationship between x and y still exists. Okay, but there is the noise are, the points are kind of spread around the trend there. Okay. So there my r square is 77. So that means that the model there, this line here explains 77% of the observed of y. And if I now look at my true y versus predicted y, you can also see this sort of spread. Okay. So we see the sort of error that my model is making. And if I massively increase the noise, the relationship kind of still exists, but it is lost among the noise. Okay. There might still, there is still a small dependency, but there is so much noise. That it's actually very, very hard to detect up to the point that you cannot really take it among the noise. And so you would be among like kind of the non-significant relationship that you might detect if you have thousands of thousands of data points, but otherwise not so much. So here, noise to 10. The r square is here 0. I think it might be 0. 0, 0, 0, 0, 0, 0 and then something. And you can see that indeed, like, there is, the slope is almost zero and there is no clear tendency there, and again the relationship between the true y and what my model predicts is again not clear at all. All right. So far, so good. Yes. Okay. Then we can also, by Rosario? Yeah, just a question that I never understood so well. So I often see in papers in biochemistry, Pearson coefficients and R-squares that are like reported right and left to show that there is linearity between two datasets. And now I understand that the R-square is just saying like how much our model is, how good our model is fit. While the Pearson coefficient gives us the linearity between the two datasets. So what do you think it's, should we report if we just measure like that the dataset X and dataset Y have certain linear relationship? So there is a mathematical relationship between the Pearson R-square between the Pearson R and R-square is that R-square is the Pearson coefficient squared. So they are actually like the same thing. Well, not exactly, but they give you the same amount of information and you can deduce one from the other. The Pearson R is nice because it gives it, people know about it and it gives you the sense of the relationship like because if you have a minus eight or minus 0.8 or eight that informs you. To me, the R-square is cool because it gives you the direct fraction of the variance that is explained by the simple linear relationship. So that's also very interpretable. So yeah, will that to say that they are about the same thing and both are useful? Yeah, sometimes like they are used because R-square is like a law. So they pump it by doing the roots and they just show the Pearson. So it's like, sometimes I think it's like a way of convincing the reviewers to change that. But yeah, okay. Yeah, no for sure. Yeah, we have to also contend with the fact that then as I said before, a relationship might be maybe a bit small. Maybe you have R-square of I don't know 0.5 or 0.4 even and some people kind of scoff at such thing but that still means that you explain a third of the variance or 40% of the variance with your model. It's you're not telling the whole story of course but you are telling a part and then I would say a significant part of the story. And to me then it's all about being honest about what your model can and cannot say in the end. Okay, so let's see quickly what happens if for some reason I had decided that I wanted to fit not a straight line through that but a degree-free polynomial. So, you know, a S-shaped curve. So this you can do by just creating your X column with X-square and X-cube and then fitting a line through that and we can look at what happens here. So this model here doesn't change too, too, too much there but when you start to have more noise then your model starts hallucinating some relationship there just because you have given it more liberty, more freedom to add weird stuff around and we will start doing what we call overfitting. Okay, that's when you start hallucinating coefficients and so on so forth to try and fit as much as possible to the data. Okay, up to the point where here the R-square is still very, very low but it is a bit higher than what we had seen before. Not to the point where I would really trust anything from that model but that's just a small cautionary tale just to tell you that it's always possible to complexify a model, to add new columns, to add new coefficients, to add new interaction parameters, to add new square, cubic relationship between anything and everything but at some point you have to kind of say stop because doing so will needlessly complexify a model and then your model will start to hallucinate relationship that don't really exist because the algorithm is very dumb. It does what you asked it to do. It optimizes a metric. It doesn't ask the question of whether it's actually worth it to do it or not. So yeah, little cautionary tale and we'll see later on at the end of the course what method we have in front of us to actually try and combat that and try to have a pragmatic way of deciding whether or not we should add things to our model. So, I don't hate that set. This is what it could look like. So we've seen here our first actual linear model. So let's spend a bit of time on that. From statsmodel.formula.api, I import this as SNF. So here I use the formula part of statsmodel and this lets us create model using a nice formula exactly as they do in R and this is a fairly nice syntax. I find this lets you build your model fairly fast at interaction quite easily and so on. So I first create a model. So SNF.OLS, so for ordinary list square. So that's what we are going to play with. And the formula says height as a function of shoe size. So it's Y tilde and then the co-variable. If you have more than one co-variable, you would write it for example like this. Here I would add the weight. Here I could hide the height of the father and so on so forth. So it's fairly easy to complexify a model. We'll see that just later on. And then of course you specify that the data should be found in this data frame. This of course forces you to have all the data neatly packed into a single data frame but that's not too complex and I think fits most of the case. Then the model was just created but so far it's a blank canvas. We need to first then fit it. So then in this step, the model will meet the data if you will and we will find the best value for the parameters which you then grab and do this result object which contains a lot of things and among other all a bunch of quality metrics, all the coefficients of your models and so on so forth. All of that you need to do analysis after. Then I grab prediction from my model. So I give it the data frame that we used to fit the model there and then I predict, I get what my model predicts about the wise. So I know what my models think why is and I will then be able to relate that to the reality. So here I compute my R square and my mean square error. Here I do that externally but you have to know that I don't really have to because in fact it's actually packed somewhere inside this results object. So if you dissect it a little you should be able to have a look at it. Then I will do a bunch of little plot we'll look at them just in a moment. And then the very, very, very nice function which I love with this result object is a summary function which gives you almost all of the information that you need to evaluate your model. It looks like this. So let me make sure that I execute this thing so that I know it's live. Okay, so it looks like this. You have here first a first frame with a bunch of information. First about just your model. It's about height. It's an ordinary release square. This is when it was created. Just in case I have several version of the model that can help me organize them a bit better. This is the number of observation in the model. I have 47 data point. This is degree of freedom of the residual. So this is a number of observation minus the number of coefficient in the model. Here I have a single co-viable. So I have an intercept and a slope. So then two parameters. So 47 minus two, 45. Then the degree of freedom of the model is a number of parameter minus one. So two parameter minus one, that's one. Covalence type, we're not going to care about that today. Then you have your R square. And then your R square adjusted for the fact that you have two parameters. Again, intercept and slope. So you see here that our model gives us, you know, explains about two-third of the vines in height. So it's actually not so bad, right? Okay, already we have removed two-third. We have explained two-third of the height. Okay, there remains still one-third of the height which we cannot explain with that, but we just have used shoe size there. Okay, which is just one variable bone, many. Then we have here an F statistic. I've not talked to it about too much. This is related to the adjusted R square. And this then is used to compute a p-value that tells to you whether or not the R square is significantly different from zero. Okay, so here with a p-value of 10 to the minus 12, you can say that, you know, you would reject the null hypothesis that the adjusted R square is equal to zero. And then these three there will come to that slightly later on. So the log likelihood is a metric that we get from a model where the error would follow a normal distribution. And so if that were the case, then the probability of obtaining the data, according to this model, is the log 10 to the power that, okay? This by itself is not so easy. It's not easy to interpret. It's not made for that because it's not scaled by the number of points and so on and so forth. But it's a metric that we could then use to compare different model to one another by making a difference between this, the likelihood of that model and the likelihood of other model. And the AIC and BIC, I will not cover these today, but first to say that they are related to this likelihood and the number of parameters in your model. And there are other methods used to compare models. So that's for the first frame. Yeah. Then the second frame is here, it gives you the different coefficients, okay? So you know that your intercept is 72, the shoe size is 2.4. You have the standard error around these and then under the assumption that the residual are normally distributed, we can use this coefficient and standard error to make a t-test to check whether these are significantly different from zero and compute a 95% confidence interval around these, okay? So on our slope, which we see represented just here, it says that the intercept is 72 with a confidence interval between 51 and 94. So it could kind of shift around there, but centered on 72, so for shoe size of zero. And there, the shoe size is 2.4 in between 1.9 and 2.9, okay? So basically what that means is that for one shoe size of difference, if you go from shoe size 36 to 37, you would expect to have a difference in centimeter of 2.43 centimeters of height. So that's how one can interpret that. And you see that these are quite significantly different from zero because it doesn't report the full p-value, but that means that it's lower than 0.001. And last, but definitely not least, we have here a frame there. Each of these value means something slightly different. So I will describe them just in a moment for you, but these are related to the evaluation of the assumption of the linear model. On top of that, what I like to do, sorry, what I like to do is also print some summary about the error of your model. So the error is the difference between the red line and each of the point here, so that we can look and see that, for instance, the mean error is very close to zero. Here 10 to the minus 13 is very close to zero. And you can check if it's kind of equally spread around zero, where here the first quartile is minus two and the third quartile is 3.4, so it looks quite equally spread. Again, we don't have that many points there, so variation is expected. I can print a model, I can print here the true height versus predicted height, and you can see that it seems to kind of follow one-on-one relationships, not too bad. Okay, there is some noise around. And then the plot that we were saying before, predicted values along the X and our error along the Y. And we use that to try and check if there is a pattern of maybe heteroscedasticity. Okay, and then here are the ORs, the error spread around, there maybe we could try and detect if we have something which is close to a normal distribution or something which is really not like a normal distribution, just visually first here. So we have that. And already with this, we can try to visually assess some of the assumption of the model in particular, the sphericity of the error and also the heteroscedasticity normality, which is not necessary, but nice to have. And on top of this, we will have this, no, that part, yeah, this here. So here we have these two first element here, omnibus and probability of omnibus. So that's the test statistic and the associated p-value of a test of normality. That's not the Shapiro-Wilich, that's another one, but it's also fairly valid. You've got this Q and cartosis. So this Q is whether your errors are more biased towards zero or, sorry, toward negative value or toward positive value. So you would like to have, ideally something close to zero. And then the cartosis is whether you have something which is like a normal low with the same amount of tail or if you have heavier tail or shorter tail, if unlikely, sorry, if very different events are more likely or less likely than in a normal distribution. And let's say ideal cartosis, the cartosis of a normal distribution is free. So here you can say that we might have a little bit of a skew. That's what we see here with a little excess of positive values here. And we have here cartosis which is close to what we would expect with normal distribution. But with that being said, we only have 47 points. So a certain amount is of variation expected. And here you can see that the test of normality fails to reject the null hypothesis of normality for that, for the residuals. There have been some tests for the autocorrelation. So remember that's also another one of the assumption of the linear model, okay? Two is no autocorrelation. And less than one means that there is some autocorrelation. Okay, that means then that this model is mis-specified and you should not use a linear model. Here with 1.8, we are close to two, so that's good. Then jargon tests if this skewness and cartosis are good enough for a normal distribution. So that's another test of normality there. And so the second test of normality also have AP value which is a bit high. And then the condition number tries to evaluate the dependency between the co-viables. So remember that's this idea that there should not be linear relationship between your co-viables among themselves because then that would kind of break down the entirety of the method that we used to estimate the slopes. Don't worry too much about this one. If it's too high, close to infinity, that means that it's bad. If it gets a bit higher than some specific threshold, when you call that, you will get a warning there that tells you if that you should maybe check that there is no linear relationship between the co-viables. So you will get a warning if you have to care about it. All right, last but not least, one thing which I think is consistently missing from this summary and should be added maybe is the test for homoscedasticity because of course we test it with visually here but the test would be good as well. So to me it's a little bit like testing for normality. You need to have both a visualization and a test and for homoscedasticity or heteroscedasticity you need also a visualization and a test because again the test has as H0 homoscedasticity. So you cannot just rely on the test. So here's test model does provide a test there. Okay, it's not in the summary but it does exist. So heterogeneity test form white. And so for them you give elements which are in the results and in the model. So you want the residual of the model and you want the exotic here should be the axis. So this is a co-viable matrix and it will return to you a bunch of values and the one that you are particularly interested in here is this LM p-val and stats. Okay, so this is a test statistic and this is the p-value. These relates to an approximation so you don't really want that. This one is what you want. And so here we have a p-value for the heteroscedasticity test. So we have the null hypothesis homoscedasticity of 0.93. Okay, so you have seen that here we have done the very classical error. That's also why I wanted to show it to you. I have read this from top to bottom. That means that I have started with looking at the quality of feed there and interpretation of my results. So I have interpreted the model there and only then did I arrive onto the part which is about checking the assumption of the model. That means that I have interpreted the model before checking if it actually made sense to have made the model. And so that I anchors the risk of automatically kind of pre-biasing myself, falling in love with my model. And then when I see that maybe it's not so good but because I've fallen in love with the model I am hesitant to throw it away. So kind of a mistake there to have these at the very bottom. I think this does not promote the best sort of behavior but this is what it is. And I wanted to show you sort of also this kind of trick here. Anyhow, now is my question to you. Scroll a little bit back up and down. Evaluate here these different numbers and everything which I said and the plot which I've created and so on so forth. And then decide what is your conclusion on the model? Do you think that the model is correctly specified that the assumption are respected here? Or do you think that one of the assumption was not respected at all and that we should throw the model away? Put a little green tick if you think that the model is correctly specified that no assumption is breached. And a right cross is to think that the model is no, the assumption are breached and that we should throw it away. And if you have any questions of course don't hesitate to ask. Okay, so I would say most of you have voted and the rest are maybe either a bit lost or I know it's getting late second day of the course people get to get tired. So no worries. And so you voted only yes and indeed. So we have here something where it says that there is no significant departure from almost elasticity. Something that we can check visually see here. You know, okay, there is a few things which are a bit out there, but not so many points. So here there is no thing that jumps my high as being completely heteroscedastic. Here the error are quite centered around zero mean error is very close to zero. Okay, so that's good as well. Then we can also check that here they might use Q and so on so forth but is it really not normal normality not being an assumption but in nice to have to interpret the confidence interval around the coefficient. I can say that here it's actually maybe not seen as very significantly different from normality. Here it's kind of okay for 47 point. So I would say, okay, yes, why not? And here my autocorrelation metric says that there is no particular autocorrelation. So okay, that's good or so. And then the condition numbers is a bit high but not crazy high. And that, you know, tells to me that here there is no particular linear relationship between my coefficients. And that here would be weird because as I have a single co-viable just two sides I cannot have a relationship, linear relationship between my co-variables when I have a single one. So here overall the model is correctly specified and I can keep it, I can be happy with it and then interpret and say, oh yes, I explained 66% of the variance and here are the parameters and so on so forth here are the coefficients and so on so forth. All right. So from there, once you have done that you can of course, complexify your model. You can say, oh, what would happen if I use true size square and true size to the cube? So for instance, here that's what we could have if we for instance, create a model where Y depends on not just X linearly but to something to the power of three of X you could see that it works exactly like before when I have no noise then you fit the line perfectly. And then when there is more and more noise you fit less and less well until you are actually enabled to find your relationship back because there is too much noise. So on the true size, we try and try it again, okay? And we try to see what that will look like whether or not that makes sense is something that we will decide a bit later on. So here the idea is that you would then make your copy of your data frame and then I would add a column, the true size square and the true size to the cube here perhaps from the original data frame, so square cube and then you create a model so height depends on true size plus true size square plus true size cube and just fit that and then report the result as before, okay? So you can see here what the model would look like you have the prediction of the model there in red and what your model does and you can see also that then it would become maybe a bit weird, this model. So if I were to try and extend this curve here you can see that here maybe it would actually start to go down like this, okay? And that's also one thing that is important to recognize with these linear model is that they are not so bad at doing what we would call interpolation which means that they can predict stuff between the boundaries of the data that you are given but they do nothing about extrapolation about what happens outside of the boundaries that you have given and there is absolutely no reason why they should and why you should trust what's outside of these boundaries. So be quite careful about that also sometime when playing with this model and making assertion with these models, all right? And then of course what we do most of the time is not to add square and cube relationship but to add more co-viable. So for instance here I had the height of the model on top of the shoe size. So I said my model depends on the height, depends on the shoe size plus the height of the model and I fit it exactly like before. And so you can there now get your visual output. Now we are not going to do the same mistake as before. We are going to first look at these, okay? So here we see that it looks like we are kind of spread around zero, okay? So this cube here is minus 0.7, the kurtosis is close to three, not too bad. Here Durbin Watson is at two, so that means no autocorrelation. The two tests of normality are fairly low p value. So maybe we are not so close to normality here and that might be a small issue for the interpretation there but that's not necessary for the model itself. Condition number is a bit high. There might be a linear relationship between shoe size and height of the model if we went back to our correlation, to our correlation heat map up there, we would say indeed these two are a bit correlated but they are not perfectly correlated so that does not pose a problem by itself. It's only when there is an actual collinearity, a perfect collinearity that there is really a problem. So there, so for the moment that's fine and that's what I said, they tell it to you, like be careful, this light and it says it might indicate. It's not, it's not problematic, it's not necessarily problematic. It's just that they say, okay, maybe think about it and check. So that's what we see here so we can say, okay, here, you could also do our heteroscedasticity test if we want it, here it's fine. And so we can say, okay, the model is about correctly specified so we can actually interpret it. And so we see that now our adjusted R square has gone from 0.66 if you remember to 0.69, right? So we have now like adding the height of the model we have gained, which was not very correlated if you remember with the height, we have gained something like 4% or 3% of explanation of the of the vines, okay? So our model is now slightly better and our coefficient are there. And if we feel like we can trust this as being normal enough there while there is no clear rejection of normality, we can see here that the intercept is seen as not significantly different from 0 which might make sense. I mean, if you don't have a shoe size at 0 I mean that you have no feet and your model has a height of 0 then your height should maybe not be positive or negative. And your shoe size here, the coefficient is fairly close to what we had before and goes from 1.8 to 2.8. And the height of the model contributes not a normally not necessarily a lot for each centimeter of height of your model that contributes to 0.27 centimeter of height. And you see that is actually different from a significantly different from 0 but not by very much like this is a 95 confidence interval but if that was a 99 confidence interval then that would actually recover 0 because you see here the p-value is 1.6%. Okay, and now you can see in the plots as well but it's now a bit harder to actually plot the model results because you have now the height of the model and the shoe size and here the actual height. Now so we have to now do a 3D plot and these are not so easy to look at. And if we had something else we would need to do a 4D plot and I'm sure that's a headache. I've never seen one and I'm not sure I want to ever see one. But that's why then we use our plot of true height versus predicted height and irrespective of the number of co-vibes that you have these should follow the diagonal line, okay? Here we can say, okay, it follows something close to the diagonal line it's quite good. Okay, all right. So I know this is a lot of information let's see the entire framework how are we going so far? Everything is good? Okay, perfect. So now it is 320. So I think this is high time that we have our afternoon break. So what I'm going to do is I'm going to post the recording. Okay, so now we have tried to fit a new model where we have both the shoe size and the height of the model and to predict the height. And we have seen that we have gone from an adjusted R square of 0.66 to an adjusted R square of 0.69. So adding a new parameter, a new variable here has added us about 3% of adjusted R square. And you could legitimately ask yourself the question, is it really worth it? We can see here that it's not so significantly different from zero this parameter, but are we actually gain enough like something from this new parameter? Because as I said, if you continue adding new and new and new, new and new conviable even nonsensical one, the adjusted R square will always climb. Okay, but your model will start to hallucinate some relationships and it will just try to overfit the actual data. So we need to have a mechanism in place to be able to say whether a gain in adjusted R square is kind of worth it or not. There are several framework to do this sort of decision. Let me just introduce one, which is also quite useful if one wants to then build upon that knowledge and understand GLM later on. And that one is maximum likelihood, which I briefly mentioned before. The idea of maximum likelihood is that you want to build your model such that it maximizes the probability of observing the data given the model, okay? And as I said, this should be related to the p-value which is the probability of observing the data given a null hypothesis, okay? It's just that here we will move our hypothesis such that the p-value is maximal, if you will, right? And so typically we measure that not with the pure probability but with the logarithm of this probability because otherwise it would be a very, very small fraction. So using the logarithm in particular, the logarithm 10 is a bit more interpretable because then you get something like minus 149 rather than 10 to the power minus 139 which would be so tiny that we would have difficulties interpreting it. Okay, so in maximum likelihood, you want to shift your model parameter until you maximize this probability of observing the data. So if we try to have a visual representation of that, let's say that we have simulated some data from a normal distribution with then two parameters, mu and sigma. And the real values are mu equal two and sigma equals 0.5. So we have some simulated data and then what we do is the same as before, I will try different sets of parameter and I will score them using the log likelihood. So the probability of having observed this data under this set of parameters. So here when I use the correct set of parameters, so mu equal two and sigma 0.5, the likelihood of each point is the density according to this normal distribution with these parameters there. And so the likelihood of each point is the red bar, okay. And so then the log likelihood is the sum of the log of each of these points or of each of these bars, okay. And so here you see that we have large bar, so our log likelihood is not too low, it's fairly high. But if we use different set of parameters which are farther away from the real ones, for instance here, I just shifted mu to 0.5, then the bars here are much, much, much smaller. And that means that their log likelihood is lower. And so we have a sum of log likelihoods which is minus 44, so it's worse, okay. That one is better than this one. So far, so good. Okay, so then of course we want to do exactly as before. We will test many, many different values, okay. But again, remember that sometimes the optimal will just by luck be slightly away from the actual real density, but you hope that it's not too far away. So for instance here, you see the actual data, for instance with 800 point generated. And you see the density, the actual theoretical density of the distribution I used to generate these. And you see that they coincide almost, but not 100%. So if we test many values for mu and many values for sigma and for each values of mu and sigma, I do this. I score them using the log likelihood. I will then try all values and then I will keep all these and try and plot them. You get what we call a log likelihood landscape. So for different values of mu and different values of sigma, you have each time a log likelihood. Here, this is a static image. If you have a library called Procly, it's difficult to install. So that's why I did not put that as per accuracy. But if you are curious, you can store it after the course. You can create a dynamic visualization of that. So, and you can have a look at the likelihood landscape. So you can see that here, for example, what it says is that a, see here this is sigma, here this is mu, okay? It says that a sigma, which is too low, has a very, very low log likelihood. So it's not very likely, but when it's higher, it's a bit better because it has a higher likelihood here. Let me zoom in a bit. And we can see here that we have this sort of profile. You can see this sort of curve here, such that the likelihood becomes maximal when we are at two. And when we get away from two, it degrades, okay, for the mu. And so what you want is actually the best possible likelihood, which is around maybe here for a mu of 2.1 and a sigma of 0.5. And you can see, let me turn a bit, yeah. So you can see here that with this, you don't get only a single like best point, but there is also an idea of a curvature around this log likelihood, okay? So you see that here, the way that you can build this landscape will kind of favor some values over others, okay? Such that the variance of this log likelihood seems to be larger in this direction rather than in this direction where it degrades very fast. Okay, so far so good. So that's a slight shift in perspective about how we, if you will score and how we would find the best slope for our model, okay? In the first case, we were just using the squared error and in the second case, we are optimizing the probability that our data was generated by a model which presumes a normal distribution. So that's also a case where we presumed a normal distribution and that becomes an important assumption. In the specific case of OLS, oh, sorry, yeah, it's a specific case of OLS if the error are normally distributed. So happens that maximum likelihood and least square coincide. I give here a small demonstration of that. If you want, if we want to be very quick about it, you can see that this is the likelihood of a model with normally distributed error. So that's how we would, that's a sort of likelihood that we would try to maximize here, okay? And in particular, among several points, this looks like this. And so you can see that in the end, you have some terms which are fairly constant there. Just depends on this, on the sigma there, but it doesn't depend on where you put your slope. And in the end, you have this sum that depends on the difference between the observation of the model and the, sorry, the observation and the prediction of the model. And there's a little minus one there. So if you want to maximize this thing, it actually coincide with a minimization of the difference between observation and the prediction of the model, which is the least square. So the least square is minimizing that thing and the maximum likelihood is maximizing the same thing but with a minus there. Hence their best solution coincide in this very particular case. This little property is why we also are happy when we have the residual of the model which follows something close enough to normal because not only can we use the T test to play with our parameters and to test their significance, we can also use these log likelihood metric to actually test between different model, right? So that's what I say, why is it worth it to add the height of the model? Well, now we can try and test it. So we introduced now the likelihood ratio test. So the idea is imagine that you have two models, okay? One with, for instance, just the shoe height, a shoe size and the other with shoe size and height of the model. We say that these two model are nested, okay? Because one is a generalization of the other, right? You can basically write the model of shoe size and height of the model. And if you put a coefficient of zero for height of the model, it's equivalent to the model with just shoe size. So far, so good. Not so much. I know it's getting maybe a bit late. We have a couple of things to see and then you'll get to have plenty of time to try and apply these, all of these concepts. So here we have two models, okay? One with just one parameter and the other with two columns. And we want to compare them. Then the likelihood ratio test says that, well, you can take the likelihood of these two model, make the difference between the two, multiplies this difference by two and that thing should follow a chi-square distribution with a number of degree of freedom equal to the difference in number of parameters between the two models. So basically between, for example, our model with shoe size, which has two parameters, the intercept and the coefficient for shoe size and a model with shoe size and height, which has three parameters, intercept, shoe size coefficient and height of the model coefficient, then the degree of freedom difference would be one, okay? Three parameter versus two. All right, so let's try this out. We have on one head our shoe size model. We would check all the assumption, okay? See, they say that they are okay. And then we want to contrast that with the same thing but with the height of the model in the mix. And we can check the assumption of the model and say, okay, we are happy, but is it worth it? We would then take these values there. Here the log likelihood minus 142. And then here the second likelihood, which is minus 139, so it's slightly better. And it would make the difference between the two. So this you can grab with the result object there dot LLF for log likelihood function. And so minus 149 and 142 and minus 139. You make the difference, you multiply by two, and then you check that against a chi-square distribution. So to get the p-value, it's one minus the chi-square distribution, the CDF, and you want a number of degree of difference, which is equal to one, so that's the difference in terms of the number of column that you have included there. That's basically this minus that. So if we do that, we see that the difference, the test statistic is six. So this is the difference between these two multiplied by two. And this corresponds to a p-value of 0.012, okay? So here, depending on your threshold, for the p-value, for instance, if your threshold was 5%, you would say, okay, this second model here with height of the model is significantly different. It explains a significantly higher portion of the variance than the one with just two height, okay? The log likelihood is significantly higher. So I would tend to keep this new model, all right? But then the p-value is also not so small, with your threshold is at one person that you would not decide to make that bet. Yes, Rosario? Before you mentioned that if we add variables or columns to our data set and we run the test, basically the R-square, if I remember well, should climb, right? Yes. And this climb, in order to understand if it's specific or not specific, can we use this likelihood method test that you just showed? Yes, this is used to determine if the climb is significant, if you will. It's significant as like an amount or significant as like it makes sense? Significant as like it's more than what you would get if you had just added a nonsense column. Ah, okay, perfect. Yeah, I understand, thank you, thank you. All right. So before we go and try and apply all this with our last exercise, we can just discuss briefly what to do when some of this way the honorary lists square are not true. So remember, maybe you have heteroscedasticity, maybe there is an autocorrelation between the columns and so on and so forth. So you have to try and handle that. So one common, let's say trick if you have heteroscedasticity is sometime to do some transform on the data. Some data doesn't make necessarily too much sense to model them like row, but sometimes if you log transform them, it will make a lot more sense to do a linear model. So that's something to try and also something that when you know about your data, you might already have an idea of whether or not that makes sense. Okay, remember that when you are in the simple space, the quantities relate by summation, but if you log transform them and then do a log transform, sorry, if you log transform them and do a linear model, then they will relate by a multiplication. Okay, so you will have to think of them as sort of ratios from now on, if you will. Then there are other methods than purely the, sorry, than purely the OLS, okay? There are other different kind of alternative. For example, if you have a good idea about what will govern the Ateoscedasticity that underlies your dataset, you have different methods such as a weighted list square, which will give different weights to different parts of the curve, if you will, based on your idea about the change in spread along the curve there. This we would only use if we have a clear idea about what governs the Ateoscedasticity in our data, okay? If there's, if you have correlation, then you have to use again method which are specific to time series and auto-correlated dataset. And last but not least, if you have also, if you know that you model a specific quantity, which you know will not really work well with the linear models. For instance, let's say that you model a binary, binary, sorry, viable. So maybe something which has just two categories. Then this will not be modeled very well by a linear model. So you might switch to another kind of model. For example, here, a logistic model and you would enter the word of the generalized linear model, which work in some sense quite similarly to the simple linear model. It's just that their likelihood function are a bit different. And then we have to interpret them using slightly different metrics, but the underlying concept are more or less the same in the end, all right? Most of these are implemented into stats models with Hofer, I think a fairly very good interface, you know, to create and evaluate and play around with these models. I find that their documentation is not the best, but the library works very, very well. Okay, so that's most of what I wanted to say here. And now we arrived to this last exercise. So now using everything that we've seen before, I know that this was a big chunk of information, but we have plenty of time to try and play around with it. So first recreate a model with, which describe height as a function of shoe size and then try to use this following subset of variable, shoe size, height of the model, a number of sibling of the father to find the best model to predict height. I will even extend that a little bit if you want to also have maybe the weight on top of that, if you want to have any other variable which you feel is worth it. Have fun, try to explore and try to think about checking the assumption of the model each time, use the luxury ratio test to differentiate between different models and choose the one that fits the best. And yeah, that's about it. All right, so then as usual, I will pause the recording. Okay, ah, great. Thank you for sharing. So let's try and approach this problem. So the first thing I wanted to see, let's say here I would just restrict myself to the shoe size, height of the model and number of sibling of the father columns, but the process with if you had more columns is just a bit more of the same in the end. Okay, it's not so, so, so different. I will here apply a very simple approach. Okay, I will first look at the correlation between all these values, just to see if there is any huge correlation issues. So here I can check and I see that the correlation between all of these is not necessarily extremely high, except maybe for height and shoe size, but that's actually good for me because I want to predict height. I also check that there is no very obvious huge correlation among my co-variable, which is here the case. Then I want to say to myself, okay, I'm going to apply this very simple procedure, which is not always the best, but for a very simple approach it works of adding the variables one by one. And I will first have the variable which are highly correlated with my target variable and then add the one which are slightly less correlated. So I would first have maybe I think the shoe size, then the height of the mother and then the number of sibling, okay, by decreasing number of absolute correlation. Alternatively, I could also maybe if I don't have the correlation, I could also fit a model with a single variable. That's what I do here. For each variable taking in turn, I fit a variable of height with this single model and then record the log likelihood, orders them by the log likelihood in order to get the order of decreasing likely, log likely. And I say, okay, I would have my co-variable in this order, shoe size, height of the mother, number of sibling of the father. So I want then to build three models. First with just shoe size, then with shoe size with height of the mother, then shoe size, height of the mother and number of sibling of the father. And, okay, that's what you see here for each co-variable I had them and I build a different formula. Now I want to not just build a formula, but I want to actually use it. So this is what I get here. For each of these variable, I will build a formula, then I feed that formula here that I just be to an OLS object. I feed that thing, I grab the result, I print the result, I do a little almost cadasticity test to complete it. I grab and keep the log likelihood somewhere and I do some diagnostic plot here. All right, so when I launch that, then I will get three model, one with just shoe size, one shoe size, height of mother and one with all three. So then I will look at first just a model with shoe size. We have done that already. We have checked that all the assumption were respected there. I have also my plot. Unfortunately, they are all pushed to the end. So we have to scroll up and down and up and down, but we can check here that we might be happy with the assumption of normality of our residual, which is important because we intend on using a likelihood ratio test to compare the models and that presumes that makes the assumption of normality. So we actually need this assumption. Here it's good. Then when we had the shoe size and height of the mother, we see that we are still not reaching the assumption of normality, but we have a high condition number there. They almost get the CCT is okay. And the autocorrelation is okay, which we can also visually check here. And last but not least, when we also had the number of sibling father there on top, we have also something which is quite okay. We can check again all the number and we have this little warning about the condition number, which means that there might be maybe some collinearity between our co-variables. Here because we have more than two co-variables, it's actually something that we could look for because when you only have two co-variables, if they are not 100% the same, then you are fine. But if you have more than two co-variables, then collinearity might exist. So maybe we can check quickly for that. So, all right, let's check for this here when you only have three co-variables in point of view. We just say, okay, let's try to model shoe size with the height of the model and number of sibling of the father. And if we are able to get the perfect model on that, that means that we have collinearity. Here I do the model and just without even looking at the rest, I can just see that here my R square is not one. It's very, very small. So there is no vertical in our T here, I am fine. So now I have fit in my model. I have checked that I could actually say that they are a good linear model and that the assumption of normality was not too crazy for all three. That means that I can apply a likelihood ratio test on them. So for each model there, I have kept the log likelihood, which are these. Okay, and here you can see that when I added there, there is actually no visual change in the log likelihood. Here if I maybe increase the number of elements after the coma, I might see a difference. Yeah, here you can see I increased from minus 139.86 to 0.85. So that's a small increase there. And so I can then for each new model, make a likelihood ratio test. So here starting with a second model, I will compute the log likelihood for my second model, for instance, versus the first and for the third versus the second, multiply by two to have the LRT statistic compare with the chi-square and then I will print my LRT and the statistic. So what we see here is that adding height of the model to the mix actually increases significantly my log likelihood. Okay, if our threshold is 5%, because this is 1.3%, but adding then number of siblings of the father to the mix makes, well, has a p-value of 0.88. Okay, if I maybe make that in non-scientific notation, hop, hop, hop, you see here my p-value is quite high. And so I would say here I failed to reject a null hypothesis that there is no difference of log likelihood between these two models, at least not significant difference here. And so I would say that adding the number of siblings of the father to this model needlessly complexifies the model. Basically I had one parameter, but it's not worth it in terms of the increase of log likelihood in terms of the increase of our power to explain the data. Is that good? Does that make sense for everyone? Yes, okay, perfect. So I have, I think reached most of, well, I would say, yeah, the contents that I wanted to show during that course. I know that it's been fairly dense. I have maybe one little thing to add on top of this correction, because as it stands, I must say that I've used a relatively crude, if simple enough method of just, you know, taking the thing with the highest correlation and then trying to add stuff on top. But sometimes you can't have some surprises, okay? Let me demonstrate that briefly. Let's say that we want to now model height and we try with the height of the father. And if we make that model, then we see that our adjusted R square is actually very, very bad, okay? So with the method that we have seen before, we would not try to add height of the father. Now imagine that we did the same thing, but with the height of the mother, okay? And again, adjusted R square is bad. So we would not want to do that. But now see what happens if we have the height of the mother and the height of the father. And also on top of that, we had the gender and here is here, this star is all I had, what we call an interaction term. So that means that the contribution of the height of the mother and the height of the father will not be necessarily be the same depending on your gender. So maybe if you are a female, then the height of the mother factors more than if you are male. This is the potentiality that we had there. And then see suddenly our adjusted R square jumps to 0.63. Even though we have used stuff which when taken in isolation have very, very low adjusted R square, all right? And then we can see here, for instance, we have as usual our intercept, which we know, we have just the effect of being male, then the effect of the height of the mother. And then we have the effect of the height of the mother when we are male. At least the difference in slope when we are male. So when we are male, apparently the height of the mother influences more. At least so that's what it seems to say, although here it's not significantly different from zero. So take that with a grain of salt. And then here we have the different of slope for the height of the father when we are male. So that's how to interpret these interaction terms there between the gender and one of the co-violence. So all that to show you then that it can be a bit more complex than just taking the top correlators and adding always the top and the top and the top and so on and so forth. So sometimes you have to put in place more advanced method to actually add these. And it's certainly always very easy because when you have a lot of co-viables, well, trying all combinations, especially if you also have to account for interaction term there, can be quite time consuming. That means that you have two to the power of the number of co-viables to test without the interactions with the interaction. And then you kind of add even on top of that. So there are some ways to go about it. Some heuristic to go about it. One which I've shown to you is we take the top and then you add one after the other. Some are where you take everything and then you see what you can remove. And so you have like the bigger model and then you decrease, decrease, decrease, decrease until you get somewhere where removing a viable does change significantly your log likelihood. So there are different strategies to go around that. And so last but not least, if we want to really go further about that and make sure that we don't overfit our data, there are more advanced method. And in particular, there is one which is called regularization where we say that we don't want to just optimize the log likelihood or the squared error, but we also had a small penalization which depends on the parameters that we want to give. So there are a few which I'm sure you may be already heard a little bit about and that's the L1 or lasso regression, the ridge regression for L2 and elastic net, which is a mix between the first two. And basically what these will do is that they will penalize some of these coefficients there. And so the L1 regression will try to push some of these to zero, okay? So it's very good if you want to eliminate some noisy covariables, not very informative covariables, warm your feet. And the L2 will here make sure that there is none here that is dominating too much. Or if it's dominating too much, then it should be very explanatory for your covariables. And these are a bit better when you have a case where there is a lot of covariables which are a bit correlated or contributing similar thing to the target variable, right? I do not go too much into detail there. You can use stats models to actually try them on. If you want, you have to look up their documentation. They explain a little bit how you can use that, okay? And if we want to go a bit further on and make sure that you don't overfit, you have to kind of then move from statistics to what we would call machine learning, where we put in place a bunch of tools in order to control for overfitting and try to give more robustness to our models. Okay, so do we have any questions at this point? All right, so if not, then I think that this is about the end of the course for me. I will stay around to answer any questions that you may have either on the course topic or other things. I want to thank you all for your attention. I know that this has been a fairly dense course. I think that one big take-home message, if you want to remember only a few things from this course, would be that our tests, our p-values that we are building, I would say the majority of science of nowadays, okay, they are pervasive. These don't come from nowhere. These come with clear assumption and with clear properties. And they try and help you quantify the bet that you are making every day when you make decision about your data and what your data may mean. So remember about that. Remember to always take these with a grain of salt and to be as honest as possible when reporting and when thinking about your data and your result. And I think already you will have gained and we will all have gained a lot from that course. I hope that this was pleasant to you and that you learned a lot of things. The content will stay online. So even if you lose it for some reason, don't hesitate to come back to it and have a great day, have a great end of day.