 Let's have a look in this lecture at the very exciting topic of linear regression This is import our fancy style sheet there off we go What are we going to require here? We're going to import numpy as the abbreviation in P pandas is PD and then from sci-pi.stats We're going to import this new thing linear regress Because it is a method inside of a sub module there linear regress is on its own And I can just use it as is if I import it this way old friends matplotlib seaborn and warnings matplotlib the magic command there and ignoring the filter warnings Off we go linear regression Now remember when we did the t-test we had two categorical groups But the variable inside of those groups that we wanted to compare to each other were ratio type numerical data then we did the chi-square test where everything was categorical data But what if all the data is just numerical? That's when we're going to use linear regression So we're just going to compare one set of Variable numerical variables against the other set of new numerical variables So just for this little example I've created this regression dot CSV. So we're not doing our we do it in the end But here we're not using our usual MOOC underscore mock. We're going to use Regression CSV and read it in as a CSV Into this computer variable called regdata, which will then become a Penance data frame. Let's run it and have a look. So there we go. It is this two columns PCT CRP pro-calcitonin and C reactive protein Now what does linear regression do? It's just going to plot these two values on the X and Y value. So you might choose either X and either is Y but you're going to Say this is my X value and this is my Y value and I'm going to make a plot so in the XY a dot At the intersection of this X value and this Y value now You set up the X and Y depending on what you want to say with your data But more about that later, but that's in essence what's going to happen There's something really odd about this data set though there. We have not a number So either that was left blank or something weird was filled into our spreadsheet there But certainly I have a CRP value there But not a PCT and here I have a PCT but not a CRP. Where am I going to put this dot? So it's very important with linear regression to get rid of any rows that contain Non-numerical values or nonsense values that is very important So let's go ahead and do that and this is the way to do that I'm going to create a new data frame called regdata underscore 2 the second one I'm going to take my original one reg underscore data dot drop in a and The argument that it takes is subset equals and then the two columns Now watch carefully. I didn't put to reg underscore data right there again. So you're right there again It's a bit of a different way to use drop in a says regdata dot drop in a and Then the argument is subset equals and then just in square brackets the two columns from which you want So if any row that contains and either those columns of value that's not numerical It's going to delete it Now let's look at the data set. So just pay very careful attention to the way to that syntax there So there we go and you'll see beautiful beautifully down both You'll see for instance 11 12 13 was thrown out But for each X value, there's a Y value, whichever one you choose X and Y to be Now, let's just describe this. We just want to see what our data looks like We've never looked at this data again. It does not come from any particular patients. It's just random values So we see that they are now to 26 PCT values And of course, they should be 26 CRP values as well See the mean is 6 and the standard deviation is 2 and for the CRP mean is 61 and standard deviation of 25 We can plot that by our old friend the violin plot from SNS the data that I want to plot this is the Penis data frame and I wanted to group by PCT and CRP and I give the names PCT and CRP and There we go. It's a bit of an odd plot because obviously the values are not very comparable This goes to 150 and this only goes to what was the max there only to 9.8 So it's a bit of a odd plot not a very good plot to to submit for publication. Anyway, it gives us an idea Now here comes the linear regression function linear regress What you want to tell in regress is the X value and the Y value That's what you want done So I want from regs data to the PCT values and from regs data the CRP values But now remember they're both 26 and for each one there is a corresponding other one linear regress gives me back 1, 2, 3, 4, 5 Values and I've got to put them in order. The first one is the slope Now let's just look there remember from school the equation for straight line Y equals MX plus C Where the M value is the slope Remember that is how steep the line is that was a positive slope This was a negative slope and the higher the steeper the slope the higher the M comes C was your Y Intercept when you make X zero then why it was just going to be this value that cuts through the Y axis So this is what we're gonna have Initially because what linear regression is going to do linear says line We're going to draw the straight line and it'll have a slope in order to do draw the line It'll have to have this intercept. That's why I call it intercept You can call these whatever you want, but giving them names that are meaningful helps Intercept is going to be this R value, which we'll discuss a P value, which we always want and standard error. It's going to give you back of those five So let's have a look This is one way that we can do it. I'm using the print command here. This remember This is exclusive to Python 3.x so 3 3.1 3.2 3.3 3.4 that you have to put these The print command works read differently from the 2.6 2.7 versions. So print the slope of the line is comma when there's a comma there It's it's going to put a space there automatically for you. So don't worry about putting a space there slope Then it's going to put that value comma and then this funny thing backslash n. That's an empty line So it's like hitting return or or Enter twice The Y intercept is It's going to put the intercept the correlation coefficient That was that R value is R value the P value is P value the standard error is standard error Let's run that. Oh It's not going to work because I never ran that command. Let's run that And then this there we go. It would not know what slope is because I never executed the cell So the slope is 10. That's quite a Quite a steep line. The Y intercept is negative three the correlation coefficient is 0.96 P value is 1.2 times 10 to the power negative 15 That is Extremely low P value way less than 0.05 and then the standard error 0.56 So let's plot this the beautiful plot to do linear regression is called the reg plot SNS from seaborne.regplot It takes these important arguments below up We're going to use our appendicitis data set and I'll show you a lot more arguments But most of those are default values I just want to show you how complicated things is and it's complicated in one way But make makes a very powerful because you can change so many things, but these are the quintessential ones you want You want whatever you want the X value to be that's regdata to PCT Regdata to CRP column will be my Y values I want to construct a 95% confidence interval in my graph around my straight line That's a beautiful thing to do and in order to do that now. This is not part of Psykit's bootstrap This is something that's built into regplot So the two are not the same so I can just say CI equals 95 and n underscore boot is 1000 Please do 1000 bootstrap values for me and you remember what bootstrap values are And look how beautiful that graph is Beautiful straight line and you can see most of the data points are very close to this line hence my very low p value so the slope is 10 and So what does that say it would say because I chose remember because I chose This is the wrong way around in the x-axis we have PCT that's quite correct and on the y-axis we have CRP So for every one increase in PCT I'm going to get a 10 increase in CRP. It's very well correlated. So it's this positive correlation Because the slope is positive now the intercept of almost negative 4 remember that doesn't really help us That's just a mathematical thing We can't say when the PCT is zero we can have a CRP of negative 4 so you have to use a bit of common sense there But if we use our y equals mx plus c for this line We have this beautiful predictive model now We can say we can predict the CRP model if you give me the PCT from my data set I can predict what the CRP is going to be with this 95% confidence interval nicely drawn in that light blue there for us So I'm going to take any PCT you give me I'm going to multiply by 10 and I'm going to subtract 4 from that and that's going to give me a CRP value so beautiful model there and now you can see depending on which one you made CRP in which one you made PCT because I can deconstruct this algebraically and get PCT all on its own Or I could have just changed the two values around here that would have given me miss another model to work with Another model to work with let's look at this r underscore value Computer variable that I got back. That's called the Pearson correlation coefficient And that's a number between negative one and one and that tells us how well Correlated these two are with negative one meaning meaning a perfect negative correlation So the slope would have been negative But all the dots fall exactly on that line and positive one irrespective of the slope all the dots fall on it and there's a perfect positive correlation and Everything in between so in this example. We had a correlation coefficient If only I could spell see you just double-click in there coefficient There we go with that work What much it what spelling mistake to make there we go see how easy this to correct a Pearson correlation coefficient of about 0.97 This means the variables are very well correlated indeed And there's a positive correlation as one increases the other one increases We saw our p value there One very important thing is that to remember the dictum correlation does not mean Causation just because that something is so well correlated or two two variables are so well correlated You cannot say that one cause the other it is not proof of causation the the PCT does not the rise in PCT does not cause physiologically the rise in CRP Okay, you have to think about the mechanics of you know What really happens inside of your test tube or inside of your test subject? Let's use our appendix data So let's import again. We've know how to do that and just checking that everything is imported. Okay, no problems there Let's make data set to by dropping the Nase So take data drop Nase from the temperature and white cell count I want to correlate temperature and white cell count and In one cell I'm going to go on to just doing this rich plot now I want to show you how complicated the rich plot can be look how many arguments it takes Fortunately, and there's even more I didn't even put them all in there's a lot of them But remember from the one up top we want the X value the Y value and we want confidence intervals and bootstrap values So that we can have a nice graph that we that we can send for publication. Let's look at that. Whoa Whoa What does that look like so this is increase in white cell count on the x-axis so put white cell count first And then increase in temperature. I could have done it the other way around of course Probably would have made a bit more sense to do it the other way around Let's run our slope intercept our value p value standard error and let's see what these are Let's just see what these are so my slope was open oh for So very slight dies in white cell count It's going to give me raising temperature the r value is very low 0.16 Still a significant p value. They're still a correlation between temperature and white cell count That's very good. And then of course just the standard error So I can write the Y equals MX plus C little model again And you can give me a white cell count and I can predict what the temperature is going to be as I say You could have also done it the other way around there you have it linear regression