 All right, welcome back to the video series on linear regression. So in the previous set of videos, we talked about correlation and how we can use the hypothesis test for correlation to determine the significance of an association. Today we're going to get, start getting into actually conducting linear regression. So the correlation can sort of tell us an idea about that association, the linear regression is what really provides that best fit line and allows us to make predictions based off of that linear model. And so in this particular video, we're going to talk about the manual linear algebra style of doing linear regression. And this is an important base for the next couple videos that will focus on linear regression in Python because whatever your computer is doing to conduct linear regression, it follows the exact same framework that we're about to show in this video. And so while it may seem a little mathematical and a little abstract, just know that this is the basis for all linear regression that is conducted on any computer program. So with that, let's go ahead and dive right in. So we are here in Google CoLab and we've got a few additional libraries that we will get to at a later time. But for most of this lecture series, I'm actually going to focus on using a data frame that we have talked about when we discuss what spurious correlation is. And that is the drowning in nuclear data frame or data set that has been published right here. And essentially this data set has just correlated the amount, the number of drowning deaths in the US with the amount of power generated by nuclear energy. And as we've discussed before, this actually has a very positive high correlation value. So to demonstrate that as, demonstrate that correlation, I'm going to start with ggplot. So I've called it the data frame df and I'm also going to add in the aes statement here in the ggplot command like I did in the correlation. Video. And this just really helps to streamline the plotting process because I no longer need to add an aes statement into each individual plotting command. I can focus on adding in the command specific values. So method, color, and for now, I'm actually going to say se equals false, which says don't put that confidence interval bands that we often see around the data. So we can see here, we've got nuclear and drowning and it's a very high, very strongly positive association. But we know from just logic that there is no causal relationship between drowning and nuclear power, just that they seem to have increased at the same rate over time. And so nonetheless, we're going to use this dataset to conduct manual linear regression. And these are going to follow those same steps that we talked about in the lesson itself, but here I'm going to demonstrate how you can do that linear algebra in Python. So the first step is to create a column of ones. And I'm just going to call it ones. And I'm going to say take a scalar of one and multiply it by the number of rows in DF. So if we print that, we can see that we now have this column of ones alongside our drowning and our nuclear power data. So then the next step is to extract the X data and the ones and form a separate matrix. So when we're doing matrix multiplication and matrix math, we need that Y matrix, that X matrix, and then we're going to be solving eventually for that beta matrix. And so to get this X data, I'm going to create a new sort of list structure where I put my column names, so drowning and ones. And then I'm going to say X is PD dot data frame, the F calls dot two numpy. So first I'm converting it into a data frame and then I'm converting it into a numpy array. So then we can print X. And so now we can see that this is the same data that we had up here in drowning, but now it's in matrix array form. So then we need to separate out the Y data and that's going to look very similar. This time we don't need to create a column matrix because it's just one value. So we can say PD data frame, the F nuclear dot two numpy. And then we can print Y. And so now we've just got a matrix of our nuclear data. And now that we've got our Y matrix and our X matrix, now we can start to solve that equation. So the first thing we need to do is transpose the X matrix. And so I'm using this transpose command from numpy. And I'm just going to create a new variable, XT NP dot matrix dot transpose X. And this is just where we flip our rows into columns and our columns into rows. So we can see that now instead of being a long data frame, it's now Y data frame with the first row being drowning and the second row being once. And so then continuing on with this linear regression, the fifth step is to multiply X by XT. And so if you have taken a linear algebra class then you know that you would do this manually by adding and multiplying rows by columns. But in Python, we have this nice command called mat mall from matrix multiplication. And we just give it the two values. And I called that XTX or XT times X. And so now we can see we end up with a two by two matrix that is the X multiplied by X transpose. And now the next step, step six is to do that same process but on the Y. So you can sort of imagine the equation in your head. You've got Y equals X times B. And so if we do something to the X, we need to also do something to the Y in order to make sure that we're not unbalancing our equation essentially. So here I'm gonna call it XTY. And it is NP dot mat mall XT comma lowercase Y. And then we can run that. And so we've got two by one matrix on the Y side. And so now we've got, if you're following the equation, we've got XTY equals XTX times B. So we need to get that XTX to the other side of the equation. And the way that we do that is through the inverse. So when you divide by a matrix, you end up getting this inverse. And so in order to do that, we're still using this num pi but we're using lin-alg for linear algebra dot inv for inverse. So it's XTX inv and this is just NP dot lin-alg dot inv. And then we just give it the variable XTX. Then we can print that. And so then we can see we've still got this two by two matrix, but this is now the inverse. And so now that we have the inverse, we're ready to move XTX in from the right side of the equation to the left. And this is how we essentially get our beta values. So now we're dividing X on both sides. So we've done it from the X side and now we do it from the Y side which essentially just means multiplying XTX inverse with XTY. And the result is our beta values, beta naught and beta one. And so we can see here that we've got this two by one matrix that contains our slope in the first row and our intercept in the second row. And so through this linear algebra, we can come to this conclusion. The results are the beta values. And so if we take our regression line Y equals MX plus B, we can change that to Y equals 0.313X plus 614. And this is now our equation of the line that we could use to make predictions for new Y values given X values. And like I said, this linear algebra solution might be a little abstract, but it's a really important base because this is what for the rest of this lesson will be happening behind the scenes. This is exactly what Python is doing every single time we ask it to do linear regression. So in the next several videos, we'll actually get into some more Python commands for making this process a little easier on us.