 Here's the x, here's the y, plot them together into a line. This is the least squares line deterministic and true. This is the second line, and this is the third one too. So let's talk about how to draw a line. I mean, how to draw a line to points in a scatter plot. It starts with a standard slope formula. Start with an equation for the slope of a line. You will see this has a similar form to the equation for least squares. Next, replace the point y with a vector holding many y's, and the point x will be replaced with a matrix holding many vectors of x. It is also common to concatenate a vector of 1's as an intercept term in the x matrix. Next, the slope m can be replaced with beta 1. The intercept e can be replaced with beta not. The beta term can be combined into one vector holding beta 1 and beta not. If this notation is new, I promise there's nothing to be afraid of. Vectors are just columns of data. Matrices are 1 or more columns of data. At this point, everything in this equation has real data underneath the hood. For example, I can take the data from the previous plot and put it in this equation. The only variable missing is the vector of betas. This makes sense because this is the exact thing we don't know at this point. The slope and intercept define the line. So the mission should you choose to accept it is to find the betas that result in the smallest overall error. So how do we go about finding the betas that result in the smallest overall error? Here's a few options. 1. I ball it and call it good. 2. Grading descent. This means we have a computer make an initial guess, and then make incremental improvements until improvements become small. 3. Galton's decomposition of variability. Interesting fact, Galton, Darwin's cousin, was working on drawing lines to sweepy data in much the same way we are. But despite Lee Squares being known for at least 50 years, he went about the solution in a completely different way. But that'll be a topic for another time. 4. Calculus. We are going for the calculus option in this video. Ideally, we want to fit a line that makes the smallest overall error. It would not make much sense to have a flat line at zero, because the error for each point would be high, resulting in an overall large error. Also, this equation, as it stands, is not really what we want. One issue is that the error below the line will be negative, and the errors above the line would be positive. Make all the errors positive, swear the error. In vector notation, if the goal is to sum squared terms, we simply apply the dot product of the transpose of the vector by itself. This term, the errors summed and squared, is called the residual sum of squares, also known as Lee Squares. Here, the residual is another word for error. So, residual sum of squares, errors squared and summed, least squares, is all the same. At this point, the rest is algebra and calculus. There are three steps. First, expand the right-hand side. Second, take the partial derivative with respect to beta. Third, set the residual sum of squares equal to zero and solve for beta. The final result gives the betas. There is a special name given to the OLS equation, with the solution of the beta plugged in. It is called a hat matrix. That is because this equation that puts a hat on the y. In this notation, putting a hat on the y upgrades it from being theoretical to an equation that has a solution. This is the equation that minimizes the errors. I linked to some code so it is easy to see what this looks like, what data pumps through it. Taking the data from before gives us output. These values can be plugged into the equation for a line and overlaid on the points. This is just one line. I said in the beginning of the video, it is possible to have at least two other valid Lee Squares lines. Lee Squares calculates the vertical error from point to line. But it's also possible to minimize the horizontal distance from point to line. It is valid, but it gives a pretty different slope. Finally, it's also possible to minimize the orthogonal or perpendicular distance from point to line. Sometimes this is called total sum of squares. Part of simple regression is about how to draw a line. But trying a line has nothing to do with uncertainty. So Lee Squares alone does not generate statistical estimates, rather generates a deterministic answer. This is a mathematical solution, not a statistical estimate. There are no probability models needed to generate a Lee Squares solution. Why do we want a statistical model anyway? Without a probability model under the hood, you lose the ability to do hypothesis tests, generate confidence intervals, and probability densities. These things are possible with a simple linear regression model.