 One of the most remarkable uses of linear algebra involves solving an important problem. Suppose you have a set of data relating a set of observed outputs from a set of measured inputs. We want to find a function that most accurately predicts the observed outputs from the inputs. For example, we might want to predict a student's grade based on the number of hours they study, or the price of a home based on the number of bedrooms and bathrooms, or the cost of a flood based on the amount of rain and the temperature. So let's try to set up this problem. So in the simplest case, we might want to find a linear function that matches inputs xi to outputs yi. Suppose our observed data values are a set of ordered pairs, and we want to predict the y's from the x's using a linear function. The function will have the form y equals a1x plus a2, where a1 and a2 are unknown parameters. So if our function is y equals a1x plus a2, our data values mean that a1x1 plus a2 should give us y1, a1x1 plus a2 should give us y2, and so on. And we can express this in matrix form. Now here, remember the unknowns are the a1's and the a2's. And so in matrix form, we want to find a1, a2, where ax equals b, where a is our coefficient matrix, x is our column of variables, and b is our column of constants. So x, our column vector of the variables, is going to be a1, a2. We'll throw in an implied coefficients of 1 in front of our a2's and peel off our coefficient matrix, a. And the matrix of constants, b. And so this seems to be a fairly simple matrix equation. We want to find a1, a2, where ax equals b. Unfortunately, this is generally impossible. We have too many equations and not enough unknowns. So instead, we'll try to minimize the difference between ax, which we can think about as the predicted values, and b, which are the observed values. Now the obvious way to minimize the difference is to make the distance between ax and b as small as possible. And this means we want to minimize the norm ax minus b. But remember when we calculate a norm this way, this will involve a square root, which is a little bit messy. So instead, we'll minimize the square of the norm ax minus b. And since the norm of ax minus b squared will be the sum of the squares of the components of ax minus b, and we want to minimize the sum, we say that this is the least squares problem. And we can solve this using calculus. Wait, wait, wait, wait. The problem is we have multiple unknowns, which means we have to use multivariable calculus. Alternatively, we can use linear algebra and a little geometry. So to minimize this norm of ax minus b, we might note the following. Ax is some vector, b is another vector, and ax minus b is the difference between two vectors. And if we look at this geometrically, this vector ax minus b is the vector that joins b to ax. And we want to make the length of this vector as short as possible. So as we change x, this vector ax and ax minus b will change. And so to minimize the length of this vector, we can make ax minus b perpendicular to ax. And so from our geometric perspective, we can require ax dot ax minus b to be 0. Let's rewrite this equation. So I could distribute the dot product, rearrange my equation slightly. And so let's think about this. If u and v are two column vectors, I can express the dot product as the matrix product u transpose v. So I can rewrite my two dot products. And we have a mess of transposes here, but remember that for any matrices a, b, a, b transpose is the same as b transpose a transpose. So I can rewrite my transposes. And at this point, we'll invoke something called left cancellation. And the thing to notice here is that x transpose applied to a vector is the same as x transpose applied to a vector. Well, a solution will occur when the two vectors are the same. So we want a transpose ax to be a transpose b. So let's try it out. Suppose we collect some data pairs and we want to find a linear function y equals mx plus b that best approximates this data. So our input values are going to be 1, 2, 3, 4, and 5. And our corresponding observed output values are going to be 3, 6, 6, 10, and 9. And in a perfect world, we want the linear function to give us exactly these output values. And so we'll peel off our coefficient matrix a, our matrix of variables x, and our matrix of constants b. And our goal is to minimize ax minus b. And this requires solving a transpose ax equals a transpose b. So setting this up, we want a transpose ax to be a transpose b. And cleaning up all those matrix calculations. And now this is a beautiful system of two equations in two unknowns. And so reducing the corresponding augmented coefficient matrix gives us. And so our solution is m equals 1.6, b equals 2, and the best fit line is y equals 1.6x plus 2. And just as a follow up to this problem, we might compare what our predicted values are using this best fit equation and the observed values. So our inputs 1, 2, 3, 4, 5 give us predicted values. And if we compare those two or our observed values, we see there is a reasonably good fit. What's useful about this is that we don't have to limit ourselves to linear functions, y equals mx plus b, but we can expand our ability to find functions to many different types of functions. Let's take a look at that next.