 As we continue our discussion of statistics and data science, we need to talk about some of the choices that you have to make some of the trade offs, and some of the effects that these things have, we'll begin by talking about estimators, that is different methods for estimating parameters. I like to think of this as what kind of measuring stick or standard are you going to be using? Now, we'll begin with the most common. This is actually called OLS, which is short for ordinary least squares. This is a very common approach. It's using a lot of statistics, and it's based on what's called the sum of squared errors. And it's characterized by an acronym called blue, which stands for best linear unbiased estimator. Let me show you how that works. Let's take a scatter plot here of an association between two variables. This is actually the speed of a car and the distance to stop from about the 20s, I think. We have a scatter plot here, and we can draw a straight regression line through it. Now, the line that I've used is in fact the best linear unbiased estimate. But the way that we can tell that is by getting what are called the residuals. If you take each data point and draw a perfectly vertical line up or down to the regression line, because the regression line predicts what the value would be for that value on the x axis. Those are the residuals, each of those individual vertical lines is residual, you square those, and you add them up. And this regression line, the gray angled line here, will have the smallest sum of squared residuals of any possible straight line that you can run through it. Now, another approach is ML, which stands for maximum likelihood. And this is when you choose parameters that make the observed data most likely. It sounds kind of weird, but I can demonstrate it. And it's based on that kind of local search. It doesn't always find the best. I like to think of it like a person here with binoculars, looking around them trying hard to find something, but you could theoretically miss something. Let me give a very simple example of how this works. Let's assume that we're trying to find parameters that maximize the likelihood of this dotted vertical line here at 55. And I've got three possibilities. I've got my red distribution, which is off to the left, the blue, which is a little more center in the green, which is far to the right. And these are all identical, except they have different means. And by changing the means, you see that the one that is highest where the dotted line is is the blue one. And so if only thing we're doing is changing the mean and we're looking at these three distributions, then the blue one is the one that has the maximum likelihood for this particular parameter. On the other hand, we could give them all the same mean right around 50 and vary their standard deviations instead. And so they spread out different amounts. In this case, the red distribution is highest at the dotted vertical line. And so it has the maximum value. Or if you want to, you can vary both the mean and the standard deviation simultaneously. And here the green gets a slight advantage. Now, this is really a caricature of the process. Because obviously, you would just want to center it right there on the 55 and be done with it. The question is when you have many variables in your data set, then it's a very complex process of choosing values that can maximize the association between all of them. But you get a feel for how it works with this. The third approach that's pretty common is something called MAP map for maximum a posteriori. This is a Bayesian approach to parameter estimation. And what it does is it adds the prior distribution, and then it goes through sort of an anchoring and adjusting process. What happens, by the way, is that stronger prior estimates exert more influence on the estimate. And that might mean, for instance, larger sample or more extreme values. And those have a greater influence on the posterior estimate of the parameters. Now, what's interesting is that these three methods all connect with each other. Let me show you exactly how they connect. The ordinary least squares OLS. This is equivalent to maximum likelihood when it has normally distributed error terms. And maximum likelihood ML is equivalent to maximum a posteriori or map with a uniform prior distribution. You want to put it another way, ordinary least squares or OLS is a special case of maximum likelihood. And then maximum likelihood or ML is a special case of maximum posteriori. And just in case you like it, we can put it in set notation. OLS is a subset of ML is a subset of MAP. And so there are connections between these three methods of estimating population parameters. Let me just sum it up briefly this way. The standards that you use OLS, ML, MAP, they affect your choices, and the ways that you determine what parameters best estimate what's happening in your data. Several methods exist. And there's obviously more than what I showed you right here. But many are closely related and under certain circumstances, they're all identical. And so it comes down to exactly what are your purposes? And what do you think is going to best work with the data that you have to give you the insight that you need in your own project?