 Let's finish our very short introduction to modeling data in our with a brief discussion of regression, probably one of the most common and powerful methods for analyzing data. I like to think of it as the analytical version of e pluribus unum, that is out of many one, or in the data science sense, out of many variables, one variable, or you want to put it one more way, out of many scores, one score. The idea with regression is that you use many different variables simultaneously to predict scores on one particular outcome variable. And there's so much going on here, they'd like to think that there's something for everyone. There are many versions and many adaptations of regression that really make it flexible and powerful for almost no matter what you're trying to do. We'll take a look at some of these in R. So let's try it in R and just open up this script. And let's see how you can adapt regression to a number of different tasks and use different versions of it. When we come here to our script, we're going to scroll down here a little bit and install some packages, we're going to be using several packages in this one. I'll load those ones as well as the data sets package. Because we're going to use a data set from that called us judge ratings, let's get some information on it. It is lawyers ratings of state judges in the US Superior Court. And let's take a look at the first few cases with head. I'll zoom in on that. And what we have here are six judges listed by name. And we have scores on a number of different variables like diligence and demeanor, and whether it finishes with whether they're worthy of retention, that's the RTEN retention. Let's scroll back out. And what we might want to do is use all these different judgments to predict whether lawyers think that these judges should be retained on the bench. Now, we're going to use a couple of shortcuts that can actually make working with regression situations kind of nice. First, we're going to take our data set, we're going to feed it into an object called data. So that shows up now in our environment on the top right. And then we're going to define variable groups, you don't have to do this, but it makes the code really, really easy to use. Plus, you find if you do this, then you can actually just use the same code without having to redo it every time you do an analysis. So what we're going to do is we're going to create an object called x. It's actually going to be a matrix and it can consist of all of our predictor variables simultaneously. And the way I'm going to do this is I'm going to use as matrix. And then I'm going to say read data, which is what we defined right here, and read all of the columns except number 12. That's the one called retention. That's our outcome. So the minus means don't include that, but do all the others. So I do that. And now I have an object called x. And then the second one, I say, go to data. And then this, you know, blank means use all of the rows, but only read the 12th column, that's the one that has retention, our outcome. So following standard methods, x, those are all our variables and why that's our single outcome variable. Now, the easiest version of regression is called simultaneous entry, you use all of the x variables at once, throw them in one big equation to try to predict your single outcome. And in our we use LM, which is for linear model. And what we have here is why that's our outcome variable. And then the tilde means is predicted by or as a function of x. And then x is all of our variables together being used as predictors. So this is the simplest possible version, and we'll save it into an object called reg for regression one. And now, if you want to be a little more explicit, you can give the individual variables, you can say that our 10 retention is a function of or is predicted by all of these other variables. And then I say that they come from the data set us judge ratings that way I don't have to do the data and then dollar sign before each of these. That'll give me the exact same thing. So I don't need to do that one explicitly. If you want to see the results, we just call on the object that we created from the linear model. And I'm going to zoom in on that. And what we have are the coefficients. This is the intercepts start with minus two and then for each step up on this one, at 0.1, 0.36, so on and so forth. You'll see by the way that it's changed the name of each of the variables to add the x because they're in the data set x now that's fine. We can do inferential tests on these individual coefficients by asking for a summary. We click on that. And we'll zoom in. And now you can see there's the value that we had previously but now there's a standard error. And then this is the t test. And then over here is the probability value. And the asterisks indicate values that are below the standard probability cutoff of 0.05. Now we expect the intercept to be below that. We see for instance, this one integrity has a lot to do with people's judgments of whether a person should be retained. And this one physical really, you know, are they sick? And we have some others that are kind of on their way. And this is a nice one overall. And if you come down here, you can see the multiple r squared, it's super high. And what it means is that these variables collectively predict very, very well, whether the lawyers felt that the judge should be retained. Let's go back now to our script. You can get some more summary data here if you want. We can get the analysis of variance table, the ANOVA table. And if we click on that, zoom in. There you can see that we have our residuals and the Y. Come back out. We do the coefficients. Here are the regression coefficients. We saw those previously. This is just a different way of getting at the same information. We can get confidence intervals. We'll zoom in on that. And now we have a 95% confidence interval. So the 2.5% on the low end, the 97.5 on the top end in terms of what each of the coefficients would be. We can get the residuals on a case by case basis. Let's do this one. And when we zoom in on that, now this is a little hard to read in and of itself, because they're just numbers. But an easier way to deal with that is to get a histogram of the residuals from the model. So to do that, we just run this command. And then I'll zoom in on this. And you can see that it's a little bit skewed, mostly around zero. We've got one person way up on the high end, but mostly these are pretty good predictions. We'll come back out. Now I want to show you something a little more complicated. We're going to do different kinds of regression. I'm going to use two additional libraries for this one is called Lars that stands for least angle regression and carrot, which stands for classification and regression training. We'll do that by loading those two. And then we're going to do a conventional stepwise regression, which a lot of people say there's problems with this, but I'm just going to show that I'm going to do it really fast. There's our stepwise regression. Then we're going to do something from Lars called stage wise, similar to stepwise, but it has better generalizability. We run that through, we can also do least angle regression. And then really one of my favorites is the lasso that's the least absolute shrinkage and selection operator. Now I'm running through just the absolute bare minimum versions of these, there's a lot more that we would want to do explore these. But what I'm going to do is compare the predictive ability of each of them. And I'm going to feed into an object called R2 comp for a comparison of the R squared values. And here I specify where it is in each of them, I have to give a little index number, then we're going to round off the values. And I'm going to give them the names say the first one stepwise and forward than Lars and lasso. And we can see the values. And what this shows us here at the bottom is that all of them were able to predict it super well. But we knew that because when we did just the standards simultaneous entry, there was amazingly high predictive ability within this data set. But you will find situations in which each of these can vary a little bit, maybe sometimes they vary a lot. But the point here is there are many different ways of doing regression and our mix those available to whatever you want to do. So explore your possibilities and see what seems to fit. In other courses, we will talk much more about what each of these mean, how they can be applied and how it can be interpreted. But for right now, I simply want you to know that these exist and they can be done at least in theory in a very simple way in our