 As we go through our very brief introduction to modeling data in our another common procedure that we might want to look at briefly is called principal components. And the idea here is that in certain situations, less is more. That is, less noise, and fewer unhelpful variables in your data can translate to more meaning. And that's what we're after in any case. Now, this approach is also known as dimensionality reduction. And I like to think of it by an analogy. You look at this photo and what you see are these big black outlines of people, you can tell basically how tall they are what they're wearing where they're going. And it takes a moment to realize you're actually looking at a photograph that goes straight down. And you can see the people there on the bottom and you're looking at their shadows. And we're trying to do a similar thing. Even though these are shadows, you can still tell a lot about the people. People are three dimensional shadows are two dimensional, but we've retained almost all the important information. If you want to do this with data, the most common method is called principal component analysis or PCA. And let me give you an example of the steps metaphorically in PCA. You begin with two variables. And so here's a scatterplot, we've got x across the bottom, white at the side and this is just artificial data. And you can see that there's a strong linear association between these two. Well, what we're going to do is we're going to draw a regression line through the data set. And you know, it's there about 45 degrees. And then we're going to measure the perpendicular distance of each data point to the regression line. Now, not the vertical distance, that's what we would do if we were looking for regression residuals, but the perpendicular distance. And that's what those red lines are. Then what we're going to do is we're going to collapse the data by sliding each point down the red line to the regression line. And that's what we have there. And then finally, we have the option of rotating it. So it's not a diagonal anymore, but it's flat. And that there is the PC, the principal component. Now let's recap what we've accomplished here. We went from a two dimensional data set to a one dimensional data set, but maintained some of the information in the data. But I like to think that we've maintained most of the information. And hopefully, we maintain the most important information in our data set. And the reason we're doing this is we've made the analysis and interpretation easier and more reliable by going from something that was more complex, two dimensional or higher dimensions, down to something that's simpler to deal with fewer dimensions. It means easier to make sense of in general. Let me show you how this works in our open up this script, and we'll go through an example in our studio. To do this, we'll first need to load our packages because I'm going to use a few of these. I'll load those and we'll load the data sets. Now I'm going to use the empty cars data set, we've seen it a lot. And I'm going to create a little subset of variables. Let's look at the entire list of variables. And I don't want all of those in my particular data set. So the same way I did with hierarchical clustering, I'm going to create a subset by dropping a few of those variables. And we'll take a look at that subset. We'll zoom in on that. And so there's the first six cases in my slightly reduced data set. And we're going to use that to see what dimensions we can get to, that we have fewer than the 123456789 variables we hear. Let's try to get to something a little less and see if we still maintain some of the important information in this data set. Now what we're going to do is we're going to start by computing the PCA the principle component analysis. We'll use the entire data frame here, I'm going to feed into an object called PC for principle components. And there's more than one way to do this in our but I'm going to use PR comp. And this specifies the data set that I'm going to use. And I'm going to do two optional arguments. One is called centering the data, which means moving them. So the means of all the variables are zero. And then the second one is scaling the data, which sort of compresses or expands the range of the data. So it's unit or variance of one for each of them. That puts all of them on the same scale. And it keeps any one variable from sort of overwhelming the analysis. So let me run through that. And now we have a new object that showed up on the right. And if you want to you can also specify variables by specifically including them, the tilde here means that I'm making my prediction based on all the rest of these. And I can give the variable names all the way through. And then I say what data set it's coming from, I say data equals empty cars, and I can do the centering and the scaling there also, it produces exactly the same thing is just two different ways of saying the same command. To examine the results, we can come down and get a summary of the object PC that I created. So I'll click on that and then we'll zoom in on this. And here's the summary, it talks about creating nine components PC one for principal component one to PC nine for principal component nine, you get the same number of components that you had as original variables. But the question is whether it divvies up the variation separately. Now you can take a look here at principal component one, it is a standard deviation of 2.3391. What that means is if each variable began with a standard deviation of one, this one has as much as 2.4 of the original variables. The second one has 159. And the others have less than one unit of standard deviation, which means they're probably not very important in the analysis. We can get a screen plot for the number of components and get an idea on how much each one of them explains of the original variance. And we see right here, I'll zoom in on that, that our first component seems to be really big and important. Our second one is smaller, but it still seems to be, you know, above zero, and then we kind of grind out down to that one. Now there's several different criteria for choosing how many components are important and what you want to do with them. Right now, we're just eyeballing it and we see that number one is really big, number two, sort of a minor axis in our data. And if you want to, you can get the standard deviations and something called the rotation here, I'm going to just call PC. And then we'll zoom in on that and the console to scroll back up a little bit. And it's a lot of numbers. The standard deviations here are the same as what we got from this first row right here. So that just repeats that the first one's really big, the second one's smaller. And then what this right here does with the rotation is it says is what's the association between each of the individual variables, and the nine different components. So you can read these like correlations. I'm going to come back. And let's see how individual cases load on the PCs. What I do that is I get use predict, run through PCs, and then I feed those results using the pipe. And I round them off so they're a little more readable. I'll zoom in on that. And here, we've got nine components listed, and we got all of our cars. But the first two were probably the ones that are most important. So we have here the PC one and two you see we got a giant value there 2.49273354 and so on. But probably the easiest way to deal with all of this is to make a plot. And what we're going to do is go something with a funny name a by plot what that means is a two dimensional plot. Really, all it says is is going to chart the first two components. But that's good because based on our analysis, it's really only the first two that seem to matter anyhow. So let's do the by plot, which is a very busy chart. But if we zoom on it, we might be able to see a little better what's going on here. And what we have is the first principle component across the bottom and the second one up the side. And then the red lines indicate approximately the direction of each individual variables contribution to these. And then we have each case we show its name about where it would go. Now if you remember from the hierarchical clustering, the Maserati Bora was really unusual. And you can see it's up there all by itself. And then really what we seem to have here is displacement and weight and cylinders and horsepower. This appears to be big, heavy cars going in this direction. Then we have the Honda Civic the Porsche 911 Lotus Europa. These are small cars with smaller engines more efficient. These are fast cars up here and these are slow cars down here. And so it's pretty easy to see what's going on with each of these, as in terms of clustering the variables. With the hierarchical clustering, we clustered cases. Now we're looking at clusters of variables. And we see that it might work to talk about big versus small and slow versus fast, as the important dimensions in our data has a way of getting an insight to what's happening and directing us in our subsequent analyses.