 Now, the part of our introduction that maybe most of you were waiting for is modeling data. On the other hand, because this is a very short introductory course, I'm really just giving a tiny little overview of a handful of common procedures. And in another course here at datalab.cc, we'll have much more thorough investigations of common statistical modeling and machine learning algorithms. But right now, I just want to give you a flavor of what can be done in R. And we'll start by looking at a common procedure, hierarchical clustering, or ways of finding which cases or observations in your data belong with each other. More specifically, you can think of it as the idea of like with like, which cases are like other ones. Now, the thing is, of course, this depends on your criteria, how you measure similarity, how you measure distance. And there's a few decisions you have to make. You can do, for instance, what's called a hierarchical approach, which is what we're going to do. Or you can do it where you're trying to get a set number of groups, or that's called K, the number of groups. You also have many choices for measures of distance. And you also have a choice between what's called divisive clustering, where you start with everything in one group, and then you split them apart, or agglomerative, which is where they all start separately, and you selectively put them together. But we're going to try to make our life simple here. And so we're going to do the single most common kind of clustering, we're going to use a measure of Euclidean distance, we're going to use hierarchical clustering, so we don't have to set the number of groups in advance. And we're going to use a divisive method, we start with them all together and gradually split them. Let me show you how this works in R. And what you'll find is even though this may sound like a very sophisticated technique, and a lot of the mathematics is sophisticated, it's really not hard to do in reality. So what we're going to do here is we're going to use a data set that we use a frequently I'm going to load my default packages to get some of this ready. And then I'll bring in the data sets. We're going to use mt cars, which if you recall, is Motor Trend car road tests data from 1974. And there are 32 cars in there. And we're going to see how they group what cars are similar to each other ones. Now let's take a look at the first few rows of data to see what variables we have in here. You see we have miles per gallon cylinders, displacement, so on and so forth. Not all of these are going to be really influential or useful variables. And so I'm going to drop a few of them and create a new data set that includes just the ones I want. If you want to see how I do that, I'm going to come back here and I'm going to create a new object, a new data frame for cars. And this says it gets the data from mt cars by putting the blank and the space here that means use all of the rows. But here I'm selecting the columns C for concatenate means I want columns one through four, skip five, six and seven, skip eight, and then nine through 11. That's why I'm selecting my variables. So I'm going to do that and you see that cars is now showing up in my environment there at the top right. Let's take a look at the head of that data set. We'll zoom in on that one. And they can see it's a little bit smaller. We have miles per gallon cylinders, displacement, weight, horsepower, quarter mile seconds, and so on. Now we're going to do the cluster analysis and we're going to find is that if we're using the default it's super, super easy. In fact, I'm going to be using something called pipes, which is from the package de plier, which is why I loaded it is this thing right here. And what it allows you to do is to take the results of one step and feed it directly in as the input data into the next step. Otherwise, this would be several different steps, but I can run it really quickly. I'm going to create an object called hc for hierarchical clusters. We're going to read the cars data that I just created. We're going to get the distance or the dissimilarity matrix which says how far each observation is in Euclidean space from each of the others. And then we feed that through the hierarchical cluster routine hclust. So that saves it into an object. And now we need to do is plot the results. We're going to do plot hc my hierarchical cluster object. And then we got this very busy chart over here. But if I zoom in on it, and wait a second, you can see that it's this nice little little it's called a dendrogram because as it branches in a tree, it looks more like roots here. You can see they all start up together and then they split and then they split and they split. Now if you know your cars from 1974, and you can see that some of these things make sense. So for instance, here we have the Honda Civic and the Toyota Corolla, which are still in production are right next to each other. The Fiat 128 and the Fiat X19 were very well, they were both small Italian sports cars, they were different in many ways, but you can see that they're right next to each other. The Ferrari Dino and the Lotus Europa, they make sense to put next to each other. If we come over here, the Lincoln Continental and the Cadillac Fleetwood and the Chrysler Imperial, it's no surprise they're next to each other. What is interesting is this one here, the Maserati Bora. It's totally separate from everything else because it's a very unusual different kind of car at the time. Now one really important thing to remember is that the clustering is only valid for these data points based on the data that I gave it. I only gave it a handful of variables and so it has to use those ones to make the clusters. If I gave it different variables or different observations, we could end up with a very different kind of clustering. But I want to show you one more thing we can do here with this cluster to make it even easier to read. Let me zoom back out. And what we're going to do is draw some boxes around the clusters. We're going to start by drawing two boxes that have gray borders. Now I'm going to run that one. And you can see that it showed up. And then we're going to make three blue ones, four green ones and five dark red ones. And then let me come and zoom in on this again. And now it's easier to see what the groups are in this particular data set. So we have here, for instance, the Hornet four drive, the Valiant, the Mercedes Benz 450 SLC Dodge Challenger and Javelin, all clumping together in one general group. And then we have these other really big V8 American cars. What's interesting is again is that the Maserati Bora is off by itself almost immediately. It's kind of surprising because the Ford Pantera has a lot in common with it. But this is a way of seeing based on the information that I gave it, how things are clustered. And if you're doing market analysis, if you're trying to find out who's in your audience, if you're trying to find out what groups of people think in similar ways, this is an approach that you're probably going to use. And you can see that it's really simple to set it up, at least using the default in our as a way of seeing how you have regularities and consistencies and groupings in your data.