 Cluster analysis is a form of exploratory data analysis. Sometimes it's known as data mining when you're using large data set. And the idea is to identify relationships in those large data sets. Cluster analysis can either be supervised or unsupervised learning. What we're going to do in this course is known as unsupervised in that we are just going to let the computer try to figure things out as opposed to giving the computer examples and having the computer compare the new data with examples which is known as a form of supervised learning. And the form we're going to use is called K-means clustering. Cluster analysis is really just gathering a bunch of objects and separating them into groups of similar objects by exploring these groups and determining how they're similar and how they're different. You can learn a lot about that big pile of data and hopefully lead to insights that will help you make better decisions. All K-means is just the number of groups, number of clusters that we're interested in. And we're going to put the points in the data set, the points in space into these K groups. And we'll define these groups by a center point. Remember geometry centroids? The group's centroid is its mean location and the distance of individuals in the group from that centroid tells us something we can use. And if we can minimize the distance from individuals to the center of a group, then we can know that the people are in the best group and give us the optimum result. I'm going to use an example I got from a great book, Data Smart, by John Foreman. And it's about a middle school dance. I think we can all remember those dances in which kids filed into the gymnasium and then separated into groups, usually boys on one side, girls on one side, and the counselors and the monitors and the teachers on the other side. So that's a simplistic model, but we can use it to help us learn about K-means cluster analysis. Remember we're going to let the computer do this analysis for us, and the computer has to start somewhere. So it starts out by putting three dots into the gym because we want to do a K equal 3, K-means analysis. And it just randomly places the dots which represent the centroid of the groups. Now this doesn't look very good. If we get the computer to add lines of equal distance so that any point between the centroid is equal distance, you'll see that we've got these three odd shaped areas, a red area up here that has a bunch of people in it and some not in it. A area in the middle that's not colored that's got most of the people in it. And then a green area down here in the lower left which has just the foot of this one person in it. So obviously this is not the optimum location of those centroids. So the computer tries again, it randomly moves the centroid around and then solves some basic geometric equations trying to optimize the distance from the individuals that we have here, the students from these centroids. And this is a little better, but it's still not necessarily the optimum location of those centroids. And we humans would know right away that we should put our centroids in the center of the groups, but it takes the computer a while to get there using optimization. And it will eventually, if we give it the right information, figure out where to put the centroid so that people are optimally grouped. And of course once we have them grouped we can start comparing the characteristics of the people in those groups to learn things. For example, the people in the green group turn out to be all female students. The people in the red group turn out to be all male students. And the people down here in the bottom of the wall of the gym are the teachers, which is the basic solution we probably would have guessed at to begin. But we can use K-means clustering, the technique to analyze large volumes of data and help us make good business decisions. We'll talk about that more in the next video.