 Clustering, i.e. the detection of groups of objects, is one of the basic procedures we use to analyze data. We can, for instance, discover groups of people according to their user profiles through service usage, shopping baskets, behavior patterns, social network contacts, use of medicines, or hospital visits. Therefore cluster products based on their weekly purchases, or genes based on their expression, or doctors based on their prescription. Complicated, right? Well, not really. Luckily, most popular clustering algorithms rely on simple algorithms that are easy to grasp. In this and the following videos, I will introduce hierarchical clustering. But let me start with the data and the measurement of data distances. I will use the data on student grades. The data is available through Orange's data sets widget. Let me find it by typing grades in the filter box. Here it is. Let me examine it in the data table. 16 students were graded in 1, 2, 3, 7 different subjects, including English, French and history. We would like to find students whose grades are similar. If I were a teacher, it would be interesting to know if I have, for instance, students talented in different areas so that I can adjust their training load appropriately. While for this task I would need to consider all the grades, I will simplify this introduction to consider only grades from English and Algebra, constructing in this way a two dimensional data set. To select only specific variables from the data, I can use the select columns widget. I will ignore all features except English and Algebra. I can do this by selecting all the features, moving them to ignored column and then dragging English and Algebra back to the features column. Let me check the result with the data table. Right, English and Algebra are here, Maya, for instance, excels in English but struggles in Math, while it's the opposite for Olga. Since we have the data with only two features, it is best to see it in the scatter plot. I will label the dots representing students with their names. Alright, I see Olga with a high Math grade on the top left and Maya in the opposite corner of the plot. They are really far apart and definitely would not be in the same cluster. On the other hand, George, Phil and Lea have similar grades in both subjects and so do Jenna, Cynthia and Fred. The distances between these three students are small and they appear close to each other on the scatter plot. Oh, did I mention distances? There should be a way to measure the distances formally. In real life, we would use a ruler and, for instance, measure the distance between Catherine and Jenna by measuring the length of the line that connects them. But since orange is a computer program, we need to tell it how to compute the distances. Well, Catherine's grade in English is 20 and Jenna's is 39. Their English grade difference is 19. Catherine scored 71 in Algebra and Jenna 99. The Algebra grade difference is 28. According to Pythagoras, the distance between Catherine and Jenna should be square root of 19 squared plus 28 squared, which amounts to about 33.8. We could compute distances between every pair of students in this way. But orange can do this for us. We will use the distances widget and, for now, remove normalization. The grades in English and math are expressed in the same units, so there is no need for normalization now. We'll keep the distance matrix set to Euclidean distance, as this is exactly the one we have defined for Catherine and Jenna. We can look at the distances in the distance matrix, label the columns and rows with student names and find that the distance between Catherine and Jenna is indeed 33.8 as we have computed before. We can also see that the distance between Nash and Demi is only 5. Going back to the scatter plot, we can see that Demi and Nash are indeed positioned close to each other. Fine. Now we know how to compute the distances in two-dimensional space. The idea of clustering is to discover groups of data points, that is, students, whose mutual distance is low. For example, Jenna, Cynthia and Fred look like a good candidate group. And so do Phil, Leah and George. And perhaps Henry and Anna. Well, we would like to find the clustering for our entire set of students, but how many clusters are there? And what clustering algorithm to use? Well, all of this and a bit more in our next video on hierarchical clustering.