 In this video, we continue with hierarchical clustering. So far, we have reduced the data I used in an example to only two dimensions. We said we could use Euclidean distance to measure the closeness of two items. Then, we noted that hierarchical clustering starts with considering each data item in its own cluster and iteratively merges the closest ones. For that, we need to define how to measure distances between clusters. Somehow, intuitively, I used average linkage, where the distance between two clusters is the average distance between each of their elements. Well, that was a long introduction. In my previous video, I simulated the clustering on a scatter plot, and the resulting picture was messy. This time, I promised something better. Let me first explain it in the scatter plot. Again, I start with the data set, select the variables, and plot the data. Since Orange remembers all the settings of the widgets from my previous video, I can develop the workflow just by stitching the widgets together. Let's check the contents of the widgets. In datasets, I load course grades. I choose English and Algebra in select columns, and then visualize the data in scatter plot. When I join the two clusters, I can remember the cluster distance and plot it in a graph. Say I join George and Leah. Their distance is about five. Here is my graph. Then, I could add fill to the George-Leah cluster with the distance of, say, six. Then, Bill and Ian with the distance of seven. And then, a little later on, I add Maya to fill Leah-George at a distance of 15. The cluster merging graph that I'm showing here is called a dendrogram. It visualizes the structure of hierarchical clustering. Note that dendrogram lines never cross, because I start with clusters close to each other, and as I iteratively merge the clusters, the distances between them grow larger and larger. In my last video, I promised that I would now use Orange to perform hierarchical clustering and plot the results. So, here we go. I first need to measure the distances. We have done this already two videos ago when I used the distances widget. Let me check its contents. I will use Euclidean distance, and only for this data set, and I repeat only for this one, I will not normalize the data. Again, I can check the computed distances in the distance matrix. I can now construct hierarchical clustering. The widget receives the information about the distances, and here it is, the dendrogram. I will annotate the dendrogram branches with student name. Great. I remember that Leah, George, and Phil should be close, and that Maya and Eve joined this cluster later. Also, the distance between Bill and Ian is fairly small, but they are far from the George, Leah, and the others cluster. I can cut the dendrogram to expose the groups. Here, I cut it such that we get three clusters. So, where are they in the scatter plot? The hierarchical clustering widget emits the selected data signal. Here, I selected everything, so it should also include the information on the clusters. In scatter plot, I will zoom out a bit to see the student names and color the students according to the assigned cluster. Great. Here they are. I will now minimize the hierarchical clustering and scatter plot widgets and put them side by side. I can now experiment with the number of clusters by placing the cutoff line at different positions. Here's an example with four clusters where Bill and Ian are on their own. No wonder, as they are the only two performing well in both English and algebra. If I want five clusters, I see Jenna, Cynthia, and Fred are on their own. How many clusters are there in our data? Well, that's hard to say. A dendrogram visualizes the cluster structure, and it is usually up to domain experts, in this case the teachers, to decide what they want. We will discuss this a bit more in my next videos. I will also show you how to construct clusters on multi-dimensional data. Remember, everything I've done so far was on a two-dimensional data set. If all the data lived in two dimensions, all we would ever need would be scatter plots, and data mining would be done manually. The extension of hierarchical clustering to multi-dimensional data is simple, so that's what's coming next.