 hierarchical clustering is not suitable for large data sets. For example, if I was a molecular biologist, I could easily be dealing with upwards of 30,000 genes at a time. And to cluster these genes, I'd have to generate a distance matrix of almost a billion numbers. Then storing all these numbers at 8 bytes apiece would take a total of 8 terabytes of memory, which my laptop with 16 gigabytes of RAM just couldn't handle. So clearly, I'm going to need another kind of clustering algorithm, one that doesn't require computing all the parallelized distances. One such algorithm is called K-means. It only computes the distances between points and the centroids of emerging clusters. So let's go ahead and take a look at how this works in orange. First, I'm going to need the education add-on. Go to the options menu, select add-ons and install it. Then just wait for orange to restart. Okay, now I'll open the paint widget and draw some data. So I feed this data set to interactive K-means. Now for the purpose of this demonstration, I'll use the interactive K-means widget. Otherwise, I would probably just use the regular K-means widget. Okay, so taking a look at our plot, we see some big squares which represent our centroids. Now one thing about K-means is that it requires us to specify the number of clusters beforehand. Here we start off with three randomly placed centroids. Now we could also move them around a little bit, like this, if we wanted to, but it shouldn't really make a big difference. Now each point is assigned to its nearest centroid. Take the red centroid, for example. We can see all its closest points in red, as well as the distances between them. Now the next step is to then move the red centroid over to the center of the red point. Now the same logic obviously also applies to all the other centroids. So I click recompute centroids and take a look at what happens. Okay, now I see that some of the points are closer to the green centroid and some green points on the top are closer to the blue one. I should reassign the data points here to their closest centroids, so I'll click on reassign membership. Okay, now I'm back to step one. I want to move the centroids again. I'll keep clicking recompute centroids and then reassign membership until the centroids are in the middle of their clusters and no more points need to switch clusters. K means alternates between these two steps. In the first step it assigns each point to its closest centroid, then in the second step it recomputes the centers of these clusters. Now repeating these two steps typically converges pretty fast, even for big data sets with millions of points. It usually just takes just a couple of 10 or 100 iterations. Okay, now let's add a little bit more data, like this. In interactive K-means, I randomize the position of my centroids or maybe arrange them like this. Then I'll alternate between recomputing centroids and reassigning membership until convergence. Now, this looks okay, but one of the clusters seems a bit odd. There should be two groups here instead of just one. But if I want to break up this cluster, what I have to do is start with four centroids. So I'll just add an additional centroid by clicking anywhere on the plot. Now, running K-means, I get four clusters which look much better. But be aware that more clusters is not always better. K-means can get stuck in suboptimal solutions because of bad initial centroid placement. So what we should be asking ourselves is how many centroids should I start off with? But I'll leave the answer to this question for our next video.