 In our previous video, we introduced k-means. Now, using this clustering technique, we begin by randomly placing k-centroids. Then, we iteratively assign points to the closest centroid and place the centroids in the middle of their clusters. Okay, I'll start off by painting three clusters. I want two of them nearby, and one of them a bit further away, like this. Now, again, I'm going to use the interactive k-means widget from the educational add-on so that I can show what exactly is going on in the background. I'll move the centroids so that one starts on the left and two of them start over on the right. Okay, now I run the k-means algorithm and wait until it converges. Clearly, we see something is not okay. The right cluster is split in two, and where I expected two clusters on the left, there's only one. So, running the algorithm again after I move the centroids around a bit, still keeping one on the left and two on the right, I see it still doesn't look quite right. Just to make sure, I'll double check the results in the scatter plot, but the resulting clustering seems to be wrong. Clearly, the k-means clustering depends on the initial placement of centroids. This means that I probably shouldn't be placing them entirely at random. As we've seen, that could lead us to suboptimal clusters. So, let's take a look at the painted data again. Maybe, in order to help k-means out, I can select the point that is the farthest away from the rest of the points and make that my initial centroid. Then, the second centroid could be the most distant point from the first, and in the same way I can choose a third centroid so that it's farthest away from both the first and second one. Okay, now I'll open interactive k-means and initialize the centroids in this way. I run k-means and see that it converges very fast, and this time I got the clusters that I expected. Also, I can quickly see that the correct clustering is in the scatter plot as well. Now, initializing centroids to points far away from each other help us obtain the right clusters in this situation. However, it's not a surefire method. Sometimes there's tricky outliers in our data that can mess this all up. And to bypass this, common centroid placement algorithms like k-means++ randomly choose centroids from a few candidates. Then they repeat the clustering multiple times and only report the best result. Orange's k-means widget, the one we should actually be using for clustering, utilizes this type of initialization and reruns the algorithm a bunch of times. Now remember, we only used interactive k-means so we could visualize how clustering works in two dimensions. Okay, we find three well-separated clusters. So, if k-means runs multiple times, how does it know which result to keep? Also, what happens if we start to consider other possible values of k? Is there an algorithm that can guess the correct number of clusters? Well, you'll just have to wait and find out in our next video.