 k means clustering depends on the parameter k. So, we have to find a way to evaluate what k works best in each situation. In order to do this, we have to be able to compare different clustering among each other. So, how do we score a cluster? Well, we can assume that the best clustering is the one where all the points are well clustered. And we also know how to score how indicative of its cluster a point is with silhouettes. Now, a couple of videos ago, we talked about these, so you can go back and check that out as well. Moving on, let's see how we can use silhouette scores together with k-means. Like always, we're going to need some data, so I'll go ahead and draw a couple of clusters. Next, I'm going to send this data to the k-means widget. I see that by default it looks for three clusters. And on the output, we find two extra columns, one indicating each point's cluster, and the other showing its silhouette score. Now, we want to boil this down to one number. And to do this, we can use the boxplot widget, for example, to compute the average silhouette score. In this example, we get a score of about 0.72. Now, I can put my k-means and boxplot side by side to see how choosing different settings affects the average silhouette scores. Doing this, I can fine-tune the parameter k to find the best clustering of my data. So, for a k of three, we got a silhouette of 0.72. Now, increasing the number of clusters to four, the score decreases to 0.68. Then, further increasing the number of clusters to five means a silhouette of 0.64. And at six clusters, we stay at more or less the same score at 0.65. Now, I can be sure that k really should be three. So, I'll change it back and check out how the clusters look in my scatterplot. That seems about right. I do want to double check, though, if this type of scoring with silhouette scores really works. So, I'll add some more points to my data. And other than that, my workflow stays the same. Now, I can repeat my procedure, starting with three clusters. So, three clusters results in an average silhouette of 0.66. Four clusters yields a 0.68. Five means 0.7. And then at six, the score decreases back down to 0.68. Now, if I try more clusters, I find the score drops even lower. So, I'll set k to five, as that resulted in the highest silhouette, and see what I get on my scatterplot. Okay, that seems reasonable. I'm fairly happy with the results. But, now I've done this process manually once, and I'd really prefer if it was done automatically for me in the future. Luckily, k-means can actually do this for us. Instead of setting a fixed number of clusters, I can choose a range, say between two and eight. This way, the k-means widget gives me the average silhouette scores for each k in my range. And again, we find the best value of k is five. Also, we can see that the widget has selected the optimal value for us. So, we don't need to worry about selecting it by hand each time. Now, we can play around a bit with the paint data widget and k-means clustering to see where this process works well, and what I have to do to break it. Maybe I'll start off by painting a little circle. Now, I'll add another one on the side. And maybe one more underneath them. But what if I want to finish this painting and make it a happy little face? Well, that didn't really turn out great. As it turns out, k-means likes evenly sized spherical clusters. It doesn't really work well for elongated shapes, as it tends to split them up into multiple clusters. I can show you another example of this. I'll just draw one big lump of points, and you'll see that k-means wants to split it up. Now, the paint data and k-means widget combination is great for experimenting with clustering and getting a gut feeling for when it works and what its shortcomings are. But now it's time to stop playing around in two dimensions and take a closer look at some multi-dimensional data. You'll see that in our next video.