 In this video, we will describe what is K-means clustering. K-means clustering is one of the centroid based clustering category. Like we saw two types of major category of clustering techniques one is centroid based or continuity based. K-means is actually from the centroid based clustering algorithms. Let us look at this data. It is a response of three points like a survey which is having 10 questions. Ignore the x-axis of the numbers or y-axis. Consider this is a scatter plot you have. And this is plotting basically the students say the average score or the score they have given or addition of score something like that. So, there are say we collected around 200 students called three point like it is a survey questions you are plotting in a scatter plot. You want to see whether these students survey has some clusters. For example, if it is a 0, 1, 2, if it is a number consider it is a student ID. You want to see among these students is there a cluster, is there a pattern and can you group them into something. If you ignore the x-axis consider this is the scatter plot you have you can form clusters. So, how many clusters can you make from this data? If you want to group them into say n groups, two groups or three groups, four groups how many clusters can you form from this data? So, let us say we have two random points. There are one point here, there are two point here, the two random points. And so, we can create a clustering moving this random points over this figure and two random points means I am trying to create two clusters from this plot. So, what it happens is it tries to identify the nearest point to the center and all the point near to the center will be considered as a cluster and all these points near this center will be considered to another cluster. After creating two clusters, it identifies again the mean of all these values, the mean value will be the centroid value for the new cluster. So, then it iterates further further until it reaches the convergent point, then it creates a two cluster. So, let us see how this two random points vary. So, consider two points has been moved here and this can be one cluster, this can be second cluster. Let us see what is this distance and how this distance is calculated. Consider this data and we have seen this data in a previous classes. We have students attendance and students midterm marks. We do not know what is the enzyme mark or anything. So, we have students midterm marks, the attendance and midterm marks, we have the plotted in a scatter plot. We would like to cluster them because it is unsupervised learning and we do not know what we are trying to predict. We want to know what is happening with some students who are getting low score. Is there a pattern exists between attendance versus midterm marks? We do not know, but we have some hypothesis might be a pattern with attendance and midterm marks, can we check it. So, let us see these are the students plots. I randomly selected 3 points. So, I said 3 is my cluster. So, I want to create 3 clusters randomly selected 3 cluster points, so 0.1, 0.2, 0.3. Remember that I have only 2 variables here, if we have more than 2 variables like 3, 4, 5, we are collecting more variables, you cannot plot it and see how many clusters to come up with. So, let us consider example of 2 variables in this plot and see how this clustering works. So, I randomly selected 3 random points and these random points are here. So, what happens is this random point is select checking the distance between all the available data points. This random point, this is a centroid, first centroid. So, the k is 3, there are 3 clusters 1, 2, 3. So, I put 3 clusters in this data. So, each cluster centroid is called cluster centroid. This cluster centroid measures the Euclidean distance between all the data points in the figure. So, when it measures Euclidean distance between this point and also the same point is measured Euclidean distance between this centroid. This centroid Euclidean distance is large, this centroid Euclidean distance is less. So, what happens, this point will be considered as a part of this cluster not this cluster. Similarly, this particular data point is very near to this centroid compared to this centroid. So, this will be formed as a one cluster. For example, this Euclidean distance of this is near compared to this. So, this will be another cluster. So, how to identify the Euclidean distance. So, what is Euclidean distance? So, Euclidean distance is a points distance between centroids and data points is computed by this formula. So, c is the centroid. So, this is c 1, consider this is c 1 and this is c 2 and this is c 3 and x may be x 1, x 2, x 3 and all the points 1, 2, 3, 4, 5, 6, 7 and all the points. So, x will be this x 1 to n, it is number of data points you have. So, you have to compute Euclidean distance between each x 1 to n to c 1, c 2, c 3. So, this Euclidean distance is computed. So, what is Euclidean distance? In a plot, if you want to compute Euclidean distance, this is actually the vector distance between these two. If you know how to find vector distance, it is simple, it is like. So, why the mod? Because it can be negative and squared. So, we do not care. So, that is a squared Euclidean distance. Let us see how this compute the Euclidean distance between two points. So, there is a point here, there is another point here. If you want to find the Euclidean distance, we have to compute the Euclidean distance from a straight angle, the minimum distance between these two points, but you do not know how to compute. So, simple Euclidean distance actually considers this, so 90 degree. So, we know the distance between these two points in the x axis and also we know the values of x 1 and x 2, y 1 and y 2. This is right angle triangle. Now, you know how to compute the value of this. If you know the value of these two, because x 2, x 2 minus x 1 is this, y 2 minus y 1 is this value. So, you know how to compute it. Though this is how you compute the Euclidean distance. So, that is it. It is simple to do. So, check it out. If you are not came across this term called Euclidean distance, it is very simple to compute and it is logical intuitive workforce to compute distance. There are other distances to compute, not just Euclidean distance in clustering, Manhattan distance or some other distances. But let us see we use Euclidean distance for this clustering technique. So, what happens is the center as computes the Euclidean distance of all the points, then whichever the point which is near to the cluster, it will be considered to the cluster. So, let us see the next point. So, after it picked three points, what happens is, let us look at back. So, consider these three points are considered to be one cluster and these are part of another cluster and these are all near to this cluster. So, this particular centroid, these three points are near to this centroid, these points are near to this centroid and these points are near to this centroid. So, based on that value, the three clusters formed. After you do that, the new centroid will be computed by finding the average value of these four points. So, there will be average value of these four points. Using the four points average, the new center will be computed, new centroid might come here, the 3, 2 will move here. Similarly, these three points if you compute the average of these variables it might come here, C1 might come here. Similarly, this if you compute 1, 2, 3, 4, 5, 6 points and the new centroid might be somewhere here, so this will be moved here. So, the new centroid of this cluster will be C1 here and C2 here and C3 here. So, that is how the three clusters formed. So, the first step is you find compute the Euclidean distance of centroid to all the data points. After creating, after finding the data points which is new to the centroid, you compute the new centroid based on the cluster's values. That is a mean value. So, now the centroid is moved here. After moving this again compute again compute the Euclidean distance of centroids to all the data points. Obviously, these three points these three points are very close to this. So, this is one cluster and these points are very close to this cluster and maybe so this is close to this cluster. So, there are three clusters formed as of now. Again, if you compute the centroid, if you try to compute the mean of these three points and move it is already the mean computed. So, there is nowhere to move it is already in the central place. So, if the centroid is not moving, then you stop iterating. That is where you have to stop saying that there is no change in centroid movement or there is no change in the one data point one cluster to other cluster then you consider stopping. If that not happens, so you say a small threshold value if centroid is moving with this limited threshold, it is okay keep it out. Because some places we do not get the exact clusters from the K-Means clustering algorithm. So, you can have a threshold value it is okay one or two points can move and I want to stop it. Or you can say I want to iterate over 100 iterations I do not want to do beyond 100 iterations that is fine. So, you can stop based on number of iterations or with some threshold or no change in movement of centroids. So, this one cluster two cluster three clusters formed. Now I hope you know what is came in clustering it is actually trying to find the means of the data points from the three centroids or number of case and what we give. Why cannot you go ahead and list down the steps we saw in the clustering operation in previous slides. List down the steps like first step what is the first step in your own words does not matter what you write. So, in your own words the first point is randomly assign K clusters one or two or three K clusters then you do the equivalent distance and do move the centroid. So, write down how do you move the centroid write down those steps after writing it down resume the video to continue. I hope you would have written down the K means algorithm this is the algorithm of K means first step is select the value of K. So, the K can be two three or anything why we have to pick two or three why which number is good we will talk about that and after selecting the value of K randomly assign the K points. So, the centroid the initial centroid points are randomly assigned that is a key remember this. I select K value equal to 3 or 3 then I randomly assign this K points centroids maybe in one corner or it can be randomly split in the chart. After that compute Euclidean distance between the K points to other points to other data points compute all then move create a cluster which is near to the centroid all the points which are near to the centers are created to create a former cluster. After you create a cluster compute the centroid that is compute the centroid the center value that is a mean value or center value of all the data points in the cluster that will be the new K point that centroid is moving to the average or the center of this all this data points that is a new K point. So, you have to compute a centroid again of the data points of the newly formed clusters and that centroid will act as a new centroid new K point. So, now the number of cases still same the number of cases not reducing we are just moving the centroid here and there repeat again repeat this process again until you see the errors values is very less there is no nice to change or you have said 100 iterations or 50 iterations. So, repeat the points like the computing the Euclidean distance and compute the centroid again and again till you reach the no change in the centroids or the number of iterations reached. This is the K-means algorithm this is the basics of play means algorithm. If you written down this that is like you already understood the K-means algorithm. If not please go and watch the video again or go to internet and check the resources there are some good simulators available to show how this K points are moved, how they is computed. So, check those videos. Then now we have to see how to select the correct number of K I said that A that is a K can we select 2, 3, 4 what is the number you want to select. So, in order to select the correct number of K first you have to identify the error function or the objective. The objective is keep the error function is what is the error function what is the objective. So, let us consider the objective is J, the error function value is J. It is actually the distance between centroid and the data points after you complete the after you complete the iterations after you say you set up with the A there are 3 clusters I am not changing anything after you computed suppose for example say let us see this is a centroid, this is a centroid and this is a centroid. So, the 3 centroids there are 3 centroids that 1, 2, 3 centroids in each centroid. So, let us see the 3 centroids it is only 1 it is just high I do not know it is just high within the 3 centroids and there will be lot of data points in this cluster in this cluster there might be like 3 data points. So, compute distance between centroid and data points. So, compute the distance between centroid. So, summation of all this distance. So, C1 to x1 plus summation is like a C2 to x2 plus C3 to x3 the distance between centroid and x1 and x3 x3. So, there are 23 points. So, this cluster as the number of data points in ith cluster is 3. So, here C i is here there are 4 points. So, here C i equal to 4 here 1, 2, 3, 4, 5 points the C i equal to 5. So, first you iterate here and you identify for cluster 1 I identify 1 values plus for cluster 2 what is the cluster 2 that i equal to 1, 2 cluster 2 the distance between the centroid and point this cluster distance between the centroid and data points this error function we call it as error plus i equal to 1, 2, 3 and error function. If you have this distance computed you some other thing that is the objective function. So, our objective is to reduce this distance between these clusters as much as possible. So, you can create more clusters if you more if you create more clusters the distance might even reduce further. So, just a quick question consider I have 3 points 4 plus 5 9 plus 12 data points. What if I choose k equal to 12 what will happen? If I choose k equal to 12 they check carefully like if you check it well there will be 12 clusters which means the center of clusters actually the data point and the distance between data point and cluster 0. If you add the error of all these values will be 0. So, if you choose k equal to number of data samples the value will be 0. If I choose only one cluster then this complete this complete V1 centroid that value will be the maximum error function. If we do not know we can say what is the value which changes based on the data points but that will be the maximum error function. If you choose k equal to number of data points it will be 0. So, now what we have to do you have to compute this j objective function and for a clusters k equal to 1 and k equal to 2, k equal to 3 you have to compute for different case let us see. If I plotted it for cluster 1, 2 clusters 3 if I computed that previous whatever I was trying was sum of squared errors that is objective function j. If I computed for k equal to 1 cluster this is the maximum value you can get and if you go there are like 12 points say if it is 2 well here it might be 0 the point might sit here. So, now if I computed I can plot it like that. So, how many cluster we should consider what is the k value which k value I should pick that is based on this curve if you plot this curve you can say this is a elbow curve. So, there is a bend here there is a kind of elbow just your elbow. So, this is like our and elbow. So, this particular this is like our and and elbow point is considered to be the optimum k value. So, for this particular data you choose k equal to 3 that is the best optimal value. So, you can choose k equal to 3 3 clusters and why we choose that I will tell you the reason. For example, for k equal to 1 to 2 for k equal to 1 to 2 the difference in sum of squared errors is say some 50 55 to 80 say 25 points for 55 year say 25 year this difference is 30 but this difference you know it is just a 5. So, the sum of squared error reduction is really less from 3 to 4. So, that is why we are saying that we will pick the 0.4. So, the sum of squared error will be really less going forward. So, do not pick the cluster equal to 12 which means every data points is the cluster that makes no sense and so pick the cluster which is elbow point. So, k equal to 3. It is it always worked with this k elbow point here and check it out. You create your own data and try clustering together and see if this elbow point works or not. So, now I hope you know what is k means clustering and you know how to pick the right k value and if you understood can you list down 2 drawbacks of k means clustering based on your understanding what are the drawbacks you might have a lot of questions is the k is quite what is this lot of questions what are the drawbacks can you list down 2 of them after listing we do not resume to continue. So, initial centroids right I said you pick k equal to random value k equal to 2 or 3 and centroids assigned randomly initial centroids. That is a tricky part because where we are seeing the centroid makes the different type of clusters in a 2 variable it is easy to see but in a 3 variable or 4 variable it is really tough and it do not work for categorical data also for non-linear data sets. So, to handle that there are techniques. So, what is initial? So, let us see there are points like this. If my initial centroids my initial centroids I choose 3 clusters my initial centroids are here compared to my initial centroids are over here or my initial centroids are here. So, based on where you choose the initial extra in the 2 the 2 parameter variable it is easy you know where I will choose you tries to merge converge into particular point consider it is a 4 dimension you know more than that and the initial centroid where you choose in general matters a lot. To avoid this problem you can randomly select the initial data points and do it for multiple times. So, what I mean is consider for k equal to 2 for k equal to 2 the 2 random points might be somewhere here when you assign it. So, iterate the k equal to 2 multiple times. Check if your k equal to 2 if you do the same clustering same data say multiple times say 3 times 4 times or 5 times check the number of clusters or the j value is same then you stop it. Similarly for k equal to 3 iterate for 10 times or k equal to 4 iterate for 10 times. In each time you have to iterate multiple times to find a right clusters right centroids that is within the k means clustering what I am saying is run the k means clustering for k equal to 2 multiple times. So, take k equal to 2 and run the k means clustering for say 10 times. Similarly k equal to 3 and 10 times and find which value is most closely pick those value as a j or the pick the minimum value or pick the mean value something like that. Then you plot the elbow curve or the bend then you pick the right k that is the best option we are to avoid the initial centroids issue in the k means clustering. So, in this video we saw what is k means clustering and we discussed how to select the right number of k using the elbow curve. So, hope you understood k means clustering if you do not get what is k means clustering from this video I recommend you to go and check internet videos the lot of simulations to explain k means clustering. I am not talking about the mathematics we gained the centroid computation and everything because that is not needed for this course. The idea for this course is if you have a data and if you can apply k means algorithm you have to understand which k to pick what is k means algorithm is what this clustering means. If you understand what is k means algorithm how to pick the number of k perfect it is that is enough for this course. But if you want to know more about k means algorithm please refer the to data in the website. Thank you.