 Dear participants, welcome to the course on supply chain digitization. It is jointly being taught by Professor Priyanka Burma, Professor Sushmita Narayana and Professor Devapratha Das from IM Mumbai. So, in this lecture, we will focus on module 3 that is analytics in supply chain management module. So, as we were discussing in the previous lecture that how do you do clustering. So, in this specific lecture, we will focus on K-means clustering technique. We will also learn how this technique is developed, what are the background mechanism of this technique and how can we apply to the particular case. So, as we were observing that if we increase the number of clusters. So, let us say I will have only one cluster. Then all the customers, all the 811 customers will be served by one particular DC. Let us say it is located over here. If we make two clusters, then all the orange customers will be served by this DC. All the blue customers will be served by this DC. If we make three clusters, all the orange customers will be served by this DC. All the green customers will be served by this particular DC and all the blue customers will be served by this particular DC. If we make four clusters, then all orange customers will be served by this DC. All green customers will be served by this DC. All blue customers will be served by this DC. All red customers will be served by this DC. So, as you see if we keep on increasing the number of clusters, then what will happen? The customer responsiveness will be very high because if I compare this versus this, one cluster versus four clusters. In this case, all 811 are being served by only one DC. So, therefore, it will take time to serve the customers. I have to build up huge inventory so that all the customers could be served. First of all, so my inventory holding cost also will go up and my time also will be more because customers are located across the region and it has to be served from one distribution center. Whereas, if you see here, I have four clusters. That means, the 811 customers are being segmented into four regions. So, region 1, that is the orange region, there is a DC. DC 1, DC 2, DC 3, so DC 4. So, instead of one DC, I have four DCs. So, I will have four distribution centers and each customer will be served from the respective distribution centers and since only few customers are being served from the DC, responsiveness also will be high. That is the positive thing, but on the negative side cost will also go up. So, then how do I decide? How many clusters are optimum? So, that is the first decision. Once I decide these many clusters are optimum, then where should I locate my distribution center and so on. So, all of this should be answered using K-means clustering technique. If we summarize, if you keep on increasing the number of DCs, then definitely the customers will be served quickly, but the cost of opening and operating these DCs will go up. So, how do you decide? How many DCs should you have and where to locate those and which DC will be serving to which customers? We can definitely use K-means clustering technique as we have discussed in that was end of the last lecture. So, how can we use this technique to complete this process in a systematic way? So, before we explain how can we do this, we should know what is K-means clustering. In the last class, we gave you some ideas, but now technically we will explain what is the K-means clustering algorithm? How do we build it step by step? So, these are the steps. So, total 5 steps we have summarized over here. The step one is randomly select K-initial centroid. So, if I have 811 customers, I have 811 customers and these are their locations. Suppose, I decide that I want to build K equal to 3. Suppose, I decide that I want 3 distribution centers to be built. So, if I decide to build 3 distribution centers, I need to find out the location of these 3 distribution centers. So, final aim is I need to get the location of 3 distribution centers. Once I get the location of these 3 distribution centers, then I need to map which customers will be mapped to which distribution centers. So, the first step, let us assume K equal to 3. I need to randomly select initial centroid. So, let us say I select these 3 points as centroid. You could take any points as centroid, any random 3 points as centroid since K equal to 3. But for the explanation purpose, I am taking this point as centroid 1. This is mu 1, eta 1. That is centroid 1. I am taking mu 2, eta 2 as centroid 2. I am taking mu 3, eta 3 as centroid 3. So, this is random. You can take any 3 points, but I am taking these 3 points as initial centroid. That means, initial DC. So, now, these are my initial prospective distribution centers. Then once we process, once we complete the clustering process, this location might be changed here and there, but we will get the final location of this DC. But to start this, we need to select initial centroid. So, these are my centroid. This is centroid 1. This is centroid 2. This is centroid 3. Now, once we get the centroid, I need to group the data points into clusters based on the distances to the centroid. So, what I have to do? I have to find out distance from each of these points to each of these distribution centers. So, let us say this is my point A. So, from point A, I need to find out the distance to centroid 1. This is my centroid 1. This is my centroid 2. This is my centroid 3. Similarly, let us say I have another point B. So, from B also, I need to find out the distance to centroid 1, distance to centroid 2, distance to centroid 3. Let us have another point C. This is my point C. So, from here also, sorry, from here also, so from here also, I need to find out distance to centroid 1. I need to find out distance to centroid 2. I need to find out distance to centroid 3. So, from each of these 811 data points, I need to find out distance to centroid 1, distance to centroid 2, distance to centroid 3. So, I will have 3 distance for each of these 811 points. Whatever distance is minimum, I will allocate that point to that centroid. So, let us say in this case, if you see A is, A seems to be close to centroid 1, then I will allocate. I will find distance A to centroid 1, A to centroid 2, A to centroid 3. So, out of these 3, out of these 3, A to centroid 1 seems to be less. So, therefore, A will be allocated to centroid 1. So, let us say this B. So, B to centroid 3, B to centroid 2, B to centroid 1 is not visually so clear, whether B is close to centroid 1 or centroid 3, but let us assume B is close to centroid 3. So, B will be allocated to cluster number 3. Similarly, C will be allocated to cluster number 2. So, for each point, I need to find out the distance to these 3 centroid and whatever distance is minimum, that centroid will be allocated to that customers. So, how do you find out the distance? So, there is a technique called Euclidean distance. So, this is the formula for finding Euclidean distance. So, mu k comma eta k are my centroid. So, in our case, I have 3 centroid mu 1 comma eta 1, I have mu 2 comma eta 2 and I have mu 3 comma eta 3. So, let us say the points are measured x i comma y i for all i equal to 1, 2 dot dot dot 811. So, I need to find out distance from all of this point x 1, y 1, x 2, y 2, x 3, y 3 dot dot dot x 811 comma y 811. So, from all of this point, I need to find out the distance to the centroid. So, I will first find out distance, let us see if I take x 1, y 1. So, let us say this point is x 1 comma y 1, suppose this point is x 1 comma y 1. So, how do I find out the distance? x 1 minus mu 1 square plus y 1 minus eta 1 square and the square root. That is the distance from x 1, y 1 to mu 1 eta 1, that is centroid 1 to centroid 1. Similarly, what is the distance to centroid 2? x 1 minus mu 2 square plus y 1 minus eta 2 square that is to centroid 2, I have to take the square root. What is distance to centroid 3? x 1 minus mu 3 square plus y 1 minus eta 3 square root that is to centroid 3. So, using this technique for each 811 customers, I need to find out distance to centroid 1, distance to centroid 2, distance to centroid 3. So, for each point, I will have 3 distance, and the minimum distance, whatever is the minimum. So, let us say for x 1, y 1 minimum is 2 centroid 1. So, I will allocate x 1, y 1 to centroid 1. So, that is how this allocation happens, that is how this grouping happens. So, now after doing this distance calculation, we have found out that out of this 811 customers, these orange customers are close to centroid 1, these green customers are close to centroid 2, these blue customers are close to centroid 3. So, now this allocation is done, this mapping is done based on this Euclidean distance. So, once we create three clusters, so this is my cluster 1, this is my cluster 2, this is my cluster 3. Now, this is done based on the distance. Now, once we create this cluster I need to find out the centroid, that is next step is compute new centroid for each cluster using only the data points within that cluster. So, now I have orange data points, I need to find out which one is that centroid, is the centroid same as mu 1 eta 1 or the centroid has changed. So, now after finding out that these are my customers, these are my locations which are attached to cluster 1, I need to find out the centroid. So, how do you find out the centroid? So, this is the formula, suppose you have n number of observation in this cluster 1, what are these points? Let us say x 1 comma y 1, x 2 comma y 1, y 2 dot x n comma y n, suppose in the orange region I have n observation. So, how do you find out the centroid? Centroid is x 1 plus x 2 plus dot x n by n, y 1 plus y 2 plus dot dot y n by n. So, the x axis of the centroid is the average values of all x i's, y axis of the centroid is the average values of all y i. So, based on that I get a new centroid, this is my new centroid, this is my new centroid. So, this new centroid may be same to mu 1 eta 1 or may not be same, same way I can find out the centroid. Centroid of cluster 2, that is centroid of green clusters. So, I will have let us say a p number of observations over here. So, how will you have, I will have data x 1, y 1, x 2, y 2 dot x p, y p. So, I will find out the centroid, how do I find out centroid x 1 plus x 2 plus dot x p by p comma y 1 plus y 2 dot dot y p by p. So, in this case I will have p observations and these are my data points, I can use this formula to find out the centroid. Similarly, for blue region let us say I might have q number of observation. So, I will find out the centroid using the same formula, the centroid, new centroid is here. So, this centroid may be same as mu 3 u 3 or may not be same. So, now once we complete this step 3, I will get 3 new centroid. The new centroid is here for cluster 1, new centroid of cluster 2 is here, new centroid of cluster 3 is here. Now, this up to this step is done. So, what we have to do in next, next we have to find out distance from all 811 points, from all 811 points to this 3 centroid. I will find out distance from this to this, distance from this to this, distance from this to this, distance from this to this. I will also find out distance from this to this centroid, distance from this to this centroid, this to this new centroid. So, for all 811 customers again I will find out the distance. And based on the minimum distance to the centroid, I will reassign all data points to clusters and update the clusters. So, right now this seems to be the clusters, but once I find out the distance, the clusters may be different. Now, you need to keep on repeating step 3 and step 4 until the centroids are stabilized. That means, these points are not changing, that the centroids are not moving. So, all these three centroids are at the same location, that is one. And no data points are moving from one cluster to another cluster. That means, if orange is in this cluster 1, it will remain in cluster 1. If green is in cluster 2, it will remain in cluster 2. If blue is in cluster 3, it will remain in cluster 3. It will not change its cluster. It will not move from one cluster to another cluster. If this happens, then I will stop the process. So, these are my 5 steps. After repeating all of this step again and again and again, until and unless these centroids are stabilized and no data points are moving from one cluster to another cluster, I will achieve the final set of clusters, final set of data points which belong to various clusters. These are the 5 steps of clustering. Now, as you see, if we increase the number of cluster, cluster 1 to cluster 2, cluster 3, cluster 4, we can see that few customers are in one group. So, that means customer responsiveness will be better. But at the same time, the cost of opening the DC and operating the DCs will go up. So, how do we decide what is the optimum value of k? So, how do I decide this? Is it 2? Is it 3? Is it 4? Is it 5? Is it 6? How many clusters are optimum? So, that responsiveness is also maintained. At the same time, cost is also minimized. So, for that, there is a systematic technique as a part of k-means clustering algorithm. So, to understand that, we need to understand within cluster sum of square error. So, let us say I have only one cluster and my centroid is mu. So, I have x 1, I have x 2, I have x n, let us say. So, x 1 minus mu, x 2 minus mu, x n minus mu. So, these are my deviation from the centroid. So, that is what we have written, x minus mu, these are my deviation. Now, I will take square of that, that is square deviation and then I will sum it up. So, for each point, I will find out the deviation to the centroid from this point to this point, x minus mu and then take the square of that deviation and sum it up. So, that is called within cluster sum of square error. Now, if I have two clusters, then I will have mu 1 and mu 2, let us say, two centroid. So, all this orange color data points. So, let us say this is my x 1 1, x 1 2 dot dot dot x 1 p. So, I need to find out this deviation x 1 1 minus mu 1, x 1 2 minus mu 1, x 1 p minus mu 1 and so take their square. So, that is for one cluster. So, same thing I have to do for second cluster also. So, let us say the points are x 2 1, x 2 2, x 2 3 dot dot dot x 2 q. I will find out the deviation from each of this point to the centroid 2 and then take their square and sum it up. So, if we do this, then what is your expectation? So, the within cluster sum of square for one cluster versus within cluster sum of square for two cluster. What will happen? What is your case? So, obviously, if I increase number of cluster within cluster deviation will be less. See, if I have all points in one cluster, then within cluster deviation is large. If I increase number of cluster within cluster deviation is less. If I further increase it, it is further lesser. So, within cluster sum of square error that is within cluster deviation will keep on reducing. Now, if you instead of 4, if I have 5 cluster, 6 cluster, 7 cluster, keep on increasing the number of cluster within cluster deviation will be reducing. And finally, how many clusters can I have? Can you guess? It will be 811 customer, 811 customer, because each point could itself be a cluster, then I will not have any deviation within the cluster. So, I can have this value, I can have this value 0. I can have this value as 0 also, if I have 811 customers. So, I can have extreme large sum of square, but only 1 DC. On the other hand, I will have sum of square 0, but I will have 811 DC. So, each customer point is 1 DC. So, obvious of course, that is not an optimal solution. So, this data can be represented in a diagram. So, in x axis, I have number of clusters. Cluster could be 1, 2, 3 dot that. So, on I have plotted up to 9, but it can go till 811 observations. 811, the value will be 0. So, in y axis we have sum of square error, this value, this within cluster sum of square error is written. Now, if you see here, if I have only 1 cluster, the sum of square error is very high, 70. If I increase to 2, sum of square error reduces significantly from 72 around 32. If I increase from 2 to 3, then again it reduces to 35, 36 to around 20, 3 to 4, it further reduces from here to here. Now, 4 to 5, the reduction is happening, but not that much, 5 to 6 very minimum, 6 to 7 minimum, very minimum and so on. So, the major reduction is happening from 1 to 2, major reduction in sum of square error, then 2 to 3, 1 to 2, 2 to 3, 3 to 4 also some good amount of reduction is happening, 4 to 5 is also reduction happening, but not that great, 5 to 6 reduction is very small. So, now if you observe, the initial increase in number of cluster will reduce sum of square error significantly, but at some point, it could be either 5, it could be either 4, the marginal at some point, the marginal gain will drop giving an angle similar to an elbow. So, if you look into the elbow, this is where the elbow breaks. So, it seems the elbow breaks at 4 or at 5. So, you can take any of this data points. So, I can have the elbow. So, as per my observation, it is breaking at 4, but sometimes you could go up to 5 also depending upon the money you have at hand, depending upon the capacity you have to open the DC, but it seems that elbow is breaking at 0.4. So, 4 number of DC up to 4 number of DC, up to 4 number of DC, I can have significant reduction in sum of square error. So, the number of clusters indicated at this angle can be chosen to be the most appropriate number of cluster. So, the k value could be 4. So, that is what it could happen. So, from if I have only one cluster 1, 2, 2 significant reduction in sum of square error. That means, my responsiveness will increase significantly if I have 1 DC versus 2 DC. If I have 2 DC versus 3 DC, my sum of square error is from 2 to 3 sum of square error is reducing significantly again. So, that means, within clusters, deviation is reduced. That means, my responsiveness will increase significantly again. 3 to 4, reduction is not that significant, but it is significant. So, if I have 4 number of clusters, then within cluster total sum of square error has further reduced. So, that means, I will be able to serve my customers further quickly, my responsiveness will increase and so on. So, after that the reduction like 4 to 5, the reduction in sum of square error is not that great. Since it is not that great, my responsiveness will increase. If I increase my clusters from 4 to 5, responsiveness will definitely increase, but that gain, the marginal gain will not be as much. So, therefore, you have to take a call. So, do you have 4 number of DC or 5? If I increase from 4 to 5, definitely my responsiveness will increase little bit, but then having one more DC will increase your cost of operation will increase your fixed cost, variable cost significantly. So, we have to do the trade off and find out whether 4 DC is good enough or 5 DC. So, as per LO diagram, we are seeing that 4 DC is optimum, but we could go up to 5 DC also depending upon the amount of budget you have. If you have more budget, you can open 5 DC, then your responsiveness will increase little bit compared to 4 DC. Service level will increase little bit, but your cost also will go up. So, we have to do the trade off. Now, if I want to find out exactly the optimum number of DC, then we need to take these inputs. So, this cluster output will serve as my input, because this is telling me my prospective location of DC's. So, this cluster output will serve as an input to the optimization model. So, then I can run an optimization model with 4 DC as well as with 5 DC, then bring in the concept of capacity constraint demand constraint, the cost of transportation, cost of hiring vehicle, all of these and develop an optimization model and find out where should I have my distribution centers, whose distribution centers will be serving to which customers, how do I route my vehicle, so that my total cost is minimized. But these outputs will serve as input to the optimization model, if you further go ahead and try to find out whether 4 DC is optimum or 5 DC. So, with that we will stop here now. In the next class, we will see the coding and we will see how these outputs can be derived. So, thank you so much. Look forward to see you in the next class. Thank you.