 Dear participants, welcome to the course on supply chain digitization. So, this course is jointly being taught by Professor Priyanka Parma, Professor Suspita Narayana and Professor Devapratadas from IIM, Mumbai. In the last 2 lectures, we discussed about the case study. So, what was the case study? It was basically related to where should I locate my distribution centers. So, one particular pharma company was planning to expand its region. So, they were thinking that they have total 811 customers in the new region, who are the customers will be served from which DC, how many DCs should I have so that my responsiveness is maximized and cost of managing the DCs as well as transportation cost is minimized. So, that was the brief about the case study. So, in the last 2 lectures, we discussed that and we found out that using K-Means clustering algorithm, we can get some idea about where the DCs should be located, how many DCs are optimum, which customers will be mapped to which DC and so on. So, what we will do in this particular lecture, we will see how that model can be developed using Python. So, we have not only discussed the technique, we will also see how you can redo it on your own. So, we will share the data set with you, you can practice and see how you can get the same output like us. So, as a first step, what we have to do? We have to import the data, so how do I import the data? Like previous classes, previous Python hands-on class, we have seen we need to import a library called Panda. So, what is Panda is data manipulation and analysis library. So, we are importing that and importing it as PD, so Panda is imported as PD, then I am creating a data frame DF, so how do I create this data frame from this data, customer location dot CSB. So, if you see the right hand side, so I have 811 rows and 3 columns, so if you look into the serial number 1 to serial number 811, I have column 2 latitude, column 3 longitude, so for all 811 customers, I have latitude and longitude, I also have the serial number with me, serial number 1 means customer log, customer 1, serial number 2 represent customer 2, serial number 3 represents customer 3 and so on. So, if I write this command and run the Python code, I will get this CSB file imported and it will be same as data frame DF, so this is my DF data frame. Now, once I get this data frame, since I have only 2 columns latitude and longitude, I can easily print that, I can get a graphical view, so that is what we have done here, we have plotted the data points ok. So, we have latitude, longitude, so all this 811 customers latitude and longitude are being plotted over here. So, how to get this plot because if I plot the data, it is always good for the visualization purpose, I can see how they are dispersed, how close they are, which customers are close to which customers. So, therefore, if you can plot it, it is always good idea to plot, since we have only 2 variable, I can easily plot, if I have 3, I can make a 3D plot, but beyond 3, the plot will not be possible, but if you get an opportunity to plot, you ideally should plot, so that you can visualize the whole data set, you can see how they are dispersed, how close they are and so on. So, how to plot the data? So, first I am importing a library called NumPy and importing it as Np, so what is NumPy library? It is numerical python, mathematical and logical operations library, so all mathematical and logical operations can be done using this library. Then I am importing another library called Matplot library from their Pyplot and I am importing it as Plt, then I am importing another library called C1 and it is imported as Sn, so what is Matplot library? Matplot library is a plotting library for the python programming language, so I also have C1, C1 is another data visualization library based on Matplot lib, it provides a high level interface for drawing attractive and informative statistical graphics. If I want a very good high level interface for attractive graphs, then C1 will be useful, so in this case since I need to plot clusters, various colors then I need to put centrites on top of it, so therefore I thought C1 is a better library. So now then Sn dot Implot, so within Sn I am calling Implot, then is x axis is latitude, y axis is longitude, data is df, so what is df, this data which I am calling, so that data I am taking it over here, fit regression line falls, so I am not interested to fit a regression line that is not purpose, height equal to 4 that means size of the graph, so you can if you increase to 5, 6, 7 your graph will be enlarged, depending upon how big graph you want, how big figure you want, you have to increase this value or decrease this value. Then plot title, customers location, so customers location is plots title, then plot dot show means the plot will be shown, so if you run this towards the end we will also show you using Google collab, then if you run this you will get this output, so the data df has been plotted like this, now I definitely know that this customer and this customer cannot be clubbed together, because they are far away, but I know that these two should be clubbed together, so if I have a visualization it is always good to get some good insights from here, now I am slowly slowly going into the clustering technique, so I need to select the feature, I may have three column, serial number, latitude and longitude, but I do not need serial number, serial number does not make any sense, I do not need it, serial number 1 can be written as serial number 811, serial number 2 can be written as 3 and so it does not matter, so therefore I do not want serial number in my process, when I create the cluster serial number column is of no use, so therefore I am removing that column completely and I am creating a new data frame, so this is my old data frame df, now I am creating a new underscore df, so in new df I have only latitude and longitude, you see I do not have any serial number over here, so I will work with this new data frame, this is very important step, because when you work with real life data, you will see that there are many columns which are of no use for you, you do not want them to be used during clustering process, so therefore why should I keep it, I will just remove them, I will keep only those columns which I want for clustering purpose, so that this step will help you to do that, Now next step is very important, because in K means I need to find out the value of K, so what is the optimum value of K, as we have seen in the previous class, there is a diagram called elbow diagram, so how do you get this elbow diagram, it has to be very easy to plot it, so what we are doing, we are writing few steps of python coding and we will get this diagram, so let us understand this, so first we are importing mat plot library from their pi plot as PLT, then from a scalar, a scalar library we have introduced earlier also when we talked about decision tree and random forest, so a scalar also has clustering technique, within that we are importing K means, so clustering there could be K means hierarchical clustering, but we are interested with for K means, so importing K means from a scalar, then since I want to see like if I increase the number of cluster, how my sum of squared error is changing, so what is the formula of sum of squared error, so X is observation, mu is centroid, so from each observation to the centroid, I am finding out the deviation, then I am taking the square of the deviation and summation, so let us say X i, so i equal to 1 to n, so this is my sum of squared error, so this is my sum of squared error, so the Y axis is this value, so I want to find out if I change my number of clusters from 1 to 10, in this case we have plotted from 1 to 9, so then how my sum of squared error changes, so I am putting cluster range 1 comma 10, so it will start from 1 and then go up to 10, but in python it will go up to till 9, because python starts with 0, so therefore I will get 9 clusters, if I want to have 10 cluster I will change from 1 to 11, if I want to have 11 cluster in X axis I will put 12 and so on. When I am creating cluster under square, I am creating a empty dictionary, so now what we are doing, cluster error is nothing but sum of squared error, so this is nothing but sum of squared error, so this empty right now it is an empty array, it will be filled up if I change the number of clusters, so now a for loop is starting, so this is very important, so for number of cluster in cluster range, so number of, so this line is a for loop, so first it will learn for 1, if number of cluster is 1, then clusters k means number of cluster, so the value will be 1 here, 1 cluster then I will fit the cluster k means cluster then I will plot cluster errors dot append cluster in X axis, so what I will do, so if number of cluster is 1, then I will get this value, this is the cluster error, so cluster error is 70, so for 1 cluster error is 70, so cluster error is 70. Now if I have, if I run it for 2 let us say, now the because I my for loop is running from 1 to 10, now next time number of cluster is 2, so I will have around my error will be around 30, let us say 33, so let again I increase it to 3, because the range will go till 10 that means in pi 3 it will go till 9, I will go if I get 3 it is around 25, so for 1 the error is 70, for 2 error is 33, for 3 error is 25, then if I make it 4 number of clusters it will be around 17 let us say, if I have 5 number of cluster it will reduce to let us say 13 and so on. So, if I keep on increasing the cluster from 1 to 9 in this case, then for each cluster I will have sum of square error, so that error will be stored and will be stored in a dictionary. So first I am storing it and then for again 6 I will have some value for 7, 8, 9, for each of this observation I will have some error value, so then I am plotting this. So first I am storing this in this array of data set and then I am plotting it. So plot dot figure figure size 6 comma 4 as per your choice you can increase or decrease this value, then the figure will increase or decrease, then I am plotting cluster range you see cluster range is 1 to 9 cluster errors, these are my cluster errors, these are my cluster errors and then marker is O, so you can see 0, so 0 is one type of marker. So there are various markers available in python, if I want to have star let us say you want this as this, you want this to be printed as star then you can change the marker accordingly, if I want this to be printed as upper arrow, lower arrow there are lot of markers available in python, you can search it and put the marker as per your choice. So cluster range 1, 2, 3, 4, 5, 6, 7, 8, 9 for each cluster value I will get a cluster error, so that is what being printed. So what is the title, title of this plot is elbow diagram, elbow diagram x axis number of clusters, y axis sum of square error sum of square error, so this plot is printed. So in this case I have plotted only from 1 to 9 cluster, but after 9 I can see that decrease is very slow, so therefore there is no point of printing there, but you can actually print it. So this is basically an elbow diagram and as you have discussed in the last class is a very very important diagram to find out what are the optimum number of clusters. Definitely one is not optimum because from 1 to 2 I can see a huge reduction in sum of square error. Reduction in sum of square error means my responsiveness will increase significantly if I am talking about this particular case study. Now 2, 2, 3 again sum of square error is reduced significantly 3 to 4 not that significantly, but it is a good amount of reduction happened then 4 to 5 this much reduction 5 to 6 the reduction is very slow, very low. So since the reduction is very low after 4, so we can treat this point as optimum number of cluster. So k equal to 4 and here the elbow is also breaking, so therefore this diagram is named after elbow diagram because at k equal to 4 the elbow is breaking and we can decide what is the optimum number of cluster. So this diagram need to be printed first and it will give you an hint that how many clusters are optimum. So from this diagram I found out that k equal to 4 is optimum. Now what we will do? We will now form cluster, I know 4 clusters are optimum, so I am now forming cluster. So from SQL and library the cluster again I am taking cluster then importing k means clustering technique, so clusters underscore nu equal to k means within bracket 4 this is important. Since elbow diagram suggested 4 is an optimum I am putting 4, if it would have been suggested me 3 is optimum you should have put 3, 5, 6 whatever number of clusters you want to print you have to put this value over here. So I will get 4 clusters, the clusters underscore nu dot fit, so I am fitting which data set nu underscore df, if you see this is my this data up to this 2 was my nu underscore df, I will nu underscore df had only latitude longitude data set, latitude and longitude. Now I am fitting that data and I want 4 clusters, so after you run this algorithm you will get 4 clusters, I will get 1 column called cluster but I have to store it, where should I store it? I am storing it column location number 2, so this is column location 0, this is column location 1, this is column location 2 column. So in python it starts with 0, so this is 0th column, 1 column, second column, so column location 2 I want, the column is cluster id, the name of the column is cluster id value is clusters underscore nu dot level, so cluster levels are the value. So for 0th customer it is part of cluster id 0, for customer 1 it is part of cluster id 2 that is it is part of second cluster, this customer is part of third cluster, this customer is part of first cluster and so on. So for each customers like this customer is part of 1 cluster, so after doing this you will see for that means each customer will be allocated to 1 cluster. Since I have 4 cluster all the 811 observation will be mapped to either one of these 4 clusters and this mapping is happening over here. Now once you map it I want to print it also, so how can I print? So this is how the printed output will look like, but how can I get this output? I am plotting the clusters, so import C 1, again I am importing C 1 because I want an interesting graph like different color for different cluster, so cluster 0 the color is blue, cluster 1 color is orange, cluster 2 color is green, cluster 3 color is red. So I want that each cluster to be printed in different color, so that easily I can differentiate which customer is part of which cluster. So that is how the C 1 library will be useful. Then s n dot implode x latitude, y axis longitude, the data I am taking u underscore d f, h u e equal to cluster id. So that means the color will change based on the cluster id, regression model I do not want to fit, height equal to 4, if I want enlarged picture I will get 5, 6, 7 whatever else. So now these are plotted, so we can plot this output, so these are my cluster output, cluster output in terms of image. Now once you get this output I know which customers are part of which clusters, but I do not know where is the centroid, where is the possible location of DC. So for that I have to write another line of core which will tell me for cluster 0, this is my centroid 27.68, 80.90, for cluster 1 27.42, 81.15, this is my centroid. Similarly for cluster 2, 27.31, 80.83 is my centroid, similarly for cluster 3 these are my centroid. So how do I get it? So for that there is a code centers equal to np, np is nump dot array clusters underscore nu, I already had this from here, I am calling clusters underscore centers underscore. So if I write this code I will get the centers of each of this cluster, since python starts with 0, so I am getting cluster 0, 1, 2, 3, but I have 4 cluster, cluster 1 is represented as 0, cluster 2 is represented as 1, cluster 3 is represented as 2, cluster 4 is represented as 3 and these are my centroid. So now I got the centroid of the clusters, but it is difficult for me where it is, so is there any way that I can plot this. So these centroid, can I plot this centroid one top of the cluster output, yes you can. So for that we have to write another set of code, import I am importing c 1, then I am importing implot, x equal to latitude, y equal to longitude, data, data, nu underscore df, HEU cluster ID same thing as earlier, what nu we are doing here, we are plotting a scattered diagram, centers 0, centers 1, marker equal to x, so that means the centroid will be marked as x, these are my centroid. And how do I find out what is this location, based on this, centers 0 and 1, that means the first column of the array, this is my first column, this is my second column, so 0 means first column, 1 means second column. So if I focus on cluster 0, that is blue cluster, the centroid is 27, so this point is, this particular point is 27.68, I am writing only 2 digit after decimal and 80.90. So this is my point, 27.68 and 80.90. Similarly for cluster 2, cluster 2 is basically orange cluster, orange cluster is this one and what is this location, this point is 27.42, 81.15. So can you tell, what is this centroid, this is centroid of 3 and this point is centroid of 2, so accordingly you can find out. So if you want, I can write it down also, for green 2, this is the point, 27.31, 87.31, 80.83. So this point will be 27.56, 80.57, so that is how I can plot it on the plot itself. So it is very easier for me, that these points are the possible location of DC and this DC will be serving these customers, this DC will be serving green customer, this DC will be serving blue customers, this DC will be serving blue customers. So that is how we can actually plot clusters, non-top of it, we can superimpose the centroid also. So now what we will do in last 2, 3 minutes, we will go to the Google collab and login using Google account, all of you must have a Google account, login into this website, then output the file customerlocation.csv, which will be shared with you, then on the python code and get the same output. So we will do the hands on now, so first let us go to the Google collab. So I have already stored the data, customerlocation.csv in the Google collab. So if you run this, you will see and already you have explained this coding in the PPT format, so this is basically storing and reading the data. So you will get the data of all 811 customers serial number, latitude, longitude. Then I am plotting the data, so using matplot and cbon library, I can plot this data, if you run this code, you will get this plot, latitude, longitude, customers locations, then we have to select the feature. If you see here, I have only, I have 3 columns, serial number, latitude, longitude, but serial number is of no use for me, I only need latitude and longitude for my clustering purpose, so I am only keeping latitude and longitude. So new underscore df, so new data frame is created in which I have latitude and longitude. We are determining the number of clusters. So to find out the number of clusters, we need to plot elbow diagram and as explained in the PPT, we are plotting the elbow diagram. So if you run that part of the code, you will see the elbow diagram is formed and as 4 seems to be optimum, because this is where the elbow is breaking, we are taking the value of 4 over here, k means and 4. So 4 number of clusters will be formed. For each customer, you can see cluster ID is given. So once we run this code, you will see that for each customer, one cluster ID is mapped. We are plotting this data set with the cluster ID. So for each customer, one cluster ID and based on the cluster ID, the color is changed. So now once we get these clusters, you want to print on top of it the centroid also. So if you run this part of the code, again we have already explained the code in the PPT. So please refer the PPT for the explanation purpose, you will get an output like this. So this is where the k means clustering is done. So thank you. Looking forward to interacting with you in the next class. Have a nice day.