 Thanks everyone for joining. This is the third and final session in the machine learning workshop series. My name is Louise Kaepner and I'm also here with Nadia Kanar and she'll be taking the R demonstration but if you're after Python I'll be taking that in the first hour and then yeah stick around if you want to know how to you know do the code in R. If you want to follow through the code and execute it at the same time as me what you can do is you can navigate to our GitHub and our machine learning workshop repo and you'll see this little button here in the readme file that says launch binder if you click on that it's basically going to bring up this here if you go into Python code and you click on machine learning code demo you can then follow through the code with me. So yeah let's get started then. So the first thing you want to you're going to want to do when you're working in Python and you're doing a bit of machine learning is you're going to need to import all of the necessary packages. So you can see that I have a number of packages such as pandas which helps me manipulate my data set. I've got NumPy which has lots of important mathematical functions and matplotlib and seaborn which are great for creating visualizations and other well-known packages which are going to help me carry out machine learning so you can see we've got scikit-learn and we've also got scipy for our hierarchical clustering. So one second I'm just getting things started. Okay so yeah now we can go ahead and load in the data set and today we're going to be working with the iris data set which you'll no doubt remember from last week's session on clustering. If not don't worry there's a little brief description of it here. So the iris data set contains 50 samples from each of three species of iris flowers. So we've got iris satosa, iris virginica and iris versicolor and they each and all of the data points have four features that were measured from each sample so we've got the length and the width of the sepals and the petals which have each been measured in centimeters. So we've run that first cell now let's load up our iris data set and I've got a little description here of what a parameter is and what a function is so you can just give that a little brief and read over if you need to and parameters in python are variables so they're placeholders for the actual values that the function needs and when the function's called these values are passed in as arguments so in this case we've got this function here from pandas called read CSV so that's our function and the file path is the parameter and this specific pathway here so my iris CSV is the argument because that's the value that I've passed in so you can go and read over that if you just want a bit more clarity. Okay so now we're going to get stuck into our first centroid-based clustering algorithm and that's the canine's algorithm first though as I mentioned in the presentation it's always good to explore our data and also see if everything's up to scratch also just a little reminder here in the code so clustering algorithms work with unlabeled data so for the purposes of this demonstration we're going to be largely ignoring the variety or species column which you know tells us the species of each of our 150 data points so yeah first things first when you import data into any coding environment it's always good to see if it's all looking as it should do so sometimes when you read in the data you'll realize that you know your column headings are on the wrong row or something so a good function that you can use is a pandas function head and that just gives you the first five rows of your data set so it's really good at just you know making sure all the columns are okay we've got all our information that we need and then we have the following two functions so we have info and that's a really helpful function that can be used with a pandas data frame to give a bit more of a detailed overview so when we run this we can see you know the number of columns if there's any missing data and also the data type of each of our columns so yeah fortunately because this is you know quite a beginner data set we have no missing values or sort of difficult data types to contend with and we can go on to the next step so yeah we can do a value count here as well so that reveals that there is 50 samples for each species another really good function is the described function and that tells us a lot about the statistical variation in our data and it's really helpful when it comes to performing on supervised learning if there are certain columns or features of our data which have a higher variance this is going to affect how the k means distance maybe it works so what we can do is you can use this function to then check if we need to standardize our feature variables so you can see here we've got the means for each of our features and we've got the standardize the standard deviation as well for each of our four features so for instance we can see as well the variance so yeah I've got a little description here of why you might want to bother standardizing the feature data and that just goes into a bit more detail about the fact that you know k means uses the recording distance to calculate the distance between data points and the centroids so you want to ensure that the distance measure according to equal weight to each variable we don't want to end up putting more weight on variables that have higher variance that's why it is important to you know have a look at the statistical variation in your data and then see if it does need some sort of you know scaling so in this case we see that the petal length is quite significantly different to the sepal length sepal width and petal width so what we want to do then is standardize our data so to do this we're going to use scikit-learn's preprocessing package which comes with a standard scalar class and that's a really quick way to perform feature scaling but before we can do that to be able to perform these mathematical computations we're going to need an appropriate way of storing our feature variables without the column headings and the variety column getting in the way so yeah we separate our features from the target attribute because we don't want to have the species in our data set so here we use this function i-lock and that helps us select specific rows or columns from the data set and then we're going to place the values of our features in a 2d array so we can use this here to slice our array and this will print the first 10 values in the array so let's give that a look so here we go we've now got an array 2d array of our values for the features then what we're going to do is we're going to create a variable here available to ss for standard scalar and that contains our feature scaling function then what we do is we fit the standard scalar to the data so we fit it to this 2d array that we've created and what this will do is it computes the mean and the standard deviation and then the transform part basically scales that data so it means that we'll have a mean that is very close to zero and our standard deviation for each feature should be very close to one so let's give that a look so yeah we can see that it's transformed our data here so that's just viewing the first and what I've done here is I basically put that 2d array back into a data frame and then I've used head to print out the first fibers of the data so this is the data once it's been scaled now what we can do here is we can use the described function on our data frame of scaled variables and we can see if the you know the standard scalar has done its job right and as you can see it has because we can now see the mean for each feature variable is very close to zero and the standard deviation for each variable is very close to one so that's your pre-processing done and that's a really important part of making sure that your data is ready to be you know used in a clustering algorithm so now we're at the stage where we can give clustering a go so of course with this data set we're in the unique position of already knowing the optimum number of clusters so you know we have three speech spare species so we're expecting to find three clusters but for now what I'm going to do is just pretend that we don't know that and we're going to randomly set our k value to five and to cluster our data we can use the k means class and that comes with the scikit learn package and it has the following parameters so if you remember the parameters are the things that you supply arguments for so when we supply an argument let's go back to here so the parameter is the file path and the argument is the specific file path that we supply it with so your k means class has different parameters so there's different pieces of information that you can feed to the algorithm when you set it up so let's go through some of them so first we have the parameter init and this is the method for initialization so the standard version of the k-means algorithm is implemented by setting init to random so some of you might remember but that is foggy's method when I talked about initialization points you know how we pick those run those points that we're going to be using for the centroids with foggy we just pick some random ones from the data set there's all the ways you can do it as well we talked about the random partition method and there's another algorithm as well called k means plus plus which you can use so you can actually change the init to k means plus plus and I do recommend after this you know just have a play around with these different parameters change them change you know the initialization method and here we've also got you know the number of clusters so that's you'll pick in the k value here the number of clusters that you want the algorithms to form as well as the number of centroids that you're going to generate then we have the number of initializations so remember with the k means we have to initialize it more than once because of these random points and two runs can converge on different cluster assignments so it's important you know to make sure that you set this so you know fairly sometimes remember and in the example below I've set it to 10 so yeah that's the default so it performs 10 k means runs and then it returns the results of the one with the lowest sum of squared error and that's our performance metric for our k means algorithm then we have another parameter so that's max iterations and it refers to yeah the maximum number of iterations that the algorithm will perform for a single run so for each of these 10 initializations that we're going to have how many times are we going to let the algorithm how many times are going to let the code iterate through the algorithm and we also have random state and this just determines random number number generation for centroid initialization what we can do is we can use a integer to make the randomness deterministic so basically all this means is you'll get the same exact points as me so normally if this was set to non it will just pick some you know random centroids but we can kind of impose you know something like I said make this more deterministic if we all use the same number so you know feel free to change it so that you don't get the exact same points as me or set it to non and but here I've set it to 32 so we're all going to be producing the same we'll have the same coordinates for our centroids so yeah you can see that I've supplied my arguments here I'm going to make sure that my initialization works by selecting the initialization points at random and I've chosen five clusters I want to initialize this algorithm 10 times with my random points and I'm going to have 300 max iterations so what we can do here is run this code so we've got our model set up here and what we need to do is fit this k-means with our scaled features that we had before so remember we've performed scaling with our 2D array then put it back into a data frame so once we fit it the code below will perform 10 runs of the k-means algorithm on the data as I've said with that maximum 300 iterations per run let's go ahead and run that and you can see it just reiterates to us some important parameters so once we've run this we can access the lowest standard squared error value from the 10 initializations and this is also referred to as the inertia so let's give that a run and we can see we've got this value 91.6 we can also access our cluster centers so this is this tells us the final locations of the centroids and that comes in handy later when we want to plot this information and visualize our results so yeah these are the coordinates of the cluster centers for the k-means with the lowest sum of squared error value there is also some other attributes that the k-means class has so once you've built your model in your brumet you can access the following information so you can get the labels of each data point which tells you you know which cluster number each data point was assigned to we also have the number of iterations so we can find out how many iterations it took before the algorithm converged and you also have a couple of other attributes as well but I'm not going to go through in detail so let's run this here and see how many iterations it took I've actually gave it away in the comments so yeah it took 13 iterations before the algorithm converged and as I've said we can also see which cluster label each data point has now let's go on to visualizing our results so you can see here we've got our five clusters you might be wondering why it starts from zero and not from one and that's just because in the weird world of computer science we count from zero so you can see this as a one to five and you'll see we have the centroids solid as well we can also do a 3d visualization so there you can see we have our clusters so how can we go about finding the optimal number of clusters so we've done this you know random value for k with picked five but to actually determine the optimal number a crucial step in the k-means algorithm is using some sort of evaluation method so I've put a little recap here I'm not going to go through it entirely apart from to say that the common method used to evaluate the appropriate number of clusters is the elbow method that involves running k-means clustering on the dataset for a range of values for k so for instance from one to ten then what we can do is compute the standard um the sse sorry values for each k the elbow method then reveals this sort of sweet part sweet spot where the sse curve starts to bend so that's its elbow point and that's the point at which the you know we have diminishing returns that are no longer worth the additional cost so we choose the number of clusters that adding another cluster doesn't produce a significantly better modeling of the data so to do this I create my dictionary of keyword arguments so here you can see I want my initially initialization method to be random I want my number of initializations again to be 10 and my max iterations is going to be 300 and I've got the same random state then I create a variable sse which is an empty list and then I'm going to iterate through each k value ranging from one to ten let's run this code so you can see we get the error values from k equals one to k equals ten and just from looking at these numbers we can see that after around here so we've got 114.412 we noticed that the inertia starts decreasing in a more linear fashion so that does indicate that k equals free is going to be the optimum number but plotting this is going to make it way more obvious as we're then going to be able to observe the bend so let's go ahead and do that so you can see here that we've got this elbow around free but sometimes the graph isn't going to always be that clear and you might want a more straightforward means of acquiring the elbow point so to do this you can use this package called need and that comes with the knee locator class which determines the elbow point programmatically so let's see if it matches our observation that k should equals free so yet we can see when we access the elbow variable that it returns free so now what we can do is create a model where we pick free as our number of clusters with the same number iterate initializations the same number of maximum iterations and we've got the random state again set to 32 so let's give this a run and then remember it's important that we fit our data and we can then plot this later but you'll remember or if you don't remember in the clustering presentation I also talked about something called the silhouette coefficient and that kind of is another evaluation method you can use to determine how good your clusters are so it's used to evaluate the density and separation between clusters the score is calculated by averaging the silhouette coefficient for each sample and that's computed as the difference between the average intra cluster distance and the mean nearest cluster distance for each sample and then it is normalized by the maximum value so it produces a score between minus one and plus minus one and plus one where scores near plus one indicate high separation and scores near minus one indicate that samples might have been assigned to the wrong cluster so I've got a TLDR here as well so it means if it's nearer to one the clusters are you know they're far apart from each other and they're clearly distinguished so let's look at the silhouette score for k means with the k set to three so we've got the score 0.46 around that and then we can use this silhouette visualizer to then show the silhouette plot so you can see that the graph contains these sort of homogenous and long silhouettes and the vertical red dotted line is equal to this silhouette score here so that indicates the average silhouette score for all observations so you know this is not bad and what we can do is we can then compare it to the silhouette score for k means when we set it to five you can see it's quite a bit lower and we can see that these are not as homogenous as what we get for our k when it's k equals I'm also going to show you that you can visualize this data as well so we're going to use mapplotlib to do a scatter plot so yeah you can see that we've got this cluster here that's quite well separated but these two clusters are not as well separated so it's been harder for the algorithm to cluster this data we've got our centroids here as well we can also do a 3d visualization and so I'll just show you this so yeah as you can see this cluster one so if you think of this as one two and three remember in computer science we start counting from zero so the first cluster here is quite well separated but it's been harder for the algorithm to separate and these two clusters here you can see the points are much more overlapping and we'll talk about that bit more a bit more when we move on to hierarchical clustering now so for those that don't remember hierarchical clustering also known as hierarchical cluster analysis is an algorithm that groups similar objects into groups called clusters and the endpoint is a set of clusters where each cluster is distinct from each other cluster and the objects within each cluster are broadly similar to each other so let's see if the clusters that we get from hierarchical clustering align with the irises tax taxonomical classification so we will be using the labels for some of this and that's just to compare how well hierarchical clustering does so I'm going to show you here as well some different methods for exploring the data and I'm going to start with a correlation matrix where are we so a correlation matrix is a table showing the correlation coefficients between variables so each cell in the table is going to show the correlation between two variables we can compute it here by using iLock again to access our columns and then we can just use this correlation function to compute the correlation between our variables but as I've said here it's pretty boring to look at so we can create a visualization which makes these correlations more explicit and we do this by creating a numpy numpy array of boolean values from the matrix correlation data frame so that's the data frame that we created here so yeah from this output we can see that there's a positive correlation between our petal length and petal width attributes which is pretty good indication for clustering there's also quite a strong correlation between our sepal length and petal length as well and we have also a correlation between sepal width and petal length albeit a negative one whereas sepal length is very weakly correlated with the other attributes so this tells us the following information longer petals also tend to be wider and flowers with longer petals also tend to have longer sepals and flowers with longer sepals tend to have wider petals as well so another data exploration technique we can use is to plot some pair grids so pair grids or pair plots are great for multivariate analysis as they plot pairwise relationships in the data set so we can use this function from seaborne to plot a different function on the diagonal as well to show the univariate distribution of the variable in each column so let's run this code here and from this we can see that there are two easily distinguishable groups we have one with long petals and somewhat longer and thinner sepals and one with short petals and relatively short and thick sepals so from this we already know of course that there are samples from three species in the data set but it seems that it'll be easy to sort of separate one of these clusters whereas classifying the other two won't be so easy and we can see that if we go back to our k-means so it's really easy for us to separate this first cluster here but it's much harder for these two clusters so let's first plot our single linkage dendrogram and we do this using the sci-pi dendrogram function and what we do is we create our linkage so remember we talked about our linkage criterion and we use loc to get the values for our different feature variables so we've got stuff a length set a width petal length and petal width and then you specify the method here so you supply it with the argument that you're you know the specific linkage criterion that you're going to use so we're using single linkage here then we've just got our figure size that's set in how big the graph image will be so once we've created this distance matrix so I've called it here I've just called it a variable dist sim or single linkage we can then also specify the leaf rotation so I've just set that to 90 degrees I've also I'll just run it so you can see this a bit more clearly okay so yeah you can see I've also given it a title so dendrogram single method I've set my font size and I've got my y label as distance and my x label as index so from this it suggests the existence of two clusters but it's not so clear that we have a third cluster here and if I didn't know that the data set contains data from from free species I would then stop at two but of course we have the advantage of knowing that the label data set outlines free species so we can use some different linkage criterion to further investigate you know whether we can find a better clustering solution for this data set but obviously when it comes to hierarchical clustering I talked about this in the presentation you can get some sort of hard to read visualizations and it can be a bit messy you know this is quite a lot to look at so sci-fi's dendrogram function does have a number of parameters that you can play around with to make a messy dendrogram a bit easier to read so it has these parameters truncate mode so that's used to condense the dendrogram and it has this first one which is important and it shows only the last p merge clusters so you can set it to show you know only the last 20 merge clusters or only the last you know 50 and we also have a lot of threshold so we can make sure that all clusters below a specific value are given different colors so let's go through this here and I'll just explain a bit about how I did this so again remember we have our distance matrix so I've created this variable here and then we have the linkage which has been set to single so single linkage and we've accessed our feature variables we made sure that the plot is a certain size then we supply the dendrogram function with our distance matrix and we make sure the leaves are rotated to 90 degrees and here you can see I've supplied it with this truncate mode last p and I've set it so that it only shows me the last 50 merge clusters and I've set my color threshold to 0.81 and I do that to show that we have actually got three clusters here but as you can see it's not done very well and from first glance you know we've only got two clearly distinguished clusters so if that's a bit easier to see you can see 0.8 I made sure it had different colors below here because you can see we have got two points that have been clustered you know that have been put into their own unique cluster so there are three clusters distinguishable here but not very well and to sort of investigate how much these clusters differ from the taxonomical classification we can use a sci-pi function called f-custer and that basically flattens the dendrogram and then it allows us to obtain the cluster values for the original data points so it tells us which clusters of data points we're assigned to and obviously we have this advantage because the iris dataset is traditionally a labeled dataset so it has the following parameters z and this is the hierarchical clustering matrix returned by the linkage function so in our case this is our variable dist sin so that's our distance matrix and we have t which is just a note in the maximum number of clusters that we want and we have criterion so the criterion that we use to flatten the clusters in our case we're just using max class and that means we want to impose a threshold on the number of flatt clusters that the function returns and what this will allow us to do is to see how much the clusters differ from the actual species so you can see what I've done here is just appended the cluster assignments onto my iris dataset so I've copied my when we first read in the data we called it iris so I've just copied that and then I'll put this too here so we can leave our original iris dataset intact let's just run this so we can see the assignment again this head just prints out the first fibros of a data frame and let's see some visualizations using f cluster I'm just going to expand these so you can see them as I talk about them and slide this out a bit okay so you can see that from our free plots our single linkage method has not been able to find free groups in the data so you can see as I mentioned before we have only two data points in a third cluster so if you see here these are these two data points here it's not really done very well at distinguishing our free clusters whereas we can see this is the how the dataset actually is with each data point and with its proper label we can see satoza that's our first cluster which is really easily separated from the other two and then we have our sort of messier clusters of versicle and virginica so another thing that we can do is use a swarm plot so take a further look at how they've been assigned and that's another way of basically plotting the distribution of an attribute it's basically a scatter plot but where one variable is cat categorical so in this case we have a categorical variable which is our species or variety so that's you know satoza virginica versicle so let's see we use this seaborne function swarm plot to do this and we set our x as variety and our y as two clusters and then we do the same but for free clusters so let's have a look at this so yeah this shows us that we have these two clusters that have been assigned you know to their own cluster which is not great and what we really want is like free clearly separated clusters what we can also do is create a heat map of our feature means and that can tell us a bit more about the details of the two clusters that we have managed to find because we have found it's this clearly two big clusters here you can see it better here so let's see if we can find out a bit more about the the means of these clusters so I use a seaborne function again which is heat map and I use bloke again to access all my feature variables my feature values sorry and then I use this function here for the mean and I set the annotation to true so that we can see the exact mean values and yeah here what I've done is I've just put it in a data frame so you can also read it quite easily as well it has the more specific has the full values so what we can find out from this is what our two main clusters look like so what are the features like for these two main clusters that we've identified in our dendrogram so in the first cluster we can see that we have quite small petals so the length is the mean of the width is quite small and so is the length but we can see that we have relatively thick sepals whereas for cluster two you can see the petal length is much longer so we have quite long petals and we have quite long sepals as well so you can see the means here differ quite a bit so we can then compare this to a different hierarchical cluster in method and we also spoke about the complete linkage method so let's see how well this replicates that taxonomical species of the iris flower so here I've labeled my distance matrix and dist comp so for you know the complete linkage method I select my attributes and I set my linkage method to complete and we have the same just basically the same as before the figure size so you know how big the graph's going to be and I supply that should be comp sorry about that so yeah I supply my dendrogram function with the distance matrix and now we can see the results for this just make sure that's all right yeah sorry about that okay um so when we use this complete method this seems to suggest a number of two or three clusters and we can see how well these clusters replicate the you know the classification of the iris flowers if we use this flattening which we'll do in a second but for now I'm just going to set some parameters so that we can have this dendrogram and a bit of it you know more visually pleased in format so to do that I use this truncate mode again so I set the last p to 25 so I want to see the last 25 merge clusters and you'll see what I set the color threshold as I do so yeah I put the color threshold at four here so that we can see these two groups clearly so you can see we have three clusters and what we can do is the same thing that we did before and we can flatten this um hierarchical cost rent and then do some visualizations for our different k values so you can see what it looks like if we split the dendrogram at different points so basically what this is doing is splitting it at two clusters and splitting it at three clusters so you see if we split it two we've got two sort of big clusters when we split it three um which is what I've you know basically I'm showing here with the color threshold you can see we have three clusters so let's have a look at these visualizations and you can see that our complete linkage method has been much more able to find three groups in the data so you see it's um when our k is equal to three so when we perform that split um so we have three clusters you can see that it's done a much better job if we compare this to the actual data point labels so you can see again it has found these two clusters quite difficult and to separate but not as bad as the single linkage method did so a significant improvement there and let's also refer to our swarm plot so we use this um swarm plot function from the c borne package and we set our x again to variety and our y as the number of clusters so we've got a swarm plot for k equals two and a swarm plot for k equals three again you can see that Satoza it's been really easy to separate this one whereas it's been a bit more difficult here but you can see that we have two you know it's it's clearly easier to see that we have these two clusters whereas if we refer to the previous swarm plot for the single linkage method where are we you can see it was only able to put these two data points here again what we can do is create a heat map of our feature means so yep we use look to get the values for our features and then we use this mean function here let's see what we get so this tells us some information about our three clusters that we have it shows that cluster one has the largest flowers so we can see the petal length is clearly the largest and it's also got the longest and several length as well then we have our second cluster which is more medium sized flowers so the several length is a bit smaller so the mean differs and then quite a bit we've got the supper whip is a bit smaller than we had in the first cluster and the petal length as well is also smaller as well as the petal width and then we have our tiny flowers so we can see that the supper length is smaller than the second cluster not by March but you know a bit then we've got our supper width which is the only the only feature here where the mean is actually slightly higher than the first and second cluster but the petal length is by far the smallest and also the petal width so you can see we've distinguished some features of our three clusters quite well with the complete linkage method and I also have as well I've just put here some information about a principal component analysis but these are really good medium articles and also a tutorial of how to implement a principal component analysis in python because we don't quite have time to do it here as we are currently at 10 minutes to two and but I really suggest that you follow these links and have a look at how you can use principal component analysis to you know handle the data sets that have sort of higher dimensionality where you can't you know visualize results because we'll be able to visualize them in a 3d graph but obviously we can't go beyond that so you can try using some principal component analysis on the iris data set