 Alright, hi guys, welcome. Welcome back to the second part of this live code demonstration where we'll be explaining two types of clustering methods, that being the K-means and the hierarchical clustering. In this half we're going to be running this using RStudios. So hopefully if you have chosen this session that you've read the prerequisite so that you can keep up with this demo. But yeah, just to recap, the code can be found via GitHub link which is currently in the chat. And all that is required is for you to clone that repo into your local computer after you've obviously installed and downloaded the appropriate software. If you haven't been able to do this so far then please feel free to follow the code along via the web link ending in the HTML which again is available in our GitHub repo. But yeah, let's get started. I would just zoom in a bit to make sure we can see the code clearly. But if you've managed to clone this repo correctly then you should be seeing what I'm seeing now which includes an R markdown file for the K-means tutorial and an R markdown file for the hierarchical clustering tutorial. So yeah, as discussed clustering is a technical machine learning that attempts to find clusters of observations within a data set. We are going to be looking at two data sets for the K-means. This includes a data set called the USA arrests and a data set called Iris which was previously run by Louise in our Python tutorial. Before beginning you need to make sure that you have installed and loaded the correct packages. The packages for K-means include cluster R, cluster, facto extra and facto extra. The remaining packages are more related to like data manipulation and data pre-processing. But yeah, let's begin by looking at our USA data set. The USA data set contains statistics in arrests per 100,000 residents for assault, murder and rape in each of the 50 US states in 1973. It also includes the percentage of the population living in urban areas. So the aim of this data set is to see if there's any dependency between the state being inquired and the arrest history. So the first set would just be to read in this data set. I'm currently calling this into a new data frame called DF and I'm using the assignment operator to do so. I've also included an NA.admit so this just makes sure make sure that we remove any missing values in our data set. We can use the head function to explore the data set briefly and as you can see we have our three crime types in our urban population as well as the states are according. So the first step when it comes to running K-means is to scale and standardize your data set. In R this can be done using the scale function under the R base package. So nothing needs to be installed to run this function. Scaling is basically a technique for comparing data that isn't measured in the same way. So this basically means like the normalization of a data set using the mean value and the standard deviation. Scaling is also known as standardizing so these these terms are synonymous but let's go ahead and scale our data set. Once we do this we can then use the head function to see how results are different. So yeah as you can see that the scale function in short basically subtracts the values of each column by the matching center value from the argument. But once we have scaled our data set we can then move on to do some of the analysis. The first part of the analysis involves a distance matrix. Now typically a correlation matrix is used to summarize data as and there are there are a couple reasons why you might want to run a distance matrix before creating a model for your K-means. The first is to summarize the data set by visualizing we can then understand the relationships between each variables. Another reason might be to form some sort of diagnostic for advanced analysis such as a regression model. But in our instance we're just going to be using this for visualization purposes and we can do this by running the getDist function on our data frame and I'm calling this into a new data frame called distance. It is typically advised to create new data frames when running multiple stages of analysis and I think it's just clearer when running a tutorial to see how the data frame changes as we go along. So go ahead and run that line we can then use the fvis dist function to plot this correlation matrix. I've also set some attribute variables so the gradient and the colors and if we run this we'll get an image of a correlation matrix. So yeah the higher the value the higher the association and the lower value is the lower the association and this is just one way to start to explore your data set. But as you can see this is this is an all too clear to read and it can be quite difficult to understand the interpretations and so we can then move on to run our k-means analysis to start to understand the relationships between these variables between these states and start to cluster these these these states into a more confined group. So let's go ahead and run our k-means analysis. So yeah the basic idea behind k-means clustering consists of defining clusters so that the total intra cluster variation also known as within cluster variation is minimized and we can compute this k-means in R with the k-means function. Here we'll group the data set into two clusters. The k-means function also has this argument called n-stop and basically this attempts to apply multiple initial configurations and reports the best ones for us to use. So I'm going to start with a configuration of 25 and a center of two and again assigning this to a new data frame called k2. We can use the str function which allows us to view the cluster in more detail. So if we print these results we'll see we have quite a messy array of quite a messy list of information. So let's just explore k2 on its own and as you can see we have our cluster means for each of our variable types that includes three crime types and the urban population. We also have our clustering vectors within each state and we have information on its standard deviation. Yeah so if we print these results we'll see that the groupings resulted in two clusters sizes of 30 and 20. As you can see apologies as you can see in this function two groupings 30 and 20. We also get the cluster assignment for each observation and yeah this is just one way to explore that. If there are more than two dimensions it's important to know that the f is cluster will perform their principal component analysis and plot the data points according to the first two principal components that explain the majority of the variance. But for now we can just go ahead and start to visualize our data set using the f is cluster function. So we call on our cluster data set which is the k2 and we refer to our original data set as df. So if we run this and run a visualization you can see how our states have been clustered into two clusters. We see that there's a clear distinction there's no overlap and we can see that the variation in the two are similar. You could also use a standard pairwise scatterplot to basically illustrate the clusters compared to the original variables and we can do this using functions from the Dipler package so this is the as tibble sorry. So calling on our original data set we then call on as tibble. We then mutate this so that our state names and clusters have the appropriate names given and then we can use the ggplot to simply run a scatterplot to do so. So let's have a look at how this would look. As you can see it's very similar to what we produce with the k-means plot but we haven't got those nice groups of clusters that were seen before. So this is why someone might choose to run this k-means plot compared to a standard pairwise scatterplot. But nevertheless that brings us to our second data set. So we've briefly explored the open US crime data set but we can move on to look at iris which is a label data set but yeah so the algorithm will cluster the data and we will be able to compare the predictors results with the original results getting the accuracy of the model. So let's just have a look at what would happen if we create a simple ggplot and pointed out variables. So as we can see Satota is going to be clustered easier meanwhile there is noise between vascala and virginica even when they look perfectly clustered. You can see that variation between the two colors blue and green. Yeah so let's move on to start to run that model. So again k-mean is initialized in that base package from r so we don't need to install any packages and in the k-means function it is necessary to set center which is the number of groups as we did before with the crime data set. But in this case we know that this value will be three so let's set that. But in our instance we're going to try and build a model that where we didn't know how many clusters we had and this is because we're dealing with unsupervised data. So the first set would just be to set the seed and this is just a random iteration so that the cluster can work. We call on a new data set called iris cluster we use the k-means function from the base package and this little bit code here basically removes the it selects the columns one to four and excludes column five which is that labeled variable that we don't want. We then set the centers to three and we set our iterations to 20. If you run this code you'll see in your global environment that a new data set has been created called iris cluster so let's have a quick look at that. So again we have our three clusters of sizes 38, 62 and 50. We also have the within summer squares. We can use the table function to basically compare the predicted values to the cluster values as we did before and this brings up this nice little table that gives you a very nice summary. But the next step would be to plot this data so that we can see those clusters a bit more effectively. And here we have a really nice display of our three variables of our three plant types using some attribute variables such as color, shade and labels. I've set these to true to make sure that these represent different colors and as you can see there's still that overlap there's still that variation between our two plant types I've seen before. So the next step would then be to evaluate your model. We're going to be using the elbow plot method to do so so we might not always be known what the exact number of centers are especially when we have an unlabeled data set. Therefore we can use the elbow plot method to examine the centers we defined. I think it's important to recall that the basic idea behind clustering such as k-means is to define clusters so that the total within cluster variation is minimized and this is what the elbow plot tries to examine. So let's have a run of this function here. What this function does is creates a nice elbow plot and this elbow plot indicates wherever there is a bend would be the like optimal number of clusters. So in our case the result suggests that four could be the optimal number of clusters as it appears to be like the bend or the elbow in the plot. I'd also like to just draw attention to another way that you could plot the clusters. You could use the package it's named if I remember correctly fvis underscore and bclust and I'll quickly just write this out because we have a bit of time. So I'm going to call on a new data frame called data frame 2 and again I'm going to call in the iris dataset. Your second step would be to scale this data set. I'm going to use that scale function from the base package. We're going to call on data frame 2 because we have a categorical variable in a dataset. We want this to be removed so you could either select columns one to four as before or you could just simply minus the column that is not relevant to you. First and foremost calling data frame 2 then we can scale our object and then we can use the fvis underscore nbclust function to create a nice plot. We then run the k-means and the method I'm going to be using is the wss method which I think is applies to and then we have that kind of same elbow plot. You could also add a geo underscore vline which is part of the ggplot2 package and I'll just add an x-intercept. We'll add the set 3 and a line type of I'm going to say 2 so it's not too obvious and this basically will be able to add a line straight down where we can see that the optimal number of clusters is indeed 3. But yeah that's our k-means analysis so we're going to move on to the hierarchical clustering so let me just open that up. So yeah let's get started. So as mentioned in our previous sessions there are two approaches to hierarchical clustering which is the algorithm narrative if I've said that right and the divisive. This is known as the bottom up and the top down approach. I think it's important to know that the first is good at identifying small clusters whereas the second is identifying larger clusters. We use the same packages to run k-means and hierarchical clustering methods so this involves cluster r, cluster and factor extra. However there are some additional packages to reduce some of those visualizations including the dendix 10 and the surplies and color space but these are all just kind of attributes of visualizations. So hierarchical clustering can be performed on a data matrix using the function hclust from the cluster package and we're going to be continue looking at the iris data set. So the first step would be to load and prep the data. We've done this before but I'm just going to rerun it again. We can get a quick summary of our data set but we've already seen these with our petal length and petal width as well as those species but this species variable is not appropriate when we are dealing with unsupervised methods so we're going to treat this as an unlabeled data set. It's always good to check for missing values and you can do that using the is.null function. If the iris data set did have missing values then you can simply just admit this like we did for the usa arrest data set. Now I'm going to briefly discuss two different types of methods that result in the same output and same values but I think it's interesting to know that because R has so many packages available to run clustering that I wanted to introduce two concepts that so that you can explore it yourself. But yeah let's start with some basic clustering. Again we need to perform these on just the numeric values so we use that function within the square brackets to select our columns one to four. We're going to be using the function dist basically this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. Very confusing apologies but if we go ahead and run this we then basically just have a data matrix that allows us to run hclust so you need to run this step before exploring a data frame. But now we can use the hclust to run that hierarchical clustering. I've also called this into a new data frame called hcluster just for just for a bit of clarity. So if we explore this we'll see that we've used a euclidean distance and there are 150 objects of which we already knew. So I think it's also important to know that the distance can be any type so this could be euclidean this could be Manhattan but hclust automatically uses the euclidean distance. So once we have established our hierarchical clustering we can then go ahead and visualize our data set and yes this is all it takes to run a hierarchical clustering in R so maybe a little bit less complicated than the Python code but I'm going to talk about some of the advantages and disadvantages of using either towards the end of this talk. So let's go ahead and visualize our clusters using a dendrogram. So just to recap a dendrogram is simply a diagram that shows the hierarchical relationship between objects. So your first step would be to basically convert your hierarchical clustering dataset into a dendrogram and we can do this using the az.dendrogram. So I've called on to a new dataset called hcd which just stands for the hierarchical clustering dendrogram called on the az.dendrogram function and group this with our hierarchical clustering. Let's go ahead and run that. So we then need to establish some attributes just to make the plot a little bit better which is what I've done here. So I'm using node par and basically this is just a list of like plotting parameters to use for the nodes and then this list itself contains further attributes like remember lab cx is basically just the numeric values to indicate the point size. You have the pch which stands for the plot character. It's just a standard argument to set the characters that will be plotted in a number of r functions. We have cex which again is just the numeric values indicating the point size and we have our colors. We can then go ahead and plot this using the plot function. I've added a few labels. I've changed a few colors and I've made the image horizontal. So let's see what happens if we run this chunk of quota. So now we have a clustered dendrogram which is similar to what we had seen in the python tutorial. Obviously in this instance the dendrogram is horizontal rather than vertical but you can simply change this if you have a preference by running false instead of true. So now we've got this in the right way up. We can see our defined clusters as three and that has produced our first dendrogram for our iris dataset. So yeah now we're going to move on to our method two which is another way to produce hierarchical clustering and to explore some of the packages and functions that are available in r. So again the first steps always involve removing those categorical variables. I'm calling on a new dataset this time called iris two so we can just have some differences between what we had already worked on and we don't lose anything that we've done. We are then going to store this as a separate data frame called species and if we have a quick look at and there we see the categorical variables have just been added to its own dataset and this means we don't lose any dataset and this could be saved separately. The last step would be to convert your numeric values and add a color palette so that is exactly what I'm going to do and I'm also calling this onto a I probably haven't loaded a package. I think it might be from color space but let's have a look. Yeah there we go. Typical our error is always going to load packages but yeah I've called this into a new dataset and converted the value to numeric and I've also just added a color palette so then when we come to plotting our dataset we have different clusters with different colors. Our next step would be to you could try to plot a scatter plot matrix so before even running any type of analysis again it's recommended to evaluate any variants. Now a scatter plot matrix are the basically useful tool to help visualize the relationship between multiple quantitative variables. You could also use like a parallel coordinates to explore high-dimensional multivariate data but for the sake of simplicity I've stuck with a scatter plot because I do feel like they sometimes are easier to interpret. But yeah to plot a scatter plot you can simply just use the pairs code. We call on our dataset and I adjust for some of the attributes. I'm calling on our species column which is that new dataset we created above and I'm simply going to add some legends to make this plot to make this image a little bit more readable. So we go ahead and run this we can see yeah we can see our differences between the sepal length and sepal width the petal length and the petal width and we see the relationship between the two. We can see that Cytosis species are distinctively different from VersaColor and Virginica and this is because they have a lower petal length and a lower petal width but VersaColor and Virginica cannot easily be separated based on measurements of their sepal and petal width. So a scatter plot provides a really good really good ground for you to decide whether your dataset or how your dataset should be clustered and it allows you to explore some of that variation before even starting to create a model. So yeah the default hierarchical clustering method in the H-clust package is complete and as we learn in the K-means tutorial we measure this dissimilarity of observations using distance measures so that's that Euclidean distance or the Manhattan distance and in R the Euclidean distance is used by default to measure this dissimilarity between each pair of observations but yeah as we know it's very easy to compute this dissimilarity by using that guess get dysfunction but how do we measure the dissimilarity between two clusters of observations? There are a number of different cluster algorithms or methods known as our linkage methods that have been developed to answer this question. There are, off the top of my head there are five types of methods to measure the distance between clusters. We're going to be looking at three different methods known as the complete, the single and the average. The first is the complete and excuse me, the complete basically computes the maximum distance between the clusters before merging them and it computes all pairwise dissimilarities between element in cluster one and the elements in cluster two and considers the largest value of the dissimilarities. So let's have a go at running a complete hierarchical clustering method on our iris dataset. The first step is to call on that dysfunction that applies that dissimilarity matrix. We can then run our hierarchical clustering method using the complete method and we then need to re-level, I'm just simply re-leveling some of those variables here. We can then apply our dendrogram. We also want to re-order the observations so we can do this using the rotate function and again I've probably forgot to load a package so I'm just going to rerun all of these because I got an error saying that there was no package that existed. There we go. So yeah, I feel like that was from the facto extra package that I felt to load in the beginning, but never mind. We can then move on to color the branches based on the clusters and I'm using the color underscore branch to do so and I'm setting that k means to three. So we can go ahead and run that line. The next step would be to manually match the labels to the real classification of flowers and I'm using the sort levels values to do so. So we can go ahead and run this. The next step with them to be add the flowers to the types of labels and we can then hang the dendrogram which basically just means to separate the values a little bit better so we can see the differences between the nodes in our dendrogram. This next line is simply just reducing the size of the values and our last step and our final step would be to plot these. So if we go ahead and plot these we see that we have our clustered iris dataset. We have a bit of a nicer diagram than we did running it from method one but as you can see we still have our three clusters available. Again if it becomes confusing to read a dendrogram that's horizontal we could simply change this to false so that we have a versical dendrogram and if we run this again it will simply just rotate on its side. I think it's important just to kind of explain how to exactly read a dendrogram because I understand it can be confusing but basically each leaf corresponds to one observation and as we move up the tree observations are similar to each other are combined into branches which are themselves fused at a higher level. So this height of the fusion provided on the vertical axis indicates that the similarity between two observations the higher the height of the fusion the less similar the observations are. So that's just kind of how you interpret a dendrogram. So once you have your dendrogram your next step would then to run some sort of evaluation process on this. Again we're going to be using the elbow plot method to do so. So similar to how we determine optimal clusters with the K-means clusterings we can execute similar approaches for hierarchical clustering. To perform the elbow method you just need to change the second argument in the ffiz underscore mb class function to fun equals hcut so that's just indicating that this is a hierarchical method. And if we run this we'll see we have that similar elbow plot with that bend at our stage number three. So let's just have a look at how we could explore different linkage messages. We talked about the complete method which is the maximum distance but other methods include the single so let's have a look at how the single method would run. The single method is kind of opposite to the complete in that it computes the minimum distance between clusters before merging them. So we can run this whole trunk of code by just pressing the arrow that faces to the right and this will plot our dendrogram for us and as you can see we have a really messy picture right now which would indicate that the single method probably isn't the best to compute the pair-wise similarities between our clusters and this is a typical flaw with the single method it tends to produce long or loose clusters that can be difficult to interpret. But what about the average method? We can go ahead and change this to average just to see how this would look. Now with the average method what we are doing is basically computing the average distance between clusters before merging them so we have the maximum, the minimum and the average which are three different methods and with the average it basically computes all pair-wise the similarities between the elements in cluster one and the elements in cluster two and then considers the average of these similarities as the distance between the two clusters. So let's have a go at just running this trunk of code and changing our method to average and as you can see we have a little bit of a different dendrogram it's a bit more similar to our complete method but you can still see that there are three clusters and that would be our ideal k-means but how would you go about analysing which method is better not just from looking at a dendrogram? Yeah so we can examine which of these basically has the strongest clustering structure or the strongest linkage method. So for the purpose of this I'm going to create an object that groups the three different methods that we tried into a new object called hcluster methods and then basically going to chain them together into a single dendless object which as the name implies can hold a bunch of dendrograms together for the purpose of further analysis. I'm then going to use this sequence which is a for loop I won't explain too much what that does but yeah this basically will allow us to compare those three different methods and now we have a iris dendless which includes the comparison of three methods and what this has done is obtained the coefficients, the core dendless obtains the coefficients and correlation. In this case we're using a coefficient known as the co-connected correlation I believe it's called but you could also use like POI's correlation and this is where you as a researcher have to start questioning what type of research method would be better for your data set but I'm going to go ahead and just use the the default setting for this and here are our correlation statistics for each of our methods which is the single complete in the average and we can simply plot this using a correlation plot so yeah from this figure we can easily see that the most clustering methods yields very similar results except for the complete method which yields a correlation measure of around 0.6 I'd say um and yeah that draws conclusion to our well that's the end of our tutorial for the k-means and the hierarchical clustering I would like to just draw attention to an additional R markdown script that I wrote um I won't be going through it now but if you're interested in understanding how a principal component analysis can work in a bit more detail then you can explore this wholesale customer data set and work through some of this code by yourself it starts by running some pre-processing some descriptive analysis we focus on standardizing the data set we then refit a k-means model so this is exactly what we've done in our previous two scripts we then look at evaluation and then we have uh the PCA at the bottom here I mean I'm not sure we have time to run through this now no not quite but yeah take your time to explore this yourself and have a go at um like running a PCA analysis yourself uh yeah so thanks for listening um I think I'll just summarise with a few points when it comes to hierarchical clustering in that yes clustering can be very useful tool for data analysis in like unsupervised setting however there are a number of issues that arise in before being clustering and I think there are three things that you should probably be concerned or worry about when running this type of model the first would be what dissimilarity measure should be used the second which would be uh what type of linkage should you use so um we had to go exploring three different linkage methods and then we looked at the correlation coefficients between the three to understand which one has the highest variation and then the last step would be like where should we cut the dendrogram in order to obtain clusters so a lot of these questions are can be answered I guess dependent on your your questions as a researcher your your interests your aims your purposes um so yeah each of these like decisions can have a strong impact on the results obtained but there is no right answer to which one you know is right which linkage message could be right for you I think as long as you choose a method that comes with an explanation then and kind of fit your research methods and aims then you should yeah that should be it um and I understand that some of these k-means and hierarchical clusterings can be performed really really quickly and are as you saw but there are many things important to consider not just from attribute placement um but from understanding these like more deeper theoretical questions about what like dissimilarity measure should be used but yeah that's the end of this tutorial there are some references listed in the armot down if you'd like more like information on some of these functions packages but thank you all for listening