 Hi everyone, my name is Louise Kaepernick and I'm a Research Associate at the UK Data Service. Just want to start by saying thank you for attending this webinar which is part two in our machine learning series. Okay, so first we're going to start with a small recap of last week's session. We'll be talking about just the differences between supervised and unsupervised learning. Then I'll introduce clustering and what it is. I'm also going to cover why it's useful and why it is that you should care about it. Then we'll discuss some different types of clustering algorithms and finally we'll cover the K-means algorithm and hierarchical clustering. So for those of you that attended our first machine learning talk you might remember that we spoke about the different types of machine learning algorithms. These algorithms generally fall into two main categories. We have supervised learning and unsupervised learning. There is also something called semi-supervised learning and also something called reinforcement learning but we're not going to be focusing on these today. So the table just shows you that brief recap. You've got the differences here. So supervised learning input data is often labeled on supervised learning. It's often unlabeled. I will be putting these slides at the end so you can go back and just have a look at this table if you want to refresh yourself. But to illustrate the differences between supervised and unsupervised learning I'm going to draw on the infamous iris flower data set and this is widely used as a beginner data set for machine learning as it's freely available online. So this data set shown in the top table contains a collection of labeled iris flowers. So you can see we have the attributes we've got cephalent, petalent, petalwit and we've got the species of flower. So our data points A to F each represent a flower. So let's say that we want to predict each flower's species based on that attributes. So that's the cephalent, petalent and petalwit. Now to do this I could create a supervised learning model which uses a training data set that includes my independent variables so that's the attributes that I've just mentioned and my outcome variable which is the species that I wish to predict. Once I've trained my model with this data I'd then be able to make predictions about the species of iris flower with unseen data points. So you can see with this data set I have a new data point F which is shown on the green row. I can now use the information that's been supplied about the independent variables to assign this data point to a class. Whereas clustering algorithms that we'll be focusing on today they're unsupervised so that means that I only have input variables. So you can see in the second table I have cephalent, petalent and petalwit but I have no corresponding output variable. My data points A to E are unlabeled so we don't know which species our data points belong to. Therefore our purpose in implementing a clustering algorithm is going to be different. Instead of using it to predict specific species we can use it to model the underlying pattern and distribution of the data so we can predict the optimum number of clusters and then represent them visually. So we might end up with a result like the 3D graph on the right which groups the data points into free distinct clusters that's one for each species of iris flower. So what is clustering then? Clustering is the task of partitioning a data set into groups called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. So the output of clustering algorithms is going to be this extra column here which we've labeled cluster where you can see that each data point so that's A to E has been assigned to a cluster. So why bother with it? Well clustering algorithms appeal to data scientists for a number of reasons. First and most importantly as I've said clustering can tell us about the underlying structure of the data and that can be really useful in highlighting patterns and identifying groups of similar objects. And by revealing the underlying structure of the data that also allows us to identify possible outlines in our clusters. This can be done by calculating the distance of each data point from its center and then defining our most distant points as outliers. So that's shown in the image on the right. So you can see the green points are a little bit further away and they've been identified as outliers. What we can do then is get rid of these data points so that they don't you know distort our results or impede our statistical analysis. Finally clustering is also useful for something called image compression and that's a type of data compression that's applied to digital images. So when we compress images this is really useful because it takes up less storage space on a device. And clustering works with this by it clusters similar colors together and in doing so reduces the number of colors on the image. So we have something called lossless compression which is a method used to reduce the size of a file whilst maintaining the same quality as before it was compressed. But there's also something called lossy compression and that compresses a photo to an even smaller size but in doing so discard some parts of the photo. So on the right you can see we have an image which shows the result of lossy image compression on an image of a parrot. As you can see we haven't lost a great amount of detail and that's why this method is really useful for data scientists. So let's look at some other use cases. So one of its most popular applications involves the world of marketing and sales. Clustering is often used to group customers according to what they've bought. So purchase information and customer traits are used to develop recommendation systems. So you'll often see suggestions when you use insights like Amazon which say something along the lines of people who've bought X also bought Y. You can see the example in the picture shows how we can collect customer purchase data to identify a cluster of books that are popular with a certain audience. So then when we have an individual that buys two particular books so let's say these books are Harry Potter and The Hunger Games what we can do is we can then recommend them a third book which might be Lord of the Rings based on what other customers like that individual have bought. We do also have less sort of capitalist applications so scientists will often use fostering methods to group genetically similar viruses together and that can be helpful in improving our knowledge of certain viruses by COVID-19. Successful attempts have also been made to counter the spread of fake news online with articles that have a high percentage of certain terms being deemed to have a higher probability of being fake or misleading. So we could have one cluster for words or terms that are associated with fake news and another cluster for words or terms that are associated with true articles. And of course we have the more frivolous and fun uses as well so you can see the second image shows the result of clustering on a Pokemon data set and we've got Pokemon grouped according to their attributes. So here we have three clusters including the coolest Pokemon so we've got Squirtle wearing his shades and cutest Pokemon which includes Togppy and Pipflop and we also have the pretty unfairly named grossest Pokemon which includes Pokemon with poorly inspired designs such as Clefkey which is pretty much just a bunch of keys. So yeah you can see there's a lot of use cases for clustering but what is a cluster so what are these groups that we're trying to search for in our data sets? So here I have a good really good quote which states there's no universal definition of what a cluster is it's really going to depend on the context and different algorithms will capture different kinds of clusters. So yeah it's really important to state that these clusters or groups they don't actually exist out there in the world they exist based on how we interact with the data. So you can think about a box of Lego for instance there's many ways that we could divide this box of Lego maybe we could split it into two clusters based on the size of the pieces so we could have large pieces and small pieces or we could split this box into five or six clusters depending on the color of each piece or even 50 clusters if you know we wanted to group pieces based on how they work for a certain kind of build. As you can see it all depends on the context and the criteria that we specify however there will be times when you know a bit more about the clusters you want to produce so perhaps you're working with clustering individuals by their gender in that case you're going to expect that there's going to be at least two clusters. Right so let's talk a bit about the types of clustering algorithms so how useful the different algorithms are going to be again depends on the context and also the nature of the problem that you're trying to solve. So first up we have something called centroid based clustering and that works by assigning each data point to a number of centroids to form groups or clusters. These algorithms like the canine's algorithm are efficient effective and they're pretty simple to implement as well but they do have the downside of being pretty sensitive to initial conditions and outliers. We also have something called density based clustering and that works by separating high density regions of data points from low density areas. Unlike centroid based clustering there's no need to be sensitive about initial conditions so we don't have to pit the number of clusters that we want beforehand and also these algorithms don't assign outliers to clusters which is really helpful but they do struggle to perform well with data of varying densities and data of high dimensions. We also have something called distribution based clustering and that contrasts with our two previous examples so centroid based clustering is based on proximity so some measure of distance and density based clustering is based on composition so density whereas distribution based clustering it takes probability into consideration so data data points are grouped together based on their likelihood of belonging to the same probability distribution so that could be a Gaussian or binomial. The advantage that these algorithms have over the centroid based algorithms is their ability to model diversity sized clusters so you can see in the picture we've got like and these three clusters of different sizes and it works quite well with that but the downside is that these algorithms tend to only work well with synthetic data or with data points that belong to a predefined distribution. Finally we also have hierarchical clustering which like centroid based clustering it's based on proximity with the idea being that each object or data point is connected to its neighbors depending on the distance and it works by creating a tree of clusters which are represented as dendrograms. These models are better than centroid based algorithms when it comes to dealing with non-convex clusters and we don't have to initialize or set the number of clusters beforehand however these algorithms can be slow and they often don't perform well on larger data sets so understandably that's a lot of information to take in. I don't expect you to understand it all now and we will be going through centroid based and hierarchical in a bit more depth but you might be wondering how it is that you decide on the type of algorithm and my main advice that you know I really can't stress enough is to explore your data first this will help you sort of weed out the less suitable algorithms so you might want to determine for instance if your data falls into a predefined distribution if it doesn't well that's going to rule out your distribution based algorithms or maybe you're working with a pretty large data set in which case you might want to avoid using hierarchical clustering but I do also recommend you know just trying out a few different types as well there's loads of great machine learning programming packages and they're all like made to be pretty easy to implement so you could try out these different types of algorithms and you know compare the results but for now we're going to take a bit of a closer look at centroid based a centroid based algorithm called the K-means algorithm then I'm going to go on to talk a bit about hierarchical clustering right so how does the K-means algorithm work the first thing that we do is we start with our collection of data points so that's our input data which is shown in the first image and our goal is to separate them into K clusters so the letter K denotes the number of clusters and in this case we have chosen K to equal 3 then we start the process of finding clusters by selecting free random data points these points are now going to act as centroids or the center of the clusters that we're going to make so in the second picture you can see that we have randomly selected free initialization points and they're represented by these different color triangles then we assign each data point to its nearest centroid you can see we've done that here in the third picture and to do that we use a distance measure in the case of the K-means algorithm we use something called the Euclidean distance we do that to calculate the distance from each point to its nearest cluster so yeah you can see that we've assigned our points here and now each point has a color so once we've assigned all of our points and we have three clusters we find the actual centroids that are formed by each of them to do this we move each initialization point to the mean of the data points that were assigned to it so if you look at the fourth picture you can see that the triangles have moved a bit so we can see that the you know the red triangle has moved quite a bit the green one not so much and the blue ones also moved quite a bit so then the process is repeated so based on our new centroids which are shown in this picture here picture four we reassign each data point to its nearest cluster then we recompute the centroids so when we recompute our centroids again in picture six we can see that our triangles are now pretty well placed in the center of each cluster so you can see the triangles are quite near the center of each of our data points then the process continues until the assignment of data points to centroids remains unchanged so we continue doing this until the centroids are no longer being moved and at this point the algorithm stops as we've reached convergence and you can see that we have three clearly defined clusters of red blue and green data points so yeah you can see that here it's all pretty well clustered so in terms of how we evaluate the quality of the cluster assignments we do this by computing something called the sum of the squared error after the centroids converge and the aim of the algorithm is going to be to minimize that error so that we have good quality clusters but the key to finding good clustering solutions with the K-means algorithm is that you're going to have to run it multiple times with different random initialization points in this example here our algorithm managed to find a good solution pretty quickly so you know it only took a few runs and we only recomputed our centroids three times however that isn't always the case because our initialization points are chosen at random this can produce different results on successive runs that's why it's really important that you run the algorithm multiple times with different random initialization points and then you can compare to look at the sum of squared error so you might find you've got the best run on one of these runs another thing that I want to briefly introduce is something called pseudo code some of you might already be familiar with this concept but I do expect that we've probably got a lot of beginners that are wondering what this is so as we know we use a programming language to implement machine learning algorithms and you can think of pseudo code as a sort of recipe that you'll follow and you'll execute in a given programming language when we write pseudo code we use structural conventions of a normal programming language but in a way that makes an algorithm concise and digestible so what I'm going to do is just briefly show you how you can go from writing a sort of basic to-do list to translating that into pseudo code and then writing actual code so yeah I might whisper through these sides of it um I don't want to get too sort of caught off on this so yeah here you have step one so I call this sort of like pseudo English um all of you know how to write a to-do list so yeah you can just basically start off by writing a very simple numbered list of what you want the algorithm to do what it is that it should be doing so when you've got this you can then start to think about how to translate that into your pseudo code a good a good first step in that translation process might be defining our input and output so you can see that we have our input p which is our set of data points and our expected output which will be our k-courses and you'll notice as well that we have some mathematical notation on the programming syntax here you can see that we've sketched out where our while loops and for loops will be and we've also indented these to mark where the loops will be and that's going to make it easier for us to take the next step and translate that into code um yeah but don't worry if that looks crazy to you I really don't expect anyone to be familiar with what for loop is or what why loop is but this is just to highlight how we can go from pseudo English to a pseudo code finally you have your step three and that's where we translate our pseudo code into actual code and to do this we use our pseudo code which acts as a guide that we can follow throughout and that's going to make the whole process of writing code much easier so yeah that was just a little detail but now let's get back to the k-means algorithm and answer some of the sort of pressing questions around it so I mentioned previously that the second step in the algorithm is the initialization process and that's where we select our k initialization points and therefore our number of presumed centroids there are many different ways that we can approach this initialization process and it is worth being a bit strategic about how we choose these initial points as ultimately the results of our k-means are really going to depend upon this initialization if we think about how the about the algorithm and how it works we have this two-step iterative process where we repeatedly recompute our centroids and reassign our data points but how many iterations that it takes for the algorithm to converge is going to largely depend on the placement of the initial centroids so if we can optimally position these presumed centroids we're going to end up with a more efficient algorithm so I've just outlined a few different approaches here firstly we have the traditional approach which is known as Bogey's method I think and that involves selecting k random data points from the data set so this is notable for being a pretty fast initialization method and it has the advantage of making it more likely that we'll get a centroid that likes close to the modes of our data set there is also something called the random partition method so in for this method we randomly assign data points to a cluster and then we calculate the mean of each cluster to get the initial centroids so this tends to produce centroids which are close to the mean of the data but it doesn't work particularly well with the k-means algorithm if initialized using this method the k-means it's more likely to get stuck in something called a local minimum and that's just a fancy way of saying that it doesn't find the best solution so we also have moving away from you know sort of the different k-means methods there's something called the k-means plus plus algorithm and that uses a more strategic approach towards centroid initialization and what happens is we randomly assign the first centroid to a data point and then we carefully choose the remaining centroids based on the maximum square distance so that's going to mean that the centroids are as far as possible from one another which is you know it's going to mean pure iterations and generally this tends to work better than Fourier's method and of course much better than the random partition method so we figured out how it is that we're going to initialize our centroids but how do we know how many centroids we need you know i.e. how do we know how many clusters we want so this is where we determine what k should equal of course from having seen the label sorry iris data set we know that there are free species of iris that we want to uncover but you'll be using clustering on unlabeled data so you won't know the number of groups that are hidden in the data therefore to find this optimal number we can use something called the elbow approach or the elbow method which involves running the k-means algorithm with a range of different k-values say from one to ten then what you do is you plot the performance metric so that's the sum of squared error that i mentioned before and you plot this for each k so you can see we've got this elbow plot here on the left at the bottom so the goal of the algorithm like i've said is to minimize the error but what you'll notice is that each time we increase the number of clusters the sum of squared error decreases and that makes sense right so if you think about it there's more centroids are added the distance from each point to which closest centroid is going to decrease because there's you know a bunch of centroids everywhere and the sum of squared error is going to be zero when k is equal to the number of data points in the data set because then each data point is essentially its own cluster and there's no error between it and the center of its cluster so the goal of the elbow method is not to find the lowest sum of squared error but it's to find the point when adding more clusters no longer provides a significant decrease in the sum of squared error so as you can see at first we have this very rapid decrease in the sum of squared error and that levels out after k equals three where there's a bend and that's also known as the elbow point after this bend we can see a very slow and gradual decrease in the error so it starts decreasing in a more linear fashion so k equals three is going to be our elbow and that indicates that three is the best number of clusters that we can find for this data set but it should be noted that elbow method is not the only method available for choosing the best number of clusters and nor is it the best method so it can have like a pretty hard time determining the optimal k value when there's more clusters sorry let me read that again so it can have a hard time determining the optimal k value when there are clusters that are relatively close to one another so a solution to this and a more precise approach is to use something called the silhouette score what the silhouette score does is it measures how close a point in one cluster is to points in the neighboring clusters so we've got an example of a silhouette plot i'm not shown on the right i'm not going to go into that now in too much detail just because you know this is only an introduction and i don't want to overwhelm everyone but the main thing is that we have these different methods that can help us pick our k value so let's cover a few of the scrumps of the k-means algorithm the main strength of it is that it's very simple to implement which is the reason why i chose it as one of my first cluster examples it's also fast and it's scalable too which means that it works well with large data sets and that's due to its simple iterative nature so remember that's those two repetitive steps where we you know continuously recompute our centers and then reassign our points so it's very simple to apply this to a larger data set but there are of course some limitations that are worth discussing so as we just covered in the previous slide before the last this algorithm requires a bit of manual work so you know you're going to have to select the k value and you've also got your different centroid initialization methods to consider so that's um you know forge's method or the random partition method also the results of our k-means will depend largely upon the initial centroid values as i've said and we must run the algorithm several times to avoid suboptimal solutions so this image here shows what happens when k-means converges to a local minimum and therefore produces counterintuitive results because the initial configuration places the pink and brown centroid quite close together what happens is after five iterate after five iterations sorry they converge to a suboptimal solution so you can see here at the end we don't have a sort of like you know logical um clustering solution here and that's why the centroid initialization method matters if we initialize the this algorithm with a more you know sophisticated initialization method we'd expect to see a better result that benefits the obvious cluster structure of this data set finally the k-means can fail in cases where the data is of varying sizes and also varying densities or if it's non-spherical so the first image on the right at the bottom shows some data which is densely packed in the blue and red cluster but we can see that there's outliers that have been wrongly assigned to each of these clusters and that's because the k-means algorithm defines clusters by diameter only because clusters are circular there's also no way for k-means to account for direction which is why it performed poorly in clustering the data in the second image as well and it's also why it performs badly in the final image too because it doesn't let data points that are far away from each other share the same cluster even though as we can see they obviously do belong to the same cluster so yeah there are limitations to this method as well but um we're going to move on now to hierarchical clustering what is hierarchical clustering um how does it differ to our k-means algorithm so let's start with the quote from Reddy and Vinsumori and they state that hierarchical clustering algorithms approach the problem of clustering by developing a binary tree-based data structure and that's called the dendrogram once the dendrogram is constructed one can automatically choose the right number of clusters by splitting the tree at different levels and when you do that you obtain different clustering solutions for the same data set without having to rerun the clustering algorithm again so as you can see i've highlighted two parts of the quote that are pretty important when it comes to sort of understanding the difference between k-means and hierarchical clustering so k-means requires us to specify how many clusters we want in advance whereas hierarchical clustering it builds these clusters iteratively so it links already existing clusters that are similar into larger clusters once the dendrogram is constructed we can then slice it horizontally according to how many clusters we want so in a sense we have a bit more freedom with this you know we're not having to pick what k equals before we you know run the algorithm on the right we have the output of hierarchical clustering which describes the relationship between different primates so we can see that humans are closely related to the great apes like the chimpanzee and the gorilla so that's one cluster there but we can see that we also have another cluster which includes old world monkeys that different apes in some ways with the main difference being that they have a tail and if we continued with this example we could zoom out more and then talk about primates as a cluster and then mammals and so on and so on and if you look at the second picture we can see that we do have this even more zoomed out dendrogram which describes the relationship among different groups of animals but we're going to talk about these dendrograms in a bit more depth now and we're going to talk about how it is that you read one so we're going to start with a very simple example so on the left we have our scatter plot of six random data points abc dne and on the right is our dendrogram which is the output of hierarchical clustering so the main thing that you want to focus on when you read in a dendrogram is the y-axis or the height and that denotes the measure of distance or similarity between either individual data points or clusters so in this case similarity between objects is judged purely by their geographic x and y positions so objects are most similar when they are geographically close to one another here we can see that a and b are most similar because the height of the link that joins them together is the smallest and that's reflected in the scatter plot right whilst the next two most similar objects are d and e and we can see here they're pretty close together as well in the second image you can see that we slice the the dendrogram horizontally with this blue dotted line so that all the result in child branches that are formed below the court representing an individual cluster so in this case we have two main clusters and in terms of the similarity between clusters we can see that the main difference between them is between the cluster of c a and b versus that of d and e as the height of the link that joins them is the highest we could also perform another slice which we've done with the orange guideline so that we have three clusters in this case we have a and b is one cluster c is the next and d and e as the final cluster and you can see how we get that because this horizontal split is going through three of these branches here so let's move on to the two main approaches to hierarchical clustering so there's two main strategies for how you build your hierarchy of clusters we've got agglomerative and divisive so agglomerative clustering is a bottom-up approach where we consider each observation to be its own cluster and then what we do is we merge the two most similar clusters and so on until we have one big cluster so you can see at the bottom we have this big cluster now whereas divisive clustering is a top-down approach and that's essentially the opposite of the agglomerative approach so instead we start with our observations in one big cluster and at each step we split a cluster until each cluster contains only one observation so yeah I'm not going to linger on that slide for too long and the slides are going to be up so you can always go back and refresh yourself on this and this is just you know talking about how we build that hierarchy but then how do we know which clusters should be combined in the case of agglomerative clustering or how do we know which ones are to be split in the case of divisive the first ingredient necessary for either type of hierarchical clustering algorithm is going to be a measure of distance like our k-means algorithm hierarchical clustering is proximity based so we're going to need some measure of similarity between pairs of observations and the choice of distance measure is an important step because it defines how the similarity of two elements or data points is calculated and therefore it's going to affect the shape of the clusters the default distance measure is the Euclidean distance which we came across in the k-means algorithm you'll remember in the previous example the similarity is based purely on the x and y positions of our six data points so that's for this one right and similarity was based on the geographic position for a basic geographic clustering like this the Euclidean distance is a simple and suitable distance measure to use so i'll put the formula for the Euclidean distance on the right and an example as well so you can revisit this slide if you need to in a future date just to get to grips with what it is and how it works mathematically once we compute the distance between every pair of observations we end up with a distance matrix but depending on the type of data and the researcher questions all the similarity measures may be preferred so there's other ways that we can go about building this distance matrix so when it comes for instance to to study in gene expression correlation based distance is often used as the distance measure it's often used as the distance measure because it considers two objects to be similar if their features are highly correlated and that's even if the Euclidean distance is far apart another way that we can judge similarity is through the use of something called the Levenstein distance and that measures the similarity between words which is it's really good for allowing us to group text or other non-numeric data so for instance we could use this for linguistic synonym clustering the next thing that we'll need to define is our linkage criterion and this determines the distance between sets of observations as a function of the pairwise distances between observations and that sounds really complicated but in simple terms the linkage criteria is just a means of determining whether certain clusters should be merged and different linkage methods are going to produce different results so you'll see that their dendrogram will look slightly different depending on the linkage method that you use the default is often the complete linkage clustering and that's where we consider the distance between the most distant elements in each cluster other commonly used linkage criteria include single linkage and average linkage so yeah we use our linkage criteria to update the distance matrix and merge clusters some of you might still be a bit confused about how the measure of distance and the linkage criterion work so i'm going to illustrate this with a more in-depth example to get a more in-depth view of hierarchical clustering we're going to focus on an example which uses the agglomerative approach with complete linkage method again we have some pseudo English here which is a very simple to do list for this algorithm and on the right we have our pseudo code and you can see that we have our input which is our dataset not denoted by d and our expected output which is a dendrogram described in the relationship between our clusters yeah don't worry about reading all of this um as i'm now going to go through the algorithm step by step so it'll start to make much more sense so the first thing that we want to do is load in our dataset you can see that we have our iris dataset shown here with our five data points so we've got a to e again and we've got our x and y so these are our two features which are sepal length and petal length and on the right we have the scatter plot of these points so after we load in our iris dataset we can then move on to step two which involves using some measure of distance to build a distance matrix so in this case i've used the default measure which is the euclidean distance and as you can see i've calculated the distance between each pair of observations so in a distance matrix all entries on the main diagonal are going to be zero understandable because you know the distance from a to a will be zero as will the distance from b to b and so on and you'll also notice as well that um the distance matrix is symmetric so once we have our distance matrix what we do now is look for the pair of points with the smallest distance so you can see i've highlighted the smallest distance here in yellow so the smallest distance is to be found between point a and b and that makes sense right so you can see definitely the smallest distance there so that'll be our first merge so yeah we perform that merge of a and b and we then update our distance matrix so yeah remember that the distance matrix is symmetric so that's why i've not filled it in on this side and that's just so it's easier to read for you so we've merged a and b into that own cluster and now that we've updated our matrix this is where our complete linkage method comes in because we need some means of measuring the distance between two clusters in order to perform our next merge to do this we find the maximum distance between elements of each cluster so it's just a case of referring to our original day distance matrix which is this one here to find the right numbers so let's look at the first clusters so we have a b and c here we simply locate the distance between a and c so we go back this distance between a and c is 1.4 and we do the same with b so the distance between b and c is 2.2 and because we're using the complete linkage method we pick the maximum of these two distances when we update our distance matrix so it's obviously going to be the distance the distance between b and c which is 2.2 so we enter that in the updated distance matrix you can see there we've got 2.2 and we then do the same with cluster a b and d and we find the maximum distance between these two clusters which is 4.1 so yeah we've looked at a compared to d 3.2 and b compared to d we found that b is furthest away so we enter 4.1 there and of course we do the same for a b and a with the rest of the entries we're only comparing one element to another so you know we have c and d, c and e and e and d so we simply enter the euclidean distance which we computed in the step before so you can see these numbers are the same after doing this we do the same thing so we look for the smallest distance in the matrix which as we can see is 1.4 so that's the distance between points d and e so we perform our second merge and we go on to update our distance matrix which we'll see on the next slide so as you can see we repeat these steps of finding the maximum distance between our clusters and then merging those with the smallest distance until we end up with our result and because agglomerative clustering you know we end up basically with all of the points merged into one big cluster so let's have a look at our results so this is the output of our agglomerative hierarchical clustering on our dataset we have our measure of cluster distance on the y-axis and we can see that our first cluster a and b was merged at the height one and the height of this link is equal to the euclidean distance between a and b whereas the second cluster d and a that was merged at 1.4 and again that's equal to the euclidean distance between d and a then we have a b and c which will merge together at the height of 2.2 and finally we have a b c merged with d and a at the height of 5.4 which was the result of selecting the maximum distance between a b and d and a b and a so that's where that complete linkage method comes in so the y-axis also shows us how far apart the merged clusters are the longer the branch the further apart the merged clusters are and we can see that the longest branches in this diagram are the two lines that are marked by the blue dashed line labeled two clusters so you can see these lines are super long right because these are the longest branches that indicates that going down from two clusters to one big cluster meant merging some points that are pretty far apart and you can see that that's the case when we go back we can see that you know a b and c to be merged with d and a that requires us merging you know some parts at some points that were pretty far apart from each other let's just quickly go over the strengths of hierarchical clustering um i will just like whisper through these because i'm just a bit conscious of time and so its main strength is that it's easy to understand and it's easy to implement so out of those four types of clustering algorithms that we discussed so we had central base distribution base density base and finally hierarchical clustering the math of hierarchical clustering is by far the easiest to understand in program and its main output the dendrogram is also the most appealing to look at in terms of outputs it gives you this sort of big picture overview and highlights groups in your data we can also view results at multiple levels of granularity so you can really like see like the lower levels of clusters which is quite interesting also compared to the k-means algorithm it's better at hierarchical clustering is really better at generating results when it's dealing with non-convex clusters and we also have the benefit that you know there's no need to specify the number of clusters that you want before you run the algorithm but of course there are limitations to this type of clustering as well although it's mathematically very simple these algorithms are quite computationally expensive in any of these algorithms you're going to have to keep calculating the distance between your data points or sub clusters and that increases the number of computations that are required so that's especially true if you're working with a large data set another problem that comes when you're working with large data sets is that the results can be hard to visualize so although you know the dendrogram is pretty appealing it's less so when you've got like you know a ton of data this is a real dendrogram output that I produced as part of a work project and as you can see you know it's pretty gross to look at but yeah so you know not very good with large data sets it's something to remember plus when you begin you know analyzing and making decisions with dendrograms what you're going to realize is the hierarchical clustering is heavily driven by heuristics and that leads to a lot of sort of manual intervention in the process and consequently you're going to need some sort of uh domain specific knowledge to analyze whether the results actually make sense you know your objects might be incorrectly grouped an early stage so you know you're going to have to really examine the results closely to make sure that it makes sense and also the algorithm has the disadvantage of not being able to undo any previous steps so if it cost us two points and later on we see that the connection wasn't a good one the program's not going to be able to undo that step so yeah this is just a summary slide which I'm not going to go through entirely because I just think it'll eat into our time too much so you can go back and have a look at it and it's just summarizing what are the differences between k needs and hierarchical