 We left off with looking into this heat map and one way that we can interpret the heat map of course is already to realize that some of the genes are similar, some of the conditions are similar. So for example, I can define a set of FNIP-1, MED-13L, NRIP-1, i.e. genes here at the top and they are low at T4 and low at T6 but high at T0 and high at T2 here. And there are genes for which then versus true, so FBXL-20, where is that, CCNE-1, yeah, so these here, they are low at T0 and T2 and high at the others. So we can define these two sets simply from inspecting this, manually looking at this and typing out these gene identifiers and then compare and contrast the actual expression levels with a so-called parallel coordinates plot or a matrix plot, math plot. So this is a useful tool if you want to look at high-dimensional data, multi-dimensional data, we put it on a parallel coordinates plot, essentially this is like plotting gene expression profiles except for expression time points or whatever. You could have values for PCA column 1, 2, 3 and 4 or whatever, whatever these dimensions are. Math plot expects the values column-wise ordered, but normally we have values row-wise ordered. So samples and rows and attributes and columns and math plot expects that in exactly the opposite way but we can fix that easily by using the T operator. So T is an operator on two-dimensional objects and it transposes objects, so rows become columns and columns becomes rows. So matrix plot of T of our data set, of the set 1 that we've defined here, type is lines, line witness is 2, color is sky blue, line type is 1, i.e. solid lines not dotted and labels on that. So these are the expression values here. And then we can use the lines to superimpose the genes for gene set 2. So basically, rather than that matrix plot here, we use the lines command, the command which we can use to also put lines on plots. Now lines expects things, elements and rows and attributes and columns, so no transpose for lines. We can superimpose these lines. Those lines are expression values at the different time points for two genes, two sets of genes which I've identified by staring at my heat map. So staring at the heat map tells me there are some genes which are similar. They're located together and I can pull them out in this way. And lo and behold, that actually does happen. I don't know why I'm scaling this in this odd way here. I should update that to something like 8 and 12. I guess I'm asking, what's the difference between the red lines and the blue lines? One is set 1 and one is set 2. Set 1 is fnip, med13l, nrip, whatever. So the genes that I've observed here, fnip, med13l, nrip. They pulled them out because they were clustered. Exactly. So they're close to here. I can kind of read that on my monitor at home. I can read it even better. I could reduce the character size and then perhaps even read it. That's a bit of a nuisance, isn't it? There's ways around that. And the other one is basically the gene set that I find here. They're low on this part and low on this part. So that's just something I discovered by visual inspection. So I then manually pull out the gene names, define them as set 1 and set 2, and put them in this parallel coordinates plot. So indeed, I didn't know that these were genes that look similar. I just picked them out because they appeared at similar spots on the heat map. But indeed, they do have similar properties. Now, hierarchical clustering is probably the most basic technique of clustering. And actually, hierarchical clustering was applied by the algorithm that produced this heat map. Hierarchical clustering essentially builds dendrograms. And it does that based on a distance matrix. So what hierarchical clustering does, it takes the whole high-dimensional data set and calculates a Euclidean distance, i.e., a scalar value between every pair of points in that high-dimensional data set. It's just Pythagoras theorem calculating the hypotenuse when we know the values on two different dimensions and on two different axes. So everything is then projected into this distance matrix. So the first step of hierarchical clustering is to use the dist function and calculate a distance matrix. And then we use the clustering distance matrix to produce a dendrogram. So how do we do that? Well, in principle, all the algorithms that do that are some variant of an algorithm where at first we pick the two most similar points in our data set, the ones with the closest distance together. And then we merge these two points. And after we merge these two points, we say, well, the new distance, we'll update our distances for these two points because the distance is just the average of the first two. So we'll update our distances and then we'll go back and say, okay, so what are the two closest points other than the ones we've already seen or to that average? And that will give us another connection in our dendrogram. So the two closest points might be these two and then the third closest point connects to these two. And then perhaps we also do this one and this one and so on until all the points are connected in some way and then that defines our dendrogram. And there's some variations on that theme concerning on exactly how do we compute a distance? Do we actually compute it through space, assuming that this is a Cartesian space or do we use a so-called Manhattan metric where we just calculate along the dimensions and don't go through space or some other kinds of values? We look at some alternatives there. That's one thing and the other thing is how do we merge the points? What's the distance between a merged cluster? Is that the distance between the average or the distance, the minimum distance between cluster members or is it the maximum distance between cluster members? All of these can be used and have slightly different interpretations. For practical purposes, though, that means there's several different methods with which we can build hierarchical dendrograms, which will give us different results. And once again, there's no principled way to choose which one is preferable. We'll just have to try a few and see what makes sense against your biological knowledge of the data. So by default, the distance matrix is built from Euclidean distances and the distance for the default for building the dendrograms as complete linkage, which is, what, the minimum, the average? I think it's the average between clusters. I'd have to look it up. It doesn't really matter for our interpretation. So one command, our object, hc, is the result of the function hclust applied to the distance matrix. So you don't apply to the original data, but you apply to the distance matrix. And then we plot that. So that's what that looks like. Of course, very similar to the dendrogram, which our heat map has produced, except the font is larger so we can't read anything here. And it's turned sideways. So you'll just need to explore that. So what you're looking for is basically a distance metric that gives a clear block structure in the heat map. So if we calculate this from Euclidean distances, that's what that heat map looks like. I think this is essentially what we had before. What these distances actually are, it's defined in the help function. Just for illustration here, they come out differently. Canberra distance metric gives slightly less block structure, I would say. Maximum looks very similar. Minkowski maybe also slightly less. But you're not confined to the default distance functions. If you have some idea about how to intelligently define a metric between your points based on some biological observation, distance along a family pedigree, which would not be covered with any of the metrics that we've looked at here, something like that, then you can just plug in your function here. So here I'm plugging in the function one minus the absolute value of correlation. So we're calculating correlation coefficients along the vectors. If the correlation is high for two points, the value is zero. And if the correlation is low, then the value becomes one. So highly correlated points have small distances under this metric. So I calculate this as a vector. So my transpose of whatever x is, calculate the correlations, take the absolute value, subtract that from one, define that to be a distance, and thus return a distance metric. So the d correlation of distances is applied. And that actually looks very well structured. So this correlation metric is often one that is quite good. Now I'm intrigued by something that I see here. Without going into the details, that does look to me as if all of these genes here have a similar biological name. If that's true, that actually would be a very interesting external validation that we're doing something reasonable here. Those are all the histone genes. Looks like it. Could be. Because they're a giant gene. Could be. Could you use the CEX function or our command or whatever to make your text smaller? I don't know what our X limits are here. Are you sure they're histones? Very tall screen. I want to zoom in. So in order to zoom in, I would need to know what the values of our coordinates are here. There's a function. If you print it to a PDF, you can kind of just about make it out that it's his. I just printed it to a PDF. That's true. Problem solving with brute force. I like low tech solutions. Tech solutions. I'm actually looking this up because it's useful. I want... In fact, they're almost all histone. Where are we? Here. USR. A vector of the form. Oh, okay. So this way. Okay, here we are. So querying parameter USR gives you the coordinates of the current plot. And that's actually quite useful. Sometimes you need to know what these coordinates are to set up some particular shape that you want into that plot or put something into the margin or do something with it. So these are the current coordinates. And they range from minus 7.9 to 231. And from minus 14, I just need to know which is which. X1, X2, Y1, Y2. So we're somewhere between minus 14 and 21. So Y limit. And I think I need to go somewhere in the middle here, which would be around 14. So let's say something like is from 12 to 16. It doesn't like Y lim. I don't know why this doesn't work. I need to get back to that. Anyway, I'm surprised it doesn't work. I'll play around with it, figure it out. However, if these are all histones and the one minus correlation really puts all the histones together here, this shows us that the one minus correlation is actually a pretty good way to look at the distances for that map here. Okay. Yeah, since this looked the best, I'll just do a clustering with this one. Calculate the clustering. And now this gives us a dendrogram. And we still... This is the dendrogram we got, the clustering level dendrogram. So that doesn't give us clusters yet, right? We were doing clustering. So how do we get clusters from a dendrogram? Well, the answer is what you do is you basically conceptually cut the dendrogram and the branches that fall off at that point give you the clusters. So if I cut up here, cut this and cut this and this and this and this, one cluster will be these genes, another cluster will be these genes, a third cluster will be those and so on. So basically at the level at which I cut, that determines how many clusters I'm going to get. So if I want two, I cut at the top. If I want five, I cut a little lower. If I want 20, I cut even lower still. So the number of clusters depends on how I look at my dendrogram, essentially the resolution in which I want to cluster my data. So this is very different from so-called partition-based methods where you have to input how many clusters you want. Here, you can get everything between one cluster and as many clusters as there are elements in your data set. And that actually tells you something important. It tells you that what constitutes a cluster is in the eye of the beholder. It depends on how you want to look at it. There are many ways to cluster your data and the best way is the one that best supports the kind of analysis that you want to continue doing with it. There are a few methods which we'll discuss later where we can start comparing different clustering methods. Now, the next thing is we actually want to retrieve the actual indices and then use them to generate parallel coordinate plots and see if something that we got there is correct. So we use the cut tree function. Actually, it's not cut tree, but it's Q3. So if you misspell it with a double T, which would be the correct spelling, it doesn't actually work. So I run this and assign this to the variable class and I get this output here. I think it's pretty obvious what this means. So these are the gene names, i.e. the row names of the data that we use to generate the dendrogram and the numbers are cluster index. So HIST1H2BM is in cluster 1. CDCA8 is in cluster 2 as is CCNE2 as is FAM83D and so on. So these are the different clusters. Now, one question we might be interested in is how many clusters or what is the size of each individual cluster. And for that we can use our table function. So if we table them, we get this and if we sort them, you see there's a few clusters with only two. The major clusters have something like 62 and 32 and 28, 15, 11, 9 residues. So we can use this class which gives us, for every row, the cluster that this lies in, we can use this to generate parallel coordinate plots. So I need to adapt my code a little bit here. So the most prominent ones with the distance metric that I used are 3, 5, 7 and 2. So class equals 3, 7 and 2. So these were the clusters with the largest cluster membership. And then I can just do a parallel coordinate plot of our data along the axis by sub-setting from our class equals 3 or 5 and so on and plotting that. So these are the results. Our cluster class 3 indeed looks like something that is cyclically expressed and so on. Now, this result is actually nicer than what I've seen with the other metrics because the other metrics like distance metric or Minkowski or Manhattan or whatever were quite sensitive to the absolute values. So what I got with the other clustering metrics was something like this would be split into two clusters, the higher clusters and the lower clusters, the higher expression levels and the lower expression levels. And obviously if you calculate a Euclidean distance, the absolute height of the expression value makes a big difference. In our case here it doesn't because we're calculating correlations and the correlations are not sensitive to the absolute values but essentially just to the shapes of the curves. So what we're implicitly doing here by using a distance metrics that's based on correlations is we're implicitly normalizing the data to the same level. I have a question about the parallel coordinate plot. So I'm just trying to think about how this would apply to stuff that I do. So I don't usually have time scale data. I'm comparing different units. So in that case, these are only useful because we're looking at connected time scale points, right? No, they're always useful because they put similar values into a similar point on the x-axis. But if my x-axis was discreetly different things... Didn't I do parallel coordinate plots of the LPS data? But anyway, you could imagine that you have two that are high which are for the macrophages then you get two that are low for the monocytes then you have one that is low first then high. So it depends. It doesn't need to be... We have a model here of a sine wave and we recognize that intuitively but it could be anything. Even just categorical data. Parallel coordinate plot is one of the few ways where we can actually visualize multi-dimensional data. Yes, that's maybe perhaps what was confusing me. This is a line plot. It looks like it's... Yeah, very cool. But in this case, since they actually are connected, line is correct. Now, you're making an excellent point. I often see line plots of categorical data and that is just completely misleading. It's nonsensical to draw a line between blood pressure and aid. These are discrete, distinct incommensurate dimensions and there shouldn't be a line between them. You use lines if you have values between commensurate dimensions. Things that change in the same kind of dimension. So when you see your students doing that, scold them, they're abusing the tools here. Okay. The colors... We often don't like the colors, so we go and get some nice colors instead. With the color brewer package, I've briefly mentioned that and showed you the website. Within our... These are the color spectra, which are defined here, so we can define something. Nice colors for the brewer pellet. Peered. Peered. This one down here. Yeah, so peered. It's just one of the divergent scales. We could also use pastel. I actually also want to change this, so I want a matrix plot of class, not class.word, but the one that I've done with my correlation values. Type is line, colors is nice colors. Right. And I want... I'm just plotting the first nine here. Okay, so indeed this seems to be correct. Mostly. Maybe worthwhile to have a little discussion about class four. Is that what you would expect in the clustering? What's that? What's going on here? That's horrible. They're very different, right? Why are they so different? Why did our clustering algorithm think that they're similar, but in fact they're very different? The absolute values are similar. Different signs. How, wait, that doesn't make sense. Are the patterns similar? Yeah, the pattern is exactly the same. It's just flipped. One set is correlated. One set is anti-correlated. And we use the absolute value of the correlation. So that turns out to make the correlated and the anti-correlated ones to be... to have the same distance. So for our biological interpretation, though, for biology, I would now say, well, that doesn't make sense. I need to change my distance function here. Euclidean would be very different to begin with. But Euclidean wasn't so nice to begin with. So how do we change our correlation distance function? Let's see if everybody has been paying attention here. Where are we? So I've been taking the absolute of the correlation here. So do I just remove the absolute? What's the possible range of correlation values? Minus 1 to plus 1, right? So if I take the real values here instead of the absolute and I subtract that from 1, I'm going to be getting negative distances, which is a concept that's not recognized in mathematics as far as I'm aware. Distances, by definition, are always positive. So how do I need to make sure that they remain positive if I don't want to use absolutes? We go from 0 to 2, so we have this here. So 2 minus correlation. Or if I'm just interested in correlation but not in the sense of the correlation, 1 minus absolute of the correlation. Does it still look good? Yeah, it still looks good. My histones, if they're histones, fall over here. Now going back. Clusters from dendrogram. Oh, I didn't define my classes yet, right? No, apparently nothing changed and that's wrong. So what's going on here? Oh, I just defined, but I didn't actually recalculate. So here. Hierarchical clustering is dcore, which is now the function with the correct, not absolute values. Classes, country of that. Plotting the first nine, basically randomly picking nine to plot. Looks like that. Okay, so now we have none of these mixed things. This one is separate and distinct from that one. But all over all, this looks quite nice, quite meaningful. That clustering is able to segregate our data into types of data that actually would be supported by our biological interpretation. All right. So can we do better? So there's a problem with calculating clusters like that and that is we cut the trees all at the same level and that may not be appropriate. So some of our clusters turn out to be very small and some of them turn out to be very large. Wouldn't it be better if we adjust our cut levels dynamically based on how the clusters actually turn out? And there's a function that actually does that. You can load it from the package dynamic tree cut and again I'm using the dendrogram here that I've already calculated so that the dynamic tree cut takes as input a pre-calculated dendrogram and it needs a distance matrix. Of course it needs the correct distance matrix. Oh my God, I shouldn't be changing these things on the fly. Let's make sure this is correct. So here in my... That's actually a nice example. Here in my code I'm trying to figure out where in the world did I define dist. So I can double click it and when I double click it I see that I'm using it here, using it here and using it here. So this is now our distance matrix for the Euclidean distance but we said actually we prefer the 1-correlation distance that looked nicer so I need to change that calculating the distance matrix with the 1-correlation matrix. Okay, I'm going to bail out on the 1-correlation here. The reason is the following. I need an object of a distance matrix. By default this doesn't require a function but it requires a named method and the named method is Euclidean, maximum Manhattan, Canberra or whatever. Now in some older CBW code from two or three years ago I've actually used this so what you need to do is calculate the distance matrix by hand and then make sure the rows and the columns are correct. I think there needs to be a transposition somewhere in there and then apply the casting as distance to be using this and I need to play around with that to make sure that that's right before. I'll put it into the code but not right now otherwise we're not going to get to other things that I'd like to get to. So I'll just bail on that for the moment. We'll just use the Euclidean for the dynamic tree cut setting the cut levels correctly using this dendrogram and distance matrix and classify with that but it doesn't work like that. Okay, I need to run more code. I'm sorry. I really shouldn't be doing this on the fly. I broke it. I'm sorry. Anyway, bottom line. The hierarchical clustering depends on distance matrix and how you apply the distance. There's many different distance matrix. Some of them are more useful than others. One minus correlation actually is a pretty good distance metric and the good thing about hierarchical clustering is that you can look at your data at different resolution. If you already know how many clusters you have in your data that may not be so desirable. So there's an alternative called partitioning clustering and partitioning clustering clusters data based on you defining how many clusters you want in the first place, defining the number of clusters. This is K-means clustering or K-medoids clustering. So if we look at our transformed data here and then color it according to K-means clustering we get these things here. So how does K-means work? Well, initially you take the point cloud in space and then you randomly throw in four sentinel points because we want four clusters. You throw in four sentinel points and then you calculate the distance to the sentinel points for all closest points. So if a point is closest to my first one I attach it to the first one. If another point close to it is actually closer to the second sentinel point I attach it to the second one. So basically I randomly break up my data set into regions if you've ever seen a Voronoi tessellation. That's essentially the same idea. I have four points in there and I associate my data points with one of the four points the one that it's the closest to. Then I take the data set for the first point and I calculate the mean and I move my sentinel point to that mean. And I do that for the second and the third one. So basically after I've associated it with some point I move the sentinel point into the center of gravity of all of the data points that are associated with it. And then I update the calculation. I recalculate which ones are closest to each sentinel point. I update again on the center of gravity and so on and so on. And then I iterate that until it converges. That's a reasonably fast algorithm. It usually converges. It depends a little bit on where you start from. It can get stuck in local minima if the data is not appropriately structured. So it's worthwhile to try it with a number of different set C algorithms and see if you always get the same results. But other than that it's a well regarded and highly robust algorithm. You answered my question about the local minima. Yeah, so it can get stuck in local minima. It's an iterative process. Related to that is the idea of K-medoids. The K-medoids does not place sentinel points somewhere in space but it uses an actual data point. So like the minimum and the median are related by a mathematical abstraction and actual value. The K-medoids algorithm puts its sentinel points only as a representative of a data point which kind of... if your space is somehow... you have low energy states and high energy states that means your sentinel points cannot go into one of the high energy states because you have no data observations there. Other than that it's very similar. So we don't guarantee a globally optimal solution. It's a locally converged one but the result is actually quite fast. And then there's a number of empirical algorithms that work well in special considerations. One that was published relatively recently also from a colleague here in the computer science department, Brendan Fry, is affinity propagation. And that kind of embodies the idea that things that are close together in space should arrive in the same clusters and how do we find that things are close together? Well, we tell them to talk to each other. So if we basically pass messages among data points and we find that there's some that get messages very frequently passed we kind of move them closer together and then they pass even more messages so then they start forming loose associations and then tight-knit circles and then families and finally we know that they should all belong in the same cluster. So this is the AP cluster affinity propagation clustering. How did I call my data set? So that clustering again is able to group our genes and with each other in a pair-wise relationship in a way where we see a very clear block structure so in some way meaningful clusters. These are also then tendograms that are produced and we can use our usual cut tree algorithm on that and I will go home and I will write 100 times I shall not change code on the fly in a lecture. Okay, so these are clusters then. All right. So I've demonstrated a number of clustering algorithms. Basically, they're all kind of similar. They're really easy to use with one of the standard approaches of hierarchical clustering. I'd like you to cluster the PCA results without dimension one of the CRABS data. Or you can either use this AP-RES, this AP cluster algorithm, or just use hierarchical clustering. Do that on the PCA results without dimension one. You can do it with the scaled results or the original results and cluster the CRABS data into four clusters of orange, blue, males and females. Then we're interested in whether it worked because remember when we plotted along these dimensions we saw a clustered structure in our data and we said now it would be nice to know what these points are. Can we actually find an algorithm that pulls them apart and tells us this is group one, this is group two, this is group three. And that's exactly what the clustering algorithm does. It basically looks at our two-dimensional plot and tells us which cluster should be assigned to what value based on that data. So if your exploratory data analysis shows that there's structure in the data you can then use cluster analysis to pull out the data and subset them and apply different kinds of treatments or interpretations to your different ones and that's what I'd like you to do. In order to show that this works we'd like to see that it works on the CRABS data, i.e. that the clusters that we pull out correspond to the labels that we know that our data should be corresponding to. Because neither the clustering nor the PCA knows anything about the labels. This is our secret knowledge that we apply to figure out whether the task was correct. Okay, how do we do that? What do we need to do? Step one, subset. No, table. No, unique. Take a nap. I said that before. I like your thinking. Today at one minute past five everybody's allowed to just sleep for half an hour at the table. Just like kindergarten. Yeah, I know. Oh, I miss that too. Kindergarten knows where the case is. Step one, numeric matrix. Why would you make a numeric matrix? What are you trying to do? So what again is the task? The PCA. So we take the PCA data and do you want to subset PCA one? No, why not? So, paraphrasing create input data clustering. And as you said in principle we'd be pulling out dimension one from the results of our PCA analysis. Where are these? The object name was Crapseless the input data that you've already done the PCA and assigned it to PCAS, right? Yeah. You want the PC one out because you don't? Because what we've discussed before that doesn't consider it. Because you do have to think a little bit about it. Why do you want that this morning? We can try either. Just change one. So let's assume we're using PCAS and where are the actual values that we can trust her on? Well, PCAS dollar X. But the PCAS dollar X are not our X and Y values, right? The rotations are the coordinates of the vectors. Remember we've been plotting PCAS dollar X. And what are we going to be clustering with? All the rows. At which columns? Two to five. Two to five. Maybe. Two to four. Maybe. Two to three. Maybe. We don't know, you know, which one does the noise that we have in the higher order principle components improve or make the clustering worse. We don't know that. But we can experiment that. So something like two to three or two to four or five. We can easily swap that out and exchange that and experiment. So creating the input data. This is the input data. So the next step we need to do with our input data is we need a calculation for hierarchical clustering, right? So to do hierarchical clustering we need to calculate the distance matrix cluster basically just print the results. How do we compare whether this was successful? Well, initially we can just eyeball it by printing the results because if you would remember the first 50 are one of the categories, the second 50 are another category and so on. So if you print out the results of cut tree you should see whether one of the four classes has elements predominantly from one to 50 and the other of the classes has elements predominantly from 100 to 51 and so on or whether they're all over the country. That's relatively straightforward. If we use clustering of APREF instead, AP cluster, all we need to do is to define the default neg distance mate is R2 and put the data that we want to cluster in here which in that case is this expression p-c-a-s-d-r-x and the appropriate subset. So that I hope should be relatively easy.