 OK, clustering. So what I'm going to be talking about is principles of clustering, especially hierarchical clustering, a little bit about k-means and model-based clustering as alternatives. I'm also going to be talking about density estimation as an alternative to clustering, depending on the question that you want to learn. So what I'd like for you to take out of this lecture is that you understand some of the foundations of clustering, concepts of similarity and metricity of the data. That you also understand that there's no such thing as the best method for clustering. But it's a problem of trade-offs that you have to solve here. Learn a little bit about the functions that are available to cluster data in R. And I'd really like you to be able to become at least somewhat competent in constructing your own data sets. I'm not a fan of downloading data sets and operating on them, because then the data really is just a black box. What I'm a huge fan of is simulating your data under known parameters and seeing if you can retrieve your known parameters with the model that you're applying. Because if you can't, then how do you want to retrieve meaningful parameters from data that you've downloaded elsewhere? I think this is really important. Becoming competent in using R to play with it, to construct synthetic data, and to try things out in various ways. And easily being able to change variance on your models or number of data points or data parameters, and then see what they do. OK, here's the fundamental type of problem that we're going to be looking at. I'm using an analog pointer. It's not to intimidate you, but it's very useful, too. No, no, it's a bang-ooper. OK, so what's this? Sometimes it's two variables, just like a diagram. Scatterplot, I like Scatterplot. But what does Scatter here? It should be two variables. Two variables. One is plotted along the data sets, and one is plotted along the lines. We have a Scatterplot of these data, and every one of these data points represents two numbers, the number upon the x and the number on the y-axis. And the typical question of cluster analysis is to ask, are there data points here that are different from other data points in the systematic way? So for instance, are there data points here that would be drawn from a very different distribution than others, or that have their values because of some different underlying properties? Are we actually looking at a homogeneous data set here, or are the samples that we're looking at drawn from two significantly different populations? Can we identify the populations? Can we learn something about the populations once we identify them? And this problem comes up in a large number of interesting questions. The kind of work we do with clustering is to say, can we find groups on data like that, or can we partition all the points into separate classes, and then assign perhaps also new data to fall into any of these classes? And thus, can we make inferences about whether a data point is a member or a non-member of a particular cluster? This can come up in a huge disparate domain. Examples are, for instance, complexes and interaction data. So what's interaction data? What's interaction data? Interaction data, protein-protein interactions. Physical interaction between proteins, for example, is one type of interaction data. OK, physical interactions would be one type of interactions. But what is interaction data? How would interaction data look like? You're not unlikely to come across interaction data, especially when you try to do functional annotation. Correlate? Co-regulate? Well, that's a different type of interaction data. And network. And network, and net, yes, network. Graph. Process. There's usually an Excel spreadsheet with two columns and called A and called B. Exactly. There would be two columns. But the significance is in the column as well as in the row. Because if you have a gene or a protein in column one and the gene and the protein in column two in the same row, this is the way that you're coding that they actually do interact in some way that your experiment has suggested. So you can draw a graph where you take all of the interacting genes and connect them by line. And then you can ask, what does this tell me? Now, if we listen to the funding agencies and to the people who write grant proposals to funding agencies, this will tell you how life works. Because after all, we have a comprehensive list of genes. And now we're trying to figure out what they do. And what they do is they interact with something and do something. And once we've established all the interactions, we've pieced together the puzzles and see how everything works together. And we can reconstruct life from first principle from our Excel spreadsheet. The problem is there's a lot of noise. And a lot of interactions are also missed. And if you do interaction analysis with different methods like these two hybrid methods or mass spectrometry methods and affinity tag-tag methods, you get different genes and different rows of your data sets, and they don't always coincide. But one of the typical questions that we're interested in for physical interactions is, is there complexes? Do proteins stay in the interact together? And thus, do their functions add together the elementary functions to do something that's more than simply the sum of their parts? And in order to do that, there's a hypothesis that says, if a number of genes are in a complex, I should be able to observe interactions between all of them. So if A, B, C, D are in a complex, A should be interacting with B and with C, as well as with B. And D should be interacting with B and C. And C should also be interacting with B. So all of them should be connected. And the way we can pull this out of the raw interaction network is to cluster graphs to see, are there nodes of points, are there clouds of points that somehow belong together because they're highly densely connected, much more connected among themselves than they are connected to other things that are outside? Which, by the way, is one of the crucial notions of cluster analysis. Things within the cluster make more interactions with each other than with things outside of the cluster, whatever these interactions are. Different question. Does a protein structure have domains? What is a protein domain? Well, if we look at a protein structure, essentially it's a cloud of three-dimensional vectors. In terms of statistics, it's data. You have rows, and you have three columns, and the rows correspond to the atoms, and the columns correspond to the x, y, and z coordinates. And you can analyze that statistically. For instance, you can look at whether some of these points lie very closely together in a systematic fashion. And there might be a cloud of points, which is very close together, and cloud of points, which is more distant. So by looking at interactions between these with methods that usually imply a bit more background knowledge about what a protein is and which types of amino acids interact with each other, and so on, you can start defining whether the protein can be usefully subdivided into sub-domains. Or you can start clustering proteins of similar function based on measured similar properties, i.e., for instance, co-regulation. What's the hypothesis of co-regulation? Why are we looking at co-regulation? For example, thinking about the metabolic pathway, that you need two genes or three genes at the same time to conduct a specific process, so they need to be co-regulated. Exactly. So if gene activity has to be modulated in response to some challenge or in response to the cell cycle or whatever, and the gene activity is regulated in the way that the expression that's switched on was switched off, if a number of genes have to be co-operating at the same time to perform a particular task, then their expression should be co-regulated in the sense that they all should be expressed at the same time and then lost again after that with some decay constant. So that's a hypothesis. This is sometimes true, and it's sometimes not true. It's a hypothesis to guide us along in our discovery science or fishing expeditions, that if you find genes that seem to be co-regulated, there's this idea behind it that they perform a similar function where they're all part of a functional system within the cell. And if among that group of co-regulated genes, there might be one whose function is unknown, the function of the other genes might be a good working hypothesis as to what that unknown gene is actually doing. Now, in order to be able to apply any kind of method to any kind of experiment reasonably, and that, of course, includes statistical methods, statistical experiments to data, you have to be comfortable with thinking about your data, understanding what your data is and what it does. And this is my little teamwork task for this morning. I would like you to look at your neighbor or if there are three people in a bench, stick your heads together in a small group of three for two or three minutes. And among you, come up with one example of where you could expect to usefully apply cluster analysis. That would be an example where data points of any type are somehow measured through the similar method and end up in the same data set. But the underlying measurements are taken from different distributions. Or there's something within that that pulls them apart. Co-regulation would be one example that I've mentioned. And when you do that, think about in a typical setting how many samples would you be looking at in your cluster? Would it be 100 or 100,000? How many dimensions does your measurement provide? Is it just something that you cluster in an x and y-axis? Or does every data point actually correspond to 200 or 300 discrete measurements? And in an expression profile, would they all be commensurate? Or would there be categorical variables involved, like male, female, diseased, free of disease, chemical concentration, age of patient, and so on? What are the elements? What are the properties that you try to compare? What could the variation in the dimensions look like? If we're looking at age, the variation could be from 0 to 100. If we're looking at gender, the variation is typically only between three, male, female, and undisclosed. And then what is the metric of similarity that you would apply? And that's really interesting, because clustering ultimately operates on measurements between data points, the definition of whether two points are similar or dissimilar. And you need to apply a metric. In the example that I've shown you before, this here, basically the only metric we have, since this is x and y-axis, is to measure the distances in the distribution and the Euclidean metric, or a geometric metric. But of course, you can think creatively about metrics to apply. Time along the cell cycle would be a metric between data points. And ultimately, what's the question? And that actually is the most important point here. What is the question that you would like to ask? What's the information that you would like to ask? In most cases, the question looks something like, well, my data points are drawn from two different classes. Can I identify properties of these classes from my sense? Now, I'd like you to think about that basically as an exercise here. And we'll go through maybe two or three results. Because that's what you really have to do at home. If you want to analyze any of your data and you want to try to come to grips with the data and truly understand what the data is that you're looking at, you have to go to your data sets and answer these kinds of questions before you even start firing up your computer and using it. It's a tool that you can apply best when you know what you're applying it to. And this is possible data. So stick your heads together. Two or three people, two or three minutes. If you're completely stuck and don't know what you're supposed to do, raise your hand and ask for answers because you're supposed to skip. It's actually useful if you make notes with this and record what you've thought and come up with. It's OK. Our quick thinking exercise. So did anything interesting come up? Any volunteers? Go ahead. Do you want my stick? So let's take these notes of people in this class. Yes. The first is a sample. So how many dimensions I don't know it's a lot? We'll count. How many? What are the elements? For example, if we know each other, it's one. If we don't know each other, it's zero. So it's a library. Then if we are from one city or one university or not, then the age. And when there are postdocs, speech dissuance, PIs, so on. Institutional hierarchies, yes. Yes, institutional hierarchies. Then... Sassy words. It's good for grant applications. It's a lot. And what is their property? So it's clear it's either one in zero. Either age, it's one in zero, not one in zero. It's a point of fact, and variations. But are there properties that you haven't encoded in your data and that you're interested in? Yes? What does this will be, for example, after we have blaster then, where the people with blue eyes in the blaster and black eyes outside of the blaster? And the metric was the metric. The metric of similarity can be related to the coefficient where age could be determined. We have the same group of age. So we have closer to each other. If it's the same city, I don't know if it could be a premium distance. For that dimension. I think in that description, you needed to see a problem. You have data that is categorical. You have data that is numerical. It can be discrete. It can be continuous, and so on. And ultimately, you're asking about one distance between two points, and it all has to come together. So here's the first problem there that somehow this has to be weighted. And it's not obvious that the raw data, as we get it, will allow such weighting. And it's not obvious, because if you're comparing a very large variable with a very small variable, the small variable is not going to contribute to the distance if you all treat them in the same. So the elements are the people in this class. And the dimensions are the different properties of each person in the class along which you want to establish their similarity or dissimilarity. And the question you would like to ask is whether blue-eyed people are more likely to be close blocks from the same city that actually talk to each other than black-eyed people. Yeah, this is one question that we could ask. However, very often, the questions that we want to ask are not actually encoded anywhere. So whether somebody is blue-eyed or black-eyed and what their properties are, and whether these properties inform a certain amount of covariance can also somehow be solved nicely with other methods. So if you look at blue-eyed, it's more like a regression. Just is blue-eyed a good predictor for other variables and so on? Often, cluster would like to ask something that's not in the data, that you can't measure, but that you would like to infer from the properties of the data. Anything else? Any volunteers? Some of these twins in here? Yes. You want to talk about what you came up with? No. A process, you have to get stronger probably. Or beer, or beer, a couple of beers in the morning will really get juices flowing. OK, so as I've alluded to, blustering is the classification of similar objects into different groups. It's a problem to partition a data set in subsets called clusters so that the data in each subset are close to one another and closeness is measured through some layer. I think that intuitively makes a lot of sense. You have to be aware of what partitioning means, though. What does partitioning mean? Divide in a sense. You divide the data set. There's an important point here. All the data points get assigned. You completely divide the data set. If you have two clusters, all the data points are either in one cluster or another cluster. That's what a partitioning is. And I'm mentioning that specifically because that's not always the most reasonable way to look at your data. Some of the data in your data set might simply be noise. And there's no point in putting noise to good clusters or to good data and taking inferences from that. We'll come to that much, much later in this lecture. OK. One of the fundamental approaches here is hierarchical clustering. Given n items and a distance metric, first, assign each item to a cluster, initialize the distance matrix between clusters as the distance between clusters. Find the closest pair of clusters and merge them into a single cluster and compute new distances between clusters. So initially, each item represents one cluster. If your data set has 100 items, then you start out with 100 clusters. Then you look for the closest ones and then you merge them through some distance metric. And then you re-compute between the merged clusters. And then you start comparing the properties of similarity within the clusters and similarity between the clusters. And you repeat that iteratively until all items are finally classified into a single cluster. Whoa, we could have done that right away, couldn't we? We could have just said, well, it's all one single cluster. Right? So why is this useful? If we end up with everything in one single cluster. We have a whole tree as well. What's tree? Tree of how it was combined together at each state. How it was combined together at each state, right? You get a tree from that that basically, the tree is a recording of the merging steps that you've done. The dendrogram or tree of hierarchical clustering which you'll see in a short moment is a recording of the sequence of steps that you have undertaken until you finally arrived at the single cluster. The question then, of course, is how do we interpret this tree in the sense of doing something useful with it? OK, so there's a catch in this. And it's in this sentence, given n items and a distance metric, because the question is, what is a metric? So mathematically speaking, a metric has very definite properties. A metric is a number, sorry, is function d on some vector x. The shell doesn't have the same function, small as i. So x is an element of real numbers. And the metric has to fulfill three conditions. A distance between two points x and y is 0 if only if x e is 1. So if you have two observations and they are the same and they become indistinguishable, two observations of the same element gives you a zero distance. And if you have a zero distance, this means that two elements are the same. If two elements are different under a distance metric, they can't have zero distance. And that's important. Think of a graph, a network of things. If you think of a subway network, all of a sudden, you would have one subway station coincide with another subway station without any distance in between them. That could not be embodied in any kind of physical reality. So there is no such thing as zero distance between disparate elements. Now, if your data is structured such that at times, you have zero distance between disparate elements that you would otherwise like to distinguish, clustering is not something that you can apply to that data set because it's not metric. Secondly, symmetry. That's a very important property. Symmetry basically mathematically is described the distance between points x and y is the exact same as the distance between points y and x, which in many real-world situations isn't actually the case. The distance of me going to work in the morning always seems a lot further than me coming back from work in the evening. It seems a lot closer. Well, that depends on the distance metric. If the distance metric is in kilometers on the map, it's the same distance. If the distance metric is my impression of it, it's a different distance. So I don't anymore have something that I can cluster in this sense because my distance is not simple. It becomes dependent in which order I look at my data, what ultimately is the distance between points. And we always show clustering in these simplified examples of point clouds on two-dimensional maps. That's one of the weakest things that you can do. There are much more interesting metrics, usually, that you can apply to your data. That includes all kinds of categorical variables, not just Euclidean distances. So in that case, symmetry would be violated, not a metric. Clustering applications will fail. Most importantly, there's an equal sign missing. I think it should be in the slide there. But the distance between points x and y is less or equal in the PDF slides. The distance between the points x and y should be less or equal than the distance between x and z and z and y. This is a triangle inequality. What I'm encoding mathematically here is that if you're going from one point to another point, this is the shortest possible distance. And if you take a detour via point z, it can't be any shorter. And in order to apply clustering methods to data sets, you really have to think deeply whether this one is actually fulfilled. Definitely not all biological measurements fulfill this criterion. There's instances of graphs that very easily violate this. It just depends on how this is coded. But it's possible that the detour gives a shorter distance. If I put points on a graph, I'm completely free to weight the distances between them. For instance, you need to get a wireless microphone. I feel like a dog going for a buck in the morning. OK, so we're here. So we have three points here, which we have determined under some experimental conditions. And there's a distance between these two points, which is 10. And there's a distance between these two points, which is 2. And a distance between these two points, which is also 2. Just some experimental values, which we've observed. And you can imagine that comparing two genes or two proteins, you could have some kind of measurements that gives this kind of result. Now, the problem is this is no longer metric, because all of a sudden, if I go from here to here, my distance is 10. But if I take the detour from here to here, and then from here to here, my distance is only 4. So this kind of detour possibilities will mess up all cluster analysis. Because all of a sudden, whether something is close to something else, like this one to this one, and that one is close to something else again, is no longer informative on whether these two points are close or distant or what happens. So of course, you can cluster something like that, but the cluster results are no longer interpretable. So it's like distant neighbors can seem randomly close or distant, no matter what their relationship within the cluster is. This is often glossed over. And most people simply use some kind of Euclidean metric. Oh, what's a Euclidean metric? We're using this term a lot. So since you're not asking, I assume that everybody knows what it is? No. I'm using terms that you're not familiar with, and you don't ask, that's not good. Euclidean metric or Cartesian space is basically our metric that assume orthogonal coordinate systems for different dimensions. Orthogonal coordinate systems means coordinate systems where the coordinates axis are at right angles to each other. And we have values. We can't do that with a laser pointer. And you have points in these. And the distance between these points can simply be calculated by the application of Pythagoras theorem. The distance between A and B is the square root of the square of the distance on each of their dimensions. And you can generalize that to higher dimensions. And that's Euclidean metric. So anything that can be embedded on a 2D plot or into a 3D space or a 4D hyperspace and so on, which can be nicely embedded. And where that embedding defines the distance between the points can be analyzed under this Euclidean metric. And if that's the case, these spaces are metric. So you don't need to worry about it. But very often, especially when you include categorical data and different weighting of that, it's not obvious anymore whether these properties are fulfilled. Yes? Some kind of, well, think of an influence on biochemical reactions here. You know, if you put A and B together, you have some kind of an acceleration or deceleration of rate constant. And that might have some acceleration and deceleration of rate constant. And that might, too. But there's no real reason why they should be in any way related. Because they might go through different interactions when it's in the code. So it is certainly possible. The reaction is in the way of Euclidean. Is there a difference in the reaction rate in some kind of an Euclidean process? Yes. Yeah. Exactly. Right. As I said, whenever you have Euclidean distances, you can map something into Cartesian rectangular spaces. You're fine. It's going to be metric. But if you're not, and that's often where interesting discriminatory analysis can come in, then you have to be sure that you're actually looking at a metric. OK. Single linkage clustering. This is one of the metrics that can be applied to hierarchical clustering. The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. This is single linkage. This cluster here and that cluster here. And the metric is the distance between those two points. So on the single linkage, if you have third cluster here, and that also has points, the distance between the cluster here and this cluster 2 would be calculated from different angles. It would be this one to something here, and this one to something there. Already it takes a little thinking of whether that still is metric, or whether it's possible to actually violate the triangle in the quality of this. In complete linkage, you do something similar, except the other way around, and you cluster between the greatest distance from any member of one cluster to any member. And in average linkage, you take the average of all distances that basically describes the clustering. So it's different ways of defining these things here. So in an example, I'm going to look at the cell cycle data set. And I think you've already loaded that. So this is basically the only time that I'm using one of the example data sets. Are you all aware of what that data set is? What the columns are? What the rows are? You probably loaded that yesterday. If you don't, expression levels of about 6,000 genes during the cell cycle, 17 time points, and so on. You can either load it from where it was, or as pointed out in the slides here, you can do something very cool and very simple with R. I actually got that data set just by googling for the file name. Found it somewhere at the University of Washington. And you can enter the URL that contains the data set as your complete comp to R. And R will then automatically download that via the internet and make that data available. So you don't need to store your data on your own computer. You can download it basically on the fly within an R script from the internet. So the other commands that we're going to go through are all on this page here, which probably you'd like to have opened, because you can then copy and paste into your window. See, this is quick, less than a second, and the data is here. So just to let me see that it's really here. How can I get the first few elements from that data set? There are a number of rows, and there's not a number of columns that you want. OK. So basically, let's look at the first row of that data set. What do I need to type? Square bracket? 1. 1 comma? What? Nothing? Why nothing? Doesn't that look incomplete? No? Because if it's incomplete, if there's something missing here, our substitutes, everything that it has for that. So in the real show data set, all of these genes are also labeled by name. We've just selected that we're taking only the first 50 rows and only columns 3 to 19 from the entire data set to make it a little smaller, because the data set itself, of course, has 6,000 genes. OK. Right. The next thing is we calculate distances. We use the Euclidean method. I'm using Michelle's computer here. So there's nothing pre-canned in this. Should work in exactly the same way. OK. So what do we have here? Keywords, distance. What is DIST in our language? It's the name of a function. What does the function do? I have no idea. Let's see what DIST does. We go to the R window. We say, please help me along. I'll take you through a number of these R screens. Use it as an opportunity to make sure that you're comfortable with the way things are described here. You'll be needing this at home a lot. And there's a certain template on how this is used and how to understand this. OK. Function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of the data matrix. The rows of the data matrix. Because the data is organized, and it has to be organized here, in a way where each row represents one element. The research is the function called distance. What's x? One way later that you want to say it. x is the data matrix that you're applying to it. Method, Euclidean. So it simply takes all of the points, all of the values for one point, as if these work coordinates in an n dimensional space and calculates the Euclidean distance between these n dimensional points. So if your data matrix would have just two columns, it would be like the example plot that I've showed you, and you would be measuring the distances of that. And then it describes the arguments a little longer. Most of that is also default. And there's different kinds of distances that you can apply. Euclidean, maximum Manhattan, and Berra. They'll be fine here. Binary, Minkowski. As in most cases, there's an example in our code that you can run on data and then play around with it. OK. So here all we need is to say, calculate me a distance matrix from the data with the Euclidean method. So what does this look like? This is the entire distance matrix inside. Oh, it's still off. Good point. So the distance values are encoded here. OK. Now once we have that, we can actually go on and cluster this using the method hclust for hierarchical clustering. Distances, this is a distance matrix which defines all the distances between all the elements according to the Euclidean metric over the 17 dimensional values that we've supplied it. So it needs to do 0.1.3, 0.1.3. Yes. Exactly. It's a question later, right? Yes. So are you looking for a address between 10 and 100? 10 and 100. Yes. Yes. So are you calculating the difference? Yes. Exactly. Because once we have all the distances, we can then plot them and we can look which points fall closer to each other, i.e. which points are similar. There's no other way to evaluate whether two points are similar than to compare them both. And distance does that comparison and stores the results of the comparisons for the hierarchical clustering. That's the way this is calculated in this case. But generically, it's a measure of how similar two points in your data set are. And after we've clustered them and we're looking at the points, we can, of course, then see what this means in practice how similar or different they are. OK. So with this quick command with clustering, again, h plus is the cluster method. Hierarchically, clustering the data is the true method. The single-linked initial members is null. And all the details of that are in the help screen. And this is set up differently. And we can plot the hierarchically cluster tree by entering plotHc single. And we get that magical diagram, the cluster denogram, which we've alluded to. Now, maybe it makes a little more sense what we've been talking about, least in the denogram correspond to individual genes in all this. And they're just encoded by their index. The clustering algorithm placed closest ones together first and then recalculated distance, closest ones together. And this one happened to be close to this mini cluster here and formed a cluster of three elements, which then happened to be close to this mini cluster of two elements so they were grouped together and so on. Things started percolating up from here. And then these two clusters could be joined and so on and so on until everything is clustered together into one large super cluster which basically contains everything. So how do we actually get clusters? So basically as an aside, I have to mention one thing. A denogram can be a tricky thing because you can look at a denogram and be completely fooled about the relationships that a tree presents. And that's because the distances that you're interested in are only distances on the y-axis. The distances on the x-axis are to the largest degree completely orthogonal. And these two are very specific. And that's because denograms are completely identical if you rotate them around these branches here. So if I take this branch here and rotate this entire cluster around this branch, I find that gene 13 and gene 15 are very close together. Understand the branch. Once this, you're going to have to get a wireless thing. It's a first request. This is the last one. I don't know what to do with this one now. Are you serious? Everybody else is just hiding behind their pulpit and hoping nobody throws stuff at them? Oh, this is why I brought mine. OK, so if I take this cluster and I rotate around this link here, all of a sudden, 13 would move over here and become very close. If I don't do that, 21 lines are very close. But they're identical denograms. And this basically tells you the distance between these things doesn't mean anything at all. The distance is embodied in the path along the denogram that you need to take between two points. And not where they end up on the paper once they're plotted out. So only things that are connected to each other and looking at the path between two of these leads to will say something about the similarity that's represented in this denogram and not how close they can lie here. So if we use this kind of cluster analysis to then start arranging things in a heat map in a gene expression plot, there's a second algorithm behind that that chooses things in the best possible way so that similar things lie close together. And then you get these nice red and green bands running through your microarray data. But it's not in the initial hierarchy clustering. This is an additional sorting and ordering step. This is the measure of the path without the horizontal line. It's just the path with counting moving the vertical line. Right. The vertical lines on the path between two points is the measure of the distance. OK. Now, well, how do we get our clusters now from that? So we've completed the procedure clustering and we see this when we're here. And we see a denogram. What's with the clusters? So if your PI or you yourself or your wife would ask you, how many clusters are in your data anyway? What would you say? Yes, or your granting agency. Exactly. Well, it really depends. How many do you want to pull out? The way we pull out clusters here is with the command clusters, it draws nice little red boxes. And you can specify the parameter k, which is the numbers of clusters that you would like it to return. So if k equals 2, the algorithm goes through the tree from top to bottom and draws a line through it and stops when the line dissects the denogram in a way that you get exactly two clusters. So in this case, you cluster everything into here and there seems to be a single outlier, gene-free. It's a single linkage clustering with k equals 2. If k equals 3, the line is drawn a little lower. So you cut here and you cut here. You've already cut up here. So you have cluster that contains gene-free, 248, 31, 52, and all the rest in the state cluster. And if you want more, you can get k equals 4. Yes? The one that will show you a larger distance, that will be easier to see. We'll have a look at that in a moment, comparing this with different linkage methods to show you what the differences in the clusters are. So again, k equals 4, k equals 5, k equals 25, whatever you want to have. You can cluster it all the way down until k equals 50 and you get your 50 individual points back. And that, of course, raises the question, well, isn't this all completely arbitrary? What does it actually tell us that we have a cluster in here? If we can have one large cluster that contains everything or small clusters that contain practically nothing and everything in between. Well, for that, we have to look at the properties of the cluster members. And in order to do that, we need a single r command where we classify, simply, a classification with the command cut tree. So cut tree of our data set, the data object of the hierarchy cluster at k equals 4, we'll then classify our data according to the single linkage cluster. And this is what we call it. And to put all four parts of this classification, we give it a parameter that puts two rows and two columns of little plots there. And we can use that matrix plot of the transpose of the show data set where the single linkage clustering is 1, or 2, or 3, or 4. x-axis label is time, y-axis label is log expression value, and plot that four times to the 4 different pluses. Try that. You should all be able to reproduce that. Just to make sure that you can do this at home. Just to make sure that it is reproducible, k looks good. OK. So this is the k equals 4 clustering. And we wanted to look at this because we were asking ourselves, well, isn't this in some way completely arbitrary? And I think, I hope I can convince you that in some sense it isn't, because these lines actually look quite similar. There's not a lot happening here. The expression levels can be either high or low, or basically just moving along with a little bit of noise in there. But there's a peak around time point 10. Another drop, and then it slowly rises up again to some level that it had at the beginning. And that's basically the same property for all of the genes of this window here. The genes in this window have tendency to be dropping in an expression model. This is our D3, which is somehow an outlier, which doesn't really fit into anything else, which might mean it's noise. It might also mean that it is the single master regulator that controls everything else. So if it's noise, we can throw it out. And if it's the single master regulator that controls everything else, this one is the most interesting in the whole data set. So only because the cluster is small, certainly doesn't mean that it's not interesting. We didn't get how we extracted this information from the character that was doing it, or from the matrix. But it had nothing to do with it. OK. That's important. How did we get this information? So those are four questions. This here, this is the key command, cut tree, or cube tree, cut tree. And it operates on HC single, which is the data structure that the hierarchy clustering has generated. So the parameter, we do that, the hierarchy clustering on the distance matrix, which we got from the data. We get this data structure. Cut tree operates on that. It operates with a parameter that tells cut tree how many clusters I actually want to have as a result. And it puts that also into the data structure, class.single. And I extract rows from the data with the conditional statement that class.single for that row should be either one in the first block, two in the second block, three in the third block, or four in the fourth block. So this is the flow of information there. It's important. If you get lost in the R code, why we do particular things here, it's important for you to ask. It's not going to get more obvious when you go home and just try staring at the data. It takes more time. OK. So again, looking at the tree and looking at the cluster that we got from the tree here, this is our cluster number one, the one with the greatest number of most similar genes that we have here. This is cluster number two, which just contains four genes, a fallen expression, and so on. What you see is that they're very similar to each other, but they're all linked to the other genes very highly. So even though they're more or less similar, if you look at this one here, they're among each other very much more similar to every member of the cluster than they are to anything in that other cluster. And this is what the clustering algorithm picks out and calls something from a different category, something from a different class that we can then assign. OK. As an alternative, we can use complete-linked clustering. So we do the same thing, hclump-clust-joe-method is complete this time. And this means we take the furthest members in each cluster and cluster them together and put that into data structure. So on the other side, you pick four clusters. Based on, you do several analysis of these types of graphs, and you pick the number of clusters that makes them to you more alike. That's what you do in the rocket cluster. Unless you have prior information that says, I'm clustering data from seven different laboratory animals, and I think that they're all different. So let's try clustering of seven and see whether the data from the seven animals actually partition into these clusters or something like that. But if you're just analyzing the data as is and you're looking for underlying structure in your data and you don't have an idea how many clusters there are in the first place, you can play around. There is, however, a criterion. And we'll get to that later on when we talk about model-based clustering. There's a criterion that you can quantitatively apply to us. What is the best choice that you can make? It's basically a trade-off between you'd like the clusters to be as large as possible, as defined as possible, but you'd also like them to be as few as possible in the data set. Simply applying Occam's razor. Occam's razor? Everybody familiar? No, Occam's razor. Never heard of that? It was brought by Gillette. It was brought by Gillette. Occam's razor is a philosophical argument brought forward by William of Occam, sometime in the 16th century, I believe. In Latin, it says, encia nonzunt multificanda si necessitatum, which you can translate, of course. It means you should not increase the number of entities without necessity. You should not postulate things if there's not a good reason to postulate them. You should not have more clusters in your data than your data actually wants you to have or forces you to admit there are. The most economical explanation for something is the preferred one. And that wasn't really clear at that time. This was a very new and very revolutionary thought. And this is essentially what all of modern science is based on. It's called Occam's razor. It's a razor that cuts through embellishments to theories if the embellishments are not necessary. And a lot of modern statistic actually is based on things like maximum likelihood. And one of the ideas there is that you have to use as little information as possible to go in that you have to calculate things. So it was a very revolutionary thought at that time. Probably as everything revolutionary at that time it was considered heretical. Don't posit angels if there's nothing angelic to be observed. So basically, there are ways to come up with as few clusters as possible and make them as distinct as possible. And look at that in a moment. OK, back to this here. Why do I need the keyword complete here? And I also have it here, and I have it in the parameters. The answer is, don't even think about this deeply. The answer is that's just the way I use code here that Ida was in Peter Delgarde's book, or Rafa, I can't remember. They just wrote it that way. It doesn't have to be complete. It could be called fish and chips or something else. We could wreck plus on fish and chips k equals 4 if we call this fish and chips. So when you read code, there's, again, this trade off here. Sometimes the code as the demagram suggests relationships that are just spurious. It was called hc complete, simply to make sure that it contains the hierarchical clustering of the complete data. But don't be put off. This is just the name of whatever data the structure was generating. The only way you need the keyword is actually as the keyword in here. It almost has to be name it as, sorry, but something or just any name. No. It could be just any name. We could just call it x. Just hc. Yes, just hc. And in fact, there's a certain trade off here. Because if you read your code, you'd like to try to understand what it does. You should always write your code so you put, this is just good coding practice. Put in comments literally. Make sure you don't do anything elegant. Because the next person who's going to read it is not going to know how elegant your programming skills are. And he's going to be confused about what you're trying to achieve. Don't use variable names that are too sparse. Don't use variable names that suggest things that aren't here. I think this is a poor choice of variable names. I think putting the dot here is a poor choice to begin with, because the dot actually can mean the separation between a class name and its method or something like that. So there's actually a semantic meaning possible with dots. And it becomes complex dependent on whether this is used or not. So that the country is different from the other. It's complete versus simple. Exactly. But it's the same thing, except that the distance is calculated in a slightly different way. OK? Always except that with a tree, we get two clusters and we get different types of measurements. OK. So that's what that looks like. This is the denogram for complete linkage k equals 4. Different metric, same data, different clusters. These are the clusters here. Let's look at them. So this was a set, single linkage, which we just looked at. This is a set calculated in a completely. So which one do we prefer? Can't really tell, right? So they all have some assets to it that look pretty good. I like how the steep values are sort of nice to put together here, and the more shallow values that have higher baselines are nice to put together here. There seems to be no outlier cluster that this was a point here, and so on. So just looking at that, it's probably hard to tell to say which is the better clustering or is 4 even a good way to cluster this. So both of them would be correct. Yes, and that's an important point. Both of them are essentially correct. Both of them are a different way to look at the data. Both of them are a way to generate hypotheses. Single linkage, complete linkage, and average linkage, and other metrics that you can apply give you slightly different properties. Now, whether these properties are going to be good or bad for your data set, that depends. You can either say clustering is just an approximation, and we're not going to use it for anything much anyway. So we'll just use the first best method, perhaps single linkage, because everybody uses that, and we're going to publish our results and not worry about it. Or you can say, well, we should probably try all the clustering methods and look at the results and get some idea of whether the results we get back make biological sense to us. Because after all, it's a biological question that you're going to ask. For instance, if after clustering your expression data set, you then start coding your genes by gene ontology annotations, and you find that all your gene ontology metabolic pathways fall into this set, whereas all your cell-side regulations fall into this set, whereas they're all over the place in this one, there's a strong indication that this gives you a more useful and sensible clustering, which is possibly also more robust than to the unknown genes which might be in your data set to mean good functional prediction. But simply because the algorithm says this is the clustering, doesn't mean that it gives you back anything else except a mathematical property of your data set. There's no guarantee that that mathematical property coincides well with the biological properties. Yes. Is there a good way to choose? There are more sophisticated methods. And again, when we talk about model-based clustering briefly, we'll come up with one of the method that allows you to quantify at least the information that your cluster contains. But now this is why I was going on at the beginning of understanding your data. What I would do in a situation like that is to take great pains to try to simulate data that I would expect to get genes of different classes, trying to think about how the expression profiles might be differing, and then make a simulated data set that has a known number of classes where you actually know what the correct result is, then applying your clustering algorithm and finding which of the clustering algorithms gives you the known results back in the best possible way. This is the one that I would have the most confidence in to work well on my data set that has unknown properties at the same time. So in my mind, it's really, really important to be able to simulate data and produce synthetic data that you can analyze. If you don't do that, it's like doing an experiment for which you do not have a positive control. Only under circumstances where the mathematics behind something are so well understood that you can make an argument from first principles why a certain data set should behave in a particular way, then you don't need to have a positive control because the positive control is in the mathematical properties. You just know what's correct. But if you don't know what's correct, then very often in clustering, you don't know what the right result is going to be. You really should be simulated. So we could use this kind of information, by the way, to revise the analysis and select columns within our analysis that seem to be more informative than other columns. Because if we look at these columns here, they seem to be rather noisy. They should be running the cluster analysis just between time points 7, 10, or 12 under the principle, which is always a sound principle, to try to analyze signal in your data and not noise in your data. The more you can reduce the noise, the more you can filter out things that are actually not relevant from your data. Before you start doing your clustering and other quantitative analysis, the more distinct your signals are going to be. OK. K means, oh, five minutes. That's good. K means is a different clustering method, different from hierarchical clustering. So if we assume K clusters in the data set, the goal of K means is to minimize the number, the distance between cluster elements, and the centroids for the cluster within clusters relative to between clusters. So basically, you take in some way it's similar to hierarchical clustering under average linkage. But you iterate it from random starting points until it converges. So you divide the data into K clusters randomly. And you initialize the centroids with the mean of the clusters. So randomly, and then you just take the mean of a particular cluster and say, that's my cluster centroid, like an average linkage. And then you assign each item to the data set to the cluster with the closest centroid. So after you've calculated these first step centroids, four points somewhere in your data set, you reassign the elements in your data set to new clusters. And the new clusters are simply so that every element is assigned to the cluster that's represented by its closest centroid. The K is something you give it. So it's not something the algorithm comes up with. If you say arbitrary, it's only arbitrary to the degree that you've given it an arbitrary number. And once you've gone through reassigning all your points, you recalculate the centroids. So the centroids will then shift slightly to better represent the closest points that you've assigned to them. And then you go over. You do the same thing again. You have new centroids. You forget about the clusters that you've identified before. You reassign all the elements to the closest centroid. Once that is done, you recalculate the centroid. If a lot of elements came from up there, the centroid is going to move slightly up there. So it's going to be slowly drawn towards the regions, at least in principle, that are densest around that centroid. So the centroids are going to wander all over until they come up. And once that's done in an optimally possible way so that all objects have been assigned to close centroids, the clustering converges and the centroid no longer moves. So k-means is a bit like, for clustering, is a bit like cluster for multiple sequence alignment. It's a workhorse that everybody really understands. If you put k-means into your paper as the method that you've clustered on, it's relatively unlikely that your referees are going to say they're going to reject the paper because you used very, very poor clustering methods. We all know there are better clustering methods. Out there, still k-means, like cluster, is the one you use. There's very much better multiple sequence alignment procedures than cluster lw, for instance, t-coffee or other programs like that. Much better than cluster l. Still, people still use cluster l because it's something like the state of the art. k-means is a little bit like that. And it's also a little bit the standard that new clustering algorithms are being compared against. So here's an example for k-means. I'm just going to pop this into my R viewer. Where's the window? OK. So for k-means clustering of a data set. So I go through this briefly before I let you go. Set C100. Why? What does this do? Forgotten already since yesterday. We'll go over it when you come back from lunch. We'll go over it when you come back from lunch. This is important. In some way, this is really important when you start simulating things and recording things in your lab notebook and so on. You can informally use random numbers just to play around with where they're really random and different in any way. But if you use things professionally, you'd like them random. OK. So we calculate simply 100 samples from a normal distribution centered on 0, 0 with a standard deviation of over 3 and put this into a matrix, which has two columns. And so these are 50 points with some randomly distributed values. And we do another 100 values with a mean equals 1 and then standard deviation. And we combine these two to give two clusters which or a data set which sort of snakes along the x, y axis like that. And we do k-means clustering on that data set where the initial points that k-means uses are simply random points within the ranges that are used as centroids. And since the points are random, you will notice that every single k-means clustering gives slightly different results. So for instance, this blue cluster here is very small. This cluster here, there's a larger number of these points. So I know blue is the first cluster. Then these points will be included in the first cluster. But they would be included in that black cluster over there. So depending on where you start in a k-means cluster, your cluster results may be different. So even though the centroids do not converge anymore, they're actually stuck in some kind of a local minimum. k-means does not compute global best solutions. But it just starts at some point and it moves along from there until it can't go on anymore. But that can be a local minimum regarding the cluster. So it's not always robust. And this is one of the drawbacks of k-means that it's sensitive to initial conditions. And in fact, it's probably wise to recalculate several times with k-means and then ask yourself, which of these are actually stable solutions? It's a little bit like doing a phylogenetic tree. You calculate a phylogenetic tree several times over. And then ask in a bootstrap analysis, which of the bifurcations are actually stable and well supported? Yes? You could. You just have to encode it into a number. Basically, it also uses a distance metric, which is a Euclidean distance on the columns in your vector. I can't tell you that by heart. But the way this is calculated in k-means is in the help file for R. Look it up, or we can discuss it while everybody's for lunch. But essentially, what it does, it just calculates the distances within that space. And it basically needs to calculate Euclidean distances between cluster members and centroids and not all the cluster members among each other. So while the distance matrix calculation is of computational complexity n squared, because you have to compare everything with everything, k-means just calculates distances between cluster members and centroids, which is much less than n squared. And you can thus plus the larger data sets. Oh, yes, the set seed in the second one basically makes sure that you get, when you start the first clustering, that you actually also have the same initial conditions for the first cluster. Because as I told you, the initial set of centroids is done randomly. And the set seed then initializes the random number generator into a specific state. And it makes its random choices for the first iteration, then random choices for the second, third, and fourth. If I rerun the diagram, they're all going to look slightly different. But they're always going to look identical to the first, second, and third time I run the whole thing. If I would have put set seed into the loop here, no, sorry. That's seed, even if you have random data like this is not OK. So this works differently than I had remembered it. So apparently, set seed seems not to influence the choice of the initial parameters here, because you can get different results from run to run. It doesn't affect how the k-means algorithm calculates its centroids. Otherwise, you would have had to get the same results over. I've run this several times that the four images always look the same. But there's always the same difference between the four from image to image. But if you are using the real data not meaning not randomly generating data, you still have to use those sets seeds in that part of the code. Does that support the results? Yes, I would recommend that, because you can then reproduce your results. What you can do, of course, is to rerun with several different values, set seed 100, 101, 102. Then you get different results. But each and every single one of them is reproducible if you want to redo your results, because you have to reformat your paper for a different journal or whatever.