 We'll start this afternoon with clustering. Anybody care to define what a cluster is? Yes, it's group. Any group? Usually they're more similar in some dimension. Usually. What do you mean by usually? Partly more similar to each other than other split points, but I've seen clusters that aren't needed to get at least. You've used the word there that I'm not sure everybody knows. You used the word partition. What do you mean by partition? Divide. Divide the data into groups. So that's one way to go about thinking of data as partition. We might have some measure of success, some objective function to partition the data in a special way that would then allow us to call some of the partitions, all of the partitions that we have, clusters. There's a bit of a problem with partitioning data in the first place. That is, is it a fair assumption that all groups within the data, or all data points within the data, actually make sense in terms of the cluster to which we would hope to describe some kind of biological significance. That's an assumption that's probably not always true. That's probably not, generally not. But it's what we go about. Now, partitioning the data according to some objective function in order to arrive at clusters basically is just saying a cluster is some group of data that is grouped according to an objective function. What would that objective function power do? I've heard the word similarity. Could one imagine an objective function to build a cluster? And that's, again, rephrasing just the question. What is a cluster, anyway? Why do we call something a cluster, rather than just random arrangement of data points that I was just so inclined to pick out of my data set? Can we look at the Euclidean distance? Can we look at the Euclidean distance? Sure. We can look at Euclidean distance. We can look at many different distances. But why would we do so? What would that help us? So I take two data points and their data points in a high dimensional data set. So I can interpret all the values that annotated it, all the features as a dimension in some high dimensional space. More generally speaking, that data point is a vector. And I take a second data point, also a vector on the same dimensions, but separated by some range of values in each of the dimensions. And then I simply calculate the Euclidean distance. Now, what does that tell me with reference to the idea of a cluster? What should a cluster be? Is there a measure of similarity? Because people would like to measure that. I'm really assuming a factor should have a similar property across a lot of different measurements. Right. So Euclidean distance, essentially similar to correlation, is a measure of distance. It's larger if the points are dissimilar and it's smaller if they're very similar. Actually, it's not just a measure. It's a metric. We'll briefly define what that means in a moment. But once again, that doesn't help us define what a cluster is. It just tells us, how do we measure similarities between points? Measures of similarity on the so-called between, just so-called between, similarity. OK. So I like that. That's usually how we think about clusters, not precisely in the words that you used, but the ideas of between and within always play a role. So if we define clusters, we want the clusters to be partitions that have the property, distances of elements within a partition are smaller than the distances between the partitions. Not necessarily between the elements of the individual partitions, but between some property which we derive from the clusters. So in fact, if we have clusters that are located adjacent to each other, the distances between adjacent elements can in fact be smaller than the distances to some of the elements within a cluster. But still from the overall structure of things, we would say that this one element, though it's close to that one, would belong to this cluster and the other one to that cluster. So the distances within an entire cluster can be relatively large, even though the distances between clusters, individual distances, can be rather small. And that's kind of, there's different ways to go about it. It goes beyond just looking at individual pairs of distances in many instances. It's kind of difficult to define that, though, because it critically depends on how we partition the first place, how many clusters we build, the degree of resolution or granularity that we apply to our data set, the separation that's inherent and the points, the exact algorithms that we use and so on and so on. So there's many different ways. And if we look at data visually and then we define clusters in that data by saying, well, these are points that are close together and the others kind of don't belong there, it's exceedingly hard to cast that into a proper mathematical algorithm, not in the least because the function that well separates clusters might not be the same globally, but it might depend on the local environment of the data set. So there might need to be some kind of an adaption that is dependent on the actual features, which makes things very messy and not surprisingly, clustering is not a solved problem. There are many different clustering approaches, most of which apply very, very well, work very, very well in specialized cases and fail miserably in others. But there's a number of principles and very simple algorithms that we can apply at the outset and again, start exploring our data, start trying to find subsets that have something to do with each other, then exploring these subsets, looking at the individual properties of these subsets and figuring out whether that makes biological sense. Remember, exploratory data analysis is about hypothesis generation. What we talk about in clustering is to try to understand clustering principles, work a little bit with some basic hierarchical and partitioning methods, and talk about how to interpret clustering results in the first place, and run a little bit of cluster quality control and discuss alternatives. It's actually interesting. Clusters are abundant in biological entities, and that's not trivial. If you could think about proteins or if you could think about genes or systems or like that, there's nothing that says a protein has to be similar to another protein. They could all be equally dissimilar to each other, or a biological pathway is similar in some properties to another biological pathway. Again, that's not necessary. You could define all of biology so that every element that you would look at is entirely unique and entirely different from anything else. But that's not how biology works. There's a lot of similarity in biology. There's a lot of reuse of modules and of ideas. Much of the similarity is due to homology. Basically, that inheritance and evolutionary divergence creates protein systems, pathways that are in some way similar because they're derived from something else, and we can discover that. Other reasons for finding similarity or defining clusters is because of convergent evolution. There's only a limited number of ways in which you can hydrolyze a peptide bond. So by convergent evolution, nature has come up with methods that, at the catalytic core, look exactly the same, i.e., the catalytic triad. So if you would be clustering these active sites of peptide hydrolyzes, you would find that some cluster, in terms of the spatial arrangement of their active components, because of convergent reasons, because the number of possibilities that you can use amino acid side chains to hydrolyze peptide bonds is limited. And if you explore the space of possibilities with evolution over a large number of times, you will come up with the same solution in principle more than once. And there's a third reason for why things might be clustered or things might appear to be similar. So one was homology, i.e., divergent evolution. Two is convergent evolution. The third important reason is random chance. If we look at a large number of elements, there are bound to be some that are similar to each other for no good reason whatsoever. And one of the challenges of cluster analysis is, of course, saying, well, once we've built these clusters, do they mean anything? Can we interpret them in the context of biology? It always goes back to biology. We're always in the state of not knowing and yet knowing a lot about something else, and we're trying to reconcile that. Technically, clustering is an example of unsupervised learning, i.e., we throw clustering algorithms at a data set. We don't know anything about categories. We don't know whether the clustering is right or wrong. Unsupervised learning means we have a training set where we know the results and then we can use the features of the known results to try to infer unknown results. But clustering is completely unsupervised. It's useful for the analysis of patterns in data. In the optimal case, it can lead to class discovering. And it is, as Lauren mentioned, a partitioning method. It partitions data into groups of elements that are more similar to each other than to elements in other groups. It's a completely general method. It can be applied to genes. It can be applied to samples to both at the same time or whatever you want to do. As long as you have measurements, you can start clustering. One of the most basic and actually, indeed, one of the interesting methods is hierarchical clustering. We'll work with that a bit. I think we'll talk more about this principle of how it works as we actually compute it and have some visuals to work with. So in hierarchical clustering, you work with a number of items and a distance metric. So first you start by calculating the distances between all of the individual elements. And then you find the closest pairs of elements and merge them and define some average measure that now describes the two elements jointly instead of individually. And then you do that again. You recompute the distances from that joint element to all the rest. And you, again, look for the closest, which might be to that merged element, might be to original elements. And you join them and so on until everything is joined into a single root. And it looks very much like a phylogenetic tree. And so you compute that until all clusters have been merged into a single cluster. But what does a distance metric actually mean? What's a metric in the first place? In the mathematical sense, a metric has a very defined meaning. It has to fulfill exactly three conditions. The first two are kind of trivial. The first one is the identity condition. And then it says, if the distance between x and y is 0, then x is the same thing as y. Now, we might have two gene expression profiles that are identical, numerical. They might be different gene expression profiles. And the distance between them still might be 0. So in the mathematical sense, that would be a violation. However, it just says that according to these identity, to these expression profiles, the two genes are not distinguishable. It doesn't make sense to say one is one and the other is the other. From just looking at the profiles, we would never be able to tell a difference. The other feature of a metric is it has to be symmetric. So the distance between a point x and y has to be exactly the same as the distance between a point y and x. So if you inverted, it has to be the same. And the third thing is the triangle inequality. And that can be nasty, depending on the kind of metrics that you use. What the triangle inequality says is the distance between x and y is always smaller than, or at best, equal to the distance between x and some point z and the distance between z and that point y that we're interested in. Essentially, if we paraphrase that, it says, in our metrics, there are no shortcuts. If we take something else than the direct route, it can never be shorter than the direct route. Now, depending on the metric that you use, that can't always be guaranteed. If we use Euclidean distance metrics or something, then it's pretty clear and mathematically proven. But if you use some esoteric function, which spits out a number, say according to pathway membership or whatever, there's no guarantee that that's actually a metric unless you look at it closely. And what happens if you use a measure on clustering that is not a metric? Well, things got clustered all the same. But it means that cluster membership is no longer interpretable in a way. Because what cluster membership really says once we have given a distance between two elements, or a membership and a cluster, we can make basically a triangulation inference to a third element. It has to be outside, or it has to be somewhere else. If the metric breaks down, that third element could be anywhere. It could be much closer than the first two and somewhere different. So even though we would be generating something that looks like a clustering, an interpretation in terms of similarity would no longer be possible. So if you ever go out and define your own metrics, which would be interesting because it's certainly interesting to use more biological and more interesting functions to calculate metrics than simply using Euclidean distance all the time. So if you ever do that, you still have to be careful for the purpose of clustering that we have a true metric to work with. There's a number of different ones. We've measured Euclidean. There's a metric which is the so-called Manhattan distance, which basically says instead of taking the diagonal root, we take root only on an orthogonal grid. A very interesting metric is 1 minus correlation. So we can calculate a correlation coefficient between two sets and then subtract 1 from the correlation coefficient. So remember, the range of the correlation coefficient can be between minus 1 and 1. A distance always has to be positive. And if we say 1 minus correlation, we map the correlation coefficient into the interval from 0 to 2. And then we can use that as a distance metric. And there's something that's called a hamming distance, which we need for ordinal, binary, or categorical data, i.e. not the kind of real value data that we normally use. And that's just the sum over all identical elements. So a hamming distance of, say, five elements or of five dimensions is five if all five dimensions are equal or is three if three of the five dimensions are equal and so on. The hamming distance or the related edit distance is you've come across similar measures that would correspond to the measure of identity of aligned sequences. So if two aligned sequence characters are the same, they would score one point for the hamming distance and you would sum over all the points that are identical. So does it matter? Unfortunately, yes. So here's a comparison of a gene expression data set where we look at a heat map of metrics. So each of these points is colored and sorted according to the similarity of these data points overall in their expression vectors. One is Euclidean, one is Manhattan, one is 1 minus correlation. Red values mean that the points are very similar. Yellow and white values mean they are not very similar. And so what does this mean? How do we interpret this? What do we find here? Well, it's apparent that under this arrangement where we group together similar elements in such a heat map, we can get a block structure. The block structure basically says the group of elements that's plotted along the axis here is similar to the group of elements that's plotted along the axis here. The group of elements that's in the middle here is similar to itself in its own group. There's a subgroup in here that we can define as another subgroup in here and so on. So if we find block structures in these heat maps, it means there are similarities that our distance metric is able to figure out. So if we take a distance metric and we get a heat map and the heat map just looks random, it means that distance metric is not suitable to attempt clustering in that data set. We will, if we force the algorithm to do so, we will find clusters, but we're probably not going to be happy with the results because we're going to find that even though we can separate them into clusters and partition them, the results are not going to be very different. We get clusters that where the internal structures, i.e. for example, the expression profiles, look exactly the same between one cluster and the other cluster. So in this case, if I would look at the metrics here, I would definitely go with 1 minus correlation. Now unfortunately, that's not the only parameter that we can change. So we could simply try different distance metrics, but there's more. We also need a linkage method. And linkage methods describe how we compare clusters with each other once we've formed them. So I told you that initially we start from all elements and then we combine elements to clusters and then we calculate distances between clusters and elements or clusters and clusters. And there are different ways to do this. So there's an average linkage where we take the distance of every element in one cluster and compute the distance to every element in another cluster and then average over these distances. There's single linkage, which simply asks for the distance between the closest elements in the clusters. Complete distance asks for the longest distance between the elements in the clusters. And we can also look at distance between centroids where we take some kind of a weighted average of elements and then compute the distances between these weighted averages, between these means. So there are different ways to do this. Some of the ways are going to be better under some circumstances and for some data sets than others. And it's not trivial to decide. You basically have to look at the results and you can either apply some mathematical methods that show whether the information within these clusters is very good or you ideally might have some idea about the underlying biology and then you could ask, well, which of these methods, for example, leads to clusters that have the most homogenous go terms associated with them or something similar, so orthogonal information? Right, so I think with these preliminaries, we should try to cluster some data. And we'll cluster some expression data. So I hope you've loaded this file. If not, it's loaded in the usual way from reda-clustering.r from GitHub. Explicitly, this would be GitHub. As the project file. Anybody have trouble finding that? Holler, if you do. So what's in the box here? Oh, we'll just go through that. OK, the data I'd like to work with here is, again, cell-cycle data, but a different experiment. And this time, it's not pre-arranged data, but it's the entire data set of a cell-cycle experiment that is deposited on Geo, the gene expression omnibus, the NCBI database that stores and makes publicly available gene expression data. So Geo has a nice tool, which is called Geo2r. And with Geo2r, it's really easy to use. There's a YouTube tutorial on how to use it. I always teach that in my introductory undergraduate bioinformatics course. And nobody really has problems with that, so it must be very, very easy. You basically can load the data set. You can define groups within the data set. And then you can run an analysis for differentially expressed genes. The link, the name. Here. That should be the correct one, I hope. Can anybody confirm that that works? Yeah. Right, so you can load the data sets in your browser. You can define groups. And you can use bioconductor packages on that. And the code runs on the NCBI servers. Use bioconductor packages to identify differentially expressed genes. And the neat thing about that is once you're done, you can also download the exact R code that was used in identifying your differentially expressed genes and rerun the whole thing at home. And modify it and do more with the data. And that's actually a very nice thing. So basically, this site has a code generator that creates code based on how you worked with the site in principle. As far as I know, there is not a version that supports RNA-seq data. So GO2R, as far as I know, only runs on microarray data. If anybody ever will prove me wrong, it means the NCBI has added functionality. And I would be more than pleased. Because it's really a neat tool to have. So that's the caveat here at the moment. I believe it only works for microarray data. So how does it work? Well, first we basically, this first part of the code all the way to here is essentially taking verbatim from GO2R, adding some comments and rearranging some things a little bit. But it's the same functionality. So at first, we source a program from the web. So the R source command that we normally use to run script files, that source function that we use to run script files can also be run from web pages where this is available. So the Bioconductor project has a web page available by oclight.r, which is their own version of an installer for Bioconductor packages. So once you have installed this, you can run Bioconductor Lite and download and install packages. Now when you do that, sometimes you will be asked, in case you already have old packages that are required as dependencies on your machine, you will be asked whether you would like to update all some or none of the packages. To that, you should always answer all. If you don't update all of your dependent packages, there's a frustrating chance that something subtly will not work as intended in the package that you're using. So in this case, I update all. And then it asks me, do I want to install from sources the packages which need compilation? So some of the packages in their newest versions have not been compiled as binaries that work on Windows or Mac computers or Linux computers. But they can be compiled there by using the respective Fortran or C compilers for things that are beneath that. It doesn't mean that it won't work. It kind of means there's a bleeding-edge version which is more recent than the last compiled one that we have. So I usually just answer no. I don't want to install from source. Simply give me the packages as they are to the best that we have them. That usually works very well. If you have bugs that look like the real bugs that have something to do with the system, you might try installing packages from sources. But compiling packages from sources over a large number of packages takes a long amount of time. It usually works well. But just by default, update all of the packages but don't install from source for those that require the source. So now it has downloaded a biobase and its dependencies to your query and lima. I've already installed that. And then you load the typical libraries. And the next thing that one needs to do is to download the Gset GSE 26922 from Geo from the NCBI. Now, that doesn't always go without a hitch. I'm not entirely sure why. It might be that they recognize that we have 127 queries for the same thing originating from the same IP address, and then they start shutting it down. So I've seen in courses that get Geo doesn't always work as needed. So it's a bit patchy. But for the purpose of the workshop, it's probably better to load this data set from the backup that's in your files. And that's simply executing load GSE 26922.rdata, which is a large expression set of 101.9 megabytes size, which was compressed to 16 megabytes. So this is the real thing. It's a really large microarray data set. For your work at home or for working with different expression sets or different Geo IDs, you would simply use the Gset whatever here and set the matrix to true. OK, so let's see what we have. It's an expression set. It is assay data. It has one features over 18 samples. The samples have different sample names. There's metadata there, and so on. If we look into the structure of the Gset, there's a lot of stuff going on. So this is bioconductor for you. It's extremely powerful. But I would say it has a learning curve. It uses an object-oriented system, an S4 class system under the hood, which works a bit different than the data structures that we've worked with so far. But it's very powerful, and it really supports large-scale bioinformatics very well. Now, what I'd like you to do is you should be able to figure out, simply given this identifier, GSE 26922, what the 18 samples in this data set signify. What are they? So somebody gives you an expression data set, and now you're tasked to find information about it. Again, this is the question about the underlying biology. These are not just numbers. They're the results of a biological experiment. In order to interpret them and use them correctly, we need a little bit of an idea about what that experiment did and what these data columns mean. So where would you find that? A geo, perhaps. That's a good guess. One could Google for it. Maybe that's also a valid approach. I would suspect that if Google works correctly on that identifier, it would lead you to geo. So let's figure it out. What can you learn? What can you find about GSE 26922? Can anybody paraphrase the experiment that we're looking at here? Exactly. So heel cells are synchronized with a cell cycle block. The block is released. Then they start doing their thing, that means dividing. And at various time points, we're taking samples. In order to learn more about what these samples and the time points are, we find that information under the samples description. So the first three points here are block-healer cells, biological replicate 1, 2, and 3. So this experiment has biological replicates. And they're the first three data sets. And then we have biological replicates 1, 2, 3, 4, 2 hours after release, 4 hours after release, 6, 8, and 12 hours after release. And each of these experiments, i.e. the biological replicates, has its own identifier for the actual micro area that was used. And all of them are grouped together in this geo-expression set. So that's how such an expression set is structured. Good. So that's important information to have when we analyze our data. So these 18 data points that we have here are six sets of three biological replicates each of cell cycle expression. Now there's a couple of bioconductor commands here that get everything running. We make labels that we can use later on. We define group names for all samples here. And that is the first three samples get a group name of G0. The second three samples get a group name of G1, G2, G3, G4, and G5. We log transform them. We take the groups and treat them as factors. Add that to the gene set description and define a design for the differential expression experiment, i.e. a model matrix based on this gene set description. Basically, this says compare everything against everything and tell me which of the compared groups now have statistically significant differential expression. And then we want to return genes with the highest differential expression. So once again, this experiment, too, treats the data points as independent. We're not actually doing a time series or true cyclical analysis. Now after that's all been defined, we do a fit from the model matrix against the actual data and then calculate contrasts, i.e. G5 minus G0, G1 minus G0, and so on, to actually find the differentially expressed genes. And after that, the significance is evaluated at the 0.01 level of significance. And after that, the whole thing is adjusted by false discovery rate, which takes into account multiple testing. We'll talk a little more about multiple testing later, but this is false discovery rate. And we come up with the top table TT here of the 250, most significantly, differentially expressed genes in that data set. So mind you, these are differentially expressed genes. These are not necessarily regulated genes. They just show some expression difference between the groups here. And then G0 loads some additional platform annotation. Platform annotation essentially says which spot corresponds to which gene, and what are these genes, and what are their properties. And in the end, we have this top table. So these are the six most highly expressed genes, a histone, CDC8, a CCNE2, another histone, and so on, with adjusted p values on the order of 8 to the 10 to the minus 12. What does that even mean, a p value in this context? Most significantly differentially expressed genes. What does a p value here mean? What probability are we measuring here? Any idea? You have differentially expressed genes, and you annotate them with a p value. So that's basically the probability that there is no differential expression. The probability that the expression values are all constant. A low p value says the probability that there is no differential expression is very small. And we estimate that probability from how variable the data are internally within the groups that we set, and how different they are from other groups. The lower the variance within these replicates and the higher the difference between the individual measurements, the smaller the p value that there's nothing going on. Right. So let's look at the top gene and plot it. Look at the original expression values. What did we find here? Remember, these are triplicates of three. So there's a small variation in this triplicate. There's a high value here, lower value here. This goes down. This also goes down. This goes up. So there's a significant difference between these replicates. But within each of the replicates, the measurements are very stable, very reproducible. And that gives the question of whether it is differentially expressed a very low p value. The vertical axes are log expression levels, so log ratios. So these would be the data values for the top three. So let's process this for cluster analysis. To use essentially the same approach as for cluster analysis, it's useful to take this table with all the information that it contains and make a table that contains only numbers and just a single value for the biological replicates. So we'll take all the biological replicates and we'll cluster them, we'll average them, and assign that to a table that we will use and call dat. We'll have a table for gene symbols that we can use to annotate that. So we go through our top table here, so the 250 genes. And we take the expression values of columns 1 to 3 for t equals 0, expression values of columns 4 to 6, 7 to 9, and so on. And we'll place that, we'll take the averages, and we'll put that into a column. And just put all of that together. Now a good way to work with this is to use IDs as row names so that we can simply specify. We can pull out individual elements simply by the gene name. But before we do that, we have to figure out whether there are any duplicated values here. And yes, indeed, there are duplicated values. One kind of duplication is where it's just an empty string. And another kind of duplication is where some genes are repeated. Now we also notice that there are these slashes here. These represent spots that measure the expression values of more than one gene at a time. So we can simply remove that and then set these row names. And we can also remove all rows that have spots for isoforms, i.e., with these slash characters. And that completes the creation of an expression data set that we can use for clustering. So if something is a significant intermediate result like that, you can use the write.csv function to write it and store it. And if you need it again, then you can read it back in the following way. Or you can just save. So basically, this saves the data set as a comma-separated value, which might be important if you want to go back and edit things and change things around. But you can also simply save the binary object in the way that we've used before with the save load functions or save RDS and read RDS, which is essentially the same thing. So we have this data. Let's do a heat map. Default parameters, one heat map coming up. Yay, what does it mean? It's a heat map. So what are the colors? Presumably, some of the colors mean high expression, and some of the colors mean low expression. And I don't know which one is which. So how do we find out? Yeah, we could add a legend. So the legend would need to take the range of values that's available and print that out. So one way to find that out is just to look at a small number of genes, a smaller number where we can actually identify the name, maybe write. So in this case here, FNIP1 is a gene name that appears at the top here and has very clearly some value at T0 and at T2, that is either high or low, and very clearly a value at T6 and T8 that is also similar in either high or low. So let's just pull that out and have a look. So we want data where the row is FNIP1, and we want all the columns. OK, so here we go. T0 and T2 have values in the tens. T4 and T6 have values in the 11s. T8 is 11. And T12 is 10.9. OK, so log expression differences. Is yellow high or low? Is red high or low? Does everybody know who doesn't know yet? Nobody will admit it. Who knows? Nobody will admit that either. OK, so you see T0, red has a relatively low log expression value, T4 orange and T6 yellow has a higher relative expression value. So in this depiction here, red means low expression, low log expression values, and yellow means high log expression values. So red is low, yellow is high. Does white mean that the data is missing? No, white means it's even higher. So these are the highest values here. Is it difficult to have a legend shown in art? It's probably not difficult. I'd need to explore to add a legend here. Let me see. Maybe it's just a question of saying legend equals true, but I wouldn't be surprised if it's not easy. Yeah, I know basically the legend would need to refer to the colors. I don't even see the colors here. It's colors that colors and you say colors. No, that's something else. I think it's in the dot arguments. Anyway, so it's not a trivial just change one parameter. So what I've done here is just for illustration from readability, I've just plotted every fifth of the genes with the sequence statement by equals five, retaining the overall structure of the clustering, but making it more sparsely populated. So what does this tell us? So for example, there are genes that are lower T4 and T6 and higher T0 and T2. And these would include genes like FNIP1, med13L, NRIP1 here, T4, T6. Actually, these are not low but high. I thought it went the other way around. So these are yellow here and red over there. And there are other genes for which then versus true. The ones that are down here, these are highly expressed at T0 and then are repressed at that other time. So we can do a similar thing that we did for our defining genes from PCA analysis and define the individual data sets, which we just read out of the plot. And then use a parallel coordinates plot and plot the expression profiles. So this is the one set, this is the other set. So over the cell cycle, we have genes that get induced and genes that get repressed over the cell cycle. So these are the kind of clusters that we find here. Now let's try to cluster them. So for a hierarchical clustering, we first need to produce a distance table. Remember, it works on distances, so we don't compute all the distances individually for every question. We just make a table of distances and then use that to make the distribution. So the first and simplest thing is to calculate the distance matrix and then to cluster it. The hierarchical clustering, H-clust in R, is very quick. And the default plot procedure gives us this result. So this is a hierarchical clustering of all of these differentially expressed, the top 250 differentially expressed genes from a cell cycle experiment. Now didn't we say before that a clustering is a partitioning? So how do we use this for partitioning? It seems to me we can kind of pick out different groups at very different levels from this description, from this clustering of the data. Right, so what does that mean? Well, that means we need a little bit more work to define what we actually mean with a cluster under this. Do we want clusters that are very fine-grained or do we want clusters that are very coarse-grained or whatever? So let's try making different clusters. But first, let's have another look at the distance matrix. The distance matrix methods. This is the one that we've already seen, the Euclidean distance metric. It has a kind of a block structure that we can use here. This is the Canberra metric, which is much more all over the place, less well-ordered. This is the Maximum metric and Kofsky metrics. So all of these are available here. And we need to basically look at the different heat maps and define which one has the best contrast and highest block structure. This is 1 minus correlation. Now, 1 minus correlation is not one of the inbuilt methods that exist, but we can very easily define our own procedures, our own code and algorithm to define distances. So 1 minus correlation is simply calculated by taking the correlation, the absolute 1 minus that, and looking at the distance. And it gives a block structure that looks like this. And I'm going to skip Minerva here. So perhaps we could use the maximum distance here as having one of the more intriguing block structures and use that for clustering. So now our HCO, our hierarchical cluster, is based on that maximum distance. Let's go back to our original dendrogram. Now, the idea about actually generating clusters here is to take that dendrogram and cut it at a particular level. So for example, if we cut the dendrogram here at this level, it will naturally fall apart into two groups. And we could then say this is a group of two clusters in this way. That's how two clusters can be defined from this dendrogram. If we say we would rather have five groups instead, the algorithm can pick a different level to cut. That would be like this here. If we want 10 groups instead or 20 or 50 groups instead, we can all do that with that dendrogram. So what's the correct solution here? And as I said before, that really depends on the kind of question that you're asking. There's no hard and fast rule which of these are the best. Obviously, there's a continuum. At one extreme of the continuum, everything falls into a single cluster. That is the least informative. At the other extreme of the continuum, every cluster contains one and only one element. And that's also minimally informative. So something between the every element in its own clusters or all elements in the same cluster is the best. And it really depends on the kind of analysis that we want to do where we choose to cut this here. But let's retrieve the actual indices and then use them to generate some parallel coordinate plots and check how different these expression profiles are in the first place. So in order to get these parallel coordinates, we use the cut tree function on our hierarchical cluster, define that we'd like a k of 20, and assign that to a variable which I now call class. So class now contains each of the 250 genes. And for each of these 250 genes, we find one number. And that number defines a cluster membership. So SGO2 is in cluster number two. LRIF1 is also in cluster number two. KIFC1 is in cluster number two, and so on. So how many genes in each cluster? Well, there's an easy way to check this, and that's using table or perhaps even sort table where the cluster is assorted by membership. Cluster number four has 48 members. Cluster number two has 23 members. Cluster number one only has a single member, as does 19, as does 20, and so on. But with these, the clusters are now defined. So we can use these cluster memberships to identify the names of the genes, and we can use the names of the genes to pull out the actual expression values, and then we can put that into one of these parallel coordinate plots. That's the code. So matrix plot of datum where class equals 10, and we'll plot 10, 2, 11, and 4. So 11, 2, 10, and 4. So one set that appears kind of all over the place, simply generally increasing an expression value. One of these clusters increases during the cell cycle. One of the clusters decreases during the cell cycle, and one of the clusters looks very similar to the one that has actually increased during the cell cycle. So mathematically, according to the metric that we're using, they seem to be different enough to fall into different hierarchical clusters. Visually inspecting them, we see no really convincing reason why they would not be falling together. Perhaps on average, this one is a little lower in its absolute values than that one. Now, I'd like you to do the same thing. Basically, copy some of that code, whatever code you need, into the myscript.r file for this session, and then edit it in a way where you produce a hierarchical clustering all the way to a plot that looks like the plot we just saw, but using a different distance method. So we've used Dmax here. I would like you to either use Euclidean, or Minkowski, or Canberra, or 1 minus correlation, or something else. One of the other ones, not Dmax again. So copy the code into myscript and edit it so that you run the same thing with a different clustering metric. And then we can compare whether the different clustering metric made the results different in any important way. Now, maybe I'll just step through one of the alternatives down here. I see that this has worked for most of you, which is awesome. Let's just do the same thing with the 1 minus correlation. So these would be nine values from the 1 minus correlation. And I would identify some clusters here that are larger and perhaps more interesting than the ones that we've seen with the Euclidean distance. But all over all, the differences are not very compelling. So aesthetically, it kind of looks nice, but the clusters are not very homogenous, as we might expect them from biologically interpretation. One of the problems is that we require everything to be cut at the same level. And it might make more sense kind of looking at the information that's contained in the clusters that you get with information theoretic methods to dynamically adapt the cut level. There's a package here which is called dynamic tree cut, which does that, and for which I have example code in there. And another thing that becomes apparent is that the clusters seem to pull things apart that have similar levels, similar shape, but different levels. And that's because we didn't actually normalize our data. It's raw as it is. So we could normalize our data and then try clustering which is simply completely based on shape. For example, with the dynamic tree cut. So if we simply cluster based on shape, then we get clusters that are kind of more convincing in one respect, but in another respect I wouldn't really know why this ends up differently than that. So once again we have a similar problem. Mathematically we can cluster this very clearly, but it's not entirely sure how to interpret this biologically. One method that's quite interesting and robust across many applications is affinity propagation, clustering. That was published in Science 10 years ago by Brendan Frey of the University of Toronto. And if we run this affinity propagation clustering and calculate a heat map from that, this actually seems to give us the clearest and most defined and most distinct block structure overall. Now if we can exploit that for clustering, we should be doing pretty well. The amount of information that's there, two large subclusters that seem to be important, maybe with a little bit of internal structure, but overall except for these two large subclusters everything else is pretty much noise. So if we look at the plots regarding these, now I think I have a better separation into things that actually look dissimilar and don't just seem to be dissimilar according to some mathematical method. So I would say from the things that we've seen here and the things that we've tried here, affinity propagation clustering seems to give the best results at least visually in this scenario. So which one should we use? This brings us to the question of cluster quality metrics. How can we actually measure how well defined and how informative these clusters are? And there's a library there, a cluster valid, which requires a few sublibraries, and that library basically runs clustering on data points according to a large number of different clustering methods. So hierarchical K-means, Diana Fanny, self-organizing maps, K-medoids with Pam, Sota, Clara, model and so on. And then it's possible after running all of these validations to determine which one was actually the best. So instead of running them one by one, this goes, this is done all in a package and all of the results are then made available. Let's see how that worked. It didn't run less. Okay, something's not working here. I don't want to go deeply into this to troubleshoot it. We'd need to adapt it for individual problems anyway. Anyway, there's this package cluster valid and that in principle allows you to compare all of these different clustering methods and then apply mathematically founded measurements that are able to distinguish the quality of the clusters from each other, at least in terms of internal validation. Now we've encountered TS and E. Is that an alternative? So let's have a brief look at TS and E clustering. We see some intriguing similarities and substructures here. The question is, are they meaningful? Now I've run this many times. I mentioned before that if it's not... the results of this TS and E are dependent on what the starting values are and for some starting values it will just simply come up with a much better embedding. So if we use this seed and work on this data, we can get a relatively good and consistent clustering from TS and E. Well, it's not a clustering, it's an embedding and we can work with that in a little more detail. So this runs for a little while, slightly changing the embedding, stochastically searching for better solutions. If you want to work with the same dataset, I've saved the results into the file TS and E ref dot dat and lines 699 to 701, if you just execute that code then you will also have the same result without needing to actually run the iteration over... Jesus, how many iterations am I asking for? Oh, 8,000, that seems excessive. Halfway done. Normally we like these procedures that take about as long as it takes to boil a fresh pot of coffee. We're not quite ready for a coffee break yet, so... Boris, I've forgotten what it was that I'm asking. Please remind us what seed actually means. What the seed means? It's the value. The seed value is the value from which we start the random number generator. So once we set the seed value for a random number generator, the random numbers that are generated are always the same for a particular seed. So when you look here at a seed value, the idea is when I take a seed value here and this capacity embedding process will run in exactly the same way every second time. And it will run in the same way for me and it will run in the same way for you. So if you do the same thing with the same code, you should get the exact saying, the random number of points on your specific choice of the number of seeds are the same. It can be any single number. If the seed is not defined, the system takes a random seed that generates from parameters of your system. Normally in order to make random numbers, the system collects what's called entropy, which is related to time and process of states and weather and how you move the mouse and what the intervals between the key clicks were. So the things that are not predicted that will then make this behave in the best possible way is something that looks like a random number, even though it isn't. Okay, so this is done. I've saved the data and I can define a color spectrum. In the script I go on a little bit about why such a color spectrum is good. It's basically a rainbow-type spectrum. I mentioned that there's a number of inbuilt color spectra that we can use to apply for categorical coloring. So for example, the rainbow spectrum that's inbuilt looks like this. The heat colors look like this. We've used them in the heat map. That's the default there. They go from red all the way to white. There are terrain colors that kind of emulate from low to high the shading that could be coloring a map of terrain with lush green valleys at the bottom and barren hills and snow-capped peaks as we go higher and higher. Topography colors are similar. Only there's the deep blue ocean at one end and CM colors, cyan, magenta just go from cyan to magenta. There's a lot more on colors in the plotting reference script that is in one of the units. Now, this is code that loosely generates color values based on the moncell color wheel. And that's an interesting way to look at color. Here the colors are not separated by some notion of red, green, and blue values, but they're separated by perceptual. So these colors are meant to be perceptually equidistant. At least if you don't have any red, green blindness, then you will find that it's easier to actually distinguish colors that come from one of these moncell color wheels. So for example, if we look into this range here of 14 and 15 from the rainbow spectrum and compare that with the range of 14 and 15, I would say at least to my eyes these are virtually indistinguishable and these blues are slightly more distinguishable. So the idea here is that you map that into a color space where adjacent colors become more different. That's what it's useful for. So we can use that and plot the result of our T, S, and E embedding. We actually find that the clusters that we found with affinity propagation kind of fall into similar classes, also under T, S, E embedding. Other classes are more mixed and some classes are more different. So now we can do a similar thing that we did with the ChoCellCycle data and start plotting the numbers and squinting at them and identifying which numbers we would like to compare. By the default, this is all very large. We can plot it much smaller and then actually there's less overlap there and we can color them and use that information to pick that out and find some sets and do a parallel coordinates plot in the same way. So these are these normalized matrices here of three of these clusters and it kind of shows to a degree how these clusters fall into separate categories. But typing all these numbers by hand and figuring that out is really tedious and really error prone. So wouldn't it be nice if we could work with these plots interactively and the answer is yes, it would be very nice and we actually can. Our supports work, interactive work with plots like that. So there's two functions for interactive work with 2D plots. One is identify and one is locator. And so in order to use identify, we simply run identify. And what happens then is that our cursor changes to a cross here and we can then click on individual points and it will remember what points of the original data set we used. So you can keep on doing that until you press escape and then the points are labeled and the row numbers of the points are identified. So that's really nice because we can start doing things. For example, we could write a function that actually shows us what we picked and then displays the actual profiles in a matrix plot and finally returns the numbers so that we can use them, for example, to retrieve the gene windows. So the first thing we would like to do if we want to get the data and display it is we would now want to work with two windows. So we assign the name of the first window, open a second window, store its ID and return focus to the first window. And then we define a function to pick genes. In principle, once I'm entering the function, I set the focus to the second window and I plot an empty plot. So the empty plot will then accommodate the parallel coordinates of the actual expression values and labels. And then we go into the main picking loop where we forever pick in a while loop and assign the values that we get and discover if we've pressed the escape key and break from the loop if we did. And while we are picking loops, we're just fetching the data and plotting it in the parallel coordinates plot and displaying it. So that's actually pretty neat. So let's have a look here. We start by reproducing our plot and starting to pick here more space here. Okay, so let's look at this one and here is its expression profile and that one, that one and they're very similar and that one. And this one is from a different cluster but it looks very close, so what does it do? It's here. Okay, I see that, you know, I can see a reason why it's in a different cluster. So this is this. And what about these blue ones? Are they really different because they end up really different and far away? Something very different is going on here. Okay, so in this way, we can then start working with our two-dimensional plotted data and actually discover and identify what these genes are. We can open second windows, we can open third windows and get more information about that and play in interesting ways. And once we click escape, we get the actual row number so then we can go back and fetch gene names or, you know, whatever. If you want an unpick option, you would need to code it. But the simple thing to do would be to simply remove it from the list of picks and, yeah, and re-plot the parallel coding. So there's not an unpick function built in there but the way that this code is written, it would be easy to add unpick functionality. So in a similar vein, we still have to pick individual things. Wouldn't it be nice if we were able to kind of just draw boundaries around them, especially if there are many genes and they're partially overlapped, which would be good because that means they're very similar in the plot and to select them all. So for that, we can use the locator function. So locator gives us the X and Y coordinates of the picks that we have. So if we click into the plot a couple of times and then hit escape, we get a list of values, dollar X and dollar Y. And we can use that to draw a polygon into the plot. And once the polygon is closed, we can determine what's inside and what's outside of the polygon. So that's kind of straightforward. We can use this function to draw lines. Now if we capture the coordinates and we end, it's really easy to then close the polygon. Basically say once we are done, we would like to set the line coordinate of last point to the first point. And then we have a shape. And that shape now can be used to determine what's inside and what's outside. So then we all need to only go through the data points that we have here and determine for each one whether it's inside or outside the polygon. Now if you think that's easy to code, you're wrong. Inside-outside questions for polygons can be fiendishly difficult. Fortunately, there's a convenient function in the generalized additive modeling package, MGCB, that we can use. And the function is called in-out. And that then makes the code very simple. In fact, I'm not going to go through the details, but it kind of looks like the code that we wrote previously for the picking function. So let's close the graphics windows, make a new device, a small empty frame for our matrix plots, set focus, replot our windows, set focus on the second one, make an empty frame for our parallel coordinates, set the focus into this window and use that function for our picks. Now what shall we pick? Let's pick the points here. So now we've captured all these points in this polygon and put them on this coordinate. And let's contrast them with things that we get from over here. There we go. So this is a kind of more advanced version of our individual point clicking. We can work interactively with plots like that, extract and store information and further analyze it. Okay. So in terms of clustering, I think that's all I wanted to go over. There's additional code there for 3D embedding. It's the same thing with TS and E. Instead of two dimensions, we go into three dimensions. That uses the RGL package, which is based on the Mac on X graphics. And I think... Oh, I'm not even sure what the underlying graphics package is on Windows. It doesn't always work as expected, so I'm not even going to fire it up here. It depends a little bit on your current state of the operating system and what graphics libraries you have installed on your computers. But in terms of 2D plotting and clustering, I think we'll wrap up here. So what have we done previously? Previously, we've just asked about how to identify similar genes by looking at regression, by looking at correlations, and by working with dimension reduction techniques like principal components analysis. In clustering, we're generalizing this and using this to pull out partitions from the data according to its internal structure. And we have to be careful, because of course there's many ways to partition the data and not all of these ways are biologically relevant and it would be good if we can then find some criteria to identify which ones are the most relevant. In principle, similar techniques apply that we've used with dimension reduction. We can use basic clustering techniques. We can use t-socastic neighbor embedding. And we can then use methods like I've demonstrated with this point picking or the locator package to actually pull out the data sets and in this way do some manual further classification and further analysis.