 The point where we left off was that we installed libraries from Bioconductor, basic libraries for dealing with Bioconductor objects, a tool that is able to query and download geodata sets from the NCBI repository, and a tool that is able to do statistical analysis. Now, the dataset that we're working with is a time series analysis of HeLa cells, GSE26922. So in that sense, it's somewhat similar to the dataset that we've worked with from the yeast, but it's much later and a dataset done with a different technology. Now, if we simply Google for this string GSE26922, we get the geo accession viewer, which has information on the dataset, what it contains, periodic expression of cell cycle genes, and so on. And there's a link, which is really nice. And this link here is analyzed with geo2r. And I'd like to briefly introduce this tool to you. So geo2r is a tool where you can take, here we go. You don't actually need to read these codes and numbers. So basically the dataset has the micro area results for 18 different experiments, three control experiments, three experiments at two hours, so six time points with three biological replicates each. And geo2r is a tool that allows you to find differentially expressed genes in micro area datasets. So what you do is using their web interface to find in groups, and then at first, look at the value distribution. What you expect is that the means are all the same and the variances are approximately the same so that there should not be any outliers in your dataset. And so this is a nicely distributed, normalized dataset for biological analysis. And then you can run a script, which gives you the most significant, differentially expressed genes. So this is sorted by ID and adjusts the p-value. So the ID of this is histone one H2BM, which apparently has some highly significant expression change. And if we click on this, we can get the actual distribution of values. And it gets the top 250 of these. Now the cool thing here is that the entire procedure, which calculates these values, is available as an R script, which is in the back here. And you can download that R script and work with it. And that's exactly what I've done for the first part of this tutorial, used exactly the procedure to reproduce what happens on geo2r to find the top 250 differentially expressed genes in that gene dataset. However, we found that simply downloading the dataset in the canonical way didn't work. There are odd problems here with FTP access. We suspect it might have something to do with the OICR firewall. And it might have something to do with the FTP service at the NCBI itself, because the question of why this fails from time to time has been asked on the bioconductor support site. And apparently the answer there was, yeah, that happens, try again later. So that's not what we want to do. So for some quirky reason, wrinkle in time and space fabric, Michelle was able to download the file on her computer. So we quickly converted it into an R data object and put that on the GitHub project side. So if this first set, G set, get geo and so on, fails for you, and I suspect it probably will, then simply pull the latest version of this RStudio project from GitHub, then this file GSE26922.R data should appear. And if you then load this file by its file name, you should receive this object, G set, the one that we would have assigned from this procedure, but now simply recreate from stored data. Okay, is there still anybody who doesn't have their G set yet? Apparently not, excellent. Now, we need to load that in. At first, we convert this list into pulling out one, the appropriate data files from this entire G set, and then we can have a look at what we have here. So the head of this object looks like this. We have biological replicates and different versions of the same and so on. Okay, the structure is that this is a factor and there are numerical and character columns and we can use this for further analysis. So at first, we define proper group names. We access the expression values, convert them to numeric, convert them to log values, define data we need for our analysis, essentially to define the experimental design, apparently I have. You have to be careful. This is written in a way where you can't go back. So if I've preloaded this previously, I just need to go through this again. What's happening now? So if any of this fails along the way, you'll just need to retrace. So these are basically all the commands and the workflow that is installed in the lima package for analyzing microarray data. The result here is an object called TT, top table, which has the top 250 differentially expressed genes according to a p-value adjustment of false discovery rate. So this p-value adjustment is important to compensate for multiple testing issues. We'll talk more about p-values later on, but you probably all know that the p-value we all are looking for to happily declare that something is significant as 0.05 or 5%. So when something has less than 5% chance of occurring by random chance, we're inclined to say that something significant is happening and going on. For example, a variation in my expression profile could be interpreted as being truly differentially expressed. So if we do a large number of experiments, and in this case, we do as many experiments as we have individual genes, the chances of simply by random chance finding something that exceeds this 5% p-value limit become exceedingly large. So of course, if we have 20,000 genes and we simply take random variation there and we ask about a 5% cutoff, then we'd have, did I say 20,000? And we'd have a thousand genes then that are simply part of the random variations but then appear to be significant. So the problem with multiple testing is that we have to adjust our p-values. And the modern way to do this, and we'll talk a little more about that later, the modern way about to do this is to control the false discovery rate. And this is what's implemented here in the lima packages through the procedure called empirical base, which then helps to make adjusted p-values. So the next three steps load NCBI platform annotations. So what the gene set essentially contains are expression values for spots on microarrays and we still need to identify what these spots correspond to. And this information is in platform annotation. So lines 118, 119, 120, download that. Okay, so apparently the interaction with NCBI is tricky today. I leave the script that explains how exactly we formatted the file up to you. We'll reconnect with this script on at around line 220 of the script file where the data has been generated. Now, what you'll need to do is to pull again, everybody, we pull from the repository again. And after the pull, GSE 26922.rds should appear. And after the file has appeared, you need to read it in. Now, saving as RDS files is similar to saving as the Rdata files that we've looked at before. There's a slight difference here and this RDS file is a single file, a single data set. And when you read it in, you actually assign it. So in this case, I would like you to use the function readRDS, load this file name GSE 2692.rds and assign it to the variable datDAT, not dat2 as the original version here. So when this is done, you should have this object dat available, which has gene names as row names and which has 239 entries, which were selected from the top 250. And with the remaining columns labeled as t0, t2, t4, t6, t8, and t12 for the time points of the analysis. So this is what we'll now use for clustering. This line 222 is the key here. Your pull should have produced this RDS file and then you load it onto the variable dat, not dat2, dat with readRDS and the file name. Okay, I hope we're ready to do some clustering now and actually have the data. So what's clustering? Clustering is essentially an unsupervised machine learning method. Clustering works on data and tries to find categories within that data into which the data can be suitably subdivided. So the problem there is suitably. Of course, if you have a data set with say 50 elements, you could put everything into one category and that wouldn't be very informative. Or you could put all single elements into 50 different categories and that wouldn't be informative either. So what you're trying to do is to find some number of categories between one and 50 that your elements can be subdivided. Now, of course, there's many ways to subdivide your elements and the question is how to do that in the best possible way. And whenever somebody tells you do this in the best possible way, you immediately have to ask best for what? And that's the crux with clustering. There's no global definition of what that best for what could possibly be. It depends. It depends on your data and it also depends on what you want to do with the data. It simply, it depends on your idea of what similar means in order to be able to produce meaningful clusters. Because a cluster really is a collection of elements, a set of elements that are more similar within the set than the various sets are among or with other sets. So greater similarity within the set than the sets between each other. That's the idea of a cluster. Now, what exactly this implies, this can be very different because the idea of similarity that underlies this all is really not well defined. We can use similarity as a Pearson correlation of individual numbers, but this is simply a mathematical kind of similarity that makes no reference to any kind of biological background knowledge that you might have. For example, if we do, and this is what we will do in this clustering, analysis of simply the expression values, we lose all information about that these are not just vague vectors, but these are actually time series so that there's a connection. The elements are not completely independent from each other and that's something that we ignore here. We treat these clusters, the expression values in this clustering exercises, independent elements, but we should really be doing this as a time series analysis and have some idea of correlations of elements. So let's have a very first quick look on our data set. I hope that the data set that you have loaded now looks like this on the variable dat. We have a first column of gene names and then six columns of expression values T0 to T12. So these are the time points and we can simply do a heat map of that. One command, heat map. Yeah, without any further explanation. If you see that, do you think this is somehow intuitive or obvious, what are we seeing here? What are we seeing here? Well, red and yellow and white colors. What could these mean? Yeah, these could be expression levels because after all, this is what we put in there. But so essentially the expression levels are coded into a spectrum of colors and the spectrum of colors approximately corresponds to cooler and warmer colors. Cooler and yeah, cooler and warmer colors. So presumably red values would be low expression levels and yellow or even white values should correspond to higher expression levels. So this is an encoding of values into color. Then everything is arranged into columns but the columns are no longer in the order that we had before. So it's now T8, T6, T0. So the columns have been rearranged and then we find these funny lines here. What could these lines mean? Have ever seen lines like this before? Yeah, you've seen them in a dendrogram or specifically probably in a phylogram that captures phylogenetic relationships. So this is a dendrogram, i.e. translating from Old Greek, a tree drawing or a tree plot. It's an inverted tree, the root is at the top and the leaves are at the bottom. And what it does, it emphasizes the relationships of similarity, i.e. these two columns are most similar. These two columns are most similar. These two columns are most similar. And then the average of these two columns is most similar to the average of these two columns which then becomes similar to this one here. So there's some clear similarity structure here. Now if we express this, if we relate this back to our time points, this very simple heat plot without further parameters seems to tell us that the first and the last time points are rather similar. The second and the fourth time point are rather similar and the sixth and the eighth time point are overall rather similar because we have a large number of genes that seem to have similar expression patterns at these time points. Now, to what degree we could actually assume that T2 and T4 is similar, I don't know and why it didn't turn out that T0 and T2 is more similar, I don't know either. This depends on the details of the algorithms and if you use the algorithm with different parameters, probably this would be different. So we're just looking at what this shows us in principle. The other thing that it shows us is this line of densely overlapped lettering here which presumably are the gene names and another denogram on the side. So what this heat map does, it clusters the columns or it arranges the columns by similarity and it also arranges the rows by similarity and then it orders the rows such that groups of similar genes are close together in this plot. What you need to understand about dendrograms is that the information is in the topology and not in the ordering. Let me repeat that. Information is in the topology and not in the ordering. What does that mean? That means the information is in how these branches are connected but not in the way that they are arranged at the bottom. You can rotate each of these branches around one of the branches for exactly the same tree. So I could swap these two columns and that would be the same tree. There's no difference between them. I could swap these two columns here, again that would be the same tree. Or I could take this block and swap it in here and again that would be the same tree. The algorithm somehow needs to make a decision of how this swapping should work out but otherwise other than that it doesn't really have, the ordering doesn't have any significance and you have to be careful. For example, imagine that we would have swapped these two columns here. Then this block of red lines would have come to lie next to this block of red lines and we would have concluded that there's some structure in our data, which there's not. The structure in our data between this swapped and not swapped is exactly the same. So treat these things with some caution. When you try to interpret the block structure in here, be aware that this is only valid to the degree that the blocks are actually connected within sub-branches. So anything that's outside or between sub-branches is not a valid block structure. So for example, this part here, there's a connection going through that and I could just as well swap it out. Okay, so if we replot the same thing by randomly picking only every fifth gene, it becomes more readable, but essentially the structure is all the same. The actual range of values here is from five to 13. So these are the log expression values that we're looking for. Now, there seems to be a number of genes that are low at T4, but high at T0 and T2. For example, the genes TPX2, CCNA2, which I can kind of read when I'm squinting here, these two genes here, they're quite similar. So this block, and there's also genes for which the inverse is true. They're high at T4 and T6, but low at T0 and T2, and that would include genes like map 21, here. So this block of genes seems to be similar and this block of genes seems to be similar. And just in the same way as we did with our micro area expression data, we can say, well, they appear to have similar features, so maybe they also have similar expression patterns. Now we can do the same thing and plot the matrices, oops, after defining them. This is our gene set one, and this is our gene set two, and lo and behold, that's actually the case. They vary in opposite ways. So what we've seen in the, just by eyeballing the heat map transfers to the actual expression values if we look at the expression values as expression profiles. Okay, so the first thing we can try to do here is a hierarchical clustering. Now in hierarchical clustering, we start from a distance matrix. So this distance matrix computes or stores the result of a calculation of the distance or degree of similarity of every element with every other element that we're looking at. So for 100 elements, we'll have a matrix with 10,000 cells, 100 rows and 100 columns, and each cell contains the distance between the two objects. There are many ways to compute such distance. So the default is the Euclidean distance. What's the Euclidean distance? Who still remembers their high school geometry? How do I compute Euclidean distance? It's the square root of the difference along the coordinate axis. So the Euclidean distance on a square table is the square root of the difference between x values and the difference between y values. According to a very old theorem discovered more than 3,000 years ago now, long time ago. And this relationship generalizes to higher dimensions. In our case, it needs to generalize to 6 dimensions and simply by taking distances along the dimensions and then squaring them and adding them and taking the square root of all of them, I get a Euclidean distance. So this is the basic, most simple way to approach the notion of distance. And I can build a distance matrix and I can do hierarchical clustering and plot the result. So this is now a cluster dendrogram. So let me say a little more about this notion of distances. There's many ways to calculate such distances, but there's one thing that's really, really important. If you have a specialized metric to calculate distances, you have to make sure that this measure which calculates distances actually is a metric in the mathematical sense. A metric in a mathematical sense requires three properties. One is the identity property, is the positivity property. So a distance is always positive, always greater or equal to zero. The second property is an identity property. If two objects have a distance of zero, they're the same object. And the third one is the triangle inequality. The triangle inequality says if we have a distance from A to B and a distance from A to C, the distance going directly from A to C rather than via B to C must always be shorter or minimally just as long as going through the point B in the middle. Or paraphrasing this, a detour is never a shortcut. There are measures that you can define on data which violates the triangle inequality. And your analysis package will not complain if you use a measure like that because after all these are only numbers. However, if you do that, then the implicit assumption that somehow elements that are similar to each other according to that distance metric should be clustered together in the same cluster, that assumption need no longer be true. Something could look very similar to one object in a cluster and very dissimilar to its neighbor in the cluster that is computed through a slightly different path. So everything that we use about interpreting these clusters essentially breaks down. Now, the metrics that we use here, Euclidean and Canberra and maximum distance, Minkowski distance and so on, they all are safe. They all work in the same way and give you different kinds of heat maps. So this is the Euclidean that we just saw. I'm not even sure what the Canberra metric is. Distance function is maximum. Anyway, we always get kind of similar clustering here. But you're not confined to the default distance functions and below here is an example of defining your own function and that is 1 minus absolute Pearson correlation. So this distance is small if the correlation is large and it's large if the correlation is small. So the distance, the value is large if the elements are considered very, very similar. So the way to compute this is simply to define a function x where we define as a distance that value on the measurements and then call the heat map and define as the distance function the one that we've just defined to use. So it's as simple as that. Simply rather than plugging in a keyword about a function that's already known to R, you plug in the name of an object. So this is the 1 minus correlation map. A similar thing could be done with the maximum information coefficient that we've discussed. So simply we'll go on with using the Dmax plot the cluster here. This seems to have a nicely distributed rather even set of subdivisions but obviously these aren't clusters. What does that have to do with clustering? It's a dendrogram. So how do we get clusters from a dendrogram? How could we define a cluster based on this tree structure that we see here? Any ideas? Well, these are genes not species and essentially this calculation has already been done. This calculation has gone into calculating the distance matrix and the distance matrix has been used to build this tree. So things that are similar cluster or are grouped together in this tree here. But if we call this clustering, basically we said we are looking for sets of data that have a common shared property. So how do we get a cluster here? We have a scale that let's go. That is an indication to the right side to see how distant they are. How do we produce that? So this is far apart from this. I think I'm looking for a somewhat more abstract answer. The question is really how do I take a tree structure and make clusters from that tree structure? Mentally, I take a pair of scissors and I start cutting my tree. So imagine that I cut my tree here, then this whole set of branches just falls down. And I could say, oh, that's my cluster here. Everything that falls down, if I cut at this level here is one cluster. Everything that falls down here is another cluster. Everything that falls down here is a third cluster. And that's exactly what the rectangle h-clust function does. If I define I would like two clusters, k equals 2, I can do this in this way. So I cut at the top here and then it falls apart into two clusters. One is all these genes and one is all those genes. If I want five clusters, I cut slightly further down. Then I get five cluster sets. And if I want ten clusters, I cut even further down and then I get ten clusters. Now that should give you some pause for thought because didn't we say we want to try to find and identify the best possible clustering here? Apparently, what I can do here is get some convincing way to cluster this with, you know, two, five, two, twenty, seven, forty-two different numbers of clusters. What's up with that? Well, the answer is essentially this is exactly what the data looks like. Essentially the data or any kind of data can be clustered in different ways. And it's really not obvious to say this clustering is better than that clustering. Sometimes you will have trees that have very clear two branches and then everything else is totally similar at the bottom. But in a well-distributed tree with lots of information, you can view it as something that just comprises two clusters or something that comprises ten clusters. And in a sense, this corresponds to different degrees of resolution that we apply to our data. Now, in this hierarchical clustering, I think this is actually an advantage of the method because if we look at biological data by and large, there is probably not an underlying mechanism that clearly and in a defined way produces one, two or three or five different categories. Usually the differences between our genes are rather continuous. So, for example, in an expression profile, we might have genes that are expressed in one way, expressed in another way, and there are genes that are sort of in the middle here. Now, should they go into this cluster or into that cluster or should they form a cluster of their own? Hard to say. They are simply different. And if we force everything to just look black and white and fall into two clusters, the algorithms will do that. It's our responsibility to make sure that that's actually meaningful. Now, if we take something like a partitioning clustering, which we'll meet in a moment, something like K-means or K-medoid, where we pre-define the number of clusters that we would like to use, the algorithm then forces everything into five clusters or everything into seven clusters, depending on the parameter that we give it, that's not necessarily a better way to do it because it becomes quite challenging then to define what is the right and the best possible ordering of clusters. Okay. So this is a visualization of what our clustering looks like in order to actually extract the different elements. We use the function cut tree, and in this case, I'm using it with 20 elements, and I'm assigning this to an output here. So what we have here is associated with every single gene element here, the number of the cluster it belongs to. So hist1H2BM is in cluster 1. CDC8 is in cluster 2. KIF14 is also in cluster 2 and so on. So basically this captures the output of this cut tree procedure. We cut at one particular level that gives us a desired number of clusters and then we assign these numbers to the individual gene names. So we can look into a table. We have 20 different clusters. So the table function counts how often a particular number is encountered in an object. So the number one is encountered once. No, yeah, once. The number two is encountered 24 times. The number three is encountered four times and so on. We can also sort that. And that gives us the most largest and the smallest clusters in our data set. Now we can take that and use that to extract gene names and then plot them into a parallel coordinate plot. Very similar to what we've done previously. So the key here is data for class equals 10. So an expression like this in the usual way gives us a vector of trues and falses and the trues and falses will then be used on our object dat, the one we painfully downloaded and recreated and which contains the expression values of our top 250 co-regulated genes. And then we transpose that and put that into a parallel coordinate plot and with this little bit of magic here we put four of these plots onto the same screen. So that's what that looks like. This is now a clustering result. So this is not in the similar way that we worked with PCA or our model-based correlations, the result of looking for the most similar points. This is clustering, by the way, requires to attribute all data points into some data set and then displaying the results. So these are the four largest clusters if I cut at 20 levels. What do you think? Is this successful? Do you like the result? What does it look like? How do you interpret it? Well, the first thing I'd have to say is that if I look at these individual plots I'm kind of satisfied that what they contain are genes with similar expression profiles. What I'm less happy about is to compare cluster 2 and cluster 4. They kind of look very similar to me. If you look carefully, you'll notice that the scale is somewhat different. So they're offset and this probably means that some of them are more shallow and some of them are larger and this somehow puts them into different clusters. So once again, we have a mathematical procedure that allows us to partition our data sets in a mathematical way, which might not be what we would expect or desire when we approach the problem with our biological objectives and biological insight. So as an alternative, we can use a different kind of clustering. Here we get nine clusters in different colors from Ward's linkage clustering. So depending on our viewpoint, we can kind of see the same problem here. Clusters that look similar to our biological eye but appear to be sufficiently dissimilar if we look at the mathematical correlations and linkages to Ward putting them into different clusters. And we can try this with different methods and we get results that are similar. So it's clear that these clusters are not as homogenous as we might need them for biological interpretation. Now one of the reasons that this may be happening is that we're requiring our cuts here to always be at the same level. And maybe the structure of our clusters is such that we really should be cutting higher up and lower down depending on the clusters themselves. So there's to basically not use a constant range but use a dynamic range of values here. And there are statistical methods that could improve that. Particularly there's a package called dynamic tree cut and if we apply that, we get a different result. But again, eyeballing it, we are inheriting the same kind of problems. Now one thing that the clustering algorithms do is to pull apart profiles that have a similar shape but different absolute levels. And that's because we didn't normalize this data. So we could for example try clustering merely based on profile shape and that means clustering on relative expression level simply by scaling all rows between 0 and 1. Now if I do that and we've talked about this scaling before it's always the same x minus min of x divided by max minus min or first center them, align them on 0 and then divide them by the range of the values, we would get something like this. I would think this is somewhat improved. It seems that there are now clearer differences between the individual groups of data. So for example, even though these look kind of similar here it seems that the first and second values are often lower than the first and second values here. The sixth value is often higher than the sixth value here and these things taken together would make significant differences. Now this approach working on a normalized data with dynamic tree cut is probably as good as you can get with hierarchical clustering. So the clusters now are of reasonable size and still from a biological point of view I wouldn't be very happy and wouldn't argue that some of them are really different. So let's look whether partitioning clustering helps us. So these are functions called for example k-means and k-medoids. This is in principle an example of how the k-means algorithm works. So think of a two-dimensional example. We plot points in two dimensions and then at first we define the number of clusters that we want to have. Say for example the number of clusters is four or five. Then you randomly pick four coordinates within your plot here completely at random and define these four coordinates to be the cluster centers. So we have four cluster centers and then we say all points are parts of this respective cluster if they're closer to that cluster center than to any other cluster center. So we randomly place the points and then we define them to be cluster centers and then we assign points to them. And after we've done that we take the points, the clusters that we have now and we recalculate the centers. We recalculate the centers such that the new cluster centers are the actual average coordinate of everything in my cluster. So the cluster centers move. So then we've moved our four cluster centers and we do the same thing again. We assign all the points that are closer to one of the cluster centers than to any of the others to define our next cluster. We recalculate after we've done that and so on and we iterate that until recalculating the cluster centers no longer changes the cluster membership. So at that point we say this algorithm has converged. That's relatively fast and it can be done on very large datasets. The problem is it's a heuristic algorithm that kind of depends on where our four cluster centers are to begin with. It can get stuck in local minima. That is, you can get different clusters from different starting conditions. So what you need to do in this case is try this several times and then either choose the clusters that make the most biological sense or the clusters that appear the most often or something else like that. Or you could say, well, my clusters don't seem to be stable whenever I try this I get a slightly different result so probably this dataset can't be well clustered with k-means in the first place. That also might be a possible result that you might want to be aware of. So here's an example here. Plotting t0 against t6 and clustering this into four separate clusters using k-means. I think it becomes very clear from this view that one of the problems of these clustering algorithm is you will get clusters. It doesn't necessarily mean that the dataset can be clustered in this two-dimensional representation. The data looks very, very similar to me, but you will get clusters. So using a clustering algorithm blindly and then just accepting its result without actually checking whether they are the same or different or how similar and different they are is probably not a good idea. If you do that, you're probably on the path to a paper retraction at some point. An alternative to k-means is k-medoids. The algorithm works in exactly the same way, but rather than defining the cluster centers to be somewhere in the hypothetical space between the points, all the operations are done at the coordinates of points themselves, i.e. you choose the coordinates of points. And this ensures essentially that the cluster centers have a kind of interpretation. The cluster center for a k-medoid classification would be the most characteristic gene or the most central gene of the respective group. So a k-medoid clustering for the same data set with four clusters would look like this. Even though the algorithm is similar, you see that we get slightly different clusters. There are other approaches. Brendan Frey of the U of T Computer Science Department got a method published in Science. Wow, that's now eight years ago. Time flies, which is called affinity propagation clustering. Essentially, this works by looking at every point and then asking, do I have a neighbor? And if that point has a neighbor, it tries to hold hands with its neighbors and pull them closer together. And after all the pulling is done, those things that have a notional affinity to each other will end up in the same cluster. So if we do this on a heat map of our data here, we get a very nice block data structure and we can use this distance map to get this kind of clustering here. So, which clustering should we be using? That's really hard to tell. Fortunately, there are algorithms which help us find that out. There's not an obvious biological criterion here to decide which of these clustering methods and approaches and strategies is better. Typically, it would always be better to use some orthogonal information. So, for example, if you're looking at genes and then you can find that shared transcription factor binding sites end up in the same cluster, this would be a strong indication that the clustering actually makes biological sense. So, we can say... we can't really say much about the biological validation here because we don't have that orthogonal information right now, but we can say something about mathematical sense and there's a package called cl-valid, cluster-valid, and that package calculates what mathematical quality metrics for the clusteries are. So, what I'll do here is simply run a cluster validation on our data set and try different levels of clustering from k equals 2 to k equals 9 with a large number of methods, hierarchical and k-means and self-oblinizing maps and k-medoids and so on, and internal validation. So, basically, this cluster-valid procedure runs all the clusterings for me and gives me results for the different k's. And then I can use this table to make decisions. So, the metrics that it uses are connectivity, done, and silhouette. I'll get to that in a second. And applies the different cluster approaches, hierarchical, k-means, diana across the different clustering metrics. So, what are these? Okay, sorry. This shows me the plots here. Let me get to the next one. So, for example, for the silhouette measure here, we see that it drops as we increase the number of clusters. And, apparently, we have lowest value for one of them, which is possibly thus either the worst or the best. I'd need to look it up in the documentation for clustering. Now, ideally, we would have hoped to see something like a dip that falls and then rises later again, which would tell me something that there's an intrinsic structure through the data that I can exploit. If I have a small number of clusters, I get a poor result. And as I increase them, I get a better result. But as I get even larger, I get a worse result again. And that would tell me something about the optimum level of clustering. And that doesn't seem to be the case in my data. I think it's a good time. And I was just about to propose that. And we'll reconvene after the coffee. I, for one, wasn't really happy with these clustering results. We had a lot of clusters that all looked the same. And the question is, can we do something different? So I've taken this data also through our beautiful T-stochastic neighbor embedding. We've essentially gone through how that works in the crabs and in the show cell cycle data. But what I'd like to do now is, after we've done T-stochastic neighbor embedding, talk a little more about how we can use plots like this to actually identify elements. Because in the end, we don't want to just see that there are similar plots, similar points here. We want to know what these points are. And I'd like to show you an example of how to work with such plots interactively. So the simplest way to do this is exactly what we've done before and I plot our clusters or plot our points as numbers and then squint at the numbers and try to read what these numbers are. Maybe we should make them a little smaller. Now they're tiny and I can almost read them on my screen, but there's less overlap and so on. So this is possible. You see them all. This is possible, but tedious. Even if we color them, it's still tedious to find out what these are. Incidentally, the colors here are derived from the affinity propagation clusters which we had before. So if we squint at this, we can basically, for example, plot three clusters from the lower right-hand corner as we did previously, define three different sets and plot the expression profiles. So these are these expression profiles. The colors correspond to the colors I had on the previous plot. So we have purple, orange, and yellow. So it's these, sorry, purple, orange, and yellow points on this side here. And I think affinity propagation distinguishes these quite well even though it's not a clustering method. I still have this intriguing property that, in fact, indeed in the two-dimensional representations, similar things come out to be close together. For example, this point cloud of purple numbers here corresponds to this data set. So even though they're tendentially similar, they're also clear differences and you can basically see where these differences arise when you plot them in this way. So the internal structure is well separated into clusters by affinity propagation and TS and E gives us a good indication of that. So typing these numbers is error-prone and tedious. Can we also select them interactively? And that's possible. R provides two functions for interactive work with 2D plots. One of them is identify and one of them is locator. So identify returns the row number of a picked point. And in order to pick a point, we need to plot it and then use identify. So note that my mouse now turns into a cross here and for example I can click on this point and click on that point and at some point press escape to stop and that tells me what the row numbers of these two picked points are. So this is one way to work interactively with plots. You can click into the plot and identify closest elements that you click on. Now what we can do in principle is to collect the information about these points and display them in a separate window for a parallel coordinates plot. So we can thus explore the structure of our data at the same time that we interact with it. Now I must admit I have no idea whether this works. I developed this on R and I don't know whether I can open separate graphics windows in our studio. Oh, I can. Okay, that's great. Okay, so now we have two devices. One is our main graphics window and one is a secondary graphics window on the side. Maybe I'll give this a little more space. And I define a function in which I set the focus on my second window and then plot things into my second window if there are any things available. So this is simply defining the function. So let's re-plot the points and start picking. So this is the window now for our parallel coordinates plot and if I click on this here, I get the parallel coordinates at the same time. I've also written the function in a way where the point that I'm using is kind of blanked out. Different ones here and thus interactively explore the structure of my data. And after that, I have my picked genes available in a data structure and I can plot the row names for further follow-up and analysis. So this is one example of how to work interactively with graphics and plots and I hope that I've written enough comments into this little function so that you can adapt it to whatever you need it for. Now we can use a conceptually similar procedure not to have to pick individual points but actually kind of allow you to draw a frame around these points by using the function locate. So if we use the locator function and just press a few times, we get x and y coordinates. So that looks straightforward. There's a type parameter that allows us to show where our picks are and connect the points with lines which draws me this polygon here. So that's nice. The only thing we need to do here is to close the polygon and once the polygon is closed, we can then draw it. So do something like this here. Now we have a closed polygon which defines the points here. Okay, so this is nice. Once we have a polygon defined, it's not entirely trivial to determine which points are inside the polygon and which points are outside the polygon. Fortunately, there are packages and algorithms available for that. A convenient function in the generalized additive modeling package which is called in out and we can then basically do a similar thing as we did before. Depending on the focus of our points here, draw a polygon and then capture the information of where my polygon is and then draw it out and close it and find all the points that are inside and analyze these separately in the parallel coordinates. So I close my plots first just to make sure that we don't get confused about our focus frames and I built two frames here. One is my original plot and one is my parallel coordinates plot. Now we can select the cluster of points and then finish by pressing escape. Maybe let's look at this one here surrounding it. So this is my surrounding polygon and this now identifies all the points that are inside and plots them. So in this example, once again, I see that there's a certain similarity between these points that TS&E has brought into its same vicinity but on the other hand, affinity propagation is also nicely distinguished between a somewhat different structure in the time course between these two points. So a similar thing for plotting with T-stochastic neighbor embedding can be done in three dimensions as well. The code is here. You are welcome to try it out. But I think this is all that I wanted to say about clustering.