 Okay. Welcome back, everyone. So, just so I have a sense of this, how many people are still working on their R and bioconductor environment? Anybody? Just one? Okay, that's great. Anybody else having problems? Okay, very good. Okay, so this afternoon we're going to talk about clustering, classification, and feature selection. So, we'll give you an introduction to clustering, and that really involves distance metrics, and then two types of algorithms, hierarchical clustering and partitioning-based clustering. And I'm going to give you an example of classification. The major concepts involved in building a classifier, hopefully avoiding overfitting, and a little bit about cross-validation. And then throughout, we're going to talk about the concept of feature selection and how that helps in clustering and classification. So, clustering is also called unsupervised learning, and it really involves the discovery of patterns in data. So, this is when you have a data set, and you don't know ahead of time what structure there might be in this data set. And so we often cluster data in order to do what we call class discovery. So, there may be, for example, you take your whole gene expression data set, and you cluster it, and lo and behold, some patients exhibit a certain profile, and other patients exhibit a different profile. And so, we can consider this as grouping together objects that are most similar, or conversely, and you can think of it as being least dissimilar. So, in these objects in the data, they can be genes, or they can be patients, or samples, or both. So, again, here's a question. Are there samples in my cohort of patients that can be subgrouped based on molecular profiling? The real end point of this is you might want to find subgroups, but at the end of the day, we all are here because we probably have some sort of clinical outcome data that we want to be able to associate these groups with. And so, that's the final step, is being able to associate subgroups with clinical outcomes. So, I'm just going to go over now distance metrics. So, a key step, and a critical step, you can't do clustering without this. In order to perform clustering, you have to have a way to measure how similar, or dissimilar, two objects are. And so, the classical distance function is a function called Euclidean distance. And here, I've just shown, you can consider these as, for example, two patients. So, a patient X and a patient Y. And each patient has between one, and it has P features, okay. So, these are, these could be genes. And what we're going to do is just measure the difference of each of the features, square that value, and sum that up over all the positions, and then just take the square root of that value. And that's just called, as you called Euclidean distance. So, that's, so, another term for features is, or you can consider them as dimensions. So, sometimes you see, you hear about multidimensional Euclidean distance, or multivariate Euclidean distance, and that's how you calculate it, okay. So, it's pretty straightforward. There's another distance function called Manhattan distance. And this just takes the absolute value of the difference between each feature, between two cases, and sums up that quantity. And so, you end up with this, this type of, this formula here. So, this, these two formulas basically tell you exactly how to calculate these two distances. And these distances are just now taken as standard metrics, and are rolled into almost virtually every hierarchical or any clustering-based method. And you can just choose it, choose Euclidean, or choose Manhattan, or others as well. So, a third metric that actually has some nice advantages for gene expression is, is one minus the correlation. So, when we look at two samples, we can take a Pearson correlation coefficient. So, presumably, people know what that is, where you can calculate the correlation coefficient. And that will be between one and zero. And so, then you take one minus that value. And, and this actually reduces, this is very nice. And I'm going to show the formula here, because the correlation coefficient formula is kind of hairy, but basically people understand what this is. But the nice thing about this, this is proportional to Euclidean distance. However, it has a very nice property that is invariant to degrees of scale of measurements between the two, between the two samples, or the two, two objects that I'm looking at. So, so if one sample has a dynamic range, that's higher than the other, but yet the relative ordering of the genes is similar, then the correlation should be quite high. However, that this would be affected, the same thing would be affected if you were to just use Euclidean distance. And this is really the principal reason for why normalization of data is actually important. So, you can try to make this dynamic range invariant across the samples, and so that you can use other metrics such as Euclidean distance effectively. Okay, so, so here, just schematically, I've just, I've just put down two samples that, here we have some genes that are, that are up-regulated, here we have some genes that are down-regulated. So, you can see that these two would be dissimilar, and these two samples would be quite similar. Okay. So, yeah, so you can do it either way, right? So, you can do it of each pair of genes, or you can do each pair of patients. And actually, you do both. Okay, so, here is a heat map representation of distance, a distance matrix. And so, what this matrix represents is for each of the patients, and it just says from, from, this is proportion of patients, but really each column here is a patient, and each row here is a patient. I just calculated the Euclidean distance between each patient from our breast cancer gene expression data set that we'll work on in the lab. And, and then what I did is I clustered them, and then, and then plotted, so then sorted the patients based on, on a, on a clustering. And you can see that these, basically the, this set of patients here are quite close together from a distance perspective. And they're quite different from this, this block here. So really what you get out of the data, there might be some structure here as well. But what you get out of this data is that they're more or less from a Euclidean distance perspective, two big blocks. So everyone can hopefully see that. If you use Manhattan, you get a different type of structure. And again, the data that's input to this is identical. The clustering algorithm is identical. The only thing I've changed is the distance metric. And so if you use a different distance, distance metric, you get a different result. Okay. Now it just so happens that this block here and this block here, if you put them together, would probably sum up to this block here. So it's just a finer level partition of the data. And then finally, if you use Pearson, what's nice about Pearson is that actually the, the distinguishing metric has a greater dynamic range. So that really can, you can visualize this and you can really see that these ones are really highly similar. And, and, and they're quite distinct from these ones. So at the end of the day, what I want you to take away from this is that the distance metric matters. The choice of distance metric really does affect the result. And so one shouldn't just blindly employ a distance metric. You should get to know your data beforehand and understand the properties of the data. And one of the things we're going to do in the lab, just the first plot you're going to do is just to make sure, have an idea of what's the dynamic range of the expression values of each of the samples. And, and gives you a gauge of how well normalized or how well comparable these, these samples are. Okay. So that's, it's also the patients on the Y axis too. Okay. So this, yeah, same patients, right? It's the same ordering of the patient. So that's why there's that on the diagonal you see each other identical. Yeah. Okay. Yeah. So, so I think that really comes down to to knowing your data. So it's important to do some exploratory data analysis ahead of time. And we don't have time to really cover that in this workshop. But if you have, if you quantel and normalize your data, for example, and you have the range, dynamic range of all the data is relatively similar, then one can probably safely use Euclidean distance. If you still want to be robust to that, then one might use the Pearson correlation coefficient as a, as a distance metric. So here's just another distance metric that you might encounter. And it just, again, this is just to have the language that, you know, to give you the language that if you're going to talk to the person that is analyzing your data or you're analyzing yourself, you can understand what these terms mean. So, so hamming distance is often used for ordinal or binary or categorical data. Essentially what it does is it counts up the number of features that are different between two samples. So you might have reduce your gene expression data set to up, neutral or down, for example. And then you can look at each feature between two samples and say, are they the same or are they different? Continuous, yeah. Okay. Okay, so that more or less covers the distance metric that I wanted to go over. So now let's talk about approaches to clustering. There are really two main categories of clustering algorithms. One is based on partitioning and the other is based on hierarchical. Partitioning, you may have heard of k-means. How many people have heard of k-means? And then we have k-me-doids, which is also called partitioning around me-doids. And that's a closely related cousin to k-means. And I'll explain what the differences and similarities are. And then within these partitioning methods, you also have model-based approaches. And I'll touch briefly on model-based approaches, which are sort of more advanced from a statistical perspective. The hierarchical methods basically involve building nested clusters, where essentially you start with pairs and you build up a tree to the root. And I'll show you how that works. So let's talk about partitioning methods. So to do partitioning, you need a data matrix, which is the input. You need a distance function, and you need to specify the number of groups that you want to partition the data into ahead of time. And the output is essentially a group assignment of every object. So you're going to take each object and assign it to a group. So here's the algorithm. That's not showing up very well. But essentially what you do is you initialize the group center. So let's just talk about k-mean-doids, for example. So you initialize the group centers. It's also called a centroid or a mean-doid. And then you assign each object to the nearest centroid, according to the distance metric. That makes sense. So you take a point in space and you say, this is my centroid, and I have other centroids as well. And I'm going to take the distance from each of my objects to each of the centroids, and I'm going to assign it to the closest one. And then once you've made that assignment, you can recompute the centroids by taking each group independently and computing the centroid. So that might involve recalculating the mean of the centroid, or it may involve picking a new actual data point as a centroid. And then essentially what we do is repeat these last two steps until the assignment of each of the objects stabilizes. So here's an output of, again, our breast cancer data. And so what you can see is that there's kind of a well-separated group here. And there's another group over here. And there's another group over here. And then you get this kind of two groups that overlap, and they're really kind of ambiguous there. So it's really hard to determine, you know, whether the X's and these pluses, which group might be better assigned to. So one can make the argument, so I use five groups here, and one can make the argument that maybe four groups would be better here. So it just gives you a flavor for what this does. And in fact you're going to plot, you're going to do this yourself. So let me just, I think it's worth, hopefully everyone can see the board here. Yeah, so this is all unsupervised so far, and we'll get to supervised. Okay, so let's say that we have two groups of data like that, and we initialize centroids. Let's say we initialize somewhere over here, and somewhere over here. Okay, so let's make it. So essentially what you do is you just calculate the distance between each point and the centroid, and you pick the one that is closest to it. So this one would obviously be assigned here. This one would be assigned here. This one may actually be assigned here. This one would be assigned here. This one's quite ambiguous, you don't really know, but because you've calculated the distance precisely, you have to choose one. So let's just choose that, and then the rest would get assigned like this. So then what would happen is we then have to recalculate the centroids. So let's do that. So now the centroids, so these three guys were assigned to that one. And then this one, the centroid would probably be somewhere around here. And you can quickly see that just in a few steps you converge, and I know it would stabilize. And so that's kind of schematically how it works, yeah. Yeah, so I'm going to get to that. So initialization is actually a key part of this, and I'll talk about that in a minute. Yeah, yeah. Yeah, so sure. So I think the distance metric is a nice way to, so taking the distance matrix is a nice way to visualize the data. So it doesn't matter how many dimensions you have, you can always plot a distance matrix. In higher dimensional data set, yeah, it's difficult to visualize. So let's just look at the difference between k-means and k-meatoids. So in k-means, the centroids are actually the mean of the clusters. So I actually showed here, I showed the mean of the clusters. And in k-meatoids, the centroids are actually a data object itself. So one might pick this as a centroid and maybe this one as a centroid. And that's the actual centroid. So the advantage of doing that is that you just compute the distance metric between each of the objects, and you only compute that once. And then you can just look it up every time. You only have to compute that once. And here the centroids need to be recomputed at every iteration. So here initialization can be difficult as, again, notion of a centroid might be really unclear before you start. So in high dimensional data set, how do you define what a centroid is in that sense? And with k-meatoids, actually, you just pick a case. You pick a sample, or you pick a gene, and that's your centroid. And so that becomes much more interpretable. And in R, the command for running k-means is just k-means. And then for k-meatoids, it's PAM, which is partitioning around meatoids. And again, we're going to look at PAM in the lab so you get a much better feel of how that works. So let's look at, in general, what the advantages and disadvantages of partitioning-based methods. So the advantages is that the number of groups is actually well defined. So there's no ambiguity with a hierarchical clustering. As you'll see, you have to kind of post-process the data and say, okay, I'm going to cut a dendrogram at this point, and that'll give me so many number of groups. Well, the problem is you'll always get this number of groups. And as I showed in that plot, that may not always be appropriate. And so there are methods to actually intelligently choose a number of groups that can be applied to solve this problem. So again, that disadvantage of this is that you have to a priori choose a number of groups. With partitioning-based methods, you really get a clear and deterministic assignment of an object to a group. That's a nice advantage if the data is clean and it's well separated. But sometimes an object does not fit well into any cluster. So for example, these two objects might be somewhat ambiguous. And so that can be a problem. As I showed that the algorithms for inference are really quite simple. The disadvantage of that is that often these are very sensitive to initialization. And so for example, if I initialized here and here, I would do well. But if I initialized here and here, I would not do so well. And so often what we do is do multiple restarts and just choose the one that ends up separating the clusters the best. Okay. So let's talk about the collaborative hierarchical clustering. Everyone's sort of seen these types of heat maps before. What we get as an input, we have this matrix, this data matrix. And then essentially what we do is reorder the rows and reorder the columns of this matrix according to a distance metric. So what you need to input is essentially a distance matrix and what's what we call a linkage method. And I'm going to describe some linkage methods. And the output is a tree called a dendrogram that defines the relationships between objects and the distance between the clusters. And essentially what this represents is a nested sequence of clusters. So you start out at the leaves of a tree and build it up until you get the full tree. So let's just think about linkage methods for a minute. So what a linkage does is it actually pulls two clusters together. And the way in which you do that also sort of affects the results. So let's talk about the different types of linkage methods that you may encounter. So let's say we have two groups delineated by the green circles and the blue squares. So single linkage takes the minimum pairwise distance between the any two objects in the green versus the blue. So you take all the pairs and compute the distance. Compute the distances, the one that's smallest is as what's used for single linkage. Complete linkage takes the one that's biggest. Okay, make sense? Okay, and then we have distance between centroids. So here's a centroid of the green group and here's a centroid of the blue group and you just take that as their distance. And then you have average linkage, which takes all the pairwise distances and takes the average. Two clusters, yeah. So it's a method to actually join two clusters. Okay, and then another very nice method of linkage is called Ward. And this basically forms partitions that minimizes what we call the loss associated with each grouping. So you can imagine if you group two groups together erroneously, you can have some loss function that describes the error involved in doing that. And this error is defined as the error sum of squares. And so let's just consider ten objects, okay, and they have scores. They have these scores here. And so you take the, within each group, you take the mean of the group and then you calculate the distance to the mean and square that. Okay, so for this group of ten, the mean is 2.5. And so then the error sum of squares is just each value minus that mean squared and summed over all the objects. So you get this number. If you had a perfect grouping of this, you would group the zeros together, you group the twos together, you group the sixes together and you group the five together. So then you have a sum of four different groups. But within each group, the mean is zero here. And so then they're all zero. And so the overall error sum of squared is zero. Here the mean would be two. And again, the error sum of squared would be zero, et cetera, et cetera. And so the error sum of squared total is just the error sum of squared for all the groups. And so if you get the correct assignment, you have a zero error. And so by using, so for this, the ten scores into four clusters, that's a perfect clustering and you've got a zero value. And so we'll look at how, what the difference between the different linkage methods right now. Okay, so I hope this didn't print this way in the slides, but this should be an open double quote. This is just a double quote here. I just noticed this now. But let's look at linkage methods in action. So clustering based on single linkage. And what I've done here is clustering samples in our breast cancer data set using single linkage. And you guys again are going to do this exercise. So if you use single linkage, what tends to happen is you get these long chains of clusters. These long chains. And this can really be quite difficult to interpret. I mean how, how would you actually interpret this? Can somebody have ideas of like how would you visually kind of subgroup this? If you were to just, there might be a group here. There may be a group over here, but the rest just look like linear chains. This one is actually really quite difficult to interpret. And here's again, this is identical data. Identical distance metric. The only thing I've changed is the linkage method. So if you move from single linkage to complete linkage, which remember is the maximum distance between any two pairs, you get something that looks like this. And this is starting to look a lot more interpretable. One can imagine maybe cutting the dendrogram here and saying that, or maybe even here, and saying that there's a group here, maybe a group here, one here, one here, one here, one here. And so this becomes a lot more pleasant, I think, to deal with. Here's the centroid linkage. This has a problem that actually it's not monotonically increasing. And so you get these kind of really weird type of clusters. And what the reason is that, which I'll get to a little bit later, is that once you actually cluster, you can't go backwards and undo that clustering. And so that's the nature of hierarchical clustering. And I'll just talk about that in a minute. And then this is average linkage. So again, this has some fairly nice properties that one could use. Imagine a cluster here, one here, one here, one here, and then some individuals here. And then this is Ward cluster. So this is arguably maybe the most interpretable, because you can just slice the dendrogram across here and you get one, two, three, four, five, six groups. So again, the moral of the story here is that the method of linkage that you choose really affects the results. And so again, it becomes, in some ways, a lot of this becomes subjective in terms of you should know your data, know what you're expecting, what you might expect to see, and choose something that makes sense in the output. But just be aware that there are other linkage methods that may give a cleaner or more interpretable, or it's hard to say more correct because we're doing discovery and we don't know what we're trying to find. But just be aware that these different linkage methods make a difference. Any questions so far? Yeah, that's right. Yeah. So each is to the branching, there are too many branches. Yeah. Okay. So let's just have a look at hierarchical clustering then. So the advantages of this is that there actually may be small clusters nested inside large ones. And so sometimes what we can do is just pick out the really nice tight small clusters as being sort of important, essentially important. The disadvantage is that clusters might not be naturally represented by a hierarchical structure. So the notion of pairwise clustering may, again, once you group a case in a cluster, it can't be undone later on. And so this can be a bit problematic. One of the advantages that in contrast to partitioning methods is that there's no need to specify the number of groups ahead of time. But as I showed, we have to actually cut the dendrogram in order to produce clusters. And where to cut is often arbitrary. And so you can just sort of look at the plot and say, okay, well, I think there are these groups here and that's what I'm going to do. So sometimes this can be subjective and arbitrary. The advantages are that actually you can use a number of different linkage methods. I mean, I think this is actually a bit of a strength. However, as we saw with the centroids that bottom up clustering can result in poor structure at the top of the tree. And that's really because the early joins can't be undone. So this is what we call a greedy algorithm. And it converges in a similar way that a partitioning methods would converge to local clustering, local optima, and often doesn't find a globally optimal solution. Yeah. So the vertical lines represent the distance between these clusters. And so, for example, according to linkage. Okay, so that's the calculation. So these two groups are really quite distant from each other because they've got these long vertical lines. Okay. So now we're just going to briefly discuss model-based approaches to clustering. We're going to assume that the data are generated from a mixture of, let's say, k distributions. And the task is to infer a cluster assignment and parameters of these distributions that best explain the data. Another way to say this is that we're going to actually fit a model to the data. And we try to get the best fit of that model that explains the data in the most parsimonious way. And a classical example is a mixture of Gaussians or a mixture of normals for continuous data. And I'll show an example of that in a minute. The advantage of this is that you can take advantage of really well-established probability theory and well-defined distributions and statistics and can sort of mathematically represent data in a principled way. So here's an example, just returning to this Ray-CGH example. So here you really have three types of states, if you will, in the chromosome. You have neutral, which are kind of these points that are centered around zero. And you have losses, which are these red squares in this context. And you have gains, which are these green squares in this context. And what you can see is actually the data is a bit noisy. And so you have these kind of singular probes that are classified as neutral by an expert by looking at it visually. But so you can imagine maybe just drawing a threshold, say, I'm going to draw a threshold and say anything below zero, I'm going to call a loss or anything maybe below, let's say minus 0.1, I'm going to call a loss. Anything above plus 0.1, I'm going to call a gain. And everything else, I'm going to call a neutral. But then if you did that, then you classify all these single outlier points as being aberrant when in fact the neutral. So what we can do is take advantage of probability distributions that can model this noise. And so if you see, if you consider a distribution like this red curve here that may have generated this data, and a different distribution that may have generated the blue data, and a third distribution that generated this green data, we can have a sort of principled quantitative metric of how likely each data point is to have been generated from each of these distributions. Does that make sense? Yeah, okay, so yeah, that's the disadvantage of these approaches that you have to specify the number of groups or the number of distributions. And there are principled methods to do that, the Bayesian information criterion and others like that. Yeah, so you could do it that way, and then you can evaluate again in a mathematically principled way which model best fits the data using model selection. So that's a bit advanced, but yeah. So here's just an example, and I'm going to show you a model based approach to clustering RACGH data that really kind of illustrates all these concepts. So let's say you have a number of, just ignore this model in the left because I think it'll be confusing, but just look at the part on the right. So let's say you have a set of RACGH samples, and we want to cluster them into a set of groups. So one thing we can do is if there are recurrent alterations in one group versus a different group, is we can try to detect those. And then we can infer what we call a profile that represents a group, and it might look like this, where this represents the probability in that group that you'd find a gain, and the green curve is the probability in that group that you'd find a loss. And then we can do what we call feature selection, which is to say, okay, well, which one of these features is most discriminative between the two groups? And so we can end up with what we call these sparse profiles that say, actually, this becomes then a model for this group. So this group should be characterized by gains in this region and losses in these two regions, and this group should be characterized by gains in this region and losses in these two regions. And so what we get out of a model-based approach is, in addition to the clustering itself, which is assignment of each object to a group, we get a model that actually tells us what the group looks like. And that's what we don't get from, for example, hierarchical clustering or partitioning-based methods. So similar to this, we're actually going to infer the shape of these distributions, and we can see that while the losses have a distribution that looks like this, gains have a distribution that looks like this, and then groups have a distribution that looks like this. And so you really get what we call a model for each group out of it, and the nice thing that you get out of that is that you essentially get a classifier for free. So then when you have a new patient, then you can compare the new patient to each of the models that you've derived using model-based clustering and see which one it fits best. Does that make sense? Okay. So, and then again, choosing a number of groups, which has already been brought up, becomes a model selection problem. And you can look at the Bayesian information criterion, and I pointed to a reference here that talks about model selection in bioinformatics. Okay, so here's just an applied example from some of my own work. And this is taking that array CGH clustering method and applying it to a cohort of 106 follicular lymphoma patients. And so the moral of the story is that, again, it's one of these heat maps where each patient is a row and each column represents a probe in the array or a feature in the array. And so the data gets nicely separated into these groups. This one is characterized by gains of chromosome 7. This one by gains of chromosome 18. This is characterized by gains of 1p. And then we have a group that has gains of 6p and a loss of 6q. And then there's one group in the middle that's kind of sparsen. It doesn't have much going on in it at all. And so we clustered the data, and then we had clinical endpoints, which were time to transformation of follicular lymphoma to a more aggressive type of lymphoma called diffuse large B-cell lymphoma. And so we plotted the survival curves of that. And indeed, so this is kind of taking the whole, you take the data, you cluster it, and then you look at association with outcome. And we had a nice association with outcome here for these cases that had this aberration in 6, chromosome 6 and 7 had a much quicker time to transformation to DLBCL than did the other groups. Again, this is suggestive, but it's a relatively small cohort, but at least it's suggested that there is some association. And without clustering the data, we may not have seen this in the first place. So now just to illustrate, so what this plot up here, so this is the heat map of the data. And what this plot up here is, is that it shows the actual profiles of sparse profiles. So again, you get a representation of each group, and you can imagine taking a new patient and comparing it to this profile versus this profile, this profile, this profile, this profile, and seeing which one it most closely matches. And maybe make some predictions as to with regard to prognosis of transformation. Okay, so let's just spend a minute talking about feature selection. So the advantage of sparse profiles, which I mentioned earlier, is that in fact most features, whether they're genes or SNP probe sets or back clones in high dimensional data sets will actually be uninformative. So you have some genes, for example, and gene expression are unexpressed, so they're just not expressed at all. You have others that are expressed ubiquitously and they're always kind of highly expressed no matter what, so-called housekeeping genes. Or you have what we sometimes call in cancer, we call passenger alterations, which are just kind of results of genomic instability that don't actually confer any tumor genesis. They're often kind of considered biological noise in the data, and we want to try to avoid those. We want to try to ignore those when we're actually trying to extract structure from data. So the message here is that clustering and also classification has a much higher chance of success if these uninformative features are removed first. And so let's just discuss some simple approaches. So one can measure the variability of genes or expression across all the samples and pick intrinsically variable ones. So genes that are uniform across the samples, of course, won't give you any information as to how to separate those samples. The ones that are variable will. And so you can take different measures of variants such as interquartile range, entropy, and actually just a standard deviation as a way of measuring intrinsically variable genes. So the other thing is that you can require a minimum level of expression in a proportion of samples. So in order for you to choose a gene, you have to have it at least expressed at some level in this proportion of samples. This is called K over A or P over A analysis. And we'll do this in the lab using the gene filter package. So those are just some tricks and actually important steps to carry out when doing feature selection and clustering. Okay, so clustering certainly doesn't end there. And I've just really given you a very brief overview of clustering. We have much more advanced topics in clustering are instead of bottom up clustering that showed in hierarchical clustering. There are methods for top down clustering. One can do by clustering or two way clustering so cluster on the genes and samples simultaneously. What we're doing is we're doing genes and samples sequentially, but one can do it at the same time. Principle components analysis. Again, the idea of choosing a number of groups and model selection. So there are different methods here. The Akaki information criteria and the Bayesian information criteria is still that coefficient and the gap curve or some terms you might see. And again, I'll point you to some references that have not yet compiled that I promise I will where you can look up these terms. And I'll point you to a couple of textbooks in fact that this is all textbook stuff. And then finally, a topic of great interest actually to me as a computer scientist and a probabilist is actually doing clustering and feature selection simultaneously. So actually the model I showed it adapts as it converges. So it might change the features that are important based on the clustering. And then that choice of features might change the clustering. The clustering again might inform the choice of feature selection. So it's adaptive as it and it considers both things simultaneously. Whereas most methods and really do this step first and then do the clustering. That might be information loss. So that's just something to consider. Okay, so just a review of clustering. Three main approaches, hierarchical, partitioning and model based. Feature selection is very important. There are two reasons for that is that well actually I didn't discuss this before, but feature selection actually reduces computational time as well. So that's actually quite important, especially when you're trying to do often distance metrics. Well distance metrics are almost invariably an n squared operation, which means that you have to look at every pair of objects in your data set. So if you've got 22,000, that gets into the billions of operations quickly. And so if you can reduce that down to let's say 100 features, then you're talking about 1,000, sorry 10,000 operations, which is orders, of course orders of magnitude slower, faster. So again to reiterate the distance metric matters. The linkage method matters in a hierarchical clustering. And then model based approaches, although they're much more advanced from a statistics point of view, and they do offer principle probabilistic models where you can sort of unambiguously and mathematically describe what you've done. So let's see what time we're at. So the breaks at three, sorry, 2.40. Okay that's perfect, okay good. So we'll do 15 minutes, 10, 15 minutes on classification, then we'll take a break. Yeah. That's correct. So now we get into supervised. Yeah, so I think people have done hybrids of model based and hierarchical. It's kind of on the fringes of statistics though. And so I believe, I think there's one paper that I've come across that does it. Yeah, so it's not to say that it's probably a good idea, but because then it's kind of trying to do the best, have the best of both worlds. But the machinery needs to be invented to do it properly. Okay, so questions have come up about unsupervised versus supervised. So this is supervised learning now, classification, also called discriminant analysis. And so the big difference here is that we work from a set of objects with predefined classes. So recall that we were trying to discover classes using clustering. Here we're given the classes along with the features. And so one can think about maybe trying to build a classifier that can distinguish between, let's say, basal and luminal or let's say a good responder versus a poor responder. And the task is really to learn from the features of the objects what's really the basis for discrimination. And then can I apply that to my new data for which I don't know the classes. So this classification is much more statistically and mathematically heavy than clustering. So I thought to illustrate the points, I'd just do this by illustration. And this is just schematically here. You could have, for example, patients that have gene expression profiles that look something like this and they have a poor response. And then you might have patients that look like this and they have a good response. And so you want to learn a classifier that essentially provides a model for poor response and a model for good response. And so when you have a new patient, you just compare this to each of these two and say which one does it fit closely or give me some sort of probability that it belongs to this group versus this group. And so a very nice example of this from gene expression data is in a paper by George Wright at all who is from Lou Stout's group working on diffuse large B cell lymphoma. And this is published in PNAS in 2003. And essentially what this group did is built a classifier that can distinguish the cell of origin of DLBCL based on expression. And the implications for that is that the cell of origin, be it ABC or GCB and I've forgotten what these stand for, maybe some germinal center B cell and what's the A, any help from pathologists? Okay, it doesn't matter. But we can abstract this to any types actually. So the point of the matter is that you can tell it there's quite a striking pattern that defines these ABC DLBCLs. And then these GCB DLBCLs. And you can learn a classifier that gives you a quantitative output given a new case of what class they would belong to given an expression profile. So all of these classes, sorry, all of these patients would have near 100% probability of being ABC and all of these patients would have almost a zero probability and then vice versa for the GCB. So these guys here, which are GCB have a very high probability here and a low probability of being ABC. So the pink curve is the probability of ABC and the blue curve is the probability of GCB. So of course, why this is important is that GCB and ABC have differential outcomes. So this becomes a prognostic marker as to the GCB subtype does much better in terms of overall survival than does the ABC subtype. And this is just two different experiments using different expression platforms to come up with the same conclusion. So let's just talk about how they did this because it's a nice simple elegant. Did you find what it is? Activated B-cell. Okay, so a germinal center versus activated B-cell. Thank you. Okay, so we all learned something. So this is a model that takes feature selection into account not necessarily deterministically, so it doesn't just take a threshold and ignore genes. It waits the contribution of genes to the predictor, and that's just demonstrated here. So the weight of gene day is actually, this is determined by a t-test statistic. So given a labeling of the data, so we know that some cases are ABC and some cases are GCB, we can then look at a computer t-test statistic of the distribution of expression between one group versus the other and then use that as a weight in this, what's called a linear predictor score. Okay, so this is the weight and then this is the expression level of that particular gene. And we can assume that there are really two distinct distributions of these. And one's for ABC and one's for GCB. So we can learn one of these for ABC and learn one of these for GCB. So it's very similar to those distributions that I mentioned earlier. So, dramatically, so one can imagine a distribution that this one could be GCB and this one could be ABC. And then given some way of determining a density from a particular case, you can classify the cases according to those distributions. And so, again, there's a principle of mathematical machinery. You can use Bayes' rule to determine a probability that a sample comes from, let's say group one. So this is just the linear predictor score of a given case. And given the parameters of the linear predictor score for, let's say, ABC, which would be indexed by one. So these are just parameters of the distribution. So you take this density function of the case for ABC and then just put that over the sum of the two and you get a probability that represents group one. So as you get a density function that in the end you do posterior probability that your case is in group one. So this is all very nice. So they did do, I mentioned that they use weighting, but they also did some deterministic feature selection. So they use a method called cross-validation. So this is just kind of an important technique that I wanted to discuss that deals with the problem of overfitting. So what we do is we pick a set of samples that we want to learn to classify or from. And we use all but one of the samples as a training set. And we leave one out for testing. So we fit the model using the training data and then we ask the question, can the classifier correctly pick the class of the remaining case? So we know what the class is because this is training data, remember. And so we can evaluate how well and then we repeat this exhaustively leaving out each sample in turn. And so we can get an accurate way of measuring the accuracy of the classifier. And then what we do is we repeat using different sets and numbers of genes, again based on the T statistic. So we can filter on T statistic by taking the top 50, top 100, top 150, et cetera. And then see basically which set of genes gives us the highest accuracy. So that's a method that they use to select their features. And I think that they ended up with something like tens of features. So I think it's something like 40 that gave the most discriminative. So you really collapse the data set down from 22,000 features to 40 that give you a reasonable classification. So just some words on overfitting. In many cases in biology, again the number of features is much larger than the number of samples. And so at the end of the day important features just may not be represented in the training data and this can result in overfitting. So overfitting happens when a classifier discriminates well on its training data but does not generalize well to external cohorts or thoughtfully derived datasets. And so when you're building a classifier validation is required in the least one external cohort to believe the results. And so these expressions sometimes the reason why they've sort of become dogma in the breast cancer world is that actually they have been validated in numerous datasets and on different platforms, et cetera. So some advanced topics again is that so one can use Bayesian priors to regularize parameter estimates of the model so I can talk more about that offline if people want to talk about that. And again some methods now integrate feature selection and classification in unified analytical framework. So in the same way that I did that for the CGH problem people have done that for classification in gene expression so there's a really nice piece of work here by Alexander Hardimink at Duke and so there's actually software that you can just import your gene expression data your label gene expression data into and it will simultaneously classify and do the feature selection so at the end of the day you don't not only get a classifier but you get the most discriminative set of genes and that can tell you something about biology as well. So again cross-validation should always be used in training a classifier. So just some ways of evaluating a classifier so this is just an example of a receiver operator characteristic curve or called a rock curve ROC and essentially what it does is it plots the true positive rate versus the false positive rate and so I'll just go through how that's calculated. So given ground truth so you need a ground truth data set that you know the classes and some sort of probabilistic classifier similar to the one I showed you can't do this with deterministic classifiers that say it's this or that you need a classifier that says it's this with some probability and the reason is because you can then set some number of probability thresholds and compute the true positive rate which is the proportion of positives that were the proportion of positives in the ground truth set that were predicted as such and you can compute the false positive rate which is just the number of false predictions taken over the total number of predictions and so you do this for some number of probability thresholds and you can plot a curve that looks like this and basically the closer this curve is to the top left corner the more accurate your classifier so if you get something along the diagonal here that's essentially as good as random so that's not good so some other methods for classification that you may see in the literature support vector machines linear discriminant analysis logistic regression random force all these I thought were probably a little bit too heavy duty for this audience but if you are interested I would recommend just looking at these two review papers that go over these topics and we'll point you to the actual development of the methods that are involved now so I think it's reasonable time to take some questions and then we can take a break so any questions? so false discovery rate is usually set by the user and you take the you have this tolerance for okay so what you can do is you can say I have a tolerance of let's say 1% false discovery rate and it's based on this type of analysis and so what this tells you is that this lets you use the probability threshold that gives you that false discovery that false positive rate and that determines the false discovery rate so you apply that threshold to any new cases that you see okay