 All right, so I think this is approximately where we left off. Here's a comparison of our k-means results with complete linkage results. So this is an error slide. This is actually the single-linked results, as you can easily write it. And as we said, some aspects of the data after the classification into these regions are actually quite well-represented. Many of the data points that sort of look bi-modal here fall into the same cluster. But there's a bit of a problem that Rafael pointed out in his last slide. If you look at this cluster here, there seem to be very disparate data points. This bi-modal distribution and round-flat distribution that seem to all fall into the same cluster. Now, why aren't the flat ones over there and over here? And the problem here is that since k-means uses Euclidean distances to update its centroids and to calculate the distance from the centroids to the data points, the whole thing kind of depends on how the variance in your data points is represented. So you can imagine if the variance in one dimension is very large and very small in another dimension, the only information available for clustering really is in that one dimension with the large variance. And to treat all your dimensions equally, or at least not to inadvertently downvalue their importance, you should really be rescaling this to the variance or standardizing your data before you do k-means cluster. And once you do that, the clusters turn out rather differently. So all the data are now standardized on the variances in the different dimensions. And the clusters actually look a lot more similar in the homogenous in the more, for instance, in this cluster here, you really get clustering together very nicely of this bi-modal distribution and you really distinguish it well from the unimodal peak here. So standardizing the data before you do the clustering allows you to make sure that all dimensions contribute equally to your clustering problem. So kind of converting the expression levels to z-scores? Yeah. Yeah. OK. So these are useful techniques fast and easy to implement. You have to beware a little bit of memory requirements for hierarchical clustering. Why? What is the memory requirement dependent on for hierarchical clustering? What's the fundamental comparison that you use for hierarchical clustering? Remember? Distance matrix. So there's a memory requirement rising from the fact that you need a matrix of all pairwise distances. And it's a square matrix. So basically, for n data points, you need n squared sized distance matrix contracts. It's a bit ad hoc in defining the number of clusters and in defining exactly which distance metric to use, what constitutes a good clustering, and so on. And so it is very useful as an exploratory data analysis method. The results of cluster analysis are not necessarily the be all and end all of your analysis. Biological insight is still required. Now I'm going to go rather briefly through the more technical aspects of model-based clustering. Basically, as one example of more advanced, more insightful statistical methods that can be brought to the game. So model-based clustering is based on explicit probability models. So whereas in the other clustering methods, you don't actually go in with any notion about what your classes and clusters should look like. Model-based clustering is based on so-called normal mixture models. Basically, a normal mixture model is simply different normal distributions associated with each class in your cluster and then all added together. So if I take two normal distributions and I plot them on the same window, I have a normal mixture model because I've mixed these two normal distributions in one data. So now if you have these normal mixture models, as you can clearly see, you can apply sophisticated mathematics to represent the probability that the clusters can actually be well represented in such a normal mixture model and then analyze what the properties of each contribution, each component in these normal mixtures would be. And mathematically, this comes down to an eigenvalue decomposition where your components in the mixture are simply a sum over parameters that determine volume orientation and shape. And don't borrow slides from people who do that. Do you people ever put animations on your PowerPoint slides? It's the bane of the audience. See, this is what happens there. Hard to turn off, too, when you borrow a slide from some words, bells and whistles that interact with the audience. Anyway, so the parameters can be set so that the normal mixtures should all have equal volume and the spherical, essentially, or they could have unequal volume and spherical. Or they could have equal volume, shape, and orientation. Or they could be completely unconstrained regarding volume, shape, orientation, and so on. So basically, as you're adding different larger numbers of parameters here, like with Orkham's razor, you're basically getting a better fit at the end at the cost of including more parameters to describe what you're doing. And there's a trade-off here, because with enough parameters, you will get the perfect fit. But it may just be dependent on your parameters and not on the actual data. With too few parameters, your analysis will be too coarse. But, and again, will not represent the data well. And you can solve that in the m-clust-r package. And we'll go through an example of that. The interesting thing, though, is that in this instance, you can actually compare the different models using the so-called Bayesian information criteria. And in the Bayesian paradigm of things, basically, what this does gives you a probability that a given model is correct after accounting for the data that you've seen and the prior probability that the model would have been correct in the first place. So this is totally weird. OK. Important aspect about the so-called Bayesian information criteria is that two different values contribute. One here is calculated as the measure of fit. How well does your model actually fit, predict, and represent the data? And from that, you subtract a penalty term, which is essentially the number of clusters that you need to represent your data in the first place. So the less clusters you need, the better. But the better each of the clusters fit, also the better. If your clusters aren't good, you're going to get poor measure of fit. But if you have too many clusters but they're all perfect, you're also going to get a low information score. So this is where this trade-off between cluster size or cluster parameters and the number of clusters and the quality of fit is handled and calculated in a mathematically rigorous way with sound statistical theory behind it. And thus, it can be used to choose the number of clusters and covariance here. So I'm going to skip over this here. It basically works in pretty much the same way. In the different degrees of freedom that you can apply to the normal mixtures here, and the one that induces spherical normal mixtures here, you would get this kind of clustering here, which is, if you just visually compare it back to the table, at least it's as obvious as the means of cluster. But it turns out that you can even get a better information score under a different parameter where you have a little more degrees of freedom, because then everything falls apart into just three clusters within a day. And now, looking at this visually, I think you note that three clusters now in each of these genes fall together very, very nicely. So it's a very, it seems to me, the best method of all that we've seen so far under the criteria of the best being the most visually obvious, similar within these fields of the ones that we've looked at so far. So once again, the bottom line is try different clustering methods. Model-based clustering is an attractive alternative. It's easy to use with that M-clust package, which is not installed by default. You simply have to type install packages M-clust, and that's on the course Viki. The best clustering method, thus, really depends on what you want to achieve. But one thing that I wanted to speak about a little is that clustering in such circumstances may not even be the best approach to begin with. And this brings us to this whole idea of density estimation. Now, clustering is a partition method. Clustering forces the data set to be partitioned into different sets. So if we simulate a data set, consider the following piece of our code. We set the C to a particular value. We create a vector, which is an arc array. We create a matrix, x1, which is an array of 70 uniform distributed values in the range between 0 and 10. That's what is 35 rows and two columns. And we create a second matrix, we call that x2, which is 30 normally distributed values between centered on just 5, with a standard deviation of 4.7. 15 rows and two columns. Now, we calculate the ranges, the x ranges for these values. Because what I'm doing here is I'm superimposing two different distributions in two dimensions. One is a uniform distribution, one is a normal distribution, and there might be outliers. I want everything on my plot, so I use the command range to make sure I know what the smallest and the largest value is. And I put that on the variable x, right? And then y range, the same thing for here for the first column and here for the second column. So I can make sure that my plot actually covers all of these data sets. And then I plot them independently in a plot. This is a normal scatter plot of the first ones, and I call them black. And then I tell this r new equals t. And then I do a second plot, and this r new equals t causes that new plot not to be updated, but it's overlaid over the first plot. So this is how I get the second plot overlaid on the first. And in order to do that in the same coordinate frame, of course, I have to specify my limits on the x and y axis. So I can't let r, what it normally does very well, automatically determine the ranges, because then they would not get essentially into the same. This is now colored red. I will suppress display of axes. The axes have already been displayed here. I will suppress the display of labels. The labels have already been generated here. I just add pure points as they are, and the result looks like that. So this is my normal distribution, and this is my uniform. This is the way my data came around. I just know, because I've been generating them in the synthetic data set from two different distributions, that I should be coloring one red and one black. But that's not a luxury I normally have when I look at data. It would be all black or all red, and I would need to figure out what's going on in that data. Is there anything of interest? Now it's obvious if I start clustering this into, say, two or three clusters, I would probably get the cluster partition along here or right along here, and so on. So there's certainly going to be many possible ways to cluster this, but it's probably equally obvious that I will not be able to consistently cluster data from noise or destroy my data as well, because there's overlap between the noise and the uniform distribution by data distribution. This is one. Synthetic example that illustrates a situation where it's probably not good to try to cluster, because the notion of partitioning is not what the data is made up of. The data is not purely able to be partitioned into two things. We have data that we'd like to recover and we'd like noise. Now density estimation is an approach that might be much more suitable in this way. So what's density estimation? Essentially, density estimation takes a discrete set of data points, and it tries to generate a probability distribution that's the most likely distribution where the discrete data points might have come from. So if I take a normal deviate, like R norm 100 in R, where I get a set of points that are distributed according to a Gaussian bell curve, and I plot a histogram. That histogram is going to be jagged and having corners that sort of look like a Gaussian curve, but that are not themselves the Gaussian curve. Density estimation is the problem. How do I reconstruct the smooth curve back if I only have observed single individual data points? The example data here is taken, I think, from Peter Dalgarde's book, Introductory Statistics with R. Raphael has mentioned that book. It's also on the course Viki. I find this a really, really excellent introduction into working with R. It explains a lot of the code that you use, even things that look trivial about the code at first. So if you're serious about this, I think this would be one of the first books to look at. At U of T, I don't know about the other institutions, but at U of T, we have it online in the library. So if you go to the U of T library and you look for the title of the book, you can connect to it online and just read it on your computer everywhere you are. You don't even need to buy it. OK. So this is data. It's a bimodal distribution with geyser eruptions in Yellowstone National Park. And the question is, there's a distribution behind that, a probability distribution, but we only have sampled these probability data. So how can we reconstruct that? The way this is done is to run a density estimation. And what the density estimation does is it takes a number of so-called kernel functions. So basically, just in the Gaussian case, bell curves. And it adds bell curves together until it can fit and reproduce the observed data well enough. Now there's a term in there which is called bandwidth. And that's the question of how broad these bell curves are going to be. If they're very narrow and very pointed, every one of these bell curves is going to sit exactly on top of an observation and they're not going to contribute to each other. If they're too broad, they are going to basically smear over the two peaks that we have in the histogram. So this is a little bit dependent on the resolution. I'll show you another example on that. On the so-called bandwidth of the bell curve, but R has bandwidth, SJ tells it to use an automatic parameter to optimally adjust the bell curve. So that's essentially all you need is to say, well, we want the density estimate for these eruptions with this bandwidth parameter. And we can put that into a plot command. Again, overlay it over the histogram in this case. So press the titles, press the labels, press the accents, and so on. Cold-read red and define line width equals 3 to make it very distinct and thick line. And that's what that looks like. So the density estimation here has now given us this primodal distribution probability density function that would give rise to these histograms that we actually observed in the data. Now, why is that useful for clustering? Is everybody approximately clear with what this density estimation does? So you have discrete data, usually never enough of it. And they're discrete anyway. And you're trying to establish what's the underlying distribution that could have generated these discrete samples that we're observing? With the setting of the two more brother or. Bandwidth. Here, this bandwidth parameter says use one of the automatic algorithms. You can also set it specifically. Actually, it's interesting to play around with it and show how the curves then get more rugged and sharper or broader and more similar. So if you could actually try several and use the best one? Yes, yes. There's an algorithm how to choose that optimally. So this is a little bit of an example from my own work. I'm interested in structure sets and proteins. We're interested to find recurring patterns in protein structures. And we're interested to take small patterns and then ask, do we have clusters, essentially, of similar patterns in the very large, multi-dimensional structure spaces of, say, all amino acid fragments that have five amino acids and that are folded in some way. So this is one representation of structure space that we've ever worked with structural biology. You know what a Chonron plot is. It essentially finds backbone and torsional angles along the backbone of a protein structure. And if you plot these aggregates here, you see they are dense regions here, here, and here. This corresponds to the Nalapelix. This corresponds to the E-thrand. And they are also dense pool regions here. And the problem that we're setting ourselves is to find out how many or where are there interesting clusters in all structure space that might tell us something about how proteins fold and why proteins fold. And the way we've been approaching this is similar to kernel density estimation. It's a problem that's not very amenable to pure clustering procedures because they're all defined. And there's a lot in that that's just noise and that influence our clusters. So what we'd really like to get at is just the best representations of each of these, what we call structural motifs. So this is the way we go about it. So we just have one dimensional structure space and discrete observations. So this is one particular piece of protein structure. This is another little fragment from protein structure and so on. Small discrete observations. And then we use a Gaussian kernel to estimate the density in that structure space. So basically what we do is we associate a Gaussian kernel with each and every single one of these observations here. And then we sum over all the Gaussian functions and we get this density estimate as a potential probability density function that will rise to all observations. And now in order to find the representatives of these clusters, of these structural motifs, we don't try to cut them apart and pry them apart into different subsets. But we look for the local maxima in that density space. Yes? Did you present a significant deviation in some again? The deviation of each particular deviation. Right now, all of these are just standard observations in these axes. But the standard deviation of the Gaussian kernels is basically the same thing as the bandwidth. And that's, I think, next slide after that. I get to that in a moment. So there's no good and reasonable way to choose the standard deviation except in some way. Well, no, I get to that when I have the slide for that. That's exaggeration. What's the best choice of this parameter, the standard deviation? If I use very broad curves, I basically smooth over all the differences in my cluster space. There's something immediate value that's quite reasonable. But if I use very narrow curves, then I can get sub-minimum that I might not even want to divide it. So you can think of this like resolution, looking at something at a very coarse-grained resolution or a very fine-grained resolution. And depending on what resolution you look at, you will get different results. But that's OK, because that's the way our data is structured. If we look at it in a very fine-grained way, we should want to see sub-structures and things falling apart again into small details. But if we look at the whole thing at once, we might only be interested in the very large pictures. And if these are maximally broad, the entire space will just have a single peak that you can find at no other local maximum. So basically, tuning the bandwidth of these kernel functions for the density estimation gives you a way to look at your data in a more smooth, or in a more fine-brained way. We then go along and we just look for maximum, local maximum. Of course, if we would be looking for a global maximum, we would just find a single point. But if we were looking to well-medate local maximum in some neighborhood of the point, we can then pick out points that are represented again. This is not a clustering constraint. But we're looking for local maximum after order of kernel density estimation. And then we can say, well, these local maximum have a certain neighborhood of all things that are within some distance within the cluster standard. And once we have the neighborhood, we can superimpose the fragments. This is one of the examples. So this is ranked 399 in our little database of what you've found. It's a piece that has seven amino acids in length. This is the backbone superposition here. And we were only using the backbone for calculating the coordinate distances. But what turns out is, even though this was not a search criteria, there are very significantly non-random sequence propensities. This position here is always either phenylalanine or kerosene. This position here has a high likelihood of being a spartan acid or a botanic acid. This position here is almost always serine and so on. So why did nature choose these particular sidechains and these particular sequences? Well, it's because these specific folds and shapes with that particular sequence give low energy confirmations. Low energy confirmations have high probability. Since they have high probability, they get peaks. When evolution has explored its possibilities in structure space, and we can reproduce these peaks with the kernel density estimation, find such things where sequence and structure come up to something like a representation of the protein folding code. So this is not result of a clustering method, but result of a very different approach looking for local maximum after density estimation. And some of the problems that we look at are much more suited to that. OK. So let's go back to the digital example. What does kernel density estimation in this data set? So I said we can cluster it. The code is there. It's easy for you to try to see whether you can come up with meaningful clusters here. But if we run the kernel density estimation, combine the two x and y, the two black points and the red points together into one joint file because the density estimation has to run over all of them. We run the density estimation, but we only run it on the x coordinates. It has to be density estimation. Here it's one dimensional, so we just project everything down on the x coordinate. We would get a similar result for the y coordinate, so we could do other things to do it in higher two or three dimensional spaces. And plot this as a fact, loop curve over the same plot, and then it looks like this. So this now shows us that we're interested in this, that there is something interesting going on in our data very high in that data distribution. And it would be relatively to the coordinates of that peak here, and just rank everything according to distance in these coordinates and start analyzing what it is. It won't completely separate signals from noise right from that, because that's just not in the data. But it will give us a good starting point and start analyzing them and grouping them. Yeah, very much scratching the surface regarding cluster and algorithms. There are many, many others. If there are many others, it probably means none of them are good for all situations. So it's a useful tool. You can shoot yourself in the foot with it and come up with the impression that there's something in your data that really isn't there. Once again, use it with moderation. Use it for exploratory data analysis. If you have to derive quantitative results, make sure that you use different approaches to clustering and see how robust your clusterings are and see if you can get independent validation. That's always the most useful thing, for instance. If you cluster according to expression profiles and then you can independently see you have an enrichment of genontology categories. That's a good validation that your clusters are actually meaningful and not just artifacts of your data. OK.