 Good morning, everybody. So welcome to analysis using R again. And we've got four modules. This morning, we're going to focus on exploratory data analysis, which I firmly believe is the foundation to data analysis. Oftentimes, we find that when research is presented, you get the impression that the data just sort of came off whatever generated in sequencer or something. And then you say, well, how do I find my differentially expressed genes and whatnot? And then you say, oh, you've got to go use this package or that package. But there's a lot more understanding your data. And what we're hoping to do is to sort of set the foundation of how to think about your data, things that you should be aware of that might be lurking in your data, things that you need to control for before you actually go to answer the question you're going to want to answer. And so we're going to start with sort of these foundational concepts here. And you'll see that as we have the two-day workshop, the subsequent modules are going to build on it. So the clustering and PCA is going to fall into the context of the EDA stuff. And then the generalized linear models are an extension of linear models. And then finally, we're going to talk about the thing that we want to do, differential expression analysis. So the learning objectives. So we're going to be introducing a lot of concepts this morning. And we're going to start the lecture, go through the concepts, and then talk about how to use R as a tool to get at sort of different aspects of your data. So I would love for you to primarily try to focus on getting the concepts right. And the R we're going to go through anyway in class, and then you'll have labs for. But then because you have all these notebooks, even if you don't completely get the R part, you don't have time to wrap your head around all the technical part. If you understand the concepts, you will know to ask those questions and reach for those tools when you go back to your desk and work on your data. So at the end of this lecture, you should be able to define sort of conceptually talk about the anatomy of a model. A model is what all of us are trying to probe in our studies. And we're going to talk a bit more about that. But first, you have to be able to name all the different parts of the model so you can tell whether you've addressed something or not and what else is left. So you're going to be able to define these terms, response variable, explanatory variable, name broad sources of variation in your data. You're going to know how to approach data exploration in a systematic way to find known sources of variation, some of which you want, and then other source of variation that you have to deal with in order to maximize a signal in your data. You can appreciate the value of exploring missingness in your data. And then you're going to have a high level of understanding of clustering and be able to cluster your data. I'm going to talk a bit more about that. So we all have some kind of effect that we're trying to find in a study. So here is an example study where we say we are doing disease research and we've got these two groups that we are interested in, responders, non-responders, cancer versus normal, and so forth. And we've got some kind of measure that we're looking at. We want to look at the differences between our groups in that measure. It could be RNA-seq, it could be DNA methylation, clinical variables, what have you. So at the heart of what we're trying to do is, can you see my pointer? OK, yeah. So what we've got is what we're saying is the reason we're doing this assay, the reason we're measuring this molecule in these two different groups, is because we think that the nature of the two groups is going to be reflected in the variable that we measure. So in this particular case, if we're measuring gene expression and our question is, are there significantly different genes in group A versus B, disease and risk control, then we're interested in measuring expression as a function of disease. Now, this notation might remind you of a bit of high school math where you're trying to fit a line to data points, where you've got y equals mx plus v, where you've got an intercept, and then you've got a slope term. So here I just want you to think about these. We're going to define these terms a bit more, but that's kind of where this idea comes from. You have a model. It's not necessarily a linear model, but it's a model. And there are terms that explain the outcome. So the outcome that we want to measure here is called the response variable. The factors that we think influence a response variable are called the explanatory variables. And then finally, our model's got this sort of, it's got this intercept term, which is some kind of constant that you need to adjust. Then you say, OK, the expression is a function of disease plus some unmodeled variation, some residual there. So that's conceptually what we're trying to do. Another way you might hear about this is that you've got the variable that you think is explaining it. It's called the independent variable. And the response variable depends on the value of the independent variable. So in this model kind of framework, these terms beta 0, beta 1, these are called coefficients. You can also call them model weights. Beta 0 is sort of the interceptor. It's sort of this additional term to shift the data up or down. So that's our overall model. So although in the ideal case scenario, your gene expression completely described by the explanatory variable of interest, in reality, that's not always the case. So you are going to have other sources of variation that are going to be driving that signal in your data, the gene expression. There are two broad categories of variation. One is biological sources of variation. And examples for that include the sex of the patient, the age, genetic ancestry, and some other factors. If you are sampling a tissue that's like a mixture of cells, that mixture can vary from sample to sample. So if I'm going to do brain transcriptomes, the brain is a mixture of neurons, oligodendrocytes, blood cells, and so forth. And that fraction is going to vary. So these are biological sources of variation affecting your data. Then you have technical sources of variation, which in a lot of cases can exceed biological sources of variation, even though we do our best to control for them. What are some examples of technical sources of variation? Does anyone have any examples? You've got some pictures up here. Yeah, protocols for preserving your tissue. What else? Technical source of variation. Difference is sustaining the growth of the tissue in batches. Yeah, yeah. There's just variation again in the protocols. Also, it could be who performed the experiment, when they performed the experiment, the Monday effect. And then you have these unknown sources of variation in your data. So you've got to appreciate that at the end of the day, you want to be able to catalog these other sources of variation because if you don't build those into your model, then your model isn't able to specifically model the effect of your variable of interest, which is disease. So the big question is, when you get a data set, which of these is affecting your data? So for these, we've got some tools, some of which we will introduce in this workshop. And so what these tools allow you to do is they allow you to visualize and quantify these various sources of variation. And some tools include clustering and dimensionality reduction, which is something we're going to cover this afternoon. These are things you might have heard of terms like PCA, UMAP, and T-SNE, very popular in single cell genomics. So that's dimensionality reduction. But when the dust settles and you've done your EDA, this is what your expanded model might look like. So you've got what you originally had, your intercept term and then your disease term, but now you find that age and sex are going to influence your expression. And in sort of disease transcriptomics, age and sex are usually just sort of added as covariates to the model. And then you might find that there is a batch effect. So you have to find a way to quantify it and then build that into your model. And now you've got the sort of unmodeled leftover variation. So at the end of the day, you got this kind of model. And all of these are your explanatory variables. Some of these are biological variables. Some of these are technical variables. And then the leftover term is what's called random variation. So the way we use the word random in common parlance in English is different from what statisticians mean when they say a random. They're talking about drawing from a particular statistical distribution, but the sampling is at random. And so when you pick a model, right? Like you say, oh, I'm gonna do differential expression analysis, I'm gonna use HR. It's making assumptions about what statistical distribution this random sampling draws from. And this sort of, and whether that best describes your dataset or not is something you need to consider. So that's what random variation means. And this in part determines whether we use linear model or some other kind of model to fit our data. So those are the terms of the anatomy of a model. Another point to consider in data is missingness, right? So can you give me some examples of why you should have missingness in your data? Where would missingness come from? Sorry, what? Yeah, yeah, they might have data entry errors. It's one example. Why else could data be missing? Yeah, so quality control. So yeah, if something went wrong with a particular sample, you're missing that data point because it didn't pass quality control, right? So the tickle message is that missingness happens for, and it can happen for various reasons. So if you're working with cohort, population cohorts and you've got a lot of clinical variables, right? Some of them might be missing the data, especially if you've got a questionnaire. Sometimes people don't respond to all of the questions. If you have data pooled from multiple institutions, not all different institutions make like different subsets of data. So you've got missingness. Multiomic data, I did gene expression and proteomics. Some of the patients are missing the proteomics data. So these are possible reasons. What do you do when you have missingness in your data? So one thing you can do is you can define a cutoff for what constitutes excessive missingness, right? So we're gonna see some of this later, but you can say something like, if a patient is missing more than 80% of samples or measures, then we have to exclude that patient. And exactly what that cutoff is, I would suggest whatever your field is, you look to see if there's a convention for what's considered appropriate for removing missingness or you have to come up with a justification based on your data. But you have to be very, very careful when you start excluding samples from your data because that can bias your results. Another thing you can do is you can use a method called imputation to guess for missing values. So, and there are different flavors of imputation, you know? Look at the patients who are most similar and make an assumption about, you know, what this person's missing value could be. And of course, that's could have possible trade-offs, right? So the point is, sickness happens, there are ways to deal with it and you need to come up with an explicit strategy and every strategy is gonna have a trade-off. So here are some visualizations of missing data, okay? So this is just some R code that takes a data table and wherever there's a missing value, it's NA, it's shown as white, otherwise it's shown as black. So in example one, we've got unstructured missing data. So there seem to be some assorted missing values. There doesn't seem to be any particular pattern in the missingness, right? Here you've got another table where there's only missingness for one particular column and everything else is complete. So then you need to consider what that column is, how important it is to your design and what you're gonna do about it, right? In this particular case, it happens to be age. So if you're gonna build that into your statistical model as a covariate, you have to come up with a plan. The most nefarious kind of missingness is biased structured missingness. All of your cases are missing values for something, right? But your controls are not. So what you do, you don't look at your missingness and you just apply imputation. And now you've got some filled in value come up by some algorithm, but because you didn't visualize your data, you didn't realize that what it filled in was for all the cases and that biases your results. So the message is look at your data and look at your missingness and drill into it to make sure that if it seems to be biased, try to understand how much it's affecting your two groups so you can come up with an appropriate plan. Yeah, okay, so that's the thing. Where possible, look at your data. So then what are the goals of exploratory data analysis? So coming back to our statistical model, remember we said we've got biological variation, technical variation. And when we come in, we're just excited about the thing we're excited about. Oh, disease of fact, right? So the goal of exploratory data analysis is to identify the magnitude of known biological and technical variation in your data, right? Is there a huge separation by sex of the samples? Do older and younger samples cluster differently? Right? The second is to identify sources of unknown variation. And this comes by visualization and tools like PCA where you can see the data seem to be separated. They're not separated based on any variable in your sample table, your metadata information as it's called. So you wanna identify these model these so that they are separated away from your disease effect and then your disease effect is just, your variable of interest is just more cleanly modeled. Picking up on outlier samples, right? So quality control things happen, so that's the other reason. And the fourth is to characterize missingness. So for each of these four goals, we have specific tools and we are gonna introduce you to these tools in the course of this workshop. So identify magnitude of known biological and technical variation. You use tools like dimensionality reduction, PCA, clustering and prior knowledge. So if you've got a data set, a genomic data set, you've gotta have the phenotype table for the sample, all the sample metadata it's called. So you've got the data, which is your gene expression. Metadata is all the information about your samples, right? Was it in the first batch or second batch? Was age, sex, treatment, course, whatever's relevant to your study context. And then the action is you add the terms to the model. Identify sources of unknown variation. There are tools, again, PCA can identify this. There are tools such as surrogate value analysis. That actually might be variable analysis. But the bottom line is, again, you identify that source of variation built into your model. Same thing, PCA clustering to find outlier samples. And then you can see there's a caution symbol on there. One action you can take is to exclude the sample from analysis. But again, this could bias your study. So you might have to do things like run the analysis with the sample included and the sample excluded and see how robust it is to that kind of thing. Finally, characterize missing, right? We're gonna make all these slides available. Yeah, we're gonna make the PDFs available. I mean, you're welcome to take screenshots, but we're gonna make them all available on the GitHub site, the website. So here's a workflow to structure your data exploration. Again, as I said, in a genomic assay, for example, you've got, say in a gene expression assay, you get 20,000 measures roughly, right? 20,000 genes per patient or per sample. If you're doing methylation analysis, you might get up to a million samples, measures per sample. And then you have the metadata. So this is how you, this is a recommended workflow to structure your exploration. It's not just kind of amorphous. I'm gonna go explore my data. So first you say, well, how many measures do I have in my metadata? How many measures do I have in my data? What is the distribution of the measures? So you do things like you do box plots, you do violin plots, you do scatter plots. We're gonna do a bit of this tomorrow morning. And those of you who took intro to our might have done some plotting as well, but you have to look at your data. You can't just throw it into like a block box method. Is there missing this, right? So visualize it and decide on a solution. Then you look at the correlation between samples. So if you've got tissue one versus tissue two, you expect tissue one samples to be very similar to each other and to be different from tissue two. Does it look like that? Or is there something else driving the structure? Does the data have natural groupings? How many major sources of variation do you have? And do they map to known biological variables or known technical variables? Maybe you go back to the lab, you say, oh my gosh, the person in the experiments, you say, I'm seeing two different clusters. Did you guys switch batches of your antibody or something like that? And they're like, oh yeah, we did. Okay, well, let's build that into the model, right? Are there batch effects? Attempt removal or to modeling? Are there unknown, unmodeled sources of variation? And then you get your final model for what is called inferential testing. And inferential testing is what we do when we do things like HR, differential expression, differential methylation. Sorry, that is a software package for differential RNA analysis. So GSEA is downstream of HR, yeah, okay. And so what do you use to do? What tools do you use in terms of technique? What do you use to answer these questions, right? So we're gonna go through this in our lab. So for data exploration, you can use tools, you can use functions like dim and head. It shows you a glimpse of your table. Summarizes the statistical distribution of your variables, plots, for missing this, you can write code to count how many missing values and they are there. You can write a simple plotting. We're gonna share a plotting function with you that shows missingness using that sort of checkerboard kind of pattern. Then correlation structure, you use clustering for that. To quantify sources of variation, we use dimensionality reduction. Again, Dela Ram's gonna cover that this afternoon. Batch effects, again, you use these tools, clustering and dimensionality reduction. What they do is they identify major sources of sort of groupings or structure in your data. And then there's a tool that I'm not gonna talk about during this workshop. You can look it up if this happens to be, you happen to need it. Sarget variable analysis, it will find additional sources of variation that may be in your data that you haven't modeled. And then there are tools to try to remove unwanted variation as well. Yeah, and then you finally generate your model and then you'll go downstream to do your inferential testing. So that's EDA. Any questions about EDA before we move on to clustering? It can be hard to methodically do this. And if you have something like a structured way of approaching this, then you can like tick the boxes as you go along and you know you're covering the basis. Also, do not skip this, it's really important. Okay, so clustering. You know, we're not gonna go into too much detail into the math of behind the clustering. But again, the idea is to give you an appreciation of the variety of tools that are out there, what purpose they serve and then to run through those in our, how they fit into the big scheme of trying to make sense of your data, find known and unknown sources of variation and then finally do your differential, whatever analysis, differential expression analysis, et cetera. Okay, so the high level purpose of clustering is to find natural groupings in your data, okay? So why would you wanna find groupings in your data? Partly it's because your experimental design has naturally got some kind of group level effects it's trying to find. And you wanna, first of all, see if, your groupings are being reflected in the overall structure of the data, right? But other reasons might be to find batches in your data. If I take, I got mouse tissue A and mouse tissue B and I do clustering, I find four clusters, right? All the Monday samples cluster together, all the Tuesday samples cluster together, then that's telling me something about what needs to be fixed. Identify subtypes of patients. So if you are looking at a cellular readout like RNA or DNA methylation and your hypothesis is patients subtype changes disease course. Some patients are gonna have more aggressive outcome and some are gonna have less aggressive outcome and that's gonna be reflected in their genomics. Then you are hoping that the genomics is going to maybe show and so you might wanna do clustering and then color code the data points based on poor survivors versus good survivors still a little bit of creativity there. And then you might wanna find groups of co-expressed genes maybe because they're driving a particular program that's of relevance for the biology of whatever you're studying, okay? And this is an example of what the output of clustering might look like, right? So this is called hierarchical clustering. This thing is called a dendrogram and samples that are, these are the data set where you've got two sets of three samples, the NC's and the M's. So we're gonna do this in more detail in R and the branches connect samples that are more similar to each other. And you can see how there seems to be a natural separation sort of these two groups, okay? So that's an example. At the heart of all the clustering methods is this idea of computing a distance between samples. We say, oh, these samples are similar, you know, that's what we say in colloquial terms, but in math you compute a distance metric between the samples, okay? So you need to find a way to quantify how similar or dissimilar observations are from each other. And then once you come up with a way, you've assigned a sample pairwise sample distance, then you take a clustering algorithm and the clustering algorithm organizes the visualization based on that distance metric. Does that make sense? Please feel free to stop and ask questions. If you're not getting something out of it, you know, there's no use. So please feel free to ask. And I'm sure if you've got a question, other people have that question too, just sometimes everybody tends to be shy. So the quality is your distance metric, okay? So different data types have different statistical distributions. If I was analyzing clinical variables, you know, from a questionnaire maybe, or a physiological measures versus gene expression data from microarray versus gene expression data from the sequencing array. All those different data types have different sort of mathematical properties. For example, if you're doing sequencing based data, you're getting counts. If you're doing microarray data, you're getting continuous value numbers. So, you know, how you compute distance may change based on that, right? So my point is, don't just throw your data into a distance metric. A good thing to do is to look at sort of published papers. You know, if you're unsure about what people in the field are doing, take two or three published papers, you know, which are sort of landmark papers in the field and look at what kind of distance measures they've been using for that kind of data. You know, if you're unsure about what distance metric to use, okay? Some data types may have many distance metrics, each of which come with their own properties. A related term is similarity. That could be the inverse of distance. So for example, you know, a distance is sort of a mathematical term because you say we've got points in space and they've got a particular distance between them. So a distance metric all of us learned in high school is Euclidean distance. So you've got a 2D plot, you've got data points on the 2D plot, right? And that's the simplest form of Euclidean distance, X1, Y1, X2, Y2, and this is the formula for Euclidean distance. This formula can be extended for, you know, arbitrarily high dimensions. So when we say dimensions, we're not talking about, you know, wormholes in space. So dimensions just means how many measures you have in your data. So for example, if you've got gene expression data and you're getting 20,000 measures per sample, you're working with 20,000 dimensions. And, you know, the measures like Euclidean distance can extend to 20,000. Instead of saying, you know, just X2 minus X1, Y2 minus Y1, then you can keep on going Z2 minus Z1, W2 minus W1, et cetera, et cetera, okay. So that's the Euclidean distance. Then you've got other forms of distance like the Mahalanobis distance, which is, it's a type of Euclidean distance which takes into account sort of broad spread in the data and, you know, what we just call the covariant structure of the data. So this is another form of, this is another sort of distance metrics and it is normalized for this kind of covariant structure in the data. So that's all I'm gonna say about that. Another kind of distance you've got is the Manhattan distance, which instead of looking at the straight line between, you know, the two points, it looks at blocks, you know, it counts distances in blocks. I've never had to use Manhattan distance in my analysis, but some of you might. Again, you know, all of these distance metrics are for continuous variables. What do you do if you've got a categorical variable? You know, a categorical variable is, you know, a patient can be one of, you know, can have a, how do I say this? Could be of type A, B or C, right? And that's a categorical variable. So how do you compute the distances between patients when your numbers are categorical or they're binary numbers, right? Yes, no, yes, no, you've got a bunch of yes, no variables. How do you compute a distance between samples? So you would use something like a hamming distance, which is the number of mismatches between patients. What are some examples of common clustering approaches? You've got hierarchical clustering, which is the example that I showed you earlier with the two groups separated out with the pink and the green. Now we've got K-means clustering and there are many others. So the point to make here is that not, you might not be able, you know, hierarchical clustering may not be the best kind of clustering for all kinds of data. You might have data represented differently. Like for example, if you are looking at a network, okay, of interactions or a network of similarities between patients, then how do you cluster? So if you have a network showing similarity between, you know, patients or people, say social networks, right? So if you've got a social network, how do you cluster a social network? You may not use something like hierarchical clustering. For that, you might use something like spectral clustering. So the point is, you know, not all clustering approaches work for all kinds of data. Hierarchical clustering. So this is the one we see often in sort of genomics papers where you've got the heat map with a multicolored grid and you've got that dendrogram on the top, right? So how does this one work? So first you come up with, just you first built the dendrogram and I'm gonna talk about how. Dendrogram is that sort of branching tree pattern that we saw a few slides ago. Then you choose a cutoff at a certain level of the dendrogram. Okay, and then, you know, that's how you sort of assign cluster membership. So when you say I'm clustering my samples, at the end of the day, each sample is gonna have a label attached to it. It belongs in cluster one, it belongs in cluster two, it belongs in cluster three. So, you know, if I were to cut this tree, if I were to not cut this tree, everybody would be in the same group. If I were to cut this tree here, right, then the samples on this side of the cut are gonna be in group one, samples on that side are gonna be in group two and so forth. And you can go with increasing levels of, it's called granularity. So then how do you, how do you build that dendrogram, right? So first you start with pair-wise distances between your samples and you use a metric such as Euclidean distance. And now you have a number for all the pair-wise distances, distance from A to B, A to C, D to E, D to F and so forth, right? So then what you do is you take the samples with the smallest distance and you put them together in their own branch, right? And you do it for the same level. Then you take this subtree, right? And then you say, what is this subtree closest to? You compute some kind of average, you know, distance metric for this and then you connect it to what's nearest to it. And so it builds it up in this bottom-up fashion. Okay. And then it continues in this way. You keep finding, you know, you keep taking the previous subtrees and then finding what's most similar to it and connecting it that way. And at the end of the day, you have the overall dendrogram. So that's how hierarchical clustering works. Now we've got this very arresting animation here on this slide. This was a different type of clustering called K-means clustering. So in hierarchical clustering, we, you know, we don't tell it how many clusters we're looking for. We just let it build the tree, right? And one possibility is that, you know, your tree, you know, might not sort of, in some cases, you might know in advance how many numbers of clusters you want to expect, right? So that you expect. So that is your K. So in K-means clustering, you tell the clustering algorithm, I am expecting to find three clusters in my data. And so once you pick the number of clusters the way the algorithm works is, it randomly starts by picking three points or say K is equal to three. It randomly starts by finding three points in that space, right? And it computes sort of the center of those three points. Those are called centroids. It's basically like the mean, but in a higher dimension. And then what it does is for each data point, it assigns it, you know, to the nearest centroid. Okay, assign it to the cluster. And now what you've done is, before you had randomly just kind of thrown a dart at the plot and you had found your centroids, now you've got a group of samples in that cluster. So what you can do is you can take those samples and compute the center of that new cluster and say, this is my new centroid, right? So once you do that, you now have these new centroids and then you say, do my samples still fall in that new centroid? Is that still the nearest cluster for this sample? Or does my samples cluster assignment change? So what you do is in this repeated fashion, you assign samples to clusters, now you've got these new clusters, you compute the centroid for the new clusters and then you do this again and again and again. And the idea is that at some point, you're not going to get much of a shift in where the cluster center is because the samples, think about it this way, right? If you've got three different clumps of data points on a plot, once you've assigned the sample to its true cluster, it is going to be closest to that cluster. It's no longer going to change its cluster membership. It's stabilizes. So what this does is it starts randomly, keeps going till it stabilizes, okay? And then just for your information, there are other types of clustering such as spectral clustering and how they work is sometimes data is represented differently. It's represented in terms of the relationships between sample, like a network. And in that case, there are clustering algorithms that operate on properties of the networks, okay? So in this case, the adjacency or how close to samples are in the network. This is just to say that you can have data where K-means clustering is going to find centroid here and here, right? Because whoops, because that's how it works, right? But the way the similarity might really work for the data is based on sort of spatial proximity. And so again, how would you catch this if you had your data and you use the wrong clustering algorithm? How would you figure out that you've used the wrong clustering algorithm? One way to do it, just visualize the data. So big take-home is when we do complex workflows like informatics and you have a lot of software packages, there's this temptation to kind of take your table and throw it in the software package, but messages always visualize your data, see if what's happening makes sense. And the second thing that's important is bookkeeping. We're gonna talk about that separately. Anyway, there are many kinds of clustering methods. So then how do you decide on the number of clusters? Sometimes it's obvious, right? And you're like, oh, I'm looking for A versus B, I'm looking for the two clusters and that's it, I don't care what it says. Do that at your own peril. But sometimes you've got complex situation like you're looking at tumors, they're heterogeneous, you need to find clusters. How many clusters is the real number of clusters, right? So there are some metrics to help you identify which number of clusters seems to best separate the data. And the level of separation could be really good if your data is truly clustering apart, or it may be a little in between where you've got some clusters that are kind of similar to each other. So here are some metrics, right? Now we're gonna go through, okay? Arbitrarily cutting the dendrogram by eye, we will do this in the lab. And in R, you have this package called CLVallet, which will compute these sort of metrics or clustering so that you can see which clustering solution seems to work best for your data, okay? So the one that I've seen most commonly used is this Silhouette width, okay? What Silhouette width does is you first need to do your clustering, right? And what does it mean to do the clustering? It means that your samples have been assigned a label that says you are in cluster one, you're in cluster two, you're in cluster three. You've got cluster membership assigned it, okay? So what it computes is it says, given this clustering solution, how similar is a sample to its buddies in the same cluster as compared to samples in the cluster nearest to it? And you can assume, you can imagine where you've got two clusters that are kind of merged. Maybe there's a lot of data points in between those two clusters, right? But you can imagine other cases where they're wide apart. So basically all this formula saying it's very simple, it says take for each data point I in your dataset, look at the average distance and we've defined the distance metric, the average distance of that sample to the samples in the nearest neighbor cluster, compare that to the distance with all the other points that have the same cluster membership as it, right? And then because we like to look at scales that are normalized, go from zero to one and so forth, you divide it. So this gives you an index that goes between plus one and minus one. Best cluster separation is this is zero, say average distance and this is infinity. So then this term becomes very positive, but then you're normalizing it so it becomes plus one. Does that make sense? Like the intuition, this is large, this is small, it's positive, so it's plus one. Worst cluster separation, the opposite, right? Minus one. And so when you, you know, what this plot is doing is just, you know, each row shows you the silhouette score for each of the samples and it's sorted it based on, you know, starting from high silhouette to low. So you can see that, you know, these seem to be mostly positive, subtype three seem to be mostly positive, subtype two seem to be mostly positive, but something's going on with subtype one where a fraction of the samples have a negative silhouette score. And then the overall silhouette width for a cluster solution is the average of the silhouette scores for all samples. That's what that is. Okay, so again, you know, here is an example where you've got some data points, you've got these four clusters, and then if we're computing the silhouette score for this red data point here, you know, AI is the average distance to all of its neighbors. These errors don't point to every last neighbor, but basically average distance to all of the points in the same cluster, compare that to, you know, average of data points in the neighboring cluster. Again, I haven't drawn every last green arrow here, but the point is you're comparing it to the neighbor. Okay, that's the score, it goes from minus one to plus one. There's a related, so the CL valid package gives you a couple other scores, you know, it's similar, the done index is similar. These are all heuristic measures. So you're usually looking to see how many of them seem to be in agreement with each other. So if you have three metrics, you know, two out of three of them are saying it's five clusters, right? And the other one is saying it's two clusters and so you come up with a decision, right? But you, whatever decision it is, you've got to have sort of an explicit plan for how you decided the number of clusters. Done index is a very similar heuristic, except that it goes from zero to infinity and you want to maximize. And there's another one called connectivity and it counts what fraction of your nearest neighbors, you know, are not in the same cluster as you. And you have to tell it, what do I mean by nearest neighbors, nearest 10 neighbors, nearest 20 neighbors, that's a parameter you change. And then it computes what fraction of your nearest neighbors are not in the same cluster. And this metric should be minimized. So just to be aware, these metrics have different scales. One goes from minus plus one, one goes to infinity. One needs to be minimized. So just be aware of which metric is, you know, requires what consideration. Okay, so that's it for the lecture part of this, right? So let's recap. What are the take home messages? First is that goal of exploratory analysis, the goals of exploratory analysis are to flesh out this model, understand your major sources of variation and co-variation, right? So they can be modeled appropriately and then they are separated out from the effect that you really care about. And sometimes it means that you might need to redo the experiment because there's so much, you know, technical artifact, right, so the signal is really, really little, something major went wrong, right? You need to catch that. If you don't, you're just gonna run the analysis, you're like, no genes are differentially expressed, I don't get it, you know? And of course, identify outliers looking at missing data, right? So this is getting to know your data. EDA can be structured using a systematic approach like the one on the left. And then finally, clustering is a tool that you use to find natural groupings in the data. It requires a distance metric. And when you're done, each of your labels has a cluster assignment, one, two, three. There are different methods for clustering and you can validate clustering using metrics. One thing I have not shown in the slide is there are two types of two conceptual ways to validate clusters. One's called internal validation and one's called external validation. Internal validation is what we've just discussed. Soluette, score, connectivity, they have to do just with your data. External validation is perhaps using measures from another dataset. So if I got three clusters, I'm expecting one of my cluster to be macrophages, another one to be neurons, another one to be blood cells, I'm gonna color code my clusters based on expression levels of whatever some macrophage marker and neuronal marker. And then I say, oh yeah, cluster three is my macrophages. So that's a form of external validation.