 Okay, thank you all for joining. I'm really excited to be your instructor this morning on day five of I'm sure what has been a really fun and exciting workshop packed with information for you. I'm Lauren Erdman. I'm a like fourth year starting in the fall PhD student in computer science. I've put this workshop like this portion actually together with my supervisor, and we taught it together for years and I have kind of taken it over in the past couple years. And so I'm excited today to talk to you about similarity network fusion clinical data integration. And then we're going to go deeply into survival analysis as well. So, this is all covered under creative commons license, which I believe means you can share and remix this work. Just make sure you attribute it and share like if you alter it. So like I said, this is going to be all about clinical integration. Of course, this is a huge field so I'm not going to cover it in completion in any way. I really encourage questions so we've got some awesome TAs, and I know they've been helping you all week and they'll probably be able to field your questions if you send them along on slack. It is going to be a little harder for me to keep up on the slack. So also feel free to open your mic for especially for the lecture portion. If you have a question that you kind of like us to, you know, stop and talk about. Or if there's a question on the slack that maybe should get attention. I just want to say the TAs like feel free to also open your mics and bring attention to that too. I'm happy to stop and I really want everyone to be on the same page in this. So, great. Without further ado, the learning objectives so just a quick overview of what I'm talking about when I say clinical data and some quick review on types of a single data analysis. We'll talk about some data integration methods, such as concatenating and clustering your data using iCluster and using SNF. As I say in the afternoon, you're going to be doing a lot of similar work. I think we, Shrata and I worked hard to kind of make sure that it's complimentary and it's really building on what, like what she's going to present later. We'll build on what I talk about now. So some of it may be review and some of it may be an extension. So also just keep in mind, a lot of these concepts may come up later, even in another session later today. We'll talk about some advantages and drawbacks of different methods you might choose and then how kind of survival analysis may come into the mix and the reason I'm including survival analysis here is to kind of drive it home like how do we make this now clinical and build a model that can be used clinically and evaluate it. Patient data. That's really what I'm talking about when I say clinical data. And so that could include, you know, genetic data, expression data, epigenetic data, microRNA protein data. So these are like omics data sets, but we could also have things that are like just clinical like from the chart data, for instance, that patients fill out so you can assess them, maybe some imaging that's been done on the patient or even aspects, whoops, even aspects of their diet. So we've worked with nutritionists, for example, who will have extensive diet information on their patient. But why would we want to integrate all of these types of data? Well, one reason might be to identify more homogeneous subsets of patients. These might respond similarly to a given drug. So if the patient's behaviors and their clinical history is similar, and that may be the case that the same drug will actually have a similar impact on these patients. They also might have a similar prognosis. So if all these features are similar about the patient, it may be that their actual clinical progression is similar as well. They may respond better to similar clinical management. So not just drugs, but different interventions that are done. So you may want to group them and identify those similarities there. So one type of single data type analysis, which I'm sure you're all familiar with is clustering. And so here, this is showing gene expression data for 1800 genes. This is from a PNAS paper in 2005. And so what they did was they collected gene expression data. They selected the most varied genes from those. So as you can see, this list is not 18,000 genes. And they performed hierarchical clustering. So that's what you see up here. This is a dendrogram hierarchically clustering these genes. And here they identify two clusters. So they identify genes that are associated with differing between these clusters using ANOVA. And they corrected for multiple testing or multiple hypothesis testing. And so that's what you see these genes here that are really driving kind of the separation between these groups. And from this, what they found really nice and neatly is a difference in survival of the patients with gluoblastoma multiforms. So it's a nice example of where you take data that is very omics based. You find a pattern in that and then you see that pattern kind of propagate to a clinical outcome. Okay. So I just want to dig into this KM curve that we're seeing. This is actually just plotting our data here. So in this case this is laying it all data. But what's happening here is you see that the survival for group one is much longer, the median survival is much longer than it is for group two here. So this data that's being plotted is showing the proportion of patients that are dying. From gluoblastoma multiform with each one of these events here each drop. And so as this falls faster, many more of these patients died sooner, whereas in group one, these patients survived much longer. And even at two years, more than half of that group actually was fine. They were, they were still surviving. So the probability, the way you could interpret this is the probability for surviving one year here is 80% for group one. So quite high for group one, but only 20% for group two. So prognostically these are very distinct groups here. Sorry guys, just lost my quicker. There we go. And then what else we can see is there's censored observations. So there's observations where we see this person lived at least half a year, but we don't actually have more information about them beyond that point. So that's why these points are here. Just showing this is the information we have on them. And we can include that in our model, but we can't say more about it so we don't know if they survived beyond that point. So, here, yes, yeah. Quick question, sorry, it's very early here. So is there a temporal component to the gene expression data, because there's no, there was. Yeah, exactly. There wasn't a temporal component. That's a really good question actually, off the top of my head from what I can remember. No, I believe this was baseline. I think it was from their tumor. But that's a great question because it could be that they're picking up just the progression of the tumor to. So it could be that maybe the people in group two, that gene expression data was collected at a different time point. It could look like, you know, it's further along, but maybe if you collect like a stage four tumor from these, these groups, one folks, then it could be that in that view their lifeline is quite shorter, but good question. And that's actually really important to consider when you're doing this kind of clustering. What pattern are you actually picking up here? Like, are you picking up a difference in the tumor at a given point in time? Or is there something actually, are you just collecting different samples from these patients? And maybe there's some different aspect that's that's going to reflect in this prognosis that you find. Yeah, thank you. Right. So, here, there's a work in cancer cell by Verhac et al, where they want to go beyond the single data type integration. And they essentially add more and more genes through this, sorry, they have gene expression data, but they select their genes based on more different types of data. So these are still gene expression data, I believe, but they're selecting genes based on, you know, copy number variation alterations, mutations, and, and then the actual gene expression itself. And then they also find that they're not actually able to, when they do this, they're not actually able to differentiate these different groups. So as you can see the p values differentiating the groups from the baseline group are very large. So it doesn't always work out. Sometimes you're finding things that are not going to be related to your actual clinical outcome. Yet, it's still identified pro-neural, neural, classical and mesenchymal groups. They just didn't have significant differences in their survival. So now, let's say, though, you don't want to just use your gene expression data like sync brought up, you know, there could be differences in that data that you want to kind of also include other data to augment that clustering that you're doing. Some approaches to doing this are, which I'll go over here, are concatenating and clustering your data. So essentially taking these different data types, putting them all together in one mega database, and now clustering that database. Another one is Shen et al's eye cluster, which is a popular method and has been developed since 2009, but I'm just going to talk about like the really basic version. And then similarity network fusion, which I'm also going to talk about in much more depth here. And there's more work worth coming on this. But we'll also use it as an example for what we're going to implement later, really to get your feet wet, kind of taking a new tool off CRAN and just using it and understanding it more deeply. This is not an exhaustive list. And in fact, later today, Shrata is going to actually show you even more of these kinds of techniques. So I'm not going to go too deeply into all of these because there's so much more to find. Okay, so concatenating cluster is super simple. It's truly what it sounds like. You just concatenate your data together, and then you cluster this database. So before, in the work I showed you before, they're only clustering gene expression data. Okay, and in this one, it would say, okay, now cluster, put these all together, treat this as one long kind of row of features that your patient has and now cluster your patients based on all of this data. Okay. So then you concatenated. Now you can cluster. And so here I'm going to just talk quickly about hierarchical clustering. And what you will have when you do the concatenating cluster or actually any clustering, you'll have a distance matrix, which is what I show here. So let's say each ABCD is a patient. Okay. So the distance matrix always has the same rows and columns. Okay, because it's comparing all to all. And so here in the middle, the distance between an individual to themselves is zero, right. And that's what that is interpreted as. And also, the lower and upper triangles are equal. So the distance from B to a is the same as the distance from a to B. This isn't always the case in the distance matrix, distance matrices you develop but far and away most of this is like by far the most common type. Now, each of these numbers is representing, for example, B to a, how far are these from each other. And so if we look across this we see that the minimum distance, other than zero so individuals to themselves is F to D. So this is the first kind of agglomeration or clustering that we would make when we make hierarchical clustering. So it's number one in the hierarchy. So when I talk about hierarchy, what I'm talking about is maybe dendrograms as you've seen them, or what we can see here, where we have F and D is the first one to get grouped together here. At point five distance at that point D and F become one cluster with each other. And then there's different ways to represent this grouped distance from new points, you could either average it you could take the maximum or you could take the minimum distance from new points. So each of these have their own implications. But here let's say we're averaging so then the next one, as you move up is going to be a to B. So these guys get grouped next. Okay. So if you were to split your clusters at this point so let's say we we did a split at point seven five here, we'd have a and B in a group and we'd have D and F in a group and everyone else. So D, C would be in their own group. They don't cluster with anyone yet because they're very far from other individuals. So as you go further out, we see that he is actually closest to the D F cluster and so then they would get grouped in there. So if you draw the line up here. So here we have D F and E all in a group. So here D F and E all in a group. We have C in its own cluster and then a and B are in their group. Okay, and then again as you go further see now gets grouped in with that D F E group. So if we split it up here at two, we could say, we've got one small cluster a B here, which does seem in our graph here. It does seem like it's a bit further away from everyone. And then we've got a kind of bigger group that holds the majority which is D F EC, and then these guys are all grouped together. Okay, so when you're looking at a glomerative clustering or hierarchical clustering. This is how you would be interpreting it. You see it stepwise growing into grouping everyone into some groups and at the end with the largest distance. Everybody is in a single group. So if we drew the line up at three, we say, yeah, this is this entire group is one cluster. Okay, so you can actually split it many different ways and it's a nice flexibility that comes with hierarchical clustering. So, but with this, like I said, you know, you can make that cut in many different places. What's the best one to do. So there's lots of options. You can just cut it by eye. And also not just by eye but by a kind of intuition or what makes the most sense when you look at those clusters after the fact. And so that is valid. I think clustering. It's, it's quite an art form. And I see it as much more for data exploration and really understanding the data you have in hand. So I think it's reasonable. But another thing you may want to explore is the silhouette statistic, which I'm going to go over in the next slide. The second gap from tips Ronnie actually is a another really nice option. And they're all very similar. Then there's many more actually that you can use that. Yeah, I encourage you to check out if you're interested. But that said, if you do these basic ones. Then you also should be fine. It's, it's more to understand the data that you have and really understand what your clusters are representing similar to what we saw here, where if you kind of just zoom out and look at the data. This a b group is pretty far from the others. So it would be reasonable to kind of draw the line up here and say, yeah, we've got this group that's far out here and then we've got this group that's in here. But if it seems like C is a real outlier, it also may be reasonable to cut it around here and say we've got a DFE group that does seem pretty close. We've got C that's a bit of an outlier and then a and B that are truly on their own. So like I said, you know it's it's more about really understanding the data that you have. A couple questions in slack. Maybe you could clarify for everyone. What does the X and Y mean in the clustering and which covariates are we clustering. Yeah, these are really good questions. Thank you so much, Heather. So here the X and Y axis. So these are the features that the clustering was actually based on so these two questions in this slide, they are related in the that the covariates or the data that you're clustering on. It creates this matrix and it can be whatever. So you could be clustering based on you could be clustering based on methylation you could be clustering based on methylation and gene expression so here. It could be these, but then here in this graph, we've got a two dimensional data set here. And so this graph here actually represents clustering between two features so we could say maybe gene one and gene two. So this could be gene one and gene two. And it's just much easier to interpret this on like to feature access so like X and Y axis. And so then the distances here represent the distance between this a which is defined by the X and Y axis points, and then be these X and Y axis points so each of them have two features that are used to create this distance value here that is representing how far they are from each other. Okay, so in this case this example. It's only two features so it could be one methylation probe and one gene expression value. But in practice, usually it's many like we saw before you know those top gene expression values it could be your entire gene expression array. You could be clustering based on your all of your methylation data. So it really it can be whichever features you'd like. But in this specific example, this is just arbitrary just two axes. Okay, I hope that was clarifying. All right, so the silhouette statistic, I really like the statistic for kind of understanding the intuition of what clusters you've come up with. And so it just a nice piece of history this has been around for a very long time. So from 1987. And it really is to show how graphically well each pattern but here we can say observation so when they're talking about a pattern, you can consider that as like a patient, an observation, or a individual. So for each pattern so each patient let's say in class CR, we're finding the difference between the average distance to all other patterns in other clusters. So all patients in the other clusters subtracted from the average distance to all the patients in the same cluster as that individual, and then divided by the max. So one thing straight off the top is, if one person if an individual is closer to to individuals in all the other clusters, then they are to their own cluster. This is going to be a negative statistic. And it's essentially representing that maybe that person should be in a different cluster. If they're closer to people in all other clusters then they are to individuals in their own cluster. It's kind of a bad sign. Right. So that's where the negative positive comes in, and then you can look at the scale so if people are like, pretty close to the other clusters but they're kind of close to the same cluster. Again, it seems like borderline it's almost like they're not in a cluster. So I'm just going to show know what we can see here. So an example of that is, let's say we put see in a cluster with E, and we wanted to compare the silhouette here. So a cluster of C and E, but we're looking at the silhouette for E is actually closer to D and F, if they're in a different cluster, it might make sense to actually put them in that in the cluster with D and F. So the silhouette statistic for E, if they're clustered with C is going to be probably negative, because they're not very close to see, but they're quite close to these guys in another cluster, D and F. Okay, so what silhouette does is it does that individual by individual and produces a plot like this. That's showing the silhouette value for each individual so the subtype here is the groups that you maybe have created in your clustering that need to be considered and then each line is one individual. And so you can see within each cluster, how much are the individuals how much should all those individuals be in that cluster, or maybe they're actually closer to individuals in other clusters so what we see really markedly here is in cluster one, a large group of those patients maybe should be in a different cluster. They're not closer to individuals in their own cluster than they are to individuals in different clusters and that's why they're showing up as negative here. And that's why I really like the silhouette statistic to give you just like a nice snapshot of, you know, are the individuals I've clustered close to the individuals they are clustered with, or should maybe they be put in a different cluster, because they're actually closer to individuals in those other clusters, or maybe it represents, if we have a lot that's close to zero, maybe it's kind of on a continuum, and our data isn't actually grouping well. That's also something that you may find when you're clustering that you're kind of just drawing lines in a continuous spectrum, and these groups are not kind of held together very tightly, as you might expect. All right. So I cluster. Okay, how do we reassign for values? Do you mind rephrasing that? Sorry, I meant how do we reassign for the bad assignment? Do we just reiterate over the process? Okay. Exactly. Yeah, no problem. Yeah, so you just reiterate, you know, sometimes you may find if you have many clusters that you're having a lot of negative values, and then maybe you just need fewer clusters. Maybe there's one big cluster that actually everyone should be in and then a small cluster that's kind of an outlier. Yeah, and then you could also try that because though with this silhouette, I might try to actually group them because if you're getting a lot of negatives, it means they're close to other like another group. So you could try a different random initialization. If it's a clustering that like jumps around a bit, you know, maybe it just got you into like a bad grouping. And hierarchical clustering is not random. So that will not help you like rerunning it. You should reproduce the same hierarchy. In which case, with hierarchical clustering, you may just want to cut it at a different point. So you may just find like I showed before, this wouldn't be the cluster you'd come up with. But if you're finding that your groups look wrong, you could also try a different clustering approach. But yeah, you'd really want to iterate it in different ways. But most often more groups is not what you actually will want to do if your silhouettes are negative, you'll usually want fewer groups. Thank you. Yeah, of course. Okay, so I cluster. Again, I cluster these all continue to be developed. So I just want to go over the real basics of it. But essentially, they're trying to find a latent variable a Gaussian latent variable model so it's a mixture of Gaussians here, and a mixture of Gaussians is one where you will have some kind of initialization. Alarm going off here. You'll have an initialization. And you could find that when you rerun it, you'll actually get slightly different clusters out. But what I clusters doing is actually building up that latent variable model using data from different groups so trying to find the same latent groups across different data types here. And so they're doing a sparsity regularization. So if you're familiar with lasso, what it's trying to do is basically drop out features that are not useful here so that are not actually contributing to that grouping that you're developing. And then the latent variable that is shared so that latent variable is the latent variable is the grouping. So the groups that are assigned, they're shared between these different, these different data types, and then it iterates so that you're finding the same latent groups. But the key thing is they're shared. So you're looking for information that is consistent across all of these different data types. So groupings that come up as consistent between the different data types. So a drawback to that is currently, it may take a bit more, but at the time. So even a few years ago, late group. Good question. So latent groups are different than collinearity, latent groups are really groups, sub types clusters. So what it's trying to do is find clusters of patients that are shared between these different data types. So finding, if we took the GBM example from before, finding those two clusters, but finding them through the copy number data, the mRNA data and the micro RNA data so finding that consistent grouping between them. So a latent group that would be like two latent groups, and it wants to find those two groupings in all of those data types and have them reinforce each other. Okay, so it's different from collinearity that way because it's, it's not the correlation essentially between features. Yeah. Awesome. So, yeah, so there's a lot of manual preprocessing though for this because you have these different data types and they're looking for latent groups. You can only kind of load so much in memory. So it takes about 1500 genes, and I'm sure you all have a good grasp of the genome that's a very small number relative to the whole genome so you're doing some major feature selection upstream, and that will dictate essentially what you find downstream. So depending on how that feature selection was done, you're kind of setting your course for what your results maybe. And in that way there's many steps in the pipeline for that. It's mostly done in the feature space. So I'm just, I'm actually going to read this because I'm not, I'm trying to think of what point I'm making here. So it's not combining the features to find a grouping. It's finding the groupings in the individual features. So if there's a grouping that you would get by combining the features you will not find it here. It has to be kind of a discoverable grouping from methylation and microRNA, for example, on their own without combining them, but they'll reinforce each other. And this is really important. It focuses on similarity across data types, and this is a similar point to what I was saying in this previous one. So what if there's complementary information? What if there is a grouping of patients that is very obvious in microRNA, but it's not obvious in your methylation data. That complementarity will not actually be found. It won't be used. And so that is, that to me is a major constraint here. And so it is just that which Wang et al set out to improve with similarity network fusion. So this is work by Bo Wang. He's actually at Princess Margaret as faculty in I think it's oh shoot now I'm going to butcher his, his faculty, but needless to say Bo is back. He left us for Stanford and he's now faculty at U of T. And he developed similarity network fusion when he was in on his lab in 2014. And the idea is instead of using the features as they are actually creating patient similarity matrix first. And then finding the complementarity and the similarity between these matrices by combining them through fusion. So I'm going to go through that I know those are a lot of like airy fairy kind of words, I'm saying here. So, step one of this is to create data type similarity, specific similarity networks. And so I before talked about distance matrices similarity matrices are essentially an inverse of that. Okay, so where before our center like our diagonal of this matrix was zero in the distance matrix before. Oh, thank you. And that's in laboratory medicine and path of biology. For some reason I thought he was in medical biophysics but I'm glad I didn't say. So, but the similarity matrix here, it is showing the most similarity on the diagonal because an individual is most similar to themselves. Okay, so in that way it's kind of inverting the logic of the distance matrix okay. So here, we've got our expression, our gene expression data, and instead of making a distance matrix, we're making a similarity matrix, and we'll be doing this later in R. And what's important, I'm going to show I'm going to be representing the similarity matrices like networks. And the key thing is, these are the same so this heat map here that is showing the similarity network here or the similarity matrix. So it is a network. Okay, so when there's no connection here. This means the similarity is zero is very very small. And when there is a connection that's very thick. It means that they are very similar, they're very closely related okay. We can think about these matrices as networks as well and so you work with side escape. Yesterday. So, similarly to that the similarity networks can be loaded into cytoscape and graft in that way to So here, we've got two networks for two different data types. And what happens with the fusion iterations is these matrices are essentially sparsified. So only the most similar individuals are kept or the most similar linkings are kept, and essentially a matrix multiplication they're multiplied by each other, and then updated, and this is done multiple times and that's what this is kind of representing here is that they're, they're updated by multiplying that sparse matrix of itself to the other data type and updated again and again, until they aren't changing much anymore so they kind of converge to a specific matrix that isn't updating. And then there's just a linear combination at that point because they are not changing and they're quite similar to each other. And that results in your fused similarity network. And in effect what that does is, it is able to pass the information between these two networks to each other to create something that includes the complimentary information. So if there's a very strong signal let's say, in this network, these guys are very, very tightly linked. And actually, so these guys, this will be dropped because this is absolute zero. It may be dropped it because it's zero but if it's small, it won't be totally dropped, it'll actually get reinforced and so that complimentary information will be included. But here, what else we can see is this link here. So it's very strong linkage here and a light link here, it will get reinforced. That's shared information that will be boosted through SNF. And so it's pretty nice for that but then you can also see where there's kind of like light linkage, not a lot from both of them so like, I would say like low information linking these two here. They actually end up getting dropped because it's like, they're not that related and all the data types are saying they're not that related. So it's kind of moving them all to more extreme version. All the links are being moved to more extreme version of themselves. So you get essentially a stronger signal from your data in completion, after running it through this procedure. So here's a case study and this is actually from the SNF paper. Okay, so here in the SNF paper, they're fusing the methylation data and gene expression data so mRNA expression and then micro RNA expression data. And then from each of these, they get a similarity network so I'm going to actually take a moment to go more in depth on how you would actually interpret these. What we see in this top one is pretty nice groupings here. So kind of a block structure and that's what you want to see when there's a good similarity network. Then you are going to see very strong blocking of your different groups. So these are patients here. And so we see these patients are quite similar to each other. These, this group of patients is quite similar to each other. This group, they're dissimilar from other groups, but they're not super similar to each other either so they're kind of the other group. Okay. And so then if we were to compare across data types, we can see in the micro RNA data, it's not so structured so it seems like in this data. Kind of individuals are sort of all over the place there is some like this this seems like a bit of a cluster here. And then these guys seem somewhat clustering but honestly like this whole bit. They're kind of related to each other so in terms of how I would interpret this I would say, you know, you're not getting great clustering out of the micro RNA expression, but it may be that there's some similar group. So maybe this group down here this small group that's consistently grouping together, that'll probably get boosted now in our similarity network fusion. And so that's what we see actually. So we get this really tight group down here, we get this group that tightens up here, which is actually combining information from all of these. So some of these groups actually get combined to create this that it's not being kept in these, where in the gene expression data it seems like there's many more little clusters. It's grouping them all together. And then in, we have one big group that, again, is kind of our other group like they are the lightest color they have the least similarity to each other. And so, again, how you would interpret this is like, these guys are pretty heterogeneous. These guys are quite homogenous. They're very tightly linked and based on just like the brightness and the distinction of that grouping there. So this was done actually inside escape this this figure here. And, and it's just representing actually this exact thing. But what's nice is, but went ahead and colored the linkages between them, based on what is actually supporting that linkage there so what's supporting the similarity. And we can see that, you know, there's a big clump here that seems like it's a gene expression and methylation data. So it seems like the micro RNA data is not contributing a lot to a subset of these guys. But there are some that are being contributed to, though there's very few that are having the full contribution from every single data type so the nice thing about SNF is it can kind of be adding this information from different data types, and drawing away low information data types as well, just not really worrying about it. So, when they went ahead and looked at the clinical properties now of these subtypes. They were happy to find actually that, especially this subtype here. It has a longer survival time so there's a quite a distinction clinically between these groups that they're finding. We've also found that the age is lower so it's a it's a younger group of patients. And, and finally, the treatment response is really different for one of the subtypes as well so in subtype one, the treated versus untreated groups are really distinct this one. I'm not too sure what this last one was about. It must be the treatment response though it is only in subtype one. And so that would be this big guy here. Yeah. So just some advantages and disadvantages we've talked through it but to list them out here for you. And one great thing is you get integrative feature selection. You can grow the network. Advantage growing the network requires additional work so because you're doing everything in network space when you have more patients. You're making it a larger computational task a much larger computational tasks so I would say the computational limitation here is the number of patients. And the last example I showed you there's 200 patients. What we're going to do today has 200 patients. If you have I've tried it with 5000 patients, it takes a long, long, long time to run. So Bo is working on that solutions to kind of expedite this but that's just something to bear in mind here. Yeah, it's unsupervised so even though I'm linking this all to a supervised kind of outcome like survival or patient age or treatment response. At the base level SNF is unsupervised and so I think it's really important to kind of set your expectations about what you're going to get from it it's describing your data so one big thing is if you have kind of group effects or I should have group effects in your data. That's probably what you're going to find actually using SNF. It's a very effective way to find all the problems with your data. So I also recommend it for just exploring your data, even if it's not the main part of your analysis, finding kind of batch effects or effects that are consistent across a lot of your data types and may really mess up your analysis down the line. I think it's very good to check your data for that. But some really good things you know it creates a unified views of patients on multiple heterogeneous sources. So here, these are all omics data types I showed you here, but we're going to have no omics data in the data integration we do at in the lab. So you could really include any kind of data here you could be integrating different data types from this, including let's say we also had imaging data on these patients let's say we had some radiomic features on the imaging that had been done on the brains for these patients, then we could actually integrate that as well so that could be its own additional feature set, we make a similarity matrix for it we fuse it. So we have an even larger fusion so you don't have to have everything on the kind of same like omics scale you can also be integrating just totally different data into this so it's a huge flexibility. And yeah, just repeating what I just said, and no need to do gene pre selection, because you're doing everything in the patient similarity space. The number of genes doesn't matter. It just matters that you get that similarity matrix out from those genes. That said, you don't have to, but in practice sometimes it is useful to do this pre selection if that's what your actual research question is. So if you're really interested in a specific pathway. Subsetting your genes to that pathway or sub setting all of your data to only omics data that is in that pathway. It would be useful to do because it's it's what you're actually looking for similarly. Work I've done with using similarity network fusion in neuro imaging is to again if there's a pathway or certain circuit that you're interested in, including all that extra data. So all those extra measurements from other parts of the brain. It's not useful. It adds additional information that you're not actually caring about so including everything what I found in that analysis was including everything. It was giving me these really broad skill patterns that were like sex, you know, or age or something just really big things that I'm like okay that's not really what I'm getting at here like I want something more clinically relevant. So, by kind of making it only about specific circuitry, a specific portion of the brain. We were able to find actually much more interesting patient sub types that were more to do with actual clinical features beyond just age and sex. So you may find that doing a pre selection of some kind actually is very useful in practice to what you want to do, but you don't have to do it and there's no computational limitation, saying that you should do that. Yeah, and it's robust to different types of noise so I didn't show it here but it's pretty cool. It essentially is able to do a denoising. You have data that has noise different types of noise added to the same data. When you do fusion on it, it denoises it. So it's, it's pretty nice for finding a nice signal through noisy diverse data. After the network fusion, after the network fusion, can you extract the features, yes, that they separate the different groups. Yes. So, this integrative feature selection, I just want to make sure, oh yeah, okay scalable. You can. What you can do here, Astrid, this is a great question. So one thing here I'm showing different kind of different non omics data where they went through and evaluated you know how related are these different clinical features. to these clusters that I've come up with. But what similarly you could do is go through everything that was used to create these clusters and see how those relate to the clusters you found. And when you do that so what I'm saying here is essentially like a tea test, or welches to sample or a Crisco Wallace test here to say, you know, are the means of each probe different between these groups. And then you just rank them based on their p values and you find which features are driving this in essence. So you can do that by hand you just write a loop and you go through. And I'll show you everything but that I don't think I'm going to show you that exact procedure in the R script today. But I think it should be easy to do. And, but you essentially go through and you just test every single feature that you put through SNF and just say, is there a difference of means. Or is there a difference of group numbers counts if it's categorical data between these groups, and then just rank them. And that's how you would, that's how you would see which features are really driving the clustering. So that's a great question about SNF working with missing values that that paper is forthcoming so I know that Bo and a couple students are actually working on it. It should be out or submitted at the end of the summer but right now, no, no missing data is okay. You can. You can interpolate your data like you can you can impute your data I should say, though I, the imputation. It's important to know who was imputed. When you go and evaluate it downstream, because what I found sometimes is, I'll just do like a median imputation because I want to like not have a very informative imputation. And then I'll find that kind of one group is like sort of my other group and they're just like the imputed group that just didn't have a lot of information so maybe they're just like low sample patients that I have in my set. So you can do that. Another thing is if you're doing imputation, you may be kind of helping drive the similarity. Because if you're imputing based on other features from other patients, and then kind of predicting the value the missing value, you are sort of contributing to the similarity between the patients that way. So that's where I would caution against kind of an informative imputation, but those most recent work coming forth will have like a different imputation method that allows you to kind of overcome that. But right now I would say it's a huge constraint. Yeah, the missing values hurt a lot. And so when using the omics data, what methods are available for scaling different data sets. Ah ha. Yes. So, um, in terms of scaling, you're always going to standard normalize, but the scale between the different data sets I should show here. The scale is actually the similarity space. So it's it's dealt with if you. Um, so for example, in this example here, thank you gave RNA seek and flow cytometry. What you would probably have is a RNA seek data set here and then a flow cytometric set here. And then within it, it's, it's dealing with the different scaling between the two sets by just transforming it into a patient, sorry, later into a patient similarity network. So you don't have to worry about scale differences between data types. But if you have continuous data, you will in general want to like have a normalization procedure so that especially I mean I think about this with gene expression you know, certain genes just are expressed more like you just have many more of them in the genome so they're kind of distributed on a different scale than other genes where you're just going to be observing less. And so you may want to actually do a standard normalization of those, because you'd want to say, you know, is this gene overrepresented versus if you don't scale it, it'll say the genes that are just expressed in higher values higher numbers, those are actually going to impact the distance between patients much more. So you'd say those are more important for determining how similar patients are to each other. And then those genes that are not expressed at high levels, those are less important. That's what it would essentially do. If you normalize it though, you're saying these are all have equal importance. And I want to say, you know, are you relatively over expressed or under expressed for each gene. So if you do that normalization within your data types, and then do similarity network fusion so you, you create these similarity networks, the scaling difference between different data types is totally handled. So you, there would be no scale difference because now it's all in the patient similarity scale. So I showed you before, you know, how a lot of times you can integrate this clustering or do your clustering, and then check after the fact, if you've distinguished groups that are surviving at different rates. You know, there's no intrinsic link between clustering and survival analysis. But because it is combined so often, I think it's really nice to go over how those are connected. And then in the lab, we're going to go over that kind of an input implementation sense. So, you know, do your clustering, get your clusters, and then do a survival analysis and evaluate your clusters based on some survival based outcome. So we're going to talk about survival data hazard rates survival functions, the Kaplan-Meier estimator which we already looked at and went through a log ring test that it's an example that we're going to go through previously this was cut but I'm going to keep it okay it's a little bit of math. And then the Cox proportional hazard ratio model, which I'm sure many of you are familiar with and is, I would say by far the most common model used even though the accelerated failure time models also very popular and I think very famously preferred by Cox himself so if you if you find you're dissatisfied with the Cox model, even he would support you using an accelerated failure time model. I'm not going to go over those models today, but I really encourage you to look them up and go through them there's great tutorials about the theory of those models as well as the actual implementation and are is very basic so if you can do the Cox modeling, you can do the accelerated failure time modeling just fine. So survival data survival data is actually got multiple components to you. The first component is the time to an event and it has to be one event. So there are many different types of survival modeling where, you know, you could have multiple events, you can have competing events. And you're considering the fact that people live like multifactorial lives where, you know, they don't die of cancer or dive one type of cancer and nothing else happens to them ever. So in its simplest form, you have one event and you assume that in the long extent of time at some point that event will happen. And some data on patients may be missing so we talked about this before those censored observations. And so it could be you, they've been lost to follow up. It occurred at some point but you never got to see it so you know that they survived at least a certain amount of time, but not, you don't know when the event actually occurred so we would call those censored data. Uncensored data is when we observe the actual death time. What we do is that we know it's beyond that time. And again, this is an assumption of survival modeling, you're assuming at some point, the death actually happens the event that you're interested in actually is happening. But you didn't actually see it. So really importantly is in creating a survival model, if you have censored data, you have to assume that it's not informative so it could be that you know all the patients that are censored. Actually, they live, or they live much longer there. They're all correlated with each other in some fashion. This has to not be the case, essentially, when you're doing survival modeling for that censoring kind of assumption to hold so it can't be the case that like this one, you know. Oh, I see. Yeah, you want to just say like yeah you know this person randomly moved out of town not like, oh yeah this whole group of people just went through a totally different kind of drug trial and they all have like a really different experience of life and we've only followed them up to this point right. So you need them to be uncorrelated, essentially. Your eyes are not your vision didn't just blur this is actually this slide but I really like this. At some point I'll make one myself this figure. So we have different cases so these are individuals. And this is the survival data we're observing here. So, we have the days to last follow up for the patients that we didn't actually get to see their event at the end of the study. So this would be a censored observation and we'd say six years is the event time for them, but the event itself of zero and it would be zero. So we would say at time six, you know, we saw for patient for no event at this point, but they lived at least that long. Whereas for patient five here, we'd say they lived one year and we observe that event we observe that they died within a year here. For patient three, we observe they died at four years. Okay. So these are our censored observations and they're both censored at six years. But in general the censored observations it'll be at different time points. There are two very important statistics here that I've been kind of alluding to the time to the event. And that time is reflected in your survival function so like we were seeing in that the KM curve. The survival function is showing the probability of a person being alive at T or any time above T. Okay. Whereas the hazard rate is the probability of an event happening at time T. So here it says in the next instant it's kind of if you think about a derivative like it's it's looking at at that point in time. How many like what's the likelihood of an event occurring. And what's really nice if you look at, you know, you look at your cam curve but then you have like a hazard curve you can see how the likelihood of an event happening kind of changes over time. So maybe there's a huge probability of death happening at one year, but then after that year, maybe it actually goes down quite a bit. So it can show this kind of transitional probabilities over the time course. So some examples of that hazard rate a concert, a constant hazard rate. So like no aging is kind of the idea but it's like you're equally likely to die at any point, like there's just no, whereas what we know is, as you age, there's a higher hazard of death. Though there is a pretty high hazard of death very early in life as well, particularly in certain groups of patients. A positive hazard rate is the older you are the more likely you are to die. Essentially, this is what we're mostly familiar with. And a negative hazard rate is dying versus the highest at birth. So if there's high infant mortality. So if the rate is going down as you age and we see this in certain populations as I said before, where the likelihood of death very early on is high. And as you survive past those time points, your likelihood of dying is actually quite, quite low. Okay, and we can see this in the KM curve. It's not showing the hazards, so to speak, but it is reflecting that same data. It's plotting our survival function. Again, this is ancient from the perspective of everything we're looking at here. But the KM estimator is, it stood the test of time. And so it's just plotting your data. And that is the survival curve. It's the probability that a member from a given population will have a lifetime exceeding T. So you just have the number of people at risk. So that's just the number of people in your data at time T, the number of actual deaths at time T. And so then you say of the people who are at risk, how many died, and therefore how many are surviving here, and you aggregate that over time. And that's what's represented here. So this is what we were talking about before where you can say, you know, what's the probability at 500 days, you will be alive, given each of your groups. And we would say, you know, it's about 90% if you're experimental and 100% if you're standard. You can also look at it from the percentage. So the median here and say the median follow up times here for the experimental group was 1000 days and for the standard group was actually a bit over 1500. Okay, so you can kind of slice this both ways. All right, so I'm going to go through a test you can do and we're actually going to do this test later in our, but I want to go through kind of the math of it, because I think it makes a lot of sense. And also it's good to just have this slide as reference so let's say we're comparing two groups, like we often are with survival. And J is distinct events in either group. The J's are the different groups. So for each J, we've got a number of people at risk. So that's just the number of people right the same N as before in each group and then overall and J is just combining those two groups. Okay, so group one and group two. And the O's are the observed number of events. Okay, so these are just the total number of people and these are the actual events. And then this is across both groups here. So the OJ's. The hypothesis is that the two curves are the same right that the the risk of an event in either group is actually it's no different. And that's what we want to test here with a log rank test so given that OJ happened. So given there's an event across both groups in time. We know that the event likelihood in one group is hyper geometric. And we can take that expected value here. So this expected value looks kind of weird at first, but all it is is saying, this is the number of people in your group. Okay, so a single group. This is the expected value for a single group. It is just the, the probability that it happens overall. That's just the expected value right so we just say, overall how many events out of how many people, and then take that proportion and multiply it by our individual group. This is a bit more of a complex term, but needless to say this is just derived from the hyper geometric distribution. But from this, we can actually get a Z score, summing up over our different groups to see how much are observed values. So here, it's not this OJ this overall, this is the actual in group observe value. So is it happening more than we would expect it to. Okay, so it's, it's just actually standard normal distributed here, very conveniently and so from this you can actually see, am I observing more events in my group, then I would expect given that my expectation my null hypothesis is that we all actually in all the groups have the same likelihood of an event happening. And from that you can actually just get your p value from a lookup table so these are easily computable values, standard normal distributed, and then you can compute, you know how likely is it under the null hypothesis that the distribution is the same. Okay, so it's quite, you know, classical statistics here. So again, sorry for a bit of a blur here but what I wanted to show you is the hazard ratio. And this is what is being used in Cox modeling. Okay, so the hazard ratio is taking what's observed versus expected so what we saw from the previous slide, these are the same statistics here so our observed group one versus our computed expected group one under the null hypothesis that they have the same likelihood of happening so that there's no difference between the groups okay. And so then this hazard ratio is looking at the relative risk of having an event relative to the expected likelihood of having an event in each group and comparing them in a ratio fashion. And so that is to say if you have a hazard ratio of 0.43 the relative risk of a poor outcome under the condition of group one is 43% of group two. So, group one has a lower risk than group to 43% lower. So the hazard ratio is two, then group one is two times as likely to have an event and group two right so this numerator is two times the size of the denominator here. Okay, so that's how these would be interpreted. So now we go to the Cox model and what their modeling is actually this hazard rate. So Cox captures how well multiple variables affect survival by using the Cox regression. So the hazard ratio of the risks here. It says here's the hazard ratio for a given individual at a given time. There are multiple ways to compute this baseline hazard here, but this can be thought of as kind of like an intercept. So this everybody has this baseline hazard rate. It says, Alright, given your covariates. How does this hazard this baseline hazard change. And what's really important is, it's exponentiated here, and it's multiplied to the baseline hazard. And so it's, it's a multiplicative effect on that hazard. So, H zero T here is the baseline hazard. Other predictors are exponentially increasing exponentially increasing hazard. So here, if you think about this, again, is the Cox model BJ it represents the log hazard ratio increase for one unit increase in the predictor holding all other predictors constant. Okay. And so here we have it up here. This is how much your hazard is increasing holding everything else constant. And the hazard ratio increase for one unit increase in XJ is the exponent of BJ. Okay, so you have one unit increase in X. So you have X, or your exponent to BJ, and you have this multiplied now by your baseline hazard. And that's your increase. If your coefficient is less than zero. It means you're increasing if you increase your X. It's a decrease in hazard and longer survival times. Okay. So you're increasing hazard rate, similar to that interpretation before you know if your hazard ratios point for three. It's a, it's a lower risk of an event, meaning a longer survival. So, let's talk about using an interpreting Cox pH. So, let's say we have a hazard ratio here for a subject I with a set of predictors X, and we're comparing it to subject J with a set of predictors X, right. So what it looks like is your hazard ratio IJ you computed this based on your Cox model and the baseline and so the ratio between the subjects actually comes down to the exponent of beta times the difference in your predictors. Okay. So it's just like linear regression and it can be interpreted as a percentage change in risk similar to kind of what we were talking about before, even bringing up Cox just the hazard ratio. It's a percent difference between groups. So if X is one, when treatments active and zero when treatments a placebo, if the hazard ratio is point eight, it means that there's a 20% decrease in mortality risk if using the treatment compared to placebo. Right, because this beta is activated when you have X equals one, and when X equals zero you have. X is not activated essentially so you're multiplying your baseline hazard by point eight. And so point eight is. All right, so how would you actually evaluate your survival model. One way of doing it is using the concordance index. So concordance actually captures the ability of your model to order the individuals correctly with respect to their survival time so it's essentially placing them in order as to when the events happened. And the way it's computed. Again, we're working all in the similarity space here. You look over every pair of individuals here. So first, this is just the number of all pairs. Okay. Over all individuals, how many of them are ordered every pair of individuals pardon how many of them are ordered correctly. Let's give them a one and any that are tied. And so they're just at the same time point. They're not ordered correctly necessarily but they're tied. So it's not before or after it's not right or wrong. So give them half a point, essentially for that. And this is computing, you know how well ordered are your patients. You know, you could be way off in your survival time predictions and have very good concordance. So it's something to consider when you're using this metric, but this is the only metric that actually captures the ordering of your individuals. So you can evaluate your model based on its ability to get the time right. But if you want the order right if you want to be able to say you know, this person's more likely to die early and this person's more likely to die later. Concordance is really going to be capturing that performance for you. Okay.