 Hi, everyone. My name is Lauren Erdman. I'm actually in my PhD, but I'm on and you're off working on a data science group at Sick Kids. I have a master's in computer science and a master's in biostatistics, and I focus on a lot of problems, particularly applying machine learning to genomic and genetic data and also clinical data. And so a lot of what we do in the lab is, in my supervisor, Dr. Goldenberg's lab, is integration of all these different data types. So here in this first slide, you can see many, and these are just a subset of all the data types we work with, and we thought it would be appropriate to end the whole workshop with this because you've seen a lot of different data types, you've seen a lot of different uses and a lot of different methods for it, and I'm hoping to kind of tie it all together and talk about kind of the implementation of this, how you can apply it to your own data, and how we've done it ourselves as well. So first thing, so just some learning objectives. We're going to go over some clinical data that we work with and talk about like a quick review of single data type analysis just to place or bench integrative analyses in that framework, and then I'm going to talk about different data integration methods, a subset. There are so many different ones, so by no means is this an exhaustive list, and then we'll discuss some advantages and drawbacks, and maybe you'll come up with some of your own as well beyond what I've listed again, and then we'll go over some survival analysis because that's been spoken about a lot, and then in the lab we're going to implement both the data integration and survival analysis, so you can see more concretely how it's done and how you might apply it to your own work. So first clinical data, there's so much, this is again by no means exhaustive, but in this context this is what I'm referring to when I'm referring to clinical data, so things like sex, family history, tumor staging, and size, age of diagnosis, and time to recurrence for example, and oftentimes you want to integrate these to get some useful prediction or some understanding of the underlying disease, so here's just an example, but there are many of these available again. This is a prediction tool for breast cancer survival, and you can input all these different details about your breast cancer, and you can see how likely are you to survive at different time laps. So again there's so many different patient data though, and it's just a shame not to leverage all of this, so one example, and it's been talked about even by Molly in the previous talk, is the cancer genome atlas. They have all different kinds of data, and theoretically it would be very useful to integrate this and understand what's going on on a more meta level that involves all of these different things, because they're all part of the same system, right? And so why would we do this? Well here's to some a few points, but to identify more homogenous subsets of patients, so if you can cut your patients more ways, you can find smaller maybe subsets, and subsets that are homogenous against many different data types, so subsets that are defined by a similar methylone pattern, or similar pattern in gene expression, and maybe you want to ensure that both of those are the same, so you want to integrate both of those data types into identifying those subtypes, and then also to help better do prediction, and we'll discuss of course what that would involve. So just some single data type analysis, and this has been talked about actually a lot during this course, and even in the last talk. So you've got a single data type, and you want to see if that is going to split your patients in some meaningful way, right? So this is a classic, we've got a heat map here of gene expression, here we perform hierarchical clustering, and you see there's a difference, you just test with a t-test if you want, you could also look at survival curves for this, so here's another way, you can see, oh, like, does this differentiate my patients in some clinically important way? So we're all very familiar with this, this is classic, but to get in, to just segue from this into the survival analysis, when we're using these in a survival context, what we're evaluating is these two KM curves, so Robin earlier was talking about the Kaplan-Meier survival curves, and I just want to go in more concretely in depth about like how to interpret them, what they mean. So here we've got GBM group one and GBM group two, and you can see that in the past slide, these were defined by the clusters that were discovered through this hierarchical clustering of GBM gene expression. And in group one, can anyone tell me if their prognosis or survival is better or worse? Better, yes. Okay, so the way you can tell is, if you just look at any of these lines actually, so I'm going to try to use this here, so the easiest way is to look at the median line but really any line you draw, the key is that half the people have died, or in this case they've died, or they've experienced an event at this point, so at about, yeah, not even half a year, and then in this other group, they've survived much longer, so half of them have died just after two years. So it's easy to see from that context, if you just slice it, you can see, oh, one group that has a way better prognosis than the other group, so it seems that they've picked up on some signature that's meaningful for survival. Another way to look at it is, you can slice it from the year, so you can say, okay, at one year, what share of the group two here is surviving versus what share of group one, and you can see that it's about, it's 80%, 90% for group one, and group two is terrible, it's around 20%, so obviously much worse prognosis. And another interesting point and actually very important aspect, excuse me, of survival analysis is the ability to integrate censored observations into your data, so these are actually missing observations, no event has occurred that you've observed, so we can still use that information that a person has survived up to that point into your analysis, so this may, survival analysis actually quite powerful for predicting prognosis, and we'll talk about that a little more in depth in the survival analysis section. So another single data type driven integration, so it's not a single data type analysis, but it's an analysis that's driven by a single data type initially, so maybe you have, you've measured gene expression, so you have your mRNA here, and you discover some subtypes, then you look at mutations and you decide to add more genes into your gene expression analysis, then you decide, oh, but there's copy number variation as well, so I think I'm interested in the gene expression of these people as well, so you have many different data types informing what genes you include, but at the end of the day it's still just gene expression, for example, so there's ways to integrate in this way, where it's kind of sums up to a filtering process, but at the end of the day it really is a single data type that you're analyzing in this context, so we've, in this example they didn't include methylation data and the problem is when you're doing this, when you're doing this filtering for each individual data type and you're saying, okay I want to include gene expression, I want to only analyze gene expression, but I want to filter my genes and add them iteratively, you're not including information about other data types that may be important, so here, even though you're including the CMV information in terms of what genes expression you include, you're not directly including that CMV information, likewise with mutations, you're not directly including that, you're just filtering through them, so it's nice to have a more direct way to include them, and what people have done is just done the clustering, like they did on the gene expression, they just do it then on methylation, and then they put it against the gene's expression. But here, yeah, so again, there's many different ways that people have kind of done this in a piecewise fashion, and now I'm going to talk about more direct integration approaches, where people just took the data from different data types and put it actually together, integrated it into a single analysis, and then took from that to kind of associate it with some prognostic indicator. So, the first one, so we're going to go through concatenating cluster, we're going to go through iCluster, a very famous one, and then we're also going to go through similarity network fusion, which was created in our lab, and again, they're all different flavors of a very similar problem being solved in a different way. So concatenation, this one's really simple, you just take your two or whatever number of data types, and you just squash them together, and now you cluster them. So, I'm going to quickly talk about hierarchical clustering here, so you concatenate, and then you hierarchically cluster. When you have a hierarchical clustering problem, you generate a distance matrix, and the idea of the distance matrix is each of these, so here you have D and F, so these are people, say, person D and person F are more similar to each other than anyone else in here, this is the lowest point. So, they're going to be the first pair that's clustered together, and that's what hierarchical clustering does. Pairwise, it looks for who is the most similar to each other, and it puts them in the same clusters as you go, so it actually doesn't cluster at all, it just creates a hierarchy of who's most similar. So, as you can see from this distance matrix, how that ends up looking is this, so you've got your first pairing here, then your next pairing here, and your next pairing here, and it's pairing actually with the group, as you can see. So, they're put together, and then they're paired off again, right? So, you can see how it's not actually clustered, it's actually just made into some hierarchy of the distances, and then you have to cut that at some point, and we'll discuss some ways that you can choose what cluster number you decide on, which is really nice that Molly talked about that as well, and I'm going to talk about some different ways you can do that. So, here, yeah, deciding the number of clusters. So, you could cut the dendrogram, that's what this guy here is called, you can cut the dendrogram by eye. So, here, if you were going to cut this by eye, how many clusters do you think this would have? Yeah, probably two, like, it seems like, and the way you would do that is say, okay, this is a really long line here, and this is really long here, so it seems like these guys are pretty different from each other, so it seems like there's maybe two distinct groups, right? That's really hard to write up in a paper, so it's nice to have some metrics and some numbers that you can put on something. So, another one, the silhouette statistic. So, what the silhouette statistic does is it takes your distance within the cluster and compares it to the distances in all the other clusters from your individuals, and so what that's doing is saying, does it make sense to have these people in their own cluster, or are they actually quite similar to everyone who's nearby them as well? Am I cutting clusters in half? Am I cutting a cloud of data in half, or are these actually separate groups? So, this is an example, these slides are actually parts of them are from my supervisor, so this one in particular, I'm not totally great at communicating, but basically, so there's three clusters, there's different examples here of some data. So, you have three clusters in two dimensions, so this seems like all right, like, I definitely think this guy's a cluster, these ones, I mean, this guy's pretty close to that guy, right? So, it's like, where you drew that line, it seems legitimate, but maybe not. This one seems like, so this one is six clusters in two dimensions. So, we've got two measurements on these people and we've cut them into these six clusters. The problem is this guy definitely seems distinct, like this cluster seems extremely distinct. But these ones doesn't make sense to have them in a separate cluster, like lives with these ones, they seem like actually it's just three clusters, right? So, what a silhouette statistic can do is actually distinguish those. So, you can see for a, the silhouette statistic is very high. Oh, sorry, this is the accuracy. So, the silhouette statistic is green here. And it says percent accuracy, but the silhouette statistic is going to be quite high. For this first one, it's going to be more different than for this one, where you're actually slicing what seems to be one cluster into multiple, and that's the intuition behind this. You want to discover if you're actually just kind of arbitrarily cutting up your data. And I don't know if you guys have been in this situation, I've been a lot of clustering. And a lot of times you really want there to be subtypes in your data, but like they just don't exist. And this is a great way to just be like, am I just splitting a data cloud into arbitrary numbers of groups? And like, is that useful? And if you are splitting them into these groups, it's, you're doing yourself a disservice if you're trying to do prediction because if it is just one data cloud, then the continuous valued information like not the categorical descriptions is much more informative. So you don't want to be pushing yourself into a subtype analysis when you really don't need to be and your data doesn't support it. And so a silhouette statistic is an easy way to just say, no, you know what? I don't have data that supports this. Another way. So a lot of times I just wanted to talk about this graph. This is a pretty classic silhouette graph. And what this is a silhouette statistic graph and what this is showing you is this is the silhouette statistic on the x axis. And these are the subtypes. So this is subtype one, this is subtype two here, and this is subtype three here. And what you can see is each bar is one individual in the cluster. And so it seems like these guys, when you have a negative silhouette, it means that you should actually be maybe in a different cluster, like you're very similar to people who are outside your group. And you're not, you're kind of similar to people in your group, but not very much. So, and so here, it kind of points to like, oh, well, it seems like there's some really nice clustering in these ones. But these ones seem like there may be some admixture from this first cluster into these ones. So it's, it's nice to see this and say, like, how much does my data support a clustering? And if you have a lot of data points lying on this side and very high on this side, then it's supportive of clustering. But I've seen a lot of data where most of it is on the negative side or near negative or near zero even. And your data just then, it doesn't support a subtype analysis. So this is a really nice check to do upstream just to save your own sanity so that when you've done your clusters and then you go back and look and discover that you don't have clusters in your data, you, you will have avoided that. All right, consensus clustering. So consensus clustering sums up to just re-sampling from your data. So this is another way to see, like, am I reliably putting people together or do my cluster divisions just rely on having certain people in the data set? Do I have some outlier group that's driving my clustering discovery? Or do I actually have clusters that exist among everyone? And so you just re-sample, you cluster again, and you just compute the number of times each group or each pair of individuals is clustering together. And that's a nice thing to check for stability of your clustering, stability of your data in a clustering framework, actually, more specifically. So now back to our different integrative clustering approaches. So iCluster. So iCluster, what it does is it looks for latent subtypes. It assumes that your data follows a Gaussian latent variable model. And then the sparsity regularization isn't very important here, but basically what you're trying to find is a latent embedding. So what that means is you're basically trying to find labels that exist across all your different data types. And so you want to see if you can divide up your sample into these subtypes, maybe you have like four subtypes, and you say, okay, yep, this subtype exists here, here, here, and here. And sometimes it can be supported by one data type more than the other, for example, which is nice. But at the end of the day, it's just looking for a continuity in subtype discovery across these different data types. So some drawbacks of this is in this iCluster, it's very computationally intensive. I think they've overcome that in more recent iterations in part, but it's at the end of the day. It still is a very computationally intensive algorithm. So that what that actually ends up amounting to is, oops, sorry, I'm on the wrong slide. Yeah, no, what that amounts to is you just are restricted in the amount of data you can put into the algorithm. So what you can use is limited, and if you want to do some whole genome analysis or something, you cannot, you can only use about 1500 maybe 3000 genes, which for certain diseases is a massive limitation. There are many steps to take, and sorry, I'm just gonna see if any of these other things are very relevant. Right, so and the other thing is it's focusing on similarity across the data types. So here where I look, when I told you it's looking for subtypes where it's like, okay, having these four subtypes it's supported in this data type and this data type. But if maybe this data type actually shows three subtypes and this data type shows four, maybe there's a subtype that only really exists in the microRNA space. Maybe you would only discover that in microRNA. And so you want to integrate often that complementary information because only one genomic data type, for example, would show that information. So that can end up being a major constraint. So enter similarity network fusion, and the idea of this is, so you construct patient similarity matrices and then you fuse multiple matrices. But what that ends up looking like is, so you have your data here, patients by gene expression, and what you do is you create a similarity matrix of these patients. So you find like these, this group is very similar to each other. These two people are very similar to each other. And you put it in that space. That space actually is a network. So each of these points here actually represent an edge between each of these patients. So a node here is a patient and an edge is the link or the affinity or similarity between those patients. So here a darker node here corresponds to a darker edge here. And so what you would do is you create a similarity network out of each one of your data types and then you fuse them iteratively and you create a fuse similarity network. So the fusion is similar to what Robin was actually talking about this morning, about graph diffusion. So what you're actually doing is you're diffusing the different data types onto each other and you're doing it iteratively, which means that, which means you're doing it a lot of times, but more importantly what it means is you're doing it on each of them in sequence, like in a cycle, so that they're becoming more and more similar with every iteration. And then once they reach a level of similarity between them, that is very high or the difference between the different fuse networks, sorry, the difference between the different diffused on networks is very small, then you can combine them into this fuse similarity network. And now you have a network that it exists or it represents information from all your data types and that information integrated is both common and complementary. So you're integrating like in the iCluster framework where you're saying I want to see that these subtypes exist, like if there's a subtype breakdown that exists across multiple data types, I really want to boost that signal. But also if there's a subtype signal that only exists in one data type, I also want to keep that information. If it's a very strong division in my data, that's probably important and something that I don't want to wash out when I'm fusing everything. So, oh yeah, the cutoff for, oh so you don't, you don't have to pick the cutoff. So what we did was, we actually do spectral clustering and we use the eigengaps algorithm. Oh we don't, yeah, so we convert it into a network that is fully connected. So that's another benefit. You don't have to cut at any point. So the network is just this full graph here. So you keep all the information, you're not cutting it into chunks. Yeah, so these all have a link between them. It doesn't show it here, but theoretically all of these have a link and the links represent the weight in this. So they represent the cell value here. So there's no cutoff. We don't cut it at all. So, and yeah, I don't, I don't love this animation because yeah, this is linked. They're all linked. So it's all connected. Yeah, yeah, yeah. So you're integrating all the information from all the data types in terms of their similarity, the patient similarity. Yep, exactly. Yeah. So it's safe to say that P1 and P2 are least similar in this entire. Yeah, yeah, exactly. And in this you can see P1 and P2 seem to not be very similar I think. Yeah, I'm not sure how much was done, sorry again this particular slide is not mine. I'm not sure how much was done to make this exactly this but know that these, all these weights exist. That's a different data set. May we have a different contribution to the final network? Yes, yeah. How do you put the weight? So we don't buy them. Yeah, you can weight them. We actually haven't used a weighting method for this because the hard thing is it's hard to know like do I want to up weight, so for example here, would you want to up weight your methylation, your expression, or your microRNA expression, your gene expression, your microRNA expression, right? Like which one would you want to weight higher or lower? So right now we just weight them all equally and then some what you'll see here is some have like this guy it's a very diffuse signal and so what happens is this one won't contribute as much to the fuse matrix because if it's just noise going in it's not going to differentiate the patients very much. So even when you're diffusing noise onto the network it is just that it is just noise so it doesn't contribute to any clustering and so after you've done the similarity network fusion you can actually see how much each one of the individual networks contributed to the fusion and you would use normalized mutual information that Molly described earlier. So you would see if I did that same clustering in this data type versus the clustering in the fuse matrix, how similar are those clustering? How similar is the membership in each of those groups between the fuse matrix and the individual data type? So here yeah so here you can see that the signal is boosted in an extreme way actually and so where you would have a very diffuse signal here it seems and it seems like there's about I don't know like four five different subtypes here this looks like maybe like a very similar subtype and a pretty diffuse one here and then some noise down here this one just looks like pure noise you're able to boost the signal and flush out the noise when you're integrating these when you're diffusing these networks on each other so some clinical properties of these subtypes can be evaluated yeah but then after after you have the final matrix actually the fused one so I mean from from this example it's very clear but how do you actually mathematically define the number of clusters here? Yes so now you have an affinity matrix and you can convert it to a distance matrix or keep it as an affinity matrix and you can use all those clustering algorithms that are just applied so hierarchical clustering spectral clustering literally you name it all it needs is a distance matrix or an affinity matrix to be input for it so you can use all of those in this framework we found spectral clustering works best and because of that we found EigenGaps is the best tool to evaluate that just because you're evaluating the spectrum when you're using EigenGaps but you can use hierarchical clustering here you can use really anything you can you can just look by like you can eyeball it and say there's three clusters and split them as well right but at the end of the day actually you get a full network out of this you get a full distance matrix or similarity matrix if you will and so you can use that in any way you want so I'm actually going to talk about a few ways that's been used and then you can look at clinical properties of the subtypes oh yeah so do all patients have that all yes yes so ideally yes um if they don't if you impute it then it comes with all the like classic imputation issues that everyone faces um it can be imputed and what's interesting is you can impute it in the raw data itself or you can impute um in the similarity space um so we can impute these in these matrices themselves but at the end of the day um you'll want to have a comparison if you impute you won't have a comparison of that result with the result on your complete data set if it's at all possible because it's hard to know how much of what you discover is going to be an artifact of this of your imputation and so especially when you come out with some tight clusters um it's kind of like okay well are they actually very similar or did I make them all similar because I imputed them and I and that imputation was done based on the similarity of these patients to each other so can you really quickly explain imputation? yes yeah absolutely so imputation can be done in a lot of ways um there's uh some classic ones are just a linear regression so you you run a regression on everyone who you have data for and then you use the values that you uh ran the regression on to predict the value uh the missing value for people you don't have the data on so a real big issue here is let's say we want to we don't have any methylation data for certain people but we have their gene expression and their microRNA well if we use their gene expression and their microRNA to impute their DNA methylation now we're making them similar to each other we're making that we're making that signal penetrate across those different data types exactly yeah their methylation becomes a function of those other data types so uh you you just constructed uh basically a similarity so yeah ideally you would have the same data for all patients yeah and if you have missing data ideally you would have some of the data so in the DNA methylation uh space suppose you're missing like some portion of their DNA methylation then um it that's easier to impute because then you can impute it from other DNA methylation um and you're not relying on the other data types now um to impute it because yeah I think that the most important thing to keep in mind with imputation is like what signal am I propagating and and what am I going to probably rediscover at the end in my results once I've imputed this data set right because you might just be constructing the results you end up getting um another way to impute and actually this is um it actually it has a lot of the similar issues as regression but you could even do like k nearest neighbors so you could say you know find the the five people who are most similar to this person now let's let them vote um on what my missing value should be for example so there's there's many different ways to do it there's full books like volumes written about imputation but um at the end of the day it's just important to know you're predicting your values and so you might be creating your end result um and so that what's nice about uh the subtypes versus the network space is you can split up your data of course and you can see if there's any clinically actionable or interesting aspects about these people so um if you look at their survival probability and you split them into two subtypes uh or into multiple subtypes here then you can say like oh well it seems like this group is pretty similar to this other group here but this group uh seems to be doing well like they seem to have like a better prognosis so um that may be interesting um you can look at their age you can see if there's something that may be driving the differences as well because um likely your location of where you live or your age or different things these will also be driving differences in your data and so it's important to discover even artifacts this way so it's uh nice to look at your batch for example um uh when you're looking at your subtypes after the fact because your batch may be driving your whole subtype discovery um and again like that's not necessarily useful um so we did this in TCGA data we applied us enough to this um and uh we included gene expression methylation microRNA um and we had some controls here oh yeah so these are cancer patients and then we had normal data here um and we just looked at like if we're clustering them uh this is putting the clusters in a latent space here um so the principal components so we took the pca of the data and we colored it by which cluster you belong to um and we saw that it was really interesting you could see uh where they lie close to each other or maybe they're kind of on a spectrum and they're being divided up um and so in this one where they seem to not be very different well they also seem to be pretty similar at least in some of the space here and in this one where they are also very different they don't seem to necessarily be extremely different like this space doesn't necessarily translate into the clinical space so that's another thing that's very important um this is unsupervised and only based on the genomic and genetic data or whatever data you put in actually we put in nutrition data neuroimaging data like you can really do it with anything but your clusters you find might not clinically be very important like i said they could be batched they could be age they could be sex so it's very important to look at all these things and say like did my clustering make sense and and is it important too so there's some advantages and disadvantages some of which i've already discussed but um it's nice because you can have some integrative feature selection uh like i was saying about uh choosing which one is or or seeing after the fact which data type was most important another thing you might want to know is were any particular genes most important and so you again through uh normalized mutual information you can see if a particular gene seems to be driving uh the clustering or if you have a certain group of genes that are most important for separating out your patients into these different sub types and so it's a nice way to do some feature selection if those sub types end up being clinically important and growing the network requires extra work so it's network space so that's n squared time so um it can take a lot of time if you have several thousand uh patients that you're trying to create a few similarity network out of um you'll want to have some processing power to do that um and again like what i said before it's unsupervised and it's hard it's very hard to turn this into a supervised problem because it's a non-linear fusion so you are seeing just like what exists in your data you're not saying what exists in my data that's associated with survival right so if you don't include something in the clustering then it's not going to drive the clustering and your clustering may or may not correspond to survival or um tumor location or anything that you would be particularly interested in um but some nice things it has it creates a unified view of patients in multiple heterogeneous data spaces so if you have nutrition data for example that you want to integrate with microRNA data um you'll have a common space to do that in they won't they don't both map to a gene or anything like that um which can be extremely challenging and there's no need to do any gene pre-selection um and it's robust to different types of noise so survival analysis now we're going to switch gears pretty dramatically here um so i'm going to talk about hazard rates survival functions uh the kms estimator or kaplan-meyer estimator the log rank test uh which also robin brought up like very nicely uh set me up for and then um and then we'll talk about the cox proportional hazards ratio model um and a key thing here is actually i'm going to talk about cox proportional hazards but uh there's also a whole class of survival analysis that deals with parametric survival functions um and those are extremely powerful for actually predicting the time so if you're interested in when are you going to die or when are you going to have a your cancer onset or when are you going to have a relapse then a parametric model actually may be more appropriate so i just wanted to point that out because cox is commonly used but um i this i don't know if this is totally true but um i've seen it in multiple places even cox preferred the parametric survival models to his own model because because of your ability to estimate when will this thing happen not just how likely is it how what's your hazard rate at any given point so we'll talk more about that so survival data it's time to a an event it can be a single event so here it says single event but it could also be multiple events it could be competing risks that's another type of survival model where you're saying okay either you're going to die or you're going to get a relapse and you want to see which one's going to happen first and if you die then you probably won't have relapse right so they're competing risks and some and the nice thing is some data on patients may be missing and so it's censored generally speaking when it's missing and you can still incorporate that data into your model so if you know at what point you no longer have information on that patient you can say well they lived up to this point or they had no onset up to this point and then that information can be integrated into the model and you've not totally lost that sample so it's it's pretty flexible and powerful in that way so just some just to look at it here's the beginning of our study and each one of these is an individual case so we've got one two three four five here and let's say this is the end of our study so these guys are going to be censored we don't know when they have their event whatever it would be right and so with these ones we can integrate this information into the model but here we know okay this person had an event at uh month one a month three here month four here and then when you do your survival analysis you can actually see do do the people who had this earlier onset are they separate from the people who never had an onset or a later onset or later death so some important statistics first event time t time is pretty important for these models our survival function so the survival function I mean it makes sense intuitively it's the probability that you're live at a given time right the hazard rate that's the probability of a of whatever event happening so dying failure cancer onset at that instant or in the next instant more importantly so it's your instantaneous likelihood of experiencing an event so some examples constant hazard rate so it's not changing over time at any given time you're equally as likely to experience an event a positive hazard rate I mean this is something we're all familiar with as you age you're more likely to have for example cancer onset right so it increases over time a negative hazard rate dying risk is highest at birth infant mortality or like what I like to think of this is like if you're male you leave adolescence and now that you've passed adolescence you're more likely to live for a period of time right like there's periods of time where you're at high risk and then you the risk drops off given that you've already lived that amount of time right so it's a non-constant hazard ratio all right so the KM estimator this is classic and very important for survival analyses and so we've got the survival function we've got the probability that a member from a given population will have a lifetime at or exceeding that time and so here's the function that actually estimates it it's not too important because I think the actual graph is very much more intuitive to understand you're really just counting and each step here represents some failure or some event that's occurred right so up to this point an event and you drop down now in this population this is the share of people that are alive or have not experienced an event and this difference is the share of people who have and then here you have your censored observations included so you know that at least one person lived at lived or didn't experience an event up to this point and it's a really easy way to just see like wow these are really different right so it's a it's a very intuitive way to show your results yes please yeah so why is it so good yeah yeah yeah the censored observation so because it's not observed what has happened to them so all you say the only information that you integrate into this is the fact that nothing has happened up to that point and that's the key thing with censored observations nothing has happened to them you don't know did they they could have had something they could have had an event on set here like right after it we don't know we just are missing that information but we know they at least lived or at least had nothing happened to them up to that point so that's why it stays a constant through that and then drops only when we observe a failure or a death or an event so the drop is the drop the exactly time you observe exactly yeah yes so every drop here is an observed failure an observed onset all right so this one actually it used to be hidden but then robin talked about the log-rank test so I'm not going to go into all of this but what I basically want to say is what we're doing here with these log-rank tests where we're testing the difference in the km curves is we're just doing a Gaussian estimation with that data so you can use that to approximate a z score and get a p-value for the difference between your curves so that's that's what it blows down to if you need any intuition about what the log-rank test is so I just wanted to touch on that briefly just because it was brought up earlier and I think it's very important because km curves are used all the time it's very nice to know when you're using a Gaussian approximation for example because that's asymptotic which means that if you have a very low sample size this may not be a very powerful way to test the differences and it will become more powerful or more appropriate as your sample size grows so do you have less or less? it depends on your data so because if you have fewer samples it's more of like a power question if you have fewer samples and they're extremely distinct then you calling it significant is okay it's going to build pretty big confidence intervals which is nice in that case because you won't be very confident with few observations but you're more likely to get a nice difference yeah so I would think of it more in a power context yeah it's hard to put numbers on it because it really is context specific right? yeah yeah not really though you can estimate power through this yeah all right and then the hazard ratio and so where I talked about hazard before the hazard ratio is used a lot in estimating survival functions and what you're doing is you're just comparing the likelihood of a thing happening at that time to one group or one subset of your data to another group and that ratio now defines is the statistic you're interested in so you want to say like is it two times more likely for people in this one group versus the people in this other group to experience a failure at time t right so nope I passed it here so here that would amount to like oh okay so it seems like maybe two times more likely like they're they're facing a much higher hazard in this group than this group right at any or it's particularly here at their hazard ratio is going to be very high here in this group relative to this group so you're just comparing them it's a way to bench it in a comparison between your two groups and it's actually a different way sorry I will totally it's right it's a different way of comparing them than the log rank test actually so yeah so all the drops are observed like we've seen that that share of this group has experienced an event so anywhere there's a black dot sorry so which one like this one for example or this one so these two I mean I get this one over there for example where they come right now right okay but in the first case that you have the first row there is no immediate black so there's no black the black dots they're randomly scattered along this curve yeah so they don't correspond with any of the drops they actually correspond with a lack of a drop because you just haven't seen anything happen at that point does that yeah right yeah yeah you have it if you haven't observed something how do you know where it goes like why does it vary on each oh because this is where we stop seeing data on this person and so we know that they're up to that point they're having nothing happen so now we can support that like in this group nothing happens up to that point and so it's all about adding information to the model right but it's incomplete information so it won't it won't support a drop there ideally yes ideally you have no missing data ever in similar network fusion survival analysis yeah writ large yeah yeah yeah so if you have a lot of sensor data so now you you're just like you're cutting down your denominator right if you think about it as a fraction and so yeah each drop will be much bigger because now your drop your drop just corresponds to the proportion of people in that group who are experiencing failure or dying or what have you so yeah exactly if you have a lot of black thoughts now each drop is larger in that group so so each black dot is an independent person yes not the same it's a different person yeah yeah okay so the hazard ratio all right and so the reason the hazard ratio is very important is because it's used in the Cox proportional hazards model and so what this does what this model does is it estimates the ratio of risks for your groups relative to each other and so we can do this on a continuous scale so it can say as you age how does your hazard ratio increase so you can compare distinct like a a fifth year old to a 10 year old right in terms of their hazard rate but you can also do it for subtypes so you can directly take your subtyping data like we talked about before with the gene expression subtyping or or if you have SNF subtypes that you've developed and you can put those in as individual groups and you can compare their hazard to each other and it has a really nice parametric form or it's sorry it's non-parametric but it has a nice form that allows you to estimate really nice confidence intervals and smooth functions that describe how important for example age a continuous value is to your survival or more importantly your hazard so your likelihood of non-survival at any given time so some intuition so here here's actually the model what you have is this baseline hazard value so this h sub zero and with t in it this is the baseline hazard this is a single value that is shared across everyone so the reason it's h sub zero and not h sub i is that it doesn't differentiate between individual people everyone has the same baseline hazard and the idea is the only thing that differentiates them are their relative risk based on all the covariates that you're giving your model and so these are all very important to keep in mind and we're going to talk about evaluating these actual assumptions in your data when we go to modeling our survival data and so this beta here represents a logged hazard ratio so see it exists up in the exponent here and it says how likely is it that you're going to have if you have one unit increase in uh some predictor how much will your hazard fold likelihood change um if you have a one unit increase right so for that what you do is you have a baseline subtype and you compare everything to that one subtype and so then you would have a different beta value for membership in each subtype group so for categorical data but for age right it would just be like one year how does that change your hazard um and the hazard ratio increase here so it says for one unit then your hazard ratio will increase by the exponent of that single predictor or that single parameter and if it's less than zero of course you'll have your hazard reducing if it's greater than zero your hazard will be increasing so so when you're comparing them the nice thing about having it in the exponent this is what I was talking about in correctly calling it parametric but here what you can use is the fact that it's in the exponent you're just subtracting them from each other so now it becomes like a typical regression model in terms of your interpretation differentiating the two and so it can be interpreted as a percentage change in risk so if you have your hazard ratio is 0.8 it means you have a 20% decrease in mortality risk with that one unit increase in whatever your predictor was so it's a really nice clean interpretation so for evaluating these models and evaluating their fit there's actually a lot of different ways to do it and this is one of them I'd like to talk about just because this is really unique to survival data and it's concordance index and so whenever you fit predictive models you want to see how well your model is predicting people and placing them in their respective places I'm trying to say how well it predicts the true values for those people and so here what's interesting is if you're trying to build a predictive model you may or may not be interested in order but if you're doing cost proportional hazards you're pretty confined to this so the parametric models if you're interested in putting and predicting when something will happen you'll more likely want to go to a parametric model than a concordance index because what a concordance index will tell you is did I put everybody back in the correct order and sometimes order is less important than the time right because order doesn't correspond to when something happened it corresponds to when did something happen relative to everybody else right and so yeah so no other metric captures the ordering of individuals but if you're not interested in the ordering of individuals or you're interested in the time or something else then you'll really want to think about what model you're creating and what model you're trying to predict from because this may not be appropriate like cox proportional hazards may not be appropriate I know I'm talking about a lot about like when these models break and when things don't work but I just like to do that because I feel like it's easy to give a talk where it's like this is great this is useful all of this but the pitfalls I think end up being some of the most important things for actually using everything so so here's an example of where SNF eye cluster and then the pan 50 clusters were used to try to predict survival and this is the concordance index here that I talked about so we were just trying to say like can we just order these patients can we predict when this will occur relative to everyone else when they will die actually relative to everyone else and what we found was we were able to marginally improve on it here but this also speaks to is it maybe more appropriate to use the data in a continuous space so in this paper what we did was we actually use the full network and this is where having a fully connecting network is very important you don't have to draw those divisions between the different groups in that fuse network you can use the full network to actually inform your prediction excuse me so here what they did was they just propagated the patient through the network which means that you take a patient you say who in the network are you most similar to and then you place them with those people and say can now we predict the order based on who these people are the most similar to just in a very simple framework and what we found was we are actually able to make a massive gain relatively speaking in our predictive accuracy by using all of that information so sometimes dividing people into subtypes you're actually throwing away a lot of information that's very important for what's going on so just another thing to think about whenever you're doing creating a distance matrix or an affinity matrix or anything that's actually a fully connected network that you may be able to use for your study and you don't have to cut them into the different groups so sometimes groups are really important and really useful and sometimes they're just not and you do much better including all the information yeah so 72% of the patients were placed in the correct order relative to yeah and 28% were out of order yes yeah so we ordered the death like correctly but the thing is again if I like as a patient if I'm looking for a prediction tool to know when I'm going to have a cancer like cancer onset or death or something I'm much more interested in knowing at what age I'm going to die versus yeah or yeah exactly versus am I gonna die before or after my colleague here right so like and that's really at the end of the day that's that is what is being predicted here and shown here so it's another thing to just think about like what is what is the meaning behind this and are people really interested in that and would you be even as a patient or a physician so data integration in the future and now so simultaneous feature selection and data integration is something that's an active area of research so by feature selection I mean going through the data and picking out what's most important filtering your individual metrics so maybe you're filtering genes or filtering methylation probes or filtering certain microRNA at the same time as integrating that data among each other so that's an active area of research right now also supervised versus unsupervised approaches there's like an SNF kind of flavored supervised clustering approach but at the end of the day so the very basic problem of SNF is there's no objective function to optimize our parameters over so we cannot do it in a supervised fashion and anything that's done in a supervised fashion for the most part is usually some linear combination and so it's not quite concatenation and clustering but at the end of the day it is kind of a linear combination concatenation and cluster so it's one area of research is trying to escape linearity while getting supervised clustering and then waits for contributions of different types of data so I'm glad you brought that up we wait them the same right now we're not sure I haven't really come across a case where it makes sense to up wait or down wait some data type because usually we're doing it we're running SNF in a context where we don't know what's most important and we really want everything to have an equal vote in terms of what's important so it's something we've thought about and you know maybe you guys will have cases where it doesn't make sense and we'd love to hear about that yes so could it be that one kind of data would be one different research in that one research question but another kind of data would be oh definitely another research so for example let's say you wanted to take cancer early on yeah there there are not that many mutations yet so the DNA methylation pattern would maybe be more grovia to use because that's an early on set in cancer yeah to predict therapy you might go from mutations because they are more predicting absolutely yeah and the other thing is we kind of want to discover what's most important on the face of it so so that's actually a great example we work in leaf romani syndrome so studying patients who have a germline p53 mutation and I just want to show but the thing is if something's not important so like your SNBs right we we expected to just show up like this like we expect the data to tell us that it's not important it's not differentiating anyone so that's the other reason why like pre-weighting everything by our own biases it doesn't necessarily make sense in this context for us but yeah very like yeah but like it's like an extremely good point like in some scenarios it may be important and others it's not and it's really context specific but also you also would hope that your data just shows that and so you don't have to bias it upstream yeah yeah okay so we're on a absolutely yeah so in a real life scenario where you have to to do the actual analysis yeah KM versus Cox where would you use always do a KM first like that's a first pass it's like everyone will recognize it everyone will understand it like it's immediately interpretable and then from the KM model it's more of a question of Cox versus a parametric survival model and with that it's it comes down to like what do you care about do you care about ordering people or do you care or do you care about hazard over like a different time point because the hazard function may be very interesting as well or do you care about when is this going to happen like at what age at what time interval do I expect some failure or some onset of cancer or what have you yeah okay so that's all I have for the lecture