 Hi everyone. I'm Lauren Erdman. I work over at Sick Kids and I'm a PhD student at U of T in Computer Science. I work with genomic data but more a bit historically and now I'm moving over more into clinical data and even imaging data and I work with Dr. Anna Goldenberg and yeah we've been teaching this this portion for this module for the last few years so I'm here to carry the torch for my supervisor today and oh yes and something off my CD. So I keep fish so I have freshwater tanks and saltwater tanks. I have coral and crabs and fish so any of you guys are into that. I would love to show you pictures always and talk about them so just let me know. Great and I'm hoping to get to know you guys more as we work together especially during the lab. Alright so as we all know I'm sure you've seen this about 12 times at this point. We've got the Creative Commons license so you can use these slides. Please share them widely. Only the correct stuff of course and then today we're going to talk about data integration this afternoon and specifically we're going to talk about kind of all the different kinds of clinical and genomic data actually that should be included in there and we're going to talk about single data type analyses and we're going to look at a lot of this from the perspective of like clustering or trying to find patterns in a large amount of the data among your patients and then we're going to look at data integration methods and those methods are again like primarily clustering methods so concatenating cluster eye cluster and similarity network fusion which was developed in my supervisor's lab and then we're going to talk about advantages and drawbacks of different methods and then last we're going to talk about survival analysis briefly and then in the lab we're going to do similarity network fusion and survival analysis both in R. So as you are certainly aware and especially across this week there is so much available patient data and much of it is relevant to cancer or are many different patient phenotypes so this includes genetic epigenetic genomic data proteomic data questionnaires to understand the phenotype of your patient clinical data maybe from EHRs or maybe systematically collected through a study imaging data to identify the cancers and of course things like diet or different lifestyle factors so when you want to pull these together it's really challenging because they don't all map to the same unit so for genetic or gene expression data you can and even epigenetic you can usually map it to a gene but not always and then microRNA you can't map it to a specific gene or maybe you want to map it to the genes they target proteins of course and that's just a subset of the genes or the protein coding genes and then clinical you can't map to an individual gene so you want to integrate them somehow so we're going to talk about how people have done this in some cases and by no means is this exhaustive and so some reasons you would want to integrate this data so maybe you want to identify a more homogenous subset of patients that maybe they respond to a similar drug maybe they have a similar prognosis so maybe some of them will go on to require surgery and some of them will resolve without for example and maybe some will require different clinical management so we've worked with patients who have cancer predisposition syndromes and we're trying to segment them into a population that needs more regular scans more more regular full-body MRIs because they're at higher risk of getting cancer sooner and another group that maybe they don't need to come in for scans as often so kind of stratifying your patients into risk groups this can be really helpful for that and also integrating your data allows you to get a fuller picture of the patient so you're not just looking at maybe their gene expression you're also looking at their lifestyle or environmental exposures or their clinical history and all of that is actually really important for the person that you're seeing in that moment so an example of a single data type analysis for the purposes of this talk would be this GBM study by Ling et al in 2005 in PNAS and so they collected gene expression data and they selected the most variable genes so the differentially expressed genes and they performed a hierarchical clustering on them and how many of you are familiar with hierarchical clustering? Okay cool so we're going to go into it more in depth but this is just to say it's one way of dividing the patients in this data set based on who has more similar gene expression here and so they identified two clusters and they identified genes that are really dividing these groups of patients into two clusters and they then looked at the survival patterns of these patients and they found that in these two groups they actually have a really different survival pattern so also how many of you are familiar with survival Kaplan-Meier curves? Awesome great so a lot that's perfect so I'm just going to go over it briefly and this will actually make the survival analysis part go pretty fast so they found in one group actually yeah I should ask oh which group has a more kind of severe prognosis here would you say which group is worse off? Yes exactly so group two they're dying faster so if we look at survival at one year we can see in group one 80 percent of that group is surviving for one year but in group two only 20 survived so the majority of that group is actually dying within the first year so they were able to find a gene expression pattern that really differentiated the prognosis and development of cancer or the prognosis of cancer in these patients and then here we show the censored observations so one really nice quality of survival analysis is that you can include information from incomplete observations so you can show that if you don't know when an event happened for someone you can still include their information in your data set because you know that no event happened up to a certain point but we'll discuss that more so there's ways to incorporate information from other data types in a single data type driven integration so before I talked about how laying at all they chose the most variable genes to include in their hierarchical clustering and so this other group Verhock at all they did something very similar in GBMs but instead of only including the most differentially expressed genes they also included or the most variably expressed genes they also included genes that had mutations in them or specific mutations that they thought were especially deleterious and then they also looked at the copy number variation in the genes and included the expression of those genes so here it is all gene expression being integrated and then clustered but they're choosing more to incorporate based on information from other data types so it's a it's a flavor of data integration where you don't directly include the other data types but you're selecting from a single data type using information from other data types and they were able to find different groups it's not very outstanding like it doesn't show up very well there but they were able to find a bit of separation between these groups however it led to the identification of the pronural neural classical and mesenchymal groups within GBMs so this approach is is valid and actually can uncover some important signals in the data but here again it's like what about methylation data there's so many other data types you may want to incorporate and information you may want to incorporate and it seems difficult for how you would actually choose to include the methylation data because mutations in cancer there are kind of the classical genes where if you know that these are cancer driver genes and if you have a mutation in that it would be an important gene to include in your analysis and same with copy number variation if you see huge spikes in copy number variation you can assume that something's really crazy there something's going wrong but with methylation we know the whole genome is kind of changing in terms of its methylation signal in most cancers and particularly certain types of cancers actually so how would you incorporate that information it becomes a bit more challenging when you move into the epigenetics so we're going to talk about some different integration approaches and as i said it's not exhaustive but i'm going to go over some approaches of concatenating and clustering so this is trying to use all of those data types but combine them into the analysis and not just use prior information and and have one data type that you're clustering so concatenating cluster eye cluster and then similarity network fusion so concatenating cluster is pretty straightforward as it suggests there's a first step so here i'm showing we have methylation data and gene expression data and we literally just concatenate them together so we put them together and now for features for a patient we have all of their gene expression and all of their methylation levels per probe or we could we could combine those across genes if we keep it at the probe level can anyone tell me some issues that might arise if you just concatenate all of your gene expression data and all your methylation data and then cluster your patients yep yep you'll be double counting which could be good or bad because if you have that double signal you may want to boost that so um but it just it depends what you're hoping to find but certainly yeah so totally yeah so the question was what are some problems that may arise when you're just concatenating and then clustering um and so uh Roman brought up that you're going to be double counting your genes and then um Garrett kind of emphasized that point by saying that when you're double counting you're actually not even double counting equally because some genes are going to have multiple probes within them and some genes may have no probes actually that are nearby or in their promoter and then the other issue is methylation has around depending on your array 450 to 850 probes in general um if you're using uh the 450k or 850k uh arrays and so that's compared to gene expression where you're sitting around 2000 genes and likely less after quality control um so your signal might be totally swamped as well by that uh 400,000 um uh measurements uh versus the 20,000 so you might not you might have no gene expression signal that you're pulling out um because it's all taken up by your methylation signal um so that can be an issue when you do this concatenation um and so the idea of concatenating cluster is you concatenate and then you cluster and so I'm just going to briefly go over hierarchical clustering um so this is a distance matrix and um it's a um symmetric matrix so that means that this lower triangle here in green is the same as the upper triangle here in purple and so if you look at uh this for example this uh red uh cell here 0.5 it's also up here so this is showing how different d and f are um and so they should be equally dissimilar uh in every view like uh it's it wouldn't change there so um the reason it's highlighted is because that's the lowest so if we're looking for the the minimum distance uh single linkage then we would say okay the closest to our d and f um and it looks like the second closest are um e and d maybe so uh as you're building your hierarchical clusters it's basically ordering um the linkage um across them so here you've got d and f are the first ones and then like as you move it out like as you allow for more um to be clustered within the same group then you're gonna have e joining in there and then c is the next one um that's growing closer but it's quite distinct so it's a bit further off when it's um added in and then a and b are actually much closer to each other than any of these so they're gonna be added in as well earlier on um and then at some point like as you increase the threshold you're gonna have them all grouped together and so that threshold is shown here so as you can see at 0.5 uh dissimilarity you would have d and f come together um and then at 1 e comes in and joins those two um and at about 1.5 you've got a and b coming together so that's what that represents and so when you're trying to decide the number of clusters though like hierarchical clustering builds that whole tree or dendrogram um and so there's a few ways you can do it um cutting the dendrogram by eye honestly it's it's if you have clear clusters which i would say here you have the a and b group and then you have the df ec group it's it's respectable like it's fine to do if you have clear uh groups that are falling out especially if if it's very obvious based on your dendrogram um but usually if it's very obvious it will also be supported by different statistics so one of the statistics that you can use to decide the number of clusters is the silhouette statistic that i'll uh talk a bit more about um you can also if you're doing spectral clustering which means you're taking uh the principal components of your data and then you do um clustering on those uh or you build a distance matrix off the first few pcs um then eigengaps uh developed by tikturani is a great way to choose the number of clusters um and there's many more that are reviewed in janet all's uh 2005 phd thesis he reviews different um different ways to choose the number of clusters based on what your clustering method is um but there's also other um uh papers written on this so the silhouette statistic um i really like it first of all oh yeah i didn't i edited my own slides but not these so first of all i found this confusing when i was reading it um at one point so pattern here pattern means um observation so you can think of it as an observation so it shows graphically how each observation or pattern is classified into a cluster so for each pattern or observation in a given class it's saying how close are you to everyone else in your same class versus the average distance to all other classes like all other um patterns in all other classes and so if you're more similar to everyone within your own cluster then you're going to have a silhouette statistic that's above one but if it's or above zero sorry if it's below zero um and uh up to negative one uh then it's showing that you're actually closer to individuals or observations that are outside your own cluster so for example this negative group here um they could equivalently just be put in another cluster and do as well so it seems like maybe you cut too many you really want to minimize uh or maximize the statistic um because this is showing for example that you've got some negative uh samples here and these could be very easily be put in a different group um so in silhouette plots they're usually shown like this where you have one cluster two cluster three cluster so it'll it'll divide your um your observations by cluster and then it'll show you how those clusters it'll order them so you can see where they drop off like do you have half your cluster where they could just belong with anyone else um and that's really good to see um if that is the case because you may be uh just kind of drawing a line in your data where there doesn't really uh there doesn't seem to be one that exists what is the distance uh yeah so distance can be a lot of things so you could have a Euclidean distance for example which is just a squared um difference between two points um you could have uh there's all kinds um hamming distance where I think hamming is just a zero one kind of count between two um so yeah that's a great point uh I think distance is more of a function of what your data type is um so if you have binary data right you want data you want your distance to just be one if they're the same zero if they're not the same right um a Euclidean distance wouldn't make as much sense there um though it would be similar um but if you have continuous data you may want a Euclidean distance there's also um really cool distances you can use so you can use Euclidean distance for example if you have trajectory data um but you can also use things like Frichet distance so what Frichet distance shows is um if you're the analogy everyone uses is if you're walking a dog and your dog's on one course so that's one of the trajectories and you're on another course so that's your trajectory what's the maximum leash length that you need for that dog so it's it's saying what's the largest difference between the two clusters and then it defines the distance as that largest distance between any two clusters or two trajectories pardon yep um in gene expression I would say the distance should probably be Euclidean especially if it's log normalized because then you have continuous normal data so um you're just wanting to find the distance between two genes that are on the continuous scale so you're just going to find the squared distance between them sure so if you're clustering genes then you would do it for their expression levels across patients or if you're clustering patients you would do it squared distance across uh all of their gene expression between the two of them yeah yeah sorry two of the distance that we're talking about doesn't matter for example if you have a set of options for the distance you can choose for example a clustering distance where let's say you really have distance to change for example the number of the clusters it can it can yeah does it then can you start for example you do not have a sense of I should say a unique cluster you know or one specific feature of the gene for changing distances it can yeah and so then you'd want to really think of what does my distance represent and is it representing what I want to be representing in my data so um because it's like if you just find clusters through that then that can be a good or a bad thing it could be an artifact of the fact that like you use a distance that's not necessarily appropriate um so yeah that's uh it's really where you'd want to go into like the theoretical aspect of the distance and um what you're actually measuring like are you assuming um like are you assuming certain aspects of your data that are not right to assume um like its properties about its distribution for example good question to you guys that's super important and choosing the right distance metric is incredibly incredibly important um oops I thought I got rid of this slide so this is just to say uh you can see that the silhouette uh the silhouette scores in these two for example or these three they're going to be very low whereas the silhouette scores in this are going to be likely well above zero um so that's what silhouette's representing like if you're having clusters where you're just dividing what seems like one cluster like maybe the the optimum here actually should be three clusters all right so i cluster is another method um that's been developed and uh used quite a bit actually and it's a Gaussian latent variable model and so basically they're saying um you have multiple Gaussians uh in your data and and those are your clusters so your clusters are um a mixture of Gaussians um and it's saying I want to take I want to find those clusters within each of the data types um and so it treats them as latent variables that are all feeding into the same um clusters and uh it regularizes it so one really attractive aspect of this is this sparsity regularization means that most of the data is going or much of the data is going to actually not contribute at all to your clusters so when you make your clusters it's going to tell you what is important um for making those clusters so the ones the specific genes for example where you see that the copy number for that gene is not the coefficient on it is not set to zero that means that that is important for the clusters that you discover um so having the sparsity regularization it means that um you're going to be able to go back through and say this and this and this are important for the clusters that I discover here um and so some drawbacks of uh i cluster are that because it's so computationally intensive you have to select a subset of genes um methylation probes microRNA etc um so it can't use a large large amount of data unless you have incredibly high compute power or um are able to uh work with it actually a bit more closer to the machine um and so there's a lot of manual processing because you want to pre-filter your genes it takes about 1500 genes max uh so you can't just throw in all of your gene expression for example um and there's many steps in the pipeline so um you're making a lot of choices along the way and uh depending on why you make those choices you might want to test the stability of your model um so uh for example as omid asked that question about um choosing your distance matrix metric uh if you choose some hyper parameter about doing the eye cluster you might want to be like how sensitive is my analysis to this choice and so that means now you're doing multiple parallel analyses to see like okay if i tweak this by this much like if i change this 0.5 to 0.8 like does it change everything for me if i change this threshold um to some number does it change everything in our findings and you would want to know if how sensitive your results are to that so it can be a drawback um and the integration is mostly done in the feature space um so if there's a signal that's a combination of the features it's not going to find it so usually it's finding um it's finding clusters that are driven by single groups of the features so the copy number variation or the gene expression um and so if there's a gene expression copy number variation uh for example actually or the methylation gene expression signal um that Roman brought up um if if that exists in your data this is not going to necessarily capture that unless that's strong enough in the individual data types it won't leverage patterns that are existing across the different data types um yeah so similarity network fusion uh Bo Wang who actually uh he's the head of the ai research group at princess margaret um now um so he developed this method and gone uh anna goldenberg's lab and uh his idea was to integrate data in the patient space so before when i was saying like it's pretty tricky if you have for example gene expression and you have genetic data and you have microRNA and you have say nutrition data they don't all map to a unit like a gene but they do map to a unit like a patient so what you can do is put all of them in a patient space which i'll describe more uh in detail and then you can fuse them together in that patient space um so uh just to go into it more concretely so putting them in the patient space essentially means creating a distance or similarity matrix uh matrix or uh network so these matrix these matrices are are actually just networks um where uh you have an individual patient and then their connection to the other patients are the cells here so uh if you think of this uh row and column as patient one you can see they're not really connected with anyone and so that's why they would show up on the network is not having connections right so a matrix a distance or similarity matrix can be represented as a network um and that's why this is called similarity network fusion even though you'll see a lot of matrices here uh what uh it means that they're very dissimilar from everyone so they're kind of off on their own whereas patient eight here is very connected um to several people here so that means that they're very similar to them uh whatever data was put in so here it would be gene expression so it'd be like their gene expression profile is really similar to this group of people um but this person has a really different gene expression profile um yeah and so then with these networks um so with your data types in the patient space uh then you do a fusion where you're basically doing graph diffusion so you're multiplying them by a sparse version of themselves which means uh we knock out all of these small uh connections here and say only keep the really high ones and then multiply this one by this one and we do it vice versa for this one to this one and then do that in an iterative fashion and what that ends up doing is um it grows any pattern that is either strong in a single data type or is shared by multiple data types so if it's a medium signal like and when I say signal if I I mean medium connection so a connection like let's say like this one where it's kind of moderate on this and this data type but it will be retained uh when you're multiplying it through and it will be up weighted because it's shared across the two data types um but connections like this one that's it's pretty weak in this data type sorry about that uh it's pretty weak in this data type and it's not shown in this data type those are going to be uh pushed down and so it's a kind of a way of um denoising uh your your clusters between your patients using multiple different data types um and then you get a fused matrix so you get a single matrix out um or a single network uh that represents clusters that are uh they have a signal from both of your data types in there so this was done in the in the nature methods paper uh that bo wrote with dr goldenberg uh in glioblastoma um and they they integrated data from methylation um mRNA and microRNA sorry these rotated back when I sent it um and uh the fused matrix you can see that denoising that I described so you can see how you get a much stronger uh signal in your clusters and you have a much darker so like much lower connections outside your clusters so you're basically finding what signal is shared across these data types and really up waiting that uh within your data and so this and this are the same thing just represented one as a network and one as a matrix um and what you can do is you can see um what are the data types that are actually driving the connections uh so for example uh we've got this um little cluster here that is like very driven by the combination of methylation and microRNA signal so this group is similar and the similarity is is primarily driven by those two data types whereas the gene expression data is actually not uh contributing a lot to that signal whereas in this uh cluster you've got this portion of the cluster that is much more driven by the gene expression data um and so when they looked at the clinical properties of these subtypes they were able to find that there is uh a real distinction in uh their their prognosis um and also uh their clinical or sorry their patient attributes so uh subtype three which I believe is this one is um it's a younger group and it seems to survive longer um and I believe it was the IDH1 subtype that they actually um which is who these people are um and they respond better uh are treated I'm not sure what this actually means treated versus untreated so um yeah so some advantages and disadvantages um it's an integrative feature selection so you don't have to choose upstream um what data you're integrating necessarily it can be helpful just for interpretation um so we did it using um neuroimaging and we found that uh you know you're always able to find clusters in patients based on neuroimaging data especially structural MRI but what those clusters mean um like if you're trying to find clusters that correspond to in this case it was OCD um subtypes uh we were just swamping our signal with just other aspects of them so we got a huge like gender and age signal when we were clustering them and so we found that if we were more specific about which regions we included in our analysis um so that we included regions that were um known to be associated with OCD symptomatology then we were better able to actually find clusters that were meaningful so you don't have to choose cluster or you don't have to choose your features upstream but it can be very useful if you um if you maybe like with cancer for example you've got a big genome wide signal but with um some phenotypes or even some cancers you may not have that huge signal across every gene so maybe choosing genes can be useful um in that case um growing the network uh requires extra work so uh it's it's uh very computationally intensive if you have thousands of patients so uh that can be a drawback um and then it's unsupervised so and just from experience of trying to publish uh similar to network uh analyses um everybody wants to make conclusions after the fact and it's really kind of showing you what's in your data like like unsupervised analyses are just giving you back your data in a different form basically um which supervised analyses are too in some respects but the really unsupervised analyses are right so um it's it's hard to turn it into a supervised problem uh other than just doing this kind of thing where you say um can I separate groups here like um but even then it's that's more observational you're not you're not really predicting anything so um that can be a constraint so it's really good for exploring your data understanding your data seeing if there's a great pattern that persists across different data types um but if you're looking for a supervised analysis I would not choose an unsupervised method to do that basically um yeah sorry is semi-supervised there is it relevant or sorry can you ask yeah so the question was uh is semi-supervised learning relevant for genetics data and I certainly think it is um one huge benefit of semi-supervised learning is that um you can use more data um so maybe you don't have labels for some portion of your data um then you can use your unlabeled data uh in addition to your labeled data to do your analysis so I definitely think it is um I have a colleague who uh did an analysis called um well he he wrote a paper called Dr. VAE um and he developed a variational autoencoder which is uh he used a semi-supervised approach to basically and create a embedding for patients um in the gene expression uh space to predict how they would respond to different drugs um and he was able to use labeled and unlabeled drug response data uh gene expression drug response data for that and so I certainly I certainly think there's a rule for it yeah yeah so in that case I think he he did a pre-selection where he chose the most variable genes um but yeah it's a good question I can't actually remember specifically how Laos loves solve that problem yeah yeah all right so switching over to survival data a really different data type but um as you've seen uh we apply our uh similarity network fusion clusters to um trying to understand patient survival um often so understanding survival analysis is really useful for that um so survival data it's it's called time to a single event data or time to event data and the analyses are called time to event in general um and not survival so something that I just uh want to point out about or actually I guess here we can talk about it is so it's a time to event so you have uh you have a start point like a beginning of a study or the beginning of a person's life um so like age until or years lived until you get cancer years lived until you die for example um and then you have uh days to last follow up for example if you have censored observations so you have the end of the study and not everybody is going to be um having a completed event at that point so not everyone for example will die um before your study ends generally speaking um and so one really important assumption about survival data and and this is in part why it's called survival data and not time to event is that at some point the event will happen so we know that even though we never observe that these people have died they are going to eventually die um and so this becomes an issue if you want to do time to surgery for example um or time to transplant or some time to event analysis like that um because not everyone if you censor them it's or if they don't get surgery yeah if they don't get surgery within the time of your study it doesn't mean they will at some point in their life get that surgery or they will at some point uh get a transplant um and so that's why uh it's called survival data because we can reliably say that everyone we see is going to die at some point um and so if you want to use a time to event these are also failure models um accelerated failure um models uh will model survival data um they were for actually modeling I think this is the case developed for modeling uh light bulb uh time to failure or device time to failure and again you know at some point that is going to fail so you can it fits those assumptions so if you want to transfer it to some other time to event um uh situation you just have to make sure that that one uh assumption among others that we'll talk about um is really satisfied because that is a very important assumption you're making when you're you're doing these models um but what's cool about it is it allows you to use these observations where if you were just oops sorry if you were just modeling this as um like at what point do they die and I need to know the end point then these people would have to be left out of your analysis and what survival modeling does uh it allows you to include these people and just say I only have up information up to this point I know they didn't die up to that point so they will they have survived at least this long so you can include their attributes and their observations and and increase the power of your study um by including those um uh people without observed final events so there's two important statistics um within survival modeling uh there's your survival function so it's the probability of someone being alive at time t so the probability that they're alive uh at t so they're good up to that point and the hazard rate so the probably the hazard rate is the probability of a person dying in the next instant so it's kind of like a derivative right like it's it's a limit uh it's a limiting statistic and it says at what's the probability that you'll die just immediately in the next moment um and so some examples of a hazard rate are a constant hazard rate that at any given time I mean it says no aging care um you could think of it that way or it's just like at any given time you have just a completely constant likelihood of um dying it just does not change with time um a positive hazard rate so the older you are the more likely you are to die that is in general the hazard rate that we encounter or a negative hazard rate um and so when you have high infant mortality for example uh that would be represented with a negative hazard rate early in life so as you live longer you are going to tend to live longer all right so um everything we've looked at in terms of the survival graphs they've been km curves kaplan meyer curves um and the kaplan meyer estimator is just a really nice way to represent your survival data um and so in the kaplan meyer curve you're showing the probability that a member from a group giving group will have a lifetime exceeding t as we saw and so it really just displays your data back to you um and so if you have the number of people at risk of dying at time t and you have the number of actual deaths um it's just calculating the proportion of people that have died of your full group um so here we already talked about this guy um what's nice here you can see also uh that half the people are expected to have died at about I can't quite see is it it looks like uh before half a year um so pretty dire if you're in group two um but half the people are expected to have died within about two years if you're in group one which is um good compared to group two but still not great um and then you have a hazard ratio so that's comparing the two groups and so hazard ratios are really important um when you're actually modeling your survival data because that's showing what's the difference between your groups and they're represented as a relative risk so it's the risk if you're in group one versus the risk if you're in group two um and so if you have a hazard ratio of point three or sorry point four three you can think of it as um well I've lost my cursor um so the condition of group one is 43% of that of group two so it's saying point four three over one is how you could interpret that you could say uh group one is has a much worse outcome than group two or less than half the survival of group two and then um a hazard ratio of two would say you have double uh the the right oh no sorry I'm interpreting these the opposite way sorry guys uh so point four three would be good in this case because it's saying a poor outcome so a hazard is lower in group one right and then if it's two it's saying it's dramatically higher so if you have two it's saying the hazards are twice as high in group one than they are in two apologies all right and so Cox proportional hazards and this is what we're going to use to actually model our data and another modeling technique is using accelerated failure time models that I brought up before um and those uh they're they require more parameterization of your data but even uh apparently even Cox himself preferred uh accelerated failure time models so I would consider them uh if you're doing an analysis uh I would consider either especially if and we'll we'll look at this in our uh data especially if your data doesn't meet the proportional hazard assumption um so if you don't meet the assumptions of the Cox proportional hazard model or you're trying to not model hazards but you want to model specific time points so you want to estimate at what point is everyone going to die accelerated failure time models can be very useful for that um but Cox models are extremely popular uh so we're going to go over them and and they're very they're very useful they're popular for a reason um so they capture how well uh multiple variables um such as genes clinical variables uh or clusters that you identify affect survival um and so you estimate the ratio of risks here and the way you do it is a Cox regression where you have a baseline hazard and you're multiplying uh by the exponentiated um uh prognostic index here and that's where your linear model lies um and so it's kind of a generalized linear model in effect that you're fitting here um and so some intuition so you have a log hazard ratio um so for one unit increase in your predictor it's the log increase in um hazards so an increase in in what's your likelihood of having an event um and so the exponent of that is the hazard ratio increase for one unit increase um and if your beta value is less than zero then there's a decreased hazard so longer survival so if you have uh beta values if you're coming up with beta values that are less than zero then you're identifying protective features in your data if they're greater than zero then uh they're deleterious features and if you exponentiate them then you can actually get the hazard ratio and so in a lot of the models they'll give you both the the raw um uh coefficient and then they'll also give you the exponentiated coefficients so you can interpret both and they're multiplicative right so it's the proportion increase uh just as we showed before like two is double uh their survival time or double the likelihood of a hazard and uh 0.5 would be uh half the likelihood of a hazard um and so if you look at the hazard ratio for a subject with a set of predictors compared to a subject with another set of predictors you would look at the ratio um of their hazards um and then that could be calculated by seeing uh the beta values times uh their characteristics for one person minus the characteristics for the other person so it's the difference in their hazards there um which is a really mathematically very convenient uh way to think about it um and so they can be interpreted as a percentage change in risk um so here this example I always found this confusing but I I think uh it makes sense so if you have one wind treatments active and zero wind treatments a placebo if the hazard ratio so um not the beta coefficient but the exponent of the beta is 0.8 it means you have a 20 percent decrease in mortality risk using the treatment so the treatment is protective um and it shows 80 percent uh share of the full uh hazard okay all right and then the concordance index so one challenge with um hazard or with survival models is expressing um what's fitting your data the best and so one way people have done this is how well does one survival model uh order your data versus another one so uh is everyone in the correct order uh based on your model and the reason this is attractive for cox models is because estimating the baseline hazard uh it can be tricky there's a lot of ways to do it um and so you don't want to be predicting the actual time uh to event that can be very sensitive so you're instead predicting am I getting the relative timing right so am I saying this group of patients is going to have an event before this other group of patients so are you ordering them correctly and that's what the concordance index actually indicates so it captures your ability to order your individuals correctly with respect to their survival time um and so this is the formula for it um so you have how you count how many pairs are concordant um and then how many pairs are ties so if you have them estimated at the same time um but they aren't uh you'll you'll do 0.5 times that and then you divide it by all pairs so it's just looking at the proportion of pairs that you're ordering correctly out of all the possible pairs um and so no other metric captures the ordering of the individuals but it's really important to specify when you're using the C index I've seen a lot of people use this as a proxy for AUC for example or different things and it's really not that so um it's it's quite different and it's interpreted uh its interpretation is very different um from for example in AUC