 About myself, my background is in machine learning, and I started doing computational biology as a postdoc, or computational medicine actually, as a postdoc about maybe ten years ago, nine years ago, and my work is in developing novel methodology for integrating different kinds of biological and clinical data. So trying to kind of help the patient by bringing in more data about them. All right, so the learning objectives for our module today will start simple, we'll talk about different kinds of data individually, or at least how you could process different kinds of data individually for the purposes of identifying subtypes, for example. And then talk about several data integration methods. And finally we'll conclude with a survival analysis. So clinical data might look like this, for a lot of the cancer patients you would have all of these variables and more. Astrogenic progesterone levels are somewhat kind of specific to breast cancer, but this is the kind of data that is now already being used in publicly available predictive systems which you can find online. So here's one of them. This is a predictor system which you can look up from this URL, I checked it today, it still works, they have the second version already. So you can input the information that it's asking, age diagnosis, the mode of detection, tumor size, and several other categories, and it will give you a prediction. It will give you a prediction of not only the survival for the patient, but what is the benefit of different kinds of treatment for this patient. And this is done from the kind of the proprietary data and also from the public data that is currently available. So these predictive systems exist and even using clinical data you can already build such predictive systems, but this is just a small drop in the island of kind of all the rest of the data that's becoming available. So in my practice as a machine learning person working at a hospital, I've seen all of the following kinds of data. So of course genetic data, gene expression, epigenetic data, microRNAs, protein. So all of these layers of data we have been integrating already, but there's other different kind of data depending on the type of the disease you're looking at. So for neuropsychiatric diseases, of course you will have a lot of questionnaire data. There's imaging data, which now is becoming available in the same kind of realm for both cancer and non-cancer also neuropsychiatric diseases, and also sometimes you have diet, especially when it's related to stomach cancer or inflammatory bowel diseases and things like that. So this kind of data is becoming available. So the question is can we really integrate all of this data to understand how to treat the patient and to refine their prognosis and diagnosis. So this is a public data which is available if you are developing methods or you want to try out your methods, predictive as well as kind of descriptive and exploratory approaches. So you probably all have heard or should have heard of the cancer genome atlas. Cancer genome atlas is a phenomenal repository of publicly available cancer data which has 29 primary cancer sites with 5 cancers with over 1,000 samples. So with this kind of data you can already start doing something with more complex methods. 12 cancer have more than 500 samples, and this is across multiple categories. You can see that in the case of breast invasive carcinoma, over 1,000 genotype, over 1,000 genotype. So this is the sample size for which this data is available. So this is substantial. All right. Why integrate patient data? Well, for one, and one of the first questions that was asked of us and we tried to solve is how to identify more homogeneous subpopulations of cancer patients so that we could potentially treat them differently from just knowing their stage of cancer or something like that. And also, the second one is to try to help to predict the response and not just predict the response, but also maybe out of the array of drugs that have already been tested, maybe not even in patients, but in cell lines, how the patients would respond, which drug to give to a given patient. So let's start simple with a single data type analysis. So this is a paper published in 2005 in PNAS. This is a glioblastoma, and I'll use glioblastoma in my particular kind of throughout the different methodologies. So this is a single data type analysis, 20 glioblastoma patients, and glioblastoma is an invasive adult onset carcinoma, which really is really lethal. There is no treatment. Temazolomide is a standard line of treatment, which doesn't work for majority of the patients. It kind of gives them a little bit of a kind of delays, the death, but not by much by several months or maybe a couple years. So this is, you see a graph here. So this is the data that people have collected. They collected gene expression in 20 glioblastomas, and then they tried to cluster to identify those two groups here. So on each row represents a gene, you can see the gene names on the right, and each column is a patient. And so what they did was they did some kind of hierarchical clustering for these patients, and they identified these two clusters. And I'll talk about hierarchical clustering later about how it actually works. So not only they found this difference, according to the set of genes, and then they just identified through univariate analysis, which genes were associated with these clusters, but also these two groups had very different survival prognosis, right? So survival is something we'll be covering today, and you probably have, if you work with cancer, you would have seen survival data already. So this is kind of a plot and a little bit of history. A Kaplan-Meier plots, they're called curves, but the Kaplan-Meier plots tied to the Kaplan-Meier estimators published in 1958. And apparently they submitted their paper separately. Two papers, Kaplan and Meier submitted a paper each on this estimator. And John Tukey convinced them to work together and submit one paper that's been cited over 50,000 times. So it's one of the very well-used tools in kind of cancer for capturing the survival information. So here what you see is the probability of surviving for one year in group one is about 80 percent, and the probability of surviving for full one year in group two is about 20 percent. So there's definitely a difference, a significant difference between those two groups. There are also these dots, more than two dots. This is censored observations, and we'll also talk about that later. So these are individuals for whom who either fell out of the study or we weren't able to, the researcher was not able to follow all through to the end. So the next stage came some kind of data integration. So what people started doing is they had multiple types of data. So here they had gene expression, they had 200 glioblastoma patients, and for those they had gene expression, they had mutations, they had copying under variants, and clinical data. And what they did, they called it an integrative analysis, but what they actually did was they looked at mRNA expression, and they clustered patients again into this essentially four major groups. The major groups were tied to the genes that were associated with the groups, so there was a pronural, neural, classical and mesenchymal groups, but the point was that even though they had additional data, they kind of tried to analyze it after they've identified their clusters. So really the subtypes that they've identified were largely gene expression driven, and it makes a difference. So what about methylation data, for example? So they had methylation data, and if you look at their paper, they said methylation data didn't seem to be associated with the clusters. But if you analyze methylation data separately, so for example in this 2012 paper, they analyzed methylation data separately, they had other kinds of epigenetic data, but they have identified this major aberration methylation, which was in the end tied to IDH1 mutant. So all the GBM patients with IDH1 mutation had a hypermethylation basically across the full genome. And it's a very strong signature, even though the signature was strong in a small percentage of the populations, I think about 10% of their or maybe 12% of their population, they didn't find it because the groups were identified based on gene expression. So really you want to have this complementary information integrated together as opposed to try and analyze it one by one, and then trying to see how one informs the other. So to that extent, several different kinds of approaches have been used since then. So one is concatenating cluster, the simplest one. So you take your data, you put it all together, all of your measurements, and then you cluster the patients according to that. Second is the eye cluster, and the third one is SNF, which we have developed, and I'll tell you about each one of them individually. So concatenation, you'll say you have gene expression and methylation across patients, you mash them together, this is great. The only problem is that you have some kind of small signal, right? Not all genes are informative of the subtyping in the patients. And what you do by mashing it together, you actually decrease the percentage of your signal in your data. And so it makes it harder to identify those subtypes because now you have smaller features compared to the total number of features that you're considering that will give you the subtypes, right? Okay. So nevertheless, this was a predominant way for data integration and majority of the TCG kind of broad papers for all the cancers and how they identified their subtypes. So this is what they did, they concatenated the data, and then they did hierarchical clustering. So very simply, hierarchical clustering works as follows. You construct the similarity matrix. And here you can use Euclidean distance, you can use correlation, linkage, whatever, people use different kind of metrics. And you can always pick in packages you use, you can use different kind of distances between. But in the end, you basically get this patient-by-patient similarity matrix. And the way the hierarchical clustering works is that it takes the ones with the smallest distance, meaning the most similar ones, and it groups them together. So this is kind of a representation. So DNF were the most similar ones here in this two-dimensional space, representation, just representation. So they got grouped together. You find the mean. The next most closest to this mean is E. So you get this merging with E. At the same time, A and B seem to be close together, so they get merged. So in the end, you do this kind of iterative merging until you've merged everything. And this is what this dendrogram represents. So this tree here is a dendrogram that represents the mergings of the different patients. So this and essentially the similarity at different levels. So you can cut it off here, and then you'll have as many clusters or groups as you have patients. You can cut it here, and you will have only two groups, which you would say the patients in one group are more similar to each other than the patients that are in a different group. And this is essentially what the clustering does. It's very simple, but very, very powerful. I would never discount it. So deciding on a number of clusters, unfortunately, there is no golden bullet. There is no way, no one metric that you can use that will tell you exactly how many clusters there are. So what people use, most often what people use. So arbitrarily cutting the dendrogram, like I said, at different levels, depending on your prior belief of how many groups you are looking to find. Silhouette statistic, I will mention, IGON gap, and et cetera. So there are more that are used in this PhD thesis and with different kinds of examples. So those who are interested, I encourage, but you will not find one metric that will tell you exactly how many groups there are, unfortunately. So silhouette statistic, I guess I should have listed here also, there is a tip shirani gap statistic, which is also very commonly used as well. So the silhouette statistic basically captures the intuition of the similarity within a cluster. So this is your distances, AI is the average distance to all everybody else in your cluster, and BI, which is the average distance to all the individuals out not in your cluster. So this is saying that we really want to minimize AI and maximize BI. So this is how you compute this metric for every cluster, for every cluster I. And then, of course, it goes from minus one to one, one being really a good assignment. Minus one is like almost the opposite of what you want, and zero, your clusters are half and half, also essentially random. Yeah, minus one is almost adversarial. You really did not put more similar individuals in the same cluster. So the kind of silhouette plots that packages usually have, the positive means that all of these individuals are close to everybody else in their cluster, whereas all of these negative individuals with a negative silhouette score are further away from the cluster. And this actually comes with a kind of a philosophical problem of saying any clustering method will always cluster everybody. But ultimately, you might have outliers, you might have outliers that really don't belong to any cluster. So this is something to keep in mind every time you analyze the data, and you might, even if you cluster your data, you might not want to go with the result that you see. You might want to check which are the individuals that don't seem to belong to this cluster, maybe they don't belong to any cluster, and maybe these are your outliers and you want to analyze them separately. So this is an example, actually, that I've taken from Hussain Parthas' paper and master's thesis that he had at the University of Waterloo. So essentially, these are two simple examples. So this is, at least the plots are, at least more, but this is one example where in two dimensions, you have three clusters, which is pretty clear to the eye. Here in D, you have six clusters in two dimensions, but the cluster is kind of below, kind of grouped together as well. So there is some kind of hierarchy of clusters as well. So here he compared all these different types of clusters, and remember, this is the A-scenario and this is the D-scenario. And this is the green one that does, this is the silhouette metric, which is really the most common metric that people use right now for determining the clusters. And you know that for C, it performs pretty well. For D, it doesn't perform well at all. So this is intuitive mathematically, but before you know what your clusters look like, it makes it very difficult to decide what metric to use, and essentially it becomes a chicken and egg problem. So that's why people do different kinds of clustering. Or alternatively, you can also do this consensus clustering. So what consensus clustering does, it's also a very simple idea. These people manage to publish it, but people do it anyway, always. So this is consensus clustering, and it's even available in R, I think, as some of the packages. And the idea is that you sub-sample your data. So say you sample 80% of your individuals, you cluster these individuals, and you repeat. You repeat 1,000 times. It will take a while, but on the cluster, it won't matter. And then you construct this consensus matrix. In other words, how often do individuals appear in the same cluster, out of all the times that you've sampled them together? So they had a chance to be in the same cluster, did they appear or not? And there, you will actually see individuals who are always meandering around different clusters, depending on how you sampled your data. And those are your real outliers. It means that you don't currently have enough data that you have considered, that you have sampled, to tell you whether these individuals really belong to a cluster or not. So I propose to kind of analyze the consensus matrix and find the connected components within a consensus matrix or something like that cluster, the consensus matrix, as opposed to just doing one clustering and going with it. Yeah? So it's not just about the distance, right? It's for the matrix. The matrix will be stable. But once you subsample from the matrix, it means you are removing some of the individuals. So your distance of one of the individuals, let's say it was close to this one, but not to this one, right? So if you remove these individuals, now this might appear closer to that cluster than to anybody in this cluster, right? So it's not really about the distance. It's about the stability of the sample and how well the density around the point that you are sampling. Yeah? All right. So a very powerful tool, which also is now used in a lot of TCGA papers, is iCluster, which was introduced by Shen and all in 2009. And this is a latent variable model idea. And in some sense, it's like a factor now. So I'll just give you the intuition for it. What they assume is that they might have many different types of measurements. They are not going to lump them together. But they will assume that ultimately these measurements should result in the same partitioning of the individuals, right? So in some latent space, there is some true clustering, which all of these measurements indicate at some level. And so this is what they find. So they find z such that they can project x1 to z, and all of them will give you the same projection. So it's an optimization problem. And they identify the z, and then they have the clustering in the end, because z is what gives you the clustering. So the problem is with this approach is that even though originally they said that if they have complementary data, they will be able to capture both, they actually don't work so well. I had some simulations, but I removed them. And I'm not a simulation of what people want to look at, but the point is that this method doesn't capture very well if you have complementary types of measurements. So for the existing methods, there are several problems. If you do the concatenation, then you remove essentially the structure of each of the individual sets of measurements, and then whatever you find might not be optimal. If you use more sophisticated methods like Bayesian methods or iCluster, then they actually have a limit on the number of features for which it works. So you have to fit measurements. So you have to pre-select the measurements somehow before. It's not part of their pipeline. So maybe you pick most variable ones, but how do you know that the most variable ones are the ones that are really driving the cancer, right? So this becomes a challenge, how do you pre-select? Also there are many, because of that, there are many steps in this pipeline, how to interpret and each step, you have to look at the data. And yeah, so another problem is if you have, for example, dietary information on the patients versus gene information on the patient, how do you actually combine those together to make sense? There are different scales and different situations. So I will tell you now about the similarity network fusion, which is the method that we have developed with my student, 2014. So the idea is very simple. First we construct a similarity matrix of patients, just like you do for, say, hierarchical clustering. And second, we combine these matrices all together. So once you are in the patient space and not in the measurement space, the scale doesn't matter and a lot of these problems go away. So how do we construct the similarity networks? You take each individual type of data and you basically do your similarity matrix according to the distance, your favorite distance. So we use Euclidean distance, actually kernelize Euclidean distance, you can use correlation, you can use chi-square distance for categorical variables, et cetera. Finally, if you think about this patient similarity matrix, if you kind of sparsify it a bit, for example, you remove the individuals that are not very similar or that have almost zero correlation with you, it actually is identical to a network. So you can think that's why it's a similarity network fusion, because this network is basically a representation captures what this matrix represents. All right. So this is a construction of the similarity matrices or networks. And then we use, then we combine those networks. So the way that it happens is as follows. At each step, okay, so there are a couple of things here. So the way that it works is kind of like a graph diffusion. So if any of you have seen maybe in physics or this is a common concept. So this is graph diffusion or random walks on graphs. If any of you have seen. And this approach just proposes to walk across multiple graphs at the same time. That's all it does. But the point is that what happens is that at each step, we try to make each matrix more similar or each network more similar to the other. So basically, we take this network and we multiply it by this network. And then we see the information. So we have a result in matrix that has the properties of both networks. So for example, here we didn't have any edges connecting. And here, but here we had a strong similarity between all of these individuals. So then this information propagates to our first graph. In this graph, we didn't have the connection between these three individuals, but it turned out that there was a very strong similarity between these individuals and another data set. And so this information also propagates. So as a result, we actually, first of all, we are guaranteed to converge to one single matrix because we keep making each network similar to each other at every step. But also, we actually are able to capture this complementarity and remove quite a bit of the noise because if the noise is not supported and it's very weak by noise, I mean similarity that is actually not supported by multiple types of data and it's not very strong similarity. So all of this noise goes away. So if there is an actual structure in your data, it will come out more clearly. So here's an example in the glioblastoma study. So here we had DNA methylation, 215 patients. And for those, we had DNA methylation data, mRNA expression and microRNA expression. And you can see here, this is the actual matrix that we ended up with. This is the similarity between the patients, the real patients. And you can also see that from these matrices the structure of the similarity is very different across the different types of data. And this is as expected, why, for example, if you cluster gene expression, you don't see the same signal methylation, etc. So this is what we have seen already in the literature. So microRNA actually interestingly kind of gives you a very diffuse signal. So it's very hard to see if there is any kind of cluster ability in the signal. So here I represent the networks that correspond to these matrices. And the topology is taken from, the topology doesn't make, is not specific to each of the individual data sets. It's taken from the fusion, from the final fusion. Just so you are able to compare visually. So, and this is the fuse matrix. So you can see that all diagonal noise kind of disappears. There was a lot of similarity. But this similarity was very weak and it did not get supported by the methylation data and microRNA. MicroRNA also had a lot of noise but it also was not supported by the expression of methylation data. So it kind of went away and so the structure became much more obvious. And this is the data, so this tiny little cluster actually turned out to be the IDH1 subtype for methylation. And you can see that it's supported, so the colors of the edges actually indicate what the similarity is supported by. So you can see that there are very few black edges and black edges are supported by all of the data sets, right? There are a lot more, this pink kind of magenta edges that are both DNA methylation and mRNA. So a lot of this similarities that are supported by both. So there are very interesting patterns that are supported just by microRNAs, for example, here. And if you go back, yes, you can see that there is quite a bit of signal here, the microRNAs. But unless you know what to look for, it's really hard to identify it. That is the problem. And here you can actually start analyzing. Here there is a pocket set of individuals that are supported by microRNA and DNA methylation, right? And you can think that when doctors make judgments of how to treat different patients, depending on similarity and what they are similar to, their response may be predicted well by the patients who are similar to them and not at all, right? And this is kind of what this matrix indicates. Even if it looks like there are three clusters, it doesn't actually necessarily predict or help us predict how the individuals will respond in each of the cluster to the different data sets, to the different drugs, yeah. Okay, so we didn't. And the reason why we didn't, and actually I have this question coming up every time. And the reason why we didn't is because we actually had no idea which ones are going to be more prominent or should be more prominent, right? And the reality is what people say is that I don't trust my CNV data. Can we weigh it down? But the reality is that the method shouldn't, if CNV data is basically random, it shouldn't affect the method too much. If all the data is random, you'll get random back, no question. But if one of the data sets is random and the other ones have actually some structure and some structure that you can capture, the third one should not matter too much. So at the end of the day, if you have a real belief that methylation is more important and is driving the disease, then maybe you want to upweigh them. And it's not very difficult in the method. But the reality is the reason why people want to upweigh or downweigh different types of methodologies is not necessarily the reason why they should be downweighed, right? Because we are trying to capture any kind of signal like microRNA here. So we would have downweighed it based on this matrix, right? It looks like random noise. We would have downweighed it. And then we wouldn't have captured some of the similarities that are due to microRNAs. And we didn't know that a priori. So our idea is because, interesting, we'll get a later reminder. Thank you. OK, so that's my point, is that this is exploratory analysis. And you kind of want to get the most out of your data. And then maybe you'll say, OK, well, it looks like this data is not really contributing very much, so I'll treat it separately. But we had situations, for example, when we were analyzing a disease and the cytokine data really did not match with the rest of the data that we had. We had expression data. We had clinical data. Expression and clinical matched better than the cytokine data. And I said, wait a minute, what's going on? And it turned out that the processing was wrong. Like, it was amplified. The weights were exponentiated, and so it looked like there were just outliers in that data. So once we reprocessed that data and didn't exponentiate, took the outliers out, we actually were able to see that it corresponded very nicely with the expression data. So I think for computer scientists, it's easy to say, I don't know anything about this data. I'm just going to wait equally. For biologists, you might have a prior, and you might want to use some of it. But for exploratory analysis, I think it can be played with a bit. OK, so I think I might want to skip that. This is just to say that this is the clinical properties of the three clusters. So blue is the smallest cluster, and that's the IDH1. And it's very well known that patients are younger, and patients have better prognosis. The other ones are basically the same, even though they look biologically relatively distinct. Their survival is the same. The age is the same, so maybe it's something else. Interestingly, the temazolomites seem to have an effect only in subtype 1, which was this big subtype, but not an IDH1 subtype and not in this kind of medium-sized subtype that we found. So there's definitely more types of data that can be collected in glioblastoma, et cetera. OK, so we actually applied. Let me show you what we got with this five different cancer data sets. We kind of got similar things. But so this is the glioblastoma, 215 samples. There was breast cancer, kidney, lung squamous, colonodendocrinoma that we looked at. And even with a smaller number of patients, which is nice, we were able to find something. So this is the clusterings in a PCA. We can the first three components of the matrix. So the clustering of the individuals for glioblastoma, breast carcinoma, kidney, and the corresponding survival curves. So colonodendocrinoma was the smallest one. And it seemed to be to be still three clusters distinguished fairly well. Also, what's interesting and what you should note when you are doing the analysis is that when you have tiny clusters like this, for example, kidney, renal, clear carcinoma, we had a tiny cluster. And we tried looking at that. And then the p-values for the survival analysis were going up and down. And the problem is that when you have very small clusters, if you move one individual out of the cluster, it seems to have a huge effect on the p-value. So the p-values are not very stable when you have very, very small clusters. So you should take care of that and note that in your analysis. So there are certain benefits and disadvantages. For example, integrating, you have a new cohort. You want to integrate. You rerun the SNF. So that may be considered as a disadvantage. But essentially, if you have new cohort, it might give you a completely different partition of your patients. So this is what you have to do. Deciding which measurements are important for each specific cluster is the usual. We just look at each measurement and see how much they support the clusters. So we use cross-call wallets. And it seemed to work well for us. But it's still a univariate kind of measurement testing, which is ultimately not what you want. You might want to look at pathways, et cetera. So that work is ongoing. But the benefit of the patient networks is if you don't cluster the patient with networks, they are already informative. So they can tell you, yes, I do have, looks like there are some potential clusters. But ultimately, my data is more heterogeneous than that. So I might want to stay in that continuous space. Because to me, I think a network is like a continuous version of clustering, like a continuous versus discrete variable. Yes, you can discretize your variable. But the reality is that you get a lot more information if you treat it in a continuous space. So looking at all the similarities in a continuous space gives you a better understanding of your disease than trying to partition the disease. Because in reality, in a lot of cancer cases that we looked at, what happens is you get new individuals. And they tend to fill out the gaps in the network, as opposed to support the fact that, yes, I have subtype 1 and subtype 2. It turns out that the individuals that we get are kind of in the middle of subtype 1 and subtype 2. And this keeps happening with a lot of different diseases. And if you want to use it, there is an SNF tool package on Cran. And it tends to work fairly well. But if there are questions, you can always ask. So this concludes kind of the sections of the integrative methods. And I will talk about survival analysis next. Any questions? Yes? Maybe about the simplest integration approach? Yeah, no. They just normalize just standardized variables. Like individually? Yeah. I mean, centered individually, and then combined or something? Yeah. So the way we do it, we would normally normalize the matrix by the sum of the entries so that they are comparable to each other when we combine matrices. When the concatenation, they just normalize, standardize each variable. Yeah? Yeah. If you run the SNF tool and then you get these three classes, how do you go about finding out what was actually what were the important factors to create these classes? How do you analyze? So we use essentially, we take each feature and we compare, we just do a crucical wallestess with the clustering assignment of the individual. It's essentially the simplest thing we could do. We also use an NMI, like normalize mutual information. In the paper, we use normalize mutual information, but you can do it with crucical wallestess doesn't matter. Yes? OK. So survival. So here's what we'll cover in survival. We'll look at hazard rates, not to be confused with hazard ratios. Survival functions, a survival function, rather, a couple in my estimator, log rank test, and the Cox proportional hazards ratio. And after that, I'll link it back to the network based analysis example that we did. OK. So survival data, as we have seen already, it's a time to a single event, either death, as it happens in a lot of invasive cancers, or a time to treatment failure, for example. It could be a time to metastasis. It really depends on what is the outcome variable that you are interested in. But it is a time to event, to a single event. OK. So the uncensored data, again, so some data, if the outcome information on the patient is missing, then we call it censored. And basically, to be specific or precise, we call it right censored. Because we know that if an event has happened, it happened after our last observation. So that's called right censored. And also, in a lot of cases, we have to assume that the censoring happened essentially at random. There is no pattern. And the censoring did not happen due to the disease itself, or disease of interest itself, that it happened due to some other circumstances. So somebody moving, for example, and not following the same possible. OK. So this is an example of right censoring. You have days to the last follow-up. The last follow-up is here. But you don't actually know if the patient has died, at which point they died, et cetera. So you have the information for this patient. You have the information for this patient. Patients 1, 3, and 5. But you don't have the information for 2 and 4. And this is censored. OK. So there are two important statistics that usually are being operated with in survival analysis. So event happens at time t. There is a survival function. And survival function basically measures the probability of a person being still alive at time t. And the hazard rate, which is different from ratio, hazard rate, is basically the probability of an event happening with a person right after time t happened. So this is what it's a kind of a limiting statistic where this next instant is right after time t. So the delta t is going to 0. So it's basically between t and t plus delta t. So what is the probability? What is the hazard rate that an event happens right after you have observed a patient? OK. So to get more intuition, if the hazard rate is constant, it kind of means no aging. Not nothing is happening. If the rate is positive, and usually in cancer, this is the rate. It means the older you are, the more at risk you are. But there is a negative hazard rate, which is actually at birth. So at birth, the older you are, the more likely you are to survive. The highest hazard rate for newborns is at birth. All right. So this is the couple estimator, which kind of is I've already mentioned. So you have the probability that an individual or a member of a given population will have a lifetime exceeding t, so it doesn't die by time t. You have the number of people at risk of dying, and the number of people that actually died. And this is a product of those probabilities, essentially. And that's what this plot is about. OK. So this is the estimator that you essentially visualize in here. So every step in a step function that you are plotting is a member of the group one population of population one. Dies, for example. And that's your steps in population one and versus population two. OK. Hazard ratio is a very important measure that people look at when they look at survival. And hazard ratio basically compares two groups that are, for example, different in treatment or different in death rate. And this is what they look at. They look at the observed death rate versus expected under a given model in population one versus population two. So for example, if the hazard ratio is one, that means there is no difference in these two populations. If the hazard ratio is below one, it means the population one is actually at relatively lower risk than the population two. And if, say, r is equal to 2, it means that a population one is twice as likely to is twice a higher rate of, say, death or treatment failure than the group two. OK. So this is the hazard ratio. I think this is one of the final things. So another thing that is very, very common for a model in the hazard ratio is the Cox proportional hazard rate. Is the Cox proportional hazard model? And Cox proportional hazard model basically allows you to incorporate multiple different predictors of the hazard rate. So you have multiple variables, x1 through xp, and each of them contributes kind of low beta to the hazard rate, and this is what you measure. So if the beta is negative, it means that your, this variable has a positive effect. It makes it less likely for an individual to die versus if beta is positive, it means it contributes to the hazard or to the negative effect of this variable. H0 is the baseline hazard, and the point of this model, of this Cox regression model, is that you actually don't have to estimate it. And that's very nice. That means that you can only compute hazard rate, because if you want to compute anything else using the statistics or betas, you actually have to know the baseline hazard, and that becomes a little bit more problematic. All right. So I think I mentioned this. So yeah, so using and interpreting the Cox hazard ratio, you can look at the probability, you can look at the hazard rate in population one, you can look at the hazard rate in the population two, and these are your two models. If you have the same measurements in the two populations, you actually look at the difference. It ends up giving you an intuition, looking at the difference between the two measurements and the two populations, and what the effect of it will be on the risk of a person dying. So in the extreme case, if x is equal to 1, if the treatment is active, and x is equal to 0, and the chip is placebo, then the hazard rate, and if the hazard rate is, say, 0.8, or is estimated to be 0.7, something like that, it means that it's 20% or 30% degrees in mortality using treatment compared to placebo. So this is kind of how you would interpret the Cox proportional rate, and I think you will have a chance to practice that a bit in your lab. So you'll be computing that. And finally, which is something that doesn't get used often enough, I think, when people do survival analysis is the concordance index. And the concordance index is basically it captures the ability to order the individuals with respect to their survival correctly. So if you place an individual who lives longer next to the individual who lives shorter and, again, longer, that should not be as high concordance index as the ones if you order them correctly. So this is the only metric that looks at survival analysis and kind of looks at pairs of individuals as opposed to individual or population-wide statistic. And I want to give you an example of something we did with SNF. So this is a breast cancer example. And this is metabolic data. This is also very good to know if you're working with breast cancer. They have CNV and expression data, and this was a little bit while ago when we did this experiment. They actually also already have microRNA data for sure, and maybe methylation data as well. On this individuals, I don't know if the methylation data has been released, but microRNA has been published. But the point is they have almost 2,000 individuals with breast cancer. They have the discovery cohort is about 1,000 patients, and the validation is about as much. So this is a very substantial data set. So the original goal of using this data set was also to identify the subtypes in breast cancer. And according to pound 50, which is a clinically approved classifier, there are five clusters. And this is, you can see, the p-value of the discovery cohort, p-value of the validation cohort. This is the survival rate and the low-rank test. And this is the CI, is the concordance index of the discovery cohort and the validation cohort. And so you can see that even if the p-values seem to fluctuate, depending on how many clusters you have, so eye cluster is another one method that I have mentioned. And there was an nature paper in 2012. And they called it interclass, but it's an eye cluster approach that they used. And they found 10 clusters. And now they say there are 10, maybe 11 clusters. But this is how many subtypes in breast cancer they are. And so you can see that even though their p-values, well, their p-values are certainly lower, but also their concordance is higher, which means the eye cluster was able to partition the population into more homogeneous subgroups where the survival is better ordered than the pound 50. And we did the same. We integrated these two types of data using SNF. And we tried to cluster the network into five clusters and 10 clusters. And even though you get some kind of a more stable result and maybe the same result as eye cluster with five and 10 clusters, you can see that the concordance index is basically the same. Maybe it's a little bit better than the eye cluster, but it's actually pretty much, it's exactly the same, whether you use five clusters or 10 clusters. So clustering further, our network did not actually yield better ordering of the individuals according to their survival, right? So the p-values are more sensitive, but the concordance index here showed us that it doesn't really matter. So this was done in response to a viewer asking so how many subtypes are there in breast cancer? And basically we responded, it is the wrong question to ask because I personally envision the world where you have this big network of all the patients in the world and you can place a new patient, you can use the whole network to make a prediction for that new patient directly using the whole network, right? So this is what's done now. You have a patient, kind of doesn't really fit anywhere, but fits more with a subtype one. So we grouped them with subtype one, we forget all their other characteristics and say, okay, we now use the characteristics of that subpopulation to build predictors or tell the patient how their disease is going to progress and it's wrong because here we kind of forget all of those personalized measurements that we have taken both biological and clinical for this individual. Whereas if you use the whole network and you integrate it, one of the things that happens with a network which is very positive is that you can take an individual and you can also use the information of the individuals that are not similar to this patient, right? Because once you group people with the like ones you essentially also don't use any more information of the individuals that are not like it but when you use the whole network, you actually can use the information on a more continuous basis, whether you like the person or you don't like the person. So what we did was of course with the network you cannot have the P value because you cannot do the ordering but this is essentially the way we computed this concordance index as we added a little bit of a regularization on the Cox model where we said people who are close together should have similar survival, people who are far apart should have different survival. This is basically the regularization that we added and the optimization is the same, it's just a modified Cox and you can see that with the same network that we used to cluster into five or 10 clusters, the result here is much better, right? We changed it to exactly the same data. You have already kind of embedded your individuals into this common space and using the exact same data you can improve the survival. The ordering of the individuals according to their survival just by using this information continuously. So this was my kind of the reason why I think that networks should be in the future is just going to be really hard because sometimes the data is not shared to compare the individuals in another hospital. So it would be interesting to kind of look at summary statistics of what do you need to be able to compare and to build out this network outside of one hospital, let us say. All right, so this is kind of, this concludes my, I guess I speak way faster than I expect always. So this kind of concludes my portion. What we are looking at now and people who are interested, we've already, there is already kind of people submitting papers that are doing simultaneous feature selection and integration and this is very, very nice. And what I think we need is looking at pathways or looking at sets of features, not just saying this is a pathway and we look at gene expression pathway but looking at gene expression, methylation kind of this heterogeneous type of information for groups of genes. So this would be very nice. There is a supervised version of a network, patient network type of approach where you combine networks based on the outcome that you are interested in. Gary Bader has developed and published this work, I don't remember where, but recently, it's either on, I don't know if it's on archive and I actually in print. It's called NetDx, I think. I can look it up now and confirm but there is a supervised version and kind of combines the networks in a linear fashion but at least if you have an outcome, specific outcome according to which you want to combine the networks of patients, you can do that. And the weight for the contribution that somebody has asked in the audience, if there is enough desire for that way, we had an idea of how to derive it, it's just, it'll still come from the data, right? These weights will still come from the data. If it's already known how you want to weight differently, it's very, very simple because you can just upweigh the network right away that when it goes into the method. If you want to know how you should weight the different types of measurements, then that would involve kind of a more different type of methodology. Okay, I'm happy to take any questions if you have and we are on call. All right, yes, question. What were your external validations in the process there? Are you doing anything with your network to change it to two minutes when you get to the external validation or is it completely fixed and just being applied like an equation? Like an equation, yeah. So it depends. For the survival data we had to fix it for generally it's an explorative tool, right? So it basically combines any data you want. So what we do usually is we kind of derive a classifier to if we want really subtypes, then what we do is we derive a classifier that would classify on kind of, we get the subtypes from the network and then we derive a classifier based on all the features into those subtypes and then on the new data set, the validation set that we have, we use the classifiers for the survival. So we have just done it for the medulloblastoma which is coming out in cancer cell. You just asked that because sometimes external sets we have issues with the generalizability and the people they need to start fighting tuning the model. Yeah, yeah. I just want to know what your numbers were actually representing there. That's the fixed model. That's a fixed model, yeah. That's the model that was used. Yeah, for the measure. Can I just ask, can the model accommodate missing data? So if you combine, you know, you can find data that's severed from cases. Yes. If you can extract them together, can it accommodate missing data? So that's a good point. There are two types of missing data. There is data missing at random where one particular measurement is missing for an individual and there is a whole patient missing. For example, you have methylation data for this patient but not gene expression for this patient. So the whole patient missing. So yes, we do have, it's not part of the package yet but we have basically done the analysis using, I don't know, all available imputation approaches. And what we found was that imputing on the original data does not do as well as imputing on the similarity matrix. So we have several approaches that we actually, it's not even an approach. It actually is a pipeline which basically tells you this is the best method for you to use for your data and this is what you would have gotten if you had all the data originally. So we subset the data to, basically the pipeline is like this, we subset the data to the portion of the patients for who we have all of the possible measurements and we evaluate the imputation approaches on that and then we say, okay, so this is the best imputation approach for this data given its, I don't know, characteristics. And then using that approach, we can impute whole patients. Well, not whole patients, similarities. We don't actually impute and we don't propose you don't impute whole patients because at the end, it was actually a very interesting exercise because you could impute complete garbage but then the p-value would go up, like random, you would impute random stuff and the p-value would go up. So at the end of the day, it's not kind of, yeah, it's not good to evaluate the final product based on the, say survival data or some, if you have a better outcome that's usually great. But there is a way and we can share the code. We should just make it as a package to impute on the similarity matrix and that works fairly well. Yeah. So we need some machine learning. Is there? Is there any machine learning? Well, iCluster is arguably machine learning, right? It's a latent embedding into SNF is machine learning. iQuay might do machine learning. Other methods that are machine learning that do subtyping, there's all kinds of class training. It really depends on what, for the integration purposes, there is not so much. There is a multiple kernel learning, which is, we actually compared in our paper, we compared to multiple kernel learning and the difference between, for example, what we do is somewhat similar to multiple kernel learning, but the kernel learning is kind of a linear combination, whereas ours is iterative and because it's iterative, it becomes very nonlinear, highly nonlinear at the end. For drug response prediction is a different question. That's a supervised question, which means you have a label and you are trying to predict the drug response and that you can, depending on the kind of data you have, you can use any of the standard machine learning classifiers or regression or whatever you want to predict drug response. What we are trying to do in my lab now, that's the kind of work we are doing is trying to build a deep learning method, which would take cell line information together with patient information, how to combine those two. We actually have a paper and bio archive that shows how multiple classifiers work for drug response prediction when you just combine patients and cell lines together, but the reality is it's not ideal because what we've discovered is that cell lines, they have a systematic bias and combining them together is a little bit like apples and oranges. So we are trying to build a better system, again, where you have a latent embedding, where there is similarity between cell lines and patients, but not in the original space. So assuming that there is similarity in the original space actually hurts the performance. But there is a lot of work on that. There is actually for drug response prediction in cell lines, there are reviews that I think from OHSU from Oregon, there were several at the Pacific Symposium for Bioinformatics and the dream challenge. They did a whole bunch of classifiers and looked at across tissues, across drugs, across classifiers, yeah. So there's really a lot of, but these are very different questions because subtypes you're trying to discover. You don't have a label as completely unsupervised question. The drug response usually is treated as a supervised problem or we are now treating it as a semi-supervised problem where you are using a classifier to predict something that you already have observed some outcomes for. You're just trying to build a generalizable one.