 So, I would like to share with you something that's actually very exciting to me because data integration is my area of research and this is what we do in my lab, we develop novel methods for data integration. I will share with you, well, we'll go kind of from the clinical data to the OMICS data to the data integration methods, but I will share with you first the kind of the state of the art in this field and what has been done before and how people have approached this problem and what we are doing now and where it's all going. And I will also talk a little bit about the survival analysis, just some intuition about doing that because that's a very important part of the kind of data integration and the valuation of clinical data and then you will do hands-on a lot more on that with Lauren. So, this is an example of the kind of clinical data that's available regularly and it's just a snapshot of a few variables, I suppose, but you will get race, sex, family history and the information about what kind of treatment they have, maybe some of the hormone therapy or the state of the proteins and the stage size and you also get a lot of information about the survival, well, first of all, whether it's primary or whether the tumor has recurred and then all the times that are associated with that. So the time of the original diagnosis, the time, the overall outcome or the outcome of the disease, right, and these are different because the patient might have died not due to the cancer but due to other causes. So these two times might be different and the status of recurrence and then the time to the recurrence. So there are actually a lot of publicly available tools that now take this data and can do inference with it. So this is a predict, it's called predict, a tool for breast cancer and it's available online for anybody who wants to try. You put in some of the information about the patient, including the age, the status, some of the kind of the chemo regimen, et cetera, and the grade. So the information about the tumor in the patient and what it does is it gives you estimates for this patient, how the probabilities of survival, how likely they are to survive after five years and after 10 years and what is the benefit of it in different kinds of therapies. And in this case, it looks like the chemo will be beneficial for this patient. I think the reality is that the situation is a lot more complicated and that's the reason why the field of integration came about is because it cannot, not all the patients with these clinical variables will have exactly the same survival and outcome. So that's the reason why a lot more data is being collected now. So what's being collected? For each patient, well, genetic information. So now in clinical genetics labs, I believe in a lot of the at least big hospitals, people are collecting at least exomes, sometimes whole genome as well. Transcriptome is being collected, also in clinical and in research settings, epigenetic data, usually DNA mutilation but sometimes some of the epigenetic mark data, thank you. MicroRNAs, proteomes or protein expression, then you have the clinical data that you have had before but then there's additionally for some of the diseases you have additional information such as extensive questionnaires. For newer psychiatric diseases, you definitely have this information. And I think there was a push to try to ask to have this general psychiatric evaluation for a lot of the patients who don't necessarily report because of the psychiatric disease just to have a kind of a broad understanding of the field and comorbidities and such. There's also, of course, imaging data, again, for newer psychiatric disorders it's very common. There's MRI for some of the chronic diseases as well, et cetera. And finally, especially in cases such as IBD or things that are related, there is diet information that is available. So I've actually seen all of this data, maybe not for the exact same patient but for a couple of different cohorts, I have seen all of this data in my lab where people come and they say, well, how do we integrate all of this data to make sense, to make it to help patients. Some of this data is actually publicly available. So TCGA probably has been mentioned before in this workshop is an amazing resource. So this is the cancer genome atlas and they have now, they don't have 33 different cancers. They have more than 500 patients and the cool thing is that they have all of these different modalities, right? They have collected this data and stored and made it publicly available. So they have the exome data. This one is protected, that's why there's a one. They have the SNP arrays, they have methylation data, mRNA, microarray, clinical data and they also have proteome for 180, 990 proteins also available for a lot of them, for a lot of these cancers. And I believe that in the lab afterwards you will be playing with one of these datasets. So why do we want to integrate patient data? Anybody wants to say anything? Why would we want to integrate patient data? Why is it not enough to look at their transcriptome if we believe that there is a regulatory problem? Yeah? Yeah. Yeah. It's exactly. Yeah? Yeah. That's exactly right. So actually a lot of times, at least in the research setting, the question that I've been asked by the clinicians and research scientists to identify subsets of the population that are more similar to each other, right? So we can try to predict the survival based on the general population characteristics, but if we knew that we have exactly the same set of patients for whom we already know the outcome, it would be a lot easier to make accurate predictions for this new patient. And so this is one of the ways that this data, I don't know why, is not necessarily the right question, but what happens with this data and what ultimately people want to do with this integration is exactly to identify the subsets of patients called disease subtyping and happens very often. So it's kind of leading to exactly the same thing as you said. So identify patients for whom we can predict the trajectory of what will happen to them later, how they will respond to drugs, et cetera. So here's a bit of history. So originally, of course, people have started with small sample sizes and single data type analysis. So if we had, in this particular case, this is a paper from PNAS in 2005, I don't think it's possible to publish in a high-profile journal with 20 cases unless it's a really rare disorder. But so GBM is a glioblastoma multiform, is a very aggressive and invasive adult brain tumor, and it's lethal. There is the standard of care, stemazolamide, which extends people's lives by maybe half a year or two a year, something like that, but they all die. So this is a very important cancer and has been studied a lot for a long time. And so this is essentially a standard pipeline of what has been done then and is still being done for a lot of the data sets now. So people have collected the gene expression, in this case, for 18,000 genes. They selected mostly varied genes, essentially because the genes that don't vary across the school population are not likely to be predictive of the difference in the outcomes. They perform hierarchical clustering, and I will show to those of you who have not seen how hierarchical clustering actually works, I will show in a little bit exactly how it happens. They identify clusters for the subset of this, for the population using a subset of the genes, the identified cluster. So here there are two clusters. There is a blue cluster. So every column on the right, you have a heat map. Every column is a patient, and every row is a gene. And it basically shows the clustering diagram. And so here they decided to split up right on top. So you have two different clusters. And you can see that the blue cluster has a very different expression pattern from the orange cluster, even though they're still quite different between each other in each of the clusters. They're definitely different, more different. There's more difference between the two clusters. So the way there is a kind of clinical relevance of these clusters to the way that these clusters are evaluated for their clinical relevance is people look at the survival curves. And the survival curves, what they show essentially is that there are two groups. There's a blue group and an orange group. And you can actually look at this curve and you can say, okay, I have years along the x-axis. I have the probability of survival along the y-axis. And as I go, for example, for year one, how likely is my patient to survive if they were to belong to group two? It's roughly 20% and if they belong to group one, it's roughly 80%. And so what then happens is if a new patient comes in, they use essentially this classifier with the genes that they have selected to predict whether this patient belongs to either group one or group two. And based on that prediction, they say, these are my survival curves and that's how I can predict and assess whether the patient is safe. So one more thing about these. These are the black dots here. Usually they're dashes, vertical dashes. These are censored observation. That means that the information about that patient is no longer available. And it's not known. It's just known that the patient has survived beyond the last point of observation, but the time to actually vent is not known for that patient. So what happened after? So this was a single data type analysis type thing. So then the next wave was integrative analysis that was based on a single data type. And what happened? So this is a very famous paper from 2010. They had their rocket all in cancer cell. They had 200 GBM patients. And they also decided to go with kind of the gene expression as the primary basis for decision about the clustering in the population. So what they did was they took the mRNA and they clustered the patients pretty much in a very similar fashion. And then they said, now we will try to either explain or add some of the genes that may be differentiating further between these patients. So these clusters were essentially, these clusters here, they were essentially dictated purely by gene expression, but a few more genes that were looked at for their predictive value of these clusters. So even though they had multiple different modalities of the data, they didn't really use them to make kind of joint decision. It was primarily driven by a single type. That's why I called it the single data type integration. And so what happened was that the paper was hailed because even though there was no real difference in the survival curves that are shown on the right here, even though there was no difference in survival curves, there was the clusters that they identified, they kind of made biological sense. So they were called this pro-neural, neural, classical, and mesenchymal groups. And this was based on the genes that were differentiating these clusters from all the rest. They didn't, no. So this is this plot I pulled from the supplementary. They didn't, they just basically, they talked about the biology of the disease, and they talked about how the importance, and they replicated, they had an independent set where they kind of identified the same clusters given these genes that are associated with these different subtypes. I think it was 80 genes per cluster. So in essence, it is, it does make sense. I've talked a lot to the clinicians about kind of the validity of evaluation by survival data. And it does make sense that if the biology of the individual is different, even if the survival is not necessarily different, they might react very differently to treatment. And that's a very important point for a clinician. So the biology might dictate the standard of treatment for a specific individual. That's what the value of this analysis was. Most of these mutation analysis that are done, the somatic mutation, here in these papers, not independently, were kind of associated with the clusters, not, yeah. So interestingly, in this paper, they mentioned that they had mutilation data, but they didn't find any differences. So it was mentioned in passing and as if the mutilation data, they didn't find any use for it in GBM. And there was a paper that came out just two years later where they did a similar analysis, but they let the mutilation drive the quastering identification. And these, then they found this hyper-mutilated cluster. So you can see that it's the first kind of band here is the DNA mutilation probes and most of them are hyper-mutilated. And this is now very well known in the GBM community as IDH1 mutation, GBM subgroup. So they looked at all of these patients that had hyper-mutilation and they identified that almost all of them, except for one, had this IDH1 mutation. And I think what's interesting is that the same mutation does not give the same hyper-mutilated status in leukemia. And it's present in leukemia in a lot of patients, but it doesn't give the same hyper-mutilated profile as it does in GBM. But in GBM, it's a very kind of a very significant and very obvious pattern. All right, so this is kind of up until 2012-13. This were the papers that were coming out in clinical literature that had a lot of different types of data that we're trying to use them, but using one of the data types to drive the analysis, not really doing it jointly. So around then in the last five, six years, people started looking at integration approaches. And I will only talk about the three commonly using integration approaches. One is this concatenation and clustering. And this is by far the most commonly used approach by Broad and by the TCGA community itself. The second one is iCluster, which is also quite commonly used right now in the literature. The method came out in 2009, but I think in the cancer community it started to be used later. And SNF is the similarity network fusion that I will also talk about. That's the method that was developed in my lab that was published in Nature Methods two years ago, which is also now used in TCGA analysis by the consortium. So the first in the simplest is the concatenation, right? You take the patient data, and the patients are now rows, and you simply concatenate all the measurements that you have for the patient. So it could be gene expression, methylation, it could be clinical imaging, whatever, you can just one big string. You might want to normalize each of them individually, but later it's just concatenated all together. And the only problem that I will mention, and I will show you the comparison between the performance of the methods later. And one thing that you can immediately think about is that the structure intrinsic to each of the individual data is kind of lost in this, right? Because you might have picked a thousand genes, and you might have picked 15,000 methylation probes, and now you have the signal with kind of unequal strings, but it's also a probe will not be as strong as a block of 10 genes, right? So it's harder to pick. Each individual probe was a very important and significant contributor, as you would pick like a block of, you have a block of 10 genes that are all strongly correlated. So the problem is that the structure of the data set becomes the, the one that is informative of what happens with the patients and the biology of the patient becomes very, very different and eluded. And that's, that's part of the problem of the concatenation of this data. So as I promised, I will mention the hierarchical plus strings. So this is essentially a similarity matrix of the six different patients. So you have patients A through F on the columns and the rows. So these are the same patients and the matrix shows how similar they are. So you could do a correlation, for example, Euclidean systems between them. And then you say, okay, patient A is similar to patient B with some point 71. And it doesn't really matter what the actual value is, as long as the relative values are known, right? So for example, in a minimum distance also called a single linkage hierarchical plus string, you pick the minimum distance from the whole matrix and you merge those first. So here you see that patients D and F have the smallest distance between them, so you merge them first. The, and they become kind of like a single patient. And then you update the matrix and you go further. So the next patients are likely to be merged A and B. And so this is what you see here. This is the, basically in this space, the A, B, C, D, E, F. And it shows how similar they are to each other. And on the right, it shows the usual hierarchical plus string. So what you see here with the height of these merges is which got merged first. So you can judge which one was merged first. It's kind of indicative of the distance, right? It's 0.5 from such a long distance. So after D and F got merged, this common plus to D, F, would be that merged third after A and B got merged, et cetera. And so A and B, this group was very, very different. And you can see it in this two-dimensional space as well. It was very different from the rest of them. And that's why this got merged into the other ones. And so when this hierarchical plus string happens, you start with every single individual being their own cluster. And you end up with one cluster for all the individuals, right? Where? These are the values, essentially, of these points. This is how you calculate how similar you are in space. This is the representation of them. Could be the projection. This is a, if A is a patient and they had two features X and Y, this would be the values of those features. So this is what happens. And then once you group everybody together, you have to decide how many clusters there really are. And this is an internal question. And I have to tell you as a computer scientist that there is no answer to this question. There is no one answer to how many clusters there are. Because depending on what you care about and how you formulate your objective of what you care about each cluster to represent, the number of clusters will be different. So it's a natural thing, but it makes everybody's automation of identifying the clusters very difficult always. So here are some of the very, very standard procedures. So for hierarchical clusters, people do it by eye. Honestly, they most often do it by eye. And I'll show you an example why sometimes it works better than anything else. So it's hard to replicate that analysis in an automated fashion again. Then there's a silhouette statistic, which is also very commonly used. But again, it has benefits, advantages and disadvantages. EigenGAP is for spectral clustering. So if you have a graph, a new cluster, a graph, then EigenGAP is most commonly used. But especially if anybody knows PCA, then EigenGAP is kind of the coefficients that are associated with each of the vectors. And so you pick the biggest difference between the vectors and choose that as the number of clusters. And so there are many more. And there are reviews on all of these different statistics that I used to pick the number of clusters. These are kind of the most common in this literature in the literature of cancer subtyping. So silhouette statistic is from 1987, presented first by Russo. And it's very simple. It basically says, I'm trying to compare the distances within the cluster versus the different, the distances from each of the points in the cluster to all the other clusters. And so for each of the points, it can compute this average distance to all of it's as patterns, but it's essentially points or patients in the other cluster versus bi, which is the difference. So if bi is zero, which means it's really, this point is really close to every other cluster, then you will necessarily have a negative silhouette, which means my point, which is in cluster A, is more similar to cluster B than similar to cluster A. So maybe the clustering was not correct. So this statistic goes between minus 1 and 1, or metric, goes between minus 1 and 1. So 1 means it's perfect. All of the points that are in my cluster are more similar to each other than to any other point outside the cluster. Minus 1 obviously means everything is reversed, and 0, I guess it's a borderline. You have to look at it. So here's an example. You have, there was a paper that looked at four different scenarios. So three clusters in two dimensions, which is represented in picture A. B is the three clusters in 10 dimensions, and each cluster has 50 activations. There is no possible picture of the 10 dimensions. So for picture B and C, C is the four clusters in 10 dimensions with randomly chosen centers, and D is six clusters in two dimensions. And they looked at all kinds of different similarities, but our favorites are here. So silhouette. These are A, B, C, and D. These are the different scenarios that they had. So silhouette is this green one, and you can see that for scenario A, where you kind of have these different clusters, they're sort of separated in space, which is nice. A lot of the different metrics perform very similarly, and perform relatively well. For this D cluster, of course, silhouette would be three clusters, but not be six clusters. And the hierarchical clustering, if you did it by eye, and you knew that there were six clusters, you could have picked the six clusters. But this is what happens, because each of these clusters is easy to tell, but then for each of them to be far away, each of them are essentially equally far away from everything else. And so there is no real way to tell that these are all different clusters. It would just pop into one cluster. By eye, would you pick three clusters? Yeah, it would pick three clusters. You'd have to have a prior information that there are actually six clusters for this specific, because here you can see colors. If I didn't put colors, and just put those three different in all in one color, you would see the three clusters. K means, which meant there's not. So K means assumes some kind of Gaussianity. And so if your clusters do not follow, so here, this kind of problem looks as easily resolved with K means. But for K means, you also have to give the number of clusters. It doesn't pick the number of clusters. It's K means, it's K. K, you have K centers that you assume that they are, and then you figure out the question. There is X means that was developed in the lab when I worked as a PhD student, was more than 10 years ago. But there is X means which kind of just searches over the space and decides how many clusters there really are. But K means you have to give the number of clusters a prior. Could you, for instance, blood the data beforehand, take a look at it by eye, and then input whatever you think is the best number of clusters? So in cases A and D where you have two-dimensional data, maybe, but most often we are not working with two-dimensional data anymore. That's the problem. Yeah. So another important thing, if you ever do this kind of bioinformatics analysis, is to figure out the robustness, right? Usually you kind of get a sample and you cluster it, and you say, these are my clusters of the sample, these are my subtypes. But the reality is that a lot of the clustering methods, what they do is they will cluster every single point into some cluster. And maybe some of these points are outliers. They're just far enough from every cluster that by chance they get into some cluster, but it's not meaningful. And it actually destroys statistics a bit too. So what people have been using, I mean, this method has been published, but everybody has been doing it anyway. They published the paper and consensus clustering, which is kind of establishing the robustness of the cluster. So instead of just doing it once and having one clustering, you re-sample, say take 80% of your data, figure out whether two individuals are in the same cluster or not, and re-sample again, and cluster, re-sample and cluster. And you do it hundreds, maybe thousands of times, a thousand iterations, and then you have a matrix of how often patients have appeared in the same cluster, given that they actually had a chance to appear in the same cluster. So essentially this IG is, there's two patients ING, GAM-H is how often they have appeared given that they actually had a chance, right? Maybe they weren't sampled at the same time, so they were not actually in the same sample, so they couldn't be in the same cluster, right? So given that they have appeared in the same cluster, given that they were both sampled, did they appear in the same cluster or not? And so what happens is, you get this kind of the core of the cluster. You get the core, which you roughly believe in, and then you can threshold. You say, I want the patients who 80% of the time appear in the same cluster, and I only trust those. And the rest, I simply cannot classify. And so this gives you more robust statistics about the biology later, when you are predicting, trying to figure out which genes are associated with the cluster. You will not be also putting in the values for the patients that are these outliers. You will just be picking this kind of the core of the group. And basically in my law, we always do this kind of analysis for the robustness of the clusters. All right, eye cluster, I will not talk about in detail. It's a Bayesian method. It's a, if any of you are familiar with kind of the factor analysis, it's very similar to that. What they do is, the intuition behind this method is that you have some intrinsic true tumor subtypes. They're Z, they're hidden, they're latent. They're not known. But because you are working with the same population of patients, it's assumed that the Z is the same for all of these different data modalities that you have. So here you have this Z subtypes. And here you have copy number variation. It's possible that it looks very, very different, but you're constrained to push these people of these individuals into the same clusters as you would have if you had the epigenetic data. So what it does is it does this kind of iterative inference method to figure out which clusters that truly are given all of this data. So this is a truly data integration method where they are not concatenating, but they're trying to find this latent space that captures well the structure of all of this data set. So their original analysis and their original paper kind of said that they can capture the complementarity of the data as well in this analysis. But in our experience, when we did a lot of the simulations and tried to run different kinds of methods, we found that it didn't necessarily quite work for complementary data sets. So if you have some complementarity and you want to capture both, it's hard to capture it with this method. So usually if there's one strong method, say, modulation, modality that has strong structures, so for example, modulation and others don't necessarily have that structure, it might be driven. This analysis might be driven by the modulation data. So this method was like a serial supervised? This is unsupervised. This is really unsupervised, yeah. And the reason why it's unsupervised is because at no point do they have any idea about their clusters. They really are trying to find this latent embedding in this latent space that they don't see, which is common. So what they do is they try to infer common representation and simultaneous projections from each of these data sets onto that latent space. All of them happen at the same time. All of them happen at the same time, it's a, yeah. This method was also referred to as interclust in a Critis Nature paper in breast cancer. I don't know why, but it's the same method. So some of the drawbacks of the methods that we have discussed is that a lot of times they require manual processing, a lot of kind of feature pre-selection and sometimes they say we selected 1500 most variable features. We can do that, right? But if we have to replicate the analysis done in a paper, if they said we kind of picked this feature for this reason, this feature for this reason, it becomes very, very hard to replicate the analysis. And so the more pre-filtering happens before, the harder it is to use it again in a different context in some sense. There are many steps in this pipeline. So there's a lot of filtering kind of manual curation, maybe picking a number of clusters by eye. So it becomes hard to automate this analysis if you want to have this kind of pipeline in your lab. And if most of them, it's true, most of them look at kind of each individual feature, not combinations of features. So if there are combinations of features across different data modalities, then that might be lost. So given this, we have proposed a similarity network fusion approach. And this is to integrate data in a patient space. So the first step is to construct patient similarity matrices in this one slide. So the first step is to construct patient similarity matrices. And the second step is to fuse multiple matrices. So what's a patient similarity matrix? Well, patient similarity matrix is the same as we have seen before. We have a patient by mRNA expression. And it's kind of the same matrix that went into this hierarchical clustering that you saw, right? So this is the similarity. We usually use a kind of Euclidean distance here and the darker the spot, the more similar the patients are and the lighter the spot, the less similar they are. And so if some of them, when you correlate individuals, you will have a lot of close to zero but non-zero similarities in what, what happens if you essentially zero them out if you sparsify the matrix, you will have zeros corresponding to no edges and ones corresponding, and non-zero edges corresponding to edges in the graph. It's essentially the same thing. It's just a different representation. So this matrix captures exactly what this weighted graph captures. So in this weighted graph, each node is a patient and each edge represents how similar these patients are. Okay. And so what we do is we construct each, we construct as many of these graphs or similarity matrices similarly as there are data modalities. So here we have two and we integrate them. So the kind of the idea for this integration is random walk. So everybody's familiar with a random walk on a single graph. You start in one node, you walk around and you come back to the same node. So here it's a random walk across graphs. So it's a kind of diffusion across graphs. So you start in this node, you say, okay, I have a null path here, so it walks somewhere else, but here I have a path. And so in every step, in every kind of, and this can be captured by a standard matrix multiplication all the random ones are. And so in this kind of scenario, what happens is that you update this graph with the information from this graph and this graph with the information from this graph. So in an iterative fashion, you start refining each of the graphs to be more similar to each other. And what happens, the byproduct of that is that similarities that are very weak in one of the graphs, they go away because they're not supported by another graph and they were weak to begin with and they just kind of disappear. So you pull a lot of noise that way, but the ones that are very, very strong, they kind of permeate to other graphs. And so in the end, because in every single step, you are guaranteed that the graphs become similar to each other, you are also guaranteed, probably guaranteed to converge to a single graph which is supported by all of the data that you originally had. So the main idea behind this approach was that once we represent each of our individual data modalities with the similarities of patients, we are in the same space, in a patient similarity space. And in a patient similarity space, we can integrate anything, whether it's similarity based on a diet or similarity based on DNA methylation or similarity based on imaging or questionnaire data. And we have actually done it in combining the imaging and genetic data and some of the questionnaire data as well. So this was the idea that it doesn't really matter. It doesn't need to be mapped to genes. It doesn't need to be mapped to a common unit. You can actually just combine and you can combine any number of these data modalities using this kind of approach, okay? So I will show you the experiments. First I will show you the simulations and this is primarily for you to see how the different methods compare on the different kind of scenarios. And then I'll show you some of the results that we had from the TCGA data. So we compared methods, we compared concatenation, I cluster the patient-driven similarity and the multiple kernel learning. The reason why I didn't talk about them is because they're not used as often. This PDSD method is a very nice method, also Bayesian kind of latent factor method, but it only works for two different views. So it's not as scalable. I think they had an extension to that. And multiple kernel learning is used a lot in imaging data, but not in this kind of scenario when you want to integrate all next data. So here's one simulation. So the ground truth is here on the left in the middle. This is the ground truth. We have two clusters in this two-dimensional space and we are trying to identify to identify them from two imperfect views of this, two imperfect data modalities with the general. So here, data type one and data type two. And what happens is that we capture pretty well one of the types in each of the data sets and we mislead some of the other ones. So the information is only available for one, this is kind of the complementarity. So one of the data views contains information about one of the clusters fairly well. In the other view, it contains information about the other cluster fairly well. So this is the results. So this point is the result. And this is information about how many points from this label. So how many points were swapped between two clusters? You can see the green one is the eye cluster. The black one is the concatenation. And you can see that as the proportion of swap goes down, you actually are losing quite a bit of the accuracy of your green concatenation. And again, this is due to complete loss of strategy when you're looking at data sets. Here's another scenario, also very, very common in biology in the omics data. Ground truth again is in the middle, but then we have noise to this data. So we have gamma noise in the bottom and the difference between gamma noise in the center and Gaussian noise, which is on the top, is that it has a tail. So it's a skewed type of noise. And so you have the structure of the data captured by both of these imperfect modalities, the data type one and data type two, but the noise patterns are different. And so what happens is on the left here is we're moving this kind of this, we are scaling up Gaussian noise, keeping the gamma noise concept and here we are scaling up gamma noise. So you can see quite interestingly that concatenation, it does okay with a low level of gamma noise. Again, the tail makes a difference. But as we add more and more noise, it gets progressive, it gets much worse. Whereas here it's a lot more stable with the Gaussian noise. It's a kind of a gradual decline. The other methods, the eye cluster performs kind of a positive reaction. And that's enough, yeah, performs much better. So with the TCGA data that we had, we had five different cancers and this is somewhat older experience. So we had fewer patients than I available now range for a lot of these cancers, but two years ago it was anywhere between 92 and 250 patients. And one of the, I guess, important point for me working with clinicians and trying to integrate data is they always ask how big of a sample size do we need, how many patients do we need to be able to make the inference? And with the similarity network fusion, we actually don't need as many patients if there is some signal we'll be able to find it. Because it's not, it's again, it's in a patient space, not in a feature space. So we had three data modalities. We had mRNA mutilation, microRNA for all of these different cancers, the five cancers. So this is the case study of glioblastoma. This is actually the real data. The top is the DNA mutilation. So you have both the similarity matrix and the graph that corresponds to the matrix. And you can see that the patterns of similarity between patients is the same set of patients, but the patterns of similarity between these patients is very different in each of these different data sets. So while it's possible that some of the patients look the same or similar in all of these different data sets, it's actually definitely not the case that it's a global kind of similarity. And the question, yeah? Yes, you are right. Yeah, I don't know why I'm not. The graphs on the right look similar to each other. So the graph, the topology of the graph is based on a joint, on a fusion, so that you could visually compare. But so you shouldn't look at the distribution of the nodes. You should look at the edges, and they look quite different. So for example, there is a much stronger similarity between the small cluster and the larger cluster, as opposed to in mRNA expression, there is a much stronger similarity between much more the kind of the second and third clusters are a lot more similar than the small cluster. You see? The nodes, the embedding of the node, the representation, the visualization is based on the fused network. So that you could actually compare. It's the same, the distribution is exactly the same. So you could compare the edges that are left. So you can see that based on the DNA mutilation, there are still some of the similarities that are left from here because they were very strong. But you can also see that. So here we have the similarity type, which is capturing all the different colors that are captured in the similarity of the different patients, patient pairs within patient pairs. And you can see that majority of them are based on kind of a combination of DNA mutilation and mRNA. So they both support the similarity between those patients. Or interestingly, there are some subsets of these yellows, which are mRNA and DNA mutilation. And also there's the big pocket of green, which is the microRNA. And I think what's important is that this black is the one that's supported by all of the data sets. And we actually had to go in and check. And there are maybe a few edges. There is an edge here. It's hard to check, but if I pointed out, maybe you'll see here. There's one here. There are a couple here. But the reality is that majority are supported by two of the data types, but not by all of them. Yeah. And with this approach, are you letting us say you're messing? So you can. We have the follow up work, which we're writing up now. So it depends on a couple of issues. We looked at, I don't know, 10 different ways of how to deal with imputation because now you're imputing patients, right? You're not imputing values of a patient anymore. Like usually the standard scenario in a clinical setting where you have maybe 10% of the data missing for a patient and then you'll. So in this case, you have whole patients missing, right? And so there are different ways to deal with it and basically what we show that if you want to get the same kind of clustering as you would have gotten with all of the data, if you had all of the data sets, the best thing is to kind of impute on the similarity. So impute the similarity of the patients rather than imputing the actual values because there are no methods that impute the actual values that actually do it accurately because this is what they do, the imputation methods. Majority of say, Gina expression, it looks like, I don't know, white noise. It looks like it's all around the same small value except for a few outliers. Those outliers, you have no chance of imputing. You have no idea that these would be outliers. So you basically impute in the mean every time, no matter which method you use. And imputing the mean is not very informative for the rest of the, for these values. But yeah, so we are kind of proposing an idea of how to figure out what imputation method to use given your data using the subset and kind of assimilate in the scenario. Okay, and so what we also found was that this small cluster was the IDH1 cluster. So it is possible that the methylation data still had an effect in our clustering procedure. And this is the one that had better prognosis and was a younger cohort as was known. But what was also interesting is that steamosolomide, which is not necessarily, it's a standard of treatment, but it's not necessarily useful in, it didn't have a difference in any of the groups except for the subtype one. So in subtype one treated and non-treated patients actually were had quite different prognosis than in this larger group than in the other two. Yeah. For the IDH1 subtype, how did you, what did you prepare them? Since they're apparently young, they're probably gonna be fairly healthy. They're all gonna be treated. So what's the, how big was the group that was untreated with IDH1? I don't, I don't think, I mean, we wouldn't have had the significance for two reasons, right? One is that there's no significance with enough data and another is there's not enough significance with not enough data. So I don't remember if it was the case that we couldn't actually tell with IDH1. Yeah. You found out, first you built the networks, correct? Yeah. And you found that they had three subnetworks. And then you found out that the smaller one was the IDH1 positive one. Yeah, we just checked their mutation status. So you just took the IDH1, mRNA, and microRNA and you found out. Yeah. Which is not, it's not hard to imagine that that would happen because we had the methylation data and you saw the methylation data had a strong signal. So what you're going to have some clustering on your own without having put that in your system. Right, yeah. How did these subtypes match to the other subtypes? So the IDH1 mapped fairly well. The other one looked like one of them was mesenchymal, but it's, I don't know, the subtypes are based on the gene expression. If we just classed the gene expression, we would have gotten the subtypes. We are using three different types of data. So. That's a question. Do you know the word graph? Yeah. Survival. Yeah, you were talking about what the most of the mind is affected by nodes. Yeah. Why do you say, why do you talk about the most of the mind rather than just saying these patients look worse and worse? This is the group. This is a group based on a treated or untreated in this big cluster. It's not just their survival. It's survival when they were treated versus the, it's the group that were treated in red versus the group that were untreated in black. The treated ones did better, yeah. Oh, and so you said they did better. Yes, yes. They didn't in other groups or we didn't have enough evidence to say that they were treated. But that they did better. And this was the case. We kind of did this analysis on the five cancers that we had, the GBM and you can see on top it's just a regular PCA plot of the data. And it shows that the cluster separated fairly well. In our data below, you can see the survival and it seemed better. So the survival here was better than using each of the individual data sets independently, data modalities independently, yeah. All right, yeah? Yeah, so I don't, yeah. So I don't have this plot or do we have anything in the lab for the feature selection, NMI? So the way that we did it, I mean, once you get the clusters, you can actually do the usual thing. You can do your t-test and what we actually found, and I'm sorry I didn't have the slide, it would have been interesting to answer your question, is that the t-tests, basically what they do is they look at whether the distribution is different in cluster one and cluster two, but it doesn't correspond to the actual similarities according to all of the features. So it doesn't really capture that intrinsic structure within the cluster. And so the features that were coming out with the usual kind of analysis on over and t-testing were not very clean. The features were not very differentiating between all the clusters. But when we did something that captured more, so what we did was we had, for each of the features, we looked at how it contributed to the clustering itself. So we looked at the mutual information with the clustering, our normalized mutual information, and then the features that we got, the top 1%, really aligned with the clustering. It actually turned out that there was a microRNA, really strong microRNA signature. Some of it was seen before and some of it was me. Yeah. So Lauren will talk to you about SNF2. So the good things about this SNF2 is that you don't have to do feature pre-selection at all. You don't have to pre-select the genes that you will be working with unless the signal is very weak. So for example, if you have one feature that matters out of the 20,000, then you correlate all the 20,000. Of course, this feature is not going to matter much to the information about patients. So as long as you have the structure of the data is predictive, is associated with a disease, you can capture it without doing the pre-selection. And it's quite scalable. I think we looked at 6,000 individuals so the integration ran within 10 minutes or so, but the longest time it took to figure out the similarity, patient similarity for 6,000. It's 6,000 by 6,000, a major extent. All right, so these are, yeah. It's independent to all the cancer types from TCGA. All the cancer, why do you choose only five of them? We looked at the biggest ones then when we were doing the analysis. We looked at the biggest ones then. Biggest and no embargo. We did it in two more cancers, but they had an embargo and so we didn't pursue them. It was available to you. Yeah, yeah, right now we are, yeah, we are kind of moving on because we are method development people. Yeah, all right. So this kind of concludes the integration, method integration for data integration and I will switch to survival analysis. So some of you already know it very well. I will just talk about it briefly because it's such an important part of the evaluation of all of these integrative methods that you can't really, it's good to have the intuition behind and then you will do more hands-on work with Lauren. So survival data is characterized by the time to a single event. I talked about censoring. This is where some of the patient information can be missing. We assume that this missingness is non-informative. I'll do the mouse, sorry. So the non-informative means that it's not that the patients that are doing worse due to cancer are moving away. So we are missing these patients not at random. We assume that we are missing patients in random due to moves, due to other reasons. So for uncentered data, we assume that we have actually observed the time of the event. So this is what it kind of looks like. So for the first patient, we definitely have the information. For the second patient, the last observation, the time to the last, the study ended and the patient did not die and there was no follow up. And so we basically know that the patient survived beyond the last, the end of the study. And that's a censored, right censored information. All right. So it happens that there is a lot of censoring and majority of the TCGA data, for example. But more importantly is that there was just very little follow up. So for people who study, for example, breast cancer and they want to perform some analysis like this on breast cancer patient, there is a meta break data set that is publicly available. You have to put some requests, but it's very easy to get. And that actually has a lot of much better clinical data associated with the sample. So there are two important statistics that we'll talk about mentioned. There is survival and that's just the probability of person surviving beyond time X and it has a great. This is basically saying, I know that person was alive at time X. What is the probability of this person being alive in the next instant? So delta X is going to zero, which means that it's essentially X plus a little bit. Just the next instant, what is the probability of a person being alive in the next instant given that I know that they're alive now, all right? So it's a rate. That's the difference essentially. What is the rate of values of the hazard rate? For the hazard rate, is it, it can be negative. I think it's just continuous. It's just the R, it's just continuous on the R, it takes all real values. So one of the examples of the negative hazard rates since you asked is that the risk of dying is decreasing as the time goes by and that happens definitely for infant mortality, for example, that's a, that's an example where babies when they're just born at much higher risk of dying than as time goes by. So that's a negative hazard rate. Constant hazard rate is kind of nice. It means no aging, your probability of surviving at any point is the same as it was before. It's kind of nice, but usually the kind of the aging process is the probability of you dying later is higher. It's basically the measure of slope of the hazard. The hazard? The hazard rate. Kind, it's the probability, you will see more, you will see more about it. I think it will be more intuitive, but it's the ratio of kind of an observed to what you have had expected. Yeah, it's the rate, it's the instantaneous rate in time. What's the probability now given the probability before? I think some of the further examples will be helpful to illustrate this. So a couple of my estimators go with a couple of my occurs that I've been showing you before. And these are very commonly used and it's basically the probability of surviving. So the probability is this is n i's, the number of people that are at risk of dying minus all the people that have actually died at time t. So this is the survival rate over time, over all time. That's why there is a product. So now you know that this is just essentially plotting the couple in my curve is essentially plotting the survival rate and we have already discussed that. So this is exactly what it's plotting. Now we have two survival curves. So previously we had two groups, we had an orange and a blue group. And now we need to know whether they're really significantly different or it's the eyes playing tricks on us. And so especially when they're kind of overlapping, it's really hard to tell whether there is significant difference in the survival or not. And so there is an actual test, a log-rank test which is most commonly used for this purpose. And the summary statistics for these are quite simple. These are the number of people that are at risk in each of the groups and the number of observed events in each of the groups, right? And then what you do is you assume, you basically assume that the two curves are the same. This is your null hypothesis and you work to reject that. So if you assume that the two curves are the same, then that is captured by the hyper geometric distribution. And then once you know that it's a hyper geometric distribution, you know the mean and the variance of these distribution based on the statistics, the usual analysis. So this is the mean of the hyper geometric distribution and this is the variance. And the test is just a standard normal. So you would get chi-square. Sometimes it's captured as a chi-square, but that's when you're looking at the actual numbers, the frequencies. These are the probabilities. And you can look up the p-value, how well it's, whether it's significant or not based on the p-value. All right, so this is coming back to the hazard ratio. And this is the hazard ratio for the two groups. And this is again, this is the number of expected events to the number of expected in group one versus group two. So if you have a value less than one, for example, that means that the probability, the hazard in group one is reduced compared to group two. So it's 43% of that of the risk of group two. If it's greater than one, if it's at group two, then it means that the risk, the hazard risk of dying or if not responding to treatment, the failure rate in group one is twice than the failure rate in group two. So far you're only dealing with two groups maximum. Yes, that's a good point. So hazard rate, there is an extension to a larger number of groups, but I've never seen it used. I've only seen it used for pairs. So... What is the previous test to compare if there's a difference? Do you basically just compare them all to each other to find out if there's a difference? There is, so this, for the Cox PH model, so this is what people usually use. I will talk about it now. There you can actually look at multiple markers associated, so multiple biomarkers and multiple groups. When you're looking at the hazard rate and things like that, you only look at pairs. And you can look at all pairs, but then it's hard to summarize sometimes. Yeah, so this is another very, very common way to assess and to estimate the hazard rate is the Cox proportional hazard model. It's called Cox PH. So if you have multiple variables, this is, you definitely use this when you have multiple variables that you want to assess the risk of on your survival or response to treatment. So the idea is that you have this base hazard rate. This is the risk, just the general risk in the safe population as like a prior and in a Bayesian sense. And you have an exponential increase, well, or decrease, exponential involved of every other marker increases the hazard if beta is positive, if it's exponential increasing the risk of the individual. So what happens is, so beta is this coefficients for each of the markers, they basically say it represents the log of the hazard ratio increase for one unit of change of each of the markers. Well, the same as e to the beta because of this exponential increase. It's the hazard ratio increase. Per unit of change for each of the markers. So if your marker is a gene expression and it changes and it increases, let us say, then the hazard ratio also increases, the hazard also increases. And if beta, of course, is less, then it means that the risk is decreased with this marker, right? So the hazard rate for the two markers for two subjects, then it's actually quite nice, right? You can see that the hazard rate for two subjects is kind of proportional to this difference between the values of this actual marker, right? So this is just an example for, I just said an example. If you have x equals one for the treatment when the treatment is active and zero when the treatment is placebo, then you basically have the hazard rate being 80%. It means that 20% decrease in mortality if you use treatment versus if you use a placebo. So this is kind of the intuition for the hazard rate. And finally, I want to, when we talk about survival, usually people talk about hazard rates a lot and the survival probability a lot, but they don't talk about concordant syntax. And concordant syntax, I feel, even though it's a very simple kind of statistic, it's a number of concordant pairs. So when your model predicts survival for say i and j and your survival of i, you predict it to be greater than survival of j. And in reality, it's also true. It means that you are concordant. This pair is concordant with a reality. So if you're, so basically what it means is that your model can order individuals with according to their survival information accurately. And I feel like it's a better statistic to use when you are actually trying to to tell how good your model is, right? So I want to give you an example of using and comparing different models using these two statistics. So the P value, so this is the metabolic cohort that I mentioned. This is breast cancer and they had CNV and expression data. They now have more. They definitely have microRNA data. They have a little bit of the exome. I think they have selected a few genes to sequence, actually, and they might have methylation data by now as well. And they had almost 1,000 patients in the discovery cohort and about 1,000 in the validation cohort. And so it's a very rich dataset. And as I said, it has a much better follow-up than the TCGA data. And so the PAM 50 is a clinically approved classifier. It's a 50 genes that are characterizing putting people into different clusters. So PAM 50, I just heard a talk by the guy who actually invented this signature and it took him 13 years to move it from the lab where they discovered it to the clinic. It was very depressing. And so this is PAM 50 and they assumed that there are five subtypes in breast cancer. And this is the eye cluster which was called intercluster, not nature paper. And they said that in breast cancer there are 10 clusters and they associated some of them with amplifications and others with a high different expression. And this is SNF. And so we said, we don't know how many subtypes there truly are. So we constructed the network and we partitioned it into five clusters and 10 clusters. And what you can see is on top, you have the P values for the discovery and the validation of the Cox-PH model. So you will see in your lab how you get these P values for your model. So basically what this says is that the survival in one of the groups, the curves is definitely significantly different from the curve in another group. And of course, there are biases. If you have a very small group, then the P value can vary greatly and so it's much less reliable. But by that, you can see that the difference is this kind of in P values, there is, they all look significant, all of them, right? According to the concordance index, so this is the ability of the model to order individuals by survival. The eye cluster did a lot better, but also they had more clusters. And SNF did a little bit better than eye cluster, but it was basically the same for five clusters or 10 clusters. So what it says is that our ability to order people with respect to their survival is limited when we are starting to cluster the data. So what we were asked, this was one of the reviews of our paper, we were asked how many subtypes are there really in breast cancer? Can you answer that question? And so what we said is that maybe there are no, it's not a fixed number. The more patients you'll see, you see thousands of patients, the more it'll look like it's a bit of a gradient. And so what we proposed was to not do the usual kind of what is currently done, take a new patient, put them into a subtype even though they don't quite fit and then use the survival associated with that subtype, but to take a patient, integrate them into a network and then predict the survival for that patient using the whole network. And the difference between this previous approach and the new approach, the previous approach for the clustering, it kind of says, okay, I assume that my patient is similar to all of this little group of patients that I have seen that are in the subtype. What the network does, it actually says, first of all, I have the weights of how similar and how different. So it's a lot more continuous scale, but also I know who my patient is not similar to and I can use that information. In clustering, you kind of don't use that. And so what we said is that it's a slight change in regularization for the Cox network, for the Cox PH model, but the point is when you use the network and you use the weights that we estimated of the similarity between these patients, you get a much better ability to order patients with respect to survival than the same network, but when it's kind of clustered into smaller subtypes. So it's kind of a discrete version of a variable versus a continuous version of a variable. And so basically what we are saying, and especially when we deal with neuropsychiatric disorders and things like that, when it's a clear spectrum, when it's really not obvious that there are these very, very different subtypes that maybe it's more useful to represent things as a network and try to use the whole network, the information that's encoded on the whole network. What do you mean by ordering the patient by survival? Do you mean ordering a new patient where he's gonna, he's gonna fall into the survival zone? It's, so when you build a model and you are trying to assess the quality of the model, you say, did I predict the ordering of the two patients correctly? I already know they're survival. Did I order them correctly? Did I, did my model predict survival of one being better than the survival of the other? Yeah. How do you actually compare different clustering of coordinates in this other, you're just looking at the number. Is there like a p-value to generate? Because if you even look at stage and grade and you can coordinate index for some models like 0.7, they need to do a fancy clustering with genetic information like up to 0.71 and it looks higher. But is that really either statistically significant or else there was a genetic risk there? So, concordance in this is just another statistics that comes out of the, of your cox ph model. So you can get that it's, it's associated with a p-value which is already associated with a cox ph. But if you compare one model to the other model? I know it, it goes up and you don't know whether this going up is significant. This, you evaluate the whole model. So this is yet another statistic but you already have a p-value for that model. And so you can use that p-value from your cox ph. But if you have like 0.72 versus 0.74 and you're just every cohort. I understand. And if you have different patients or different characteristics in your validation board could that be also reversed? Yeah, what I'm saying is that concordance index is not, is one of the statistics. You can't just look at the concordance index and say, this is, this is good enough. That's just another way to evaluate your performance or your philosophy. Yep. So for the data integration in my field, for example, one of them is trying to figure out the actual, what the question was about pathways and association and everything. So try not just to get the pathway as a gene but to figure out how the mutilation and the gene expression and the microRNA, how, whether there is a mechanism that is associated with a specific cluster. So it's a kind of a feature integration, measurement integration idea at the same time as the data integration. So that would be nice. And that's not available right now. There is another kind of venue where we are going with this, is trying to predict a response to different treatments for patients, right? So if we have a patient and we want to, and we have a basket trial, which is the best treatment to give to the patient? So if we have that kind of question, it's possible that for one of the drugs, at least with respect to the network approach, for one of the drugs, you need to compare according to one pathway. But for another drug, you need to compare patients according to a different pathway. So how do we kind of build that in dynamically? So this is all from me. It says, we're on coffee break and not working session. But I would be happy to take questions, more questions. Well, this is with respect to the drug response. We actually tried this and it seems that if there is strong signal in the data, we can do as well with unsupervised method as we do with supervised method. But it always scares me because I think supervised methods should do better. Yeah? I think for the brain tumors, you've had the three sort of networks. Do you build each network independently first? No, it's one network. And we can cluster that network. So in that case, we use... So if you do them separately by... Did you get a similar looking network? If you do them on their own? I will put the appropriate this. So for this particular approach, you build the network independently for each of the individual data modalities first. This is the... So you do do that first and they do look different and you would get different clustering. It's just, it's actually... Yeah, but it's actually interesting. We're doing some work with Lawrence and it seems that where it matters is actually the borderline cases. The cases where gene expression would put in one cluster and a methylation would put in another cluster. And so their integration really can be informative where because for some of the core clusters, there are some patients that are so different from other patients that you could use just gene expression or just methylation to put them into that. To identify that sometimes. So why are the... Why is the topology similar but the edges not? The topology is similar by design. It's put there because otherwise you couldn't tell of how similar or different they are. If we change the position of the nodes as well, it would be impossible to compare visually, right? Because it's a visual aid. So that only it's just a visual aid? Yes, yes. No, no. For here it doesn't. It's the topology. It's based from the fuselage. Yeah, the topology is based from the fuselage. Yeah, yeah. Yeah? And how do you do your survival analysis for the network? So it's the same. We do eigen gap or silhouette. We figure out which clusters they are and then it's the same way as you would do for any... You already have your subtype, your patients. You don't put one patient in one cluster since you divide the time. Oh, you mean the very, very last result for the network? So we actually do not, yeah, we do not cluster at all for the network. We actually added the regularization term where it's just a smaller appendage to the Cox model. I don't have the formulas here, but you basically add a small regularization which is essentially the weights. It looks at the weights between the two different patients. And so you add, it's like a penalty for patients who are far. You will get a higher weight, higher penalty to have the same result outcome. So essentially it's a predictive model. It's a regression with a penalty for the weights between samples. It's the same network. It's this network. I mean, this is Gleoblastoma network, but it's a network. So you can look at the weights. No, in terms of a capital line. No, you can't. No, there's no way. You would just have a list of all patients with their survival. I don't know if it could be useful visually. But it's made for the regression, isn't it? Yeah, it's just a... Yeah, it's just a mathematical, yeah. Yeah, it's just a change in the regression slightly. Yeah? When you show me one of your last graphs, you showed that your network gives you on the C value of around 27 compared to the absolute value of the one that's commercialized at whatever, of 0.56. Mm-hmm. What's the chemical implication of that? Just saying that you can order the individuals better with survival. It's just that the model, if you don't actually subtype the patients, a new patient comes in and you try to predict their survival based on the existing network, you will get... Yeah, but that difference 0.56 to 0.7 Is that a bigger difference or a smaller difference? I see, so that's the same kind of question as was asked. Would it make a huge clinical difference? No, it won't make a huge clinical difference. I don't know what's a huge clinical difference. In classifying, a patient's one becomes... Yeah, yeah. In the case of the level of the part which is the level of the individual patient, the individual patient becomes un-corrected and so you get this. So that means the population of the person is really not useful. Right, but it's still only 70% characterized correctly. So for the individual patient, we don't know how likely we are to get each individual patient for, I mean, we know, 70%. So it definitely makes a difference. 72% being able to order 72 pairs correctly versus 56% of the pairs is a big difference. Yes, we are able to predict the survival, to order people with respect to their survival better. Now... But that sounds like a huge difference for both populations. It's a difference. I don't know, if there is no statistical test like we discussed, there's no statistical test associated with it. The plan is awful because it's just better than random chance in my slides. It's slightly better, right? But it's better than what was before, which was nothing. Are there further questions? More questions? Fewer questions? No? Okay. And thank you.