 Hi everybody. I want to give you a bit of a background on myself. So I studied machine learning and that's what my PhD was in, and then I switched a bit to computational biology. So I come into this field from computer science and what I do is mostly develop computational methods, machine learning methods to try to integrate the data to kind of get more informed opinions about patients and patient cohorts that we study. So today's objectives are, first I'll talk quickly about the single data type analysis from the kind of computational perspective, and this is always good to do for data exploration purposes, even if you are working on data integration. I will mention three most common data integration methods, most commonly used that I've seen, and I will talk briefly about feature selection and classification type problems. So this is the available patient data. It might not be available for every single patient for every disease, but across several diseases, it's definitely all of this is available. So DNA, as expected, mRNA expression, epigenetic data. So epigenetic conclusion, DNA mutilation, but also the epigenetic marks, microRNA data and protein data. So this is all omics, and a lot of this is now available for large sets of patients, especially in cancer. On the phenotype side, where you can derive different kinds of phenotypes, there's all kinds of clinical information that includes the tests, questionnaire data in neuropsychiatry, if any of you work there, then for sure you've seen a lot of questionnaire data. Imaging, of course, especially in neuropsychiatry, and anything related to right now, anything related to autoimmune disease, especially inflammatory bowel disease, you would see diet data being collected and also informing kind of decision. So this data is not just sitting in hospitals. This data is publicly available, some of it, and the best, to my knowledge, such a repository is the cancer genome atlas, where for the same set of individuals, so this is cases, for the same set of individuals, you will have exome, SNP data, mutilation, mRNA, microRNAs, and in overlap, if you take an overlap of all of this data, so not all of the data available on all patients, but over 500, this data might be available on over 500 patients. So this is large enough cohorts to start kind of playing with and understanding different integrative methods, so that's your goal. So why do we really integrate patient data? Hopefully integrating patient data will help us to make more informed decisions, something that we can help doctors to integrate. Apparently it's been analyzed and published that human capacity to analyze multiple variables at a time is about five variables as a limit, so we can put together five variables and make a decision based on five variables. We are talking about thousands, right? So this will have to be done computational and that's why we are doing this. So let's start with a single data type analysis. So let me go slowly and I would really like kind of an interactive session, not just talking at you, but please ask me questions because the goal is to understand kind of the kinds of analysis that you can do with this data. So this is a paper that was published in 2005 in PNAS and it looked at 20 glioblastoma cancers and gene expression on 18,000 genes, right? So what did they do? They collected gene expression. They looked at most variable genes across their collection and they performed hierarchical clustering. I'll mention, I'll go briefly through what the hierarchical clustering is. So here they identified clusters. Now they identified clusters because they were looking at heterogeneity of this cancer and they were trying to identify patients, homogeneous subsets of patients that they had in their hands. When you're looking at disease versus healthy individuals, it's the same kind of problem. You're trying to identify genes that are most differential between one group which is disease individuals and the other group which is healthy individuals. So in some sense this is similar. So you can skip this clustering step and the rest would be the same. Then they identified the genes associated that were most different between the clusters and corrected for multiple hypothesis correction. I will not go through, I will not talk about, is it part of the workshop? I don't remember the correction for multiple hypothesis. It will be mentioned probably, not directly. The point is that if you are looking at hundreds, if you are looking at thousands of tests, it's very likely that you will identify some of the things by chance, right? Some of the things that you've tested will come up significant by chance and that's the correction is very important when you're doing that. So they have identified these genes and you can see that on the picture on the, what is it left for you? Right for me, there are several genes that pretty clearly identify a set of patients. So they're mostly red for the blue group and mostly green for this orange group. So this is a simple analysis. It's still being done. But there are a few problems with it, right? So what if you have more data? If you have more different kinds of data modalities, they will tell you something else about these patients and these might not align with those groups that you find from the gene expression. So how do people handle it? How did people used to handle it before? So the way it was done, for example, in this 2010 paper, this is a classic also glioblastoma paper. They had 200 glioblastoma patients and they took essentially also mRNA expression and they also clustered it in kind of a similar way. They identified this four clusters and I think each signature was about 80 genes for each of the clusters. So you can see it here in red. Can you see? Perfect. So this is kind of this four or five clusters and they've identified the different types and based on the genes that were correlated with those clusters, they have decided that these are pro-neural, neuro-classical mesenchymal groups. So the problem here was that in this particular analysis, it was all based on gene expression. So gene expression essentially was driving the groups. They've identified the groups based on one data type and then they looked at mutations. If the mutations correlated with the clustering that they found, they said, okay, these mutations agree and they inform our clustering. If they don't, they don't inform. So maybe they've added a few genes to refine their clustering, but the clusters were defined by this one single type. And what was interesting is that they actually had mutilation data available to them when they were doing this analysis and they said that the mutilation data was just not informative, right? And the reason for that was that gene expression and mutilation data did not correlate very well on this subset of patients and it produced different types of clustering and they didn't know how to reconcile that. So in 2012, there was a contrary evidence also in glioblastoma. It was a glioblastoma study where they looked very carefully at the epigenetic data. And they found this, this is one of the ways that they've confirmed the IDH1 subtype. If any of you have seen glioblastoma data or have heard about it, IDH1 subtype is basically the only subtype which is really well defined in glioblastoma, where this particular mutation causes a huge hypermutilation across many sites across the genome. So even though the proportion of patients on this kind of x-axis is patients, so the full set is 210 patients, the IDH1 subtype is a very, very small subtype, maybe 10% of the patients, maybe a bit less. You could still identify it because the signature in mutilation was so strong, right? So this is what happens when we start integrating data and start looking at the data from the perspective of a single type of the data stream which is driving our analysis. We can miss this kind of things. So I will tell you more about the type of different types of approaches. The first and most common by far in, for example, the TCGA papers and other publications is to concatenate on cluster. So you can imagine what this is, but I will go in a little bit of detail, no problem. So through this analysis, then there is a slightly more, well, there is a more sophisticated method called eye cluster, which is kind of a latent factor type analysis, and similarity network fusion, which is the method developed in our lab. So the concatenation is simple, right? You have patients as rows, you have your measurements as columns, so you have here your gene expression, you have mutilation. You just group them together and treat them all as kind of one vector per patient, regardless of that type. So the problems with that that we found in our own lab were that if the structure within each of the type of the data is different, you are washing it out, right? If there is some kind of correlation between genes that you don't want to lose, if there is some kind of difference in the structure of the measurements within the different types, you actually lose that and wash it out. And additionally, if there are a few genes that are important, then you are just increasing your measurements by a lot. So that is also a problem. So what they do is concatenate the data and then you do hierarchical clustering. So many of you have probably used hierarchical clustering. This is approximately how it works. So here you have six, let us say individuals, right? So this is already a correlation matrix. You go from kind of patients by features to patients by patient matrix by just either multiplying them or performing correlation on these different feature vectors. So here you have six by six patients and you look for the smallest distance between these two patients. So it's represented here also in a kind of in space and also as a graph. The green tree that you see is the most common way to visualize hierarchical clustering. But the point is that D and F here are the closest ones. So you merge them first, you identify the mean and then you look at the difference and you compress this matrix and you look at the next smallest difference. The next smallest difference will be between the mean of D and F and E and those are the ones that you merge second and you can see it here. And also you compare, you do it for all pairs essentially and A and B are also small so you merge them as well. Next, et cetera. So you basically go through and compare all the pairs and compress the matrix. As you go, at the end you have only two meta groups, two means left from these two groups in this example and then you merge it. This is how you get. So to decide on a number of clusters here in this particular clustering, what people do is most commonly actually they're arbitrarily cut by eye. So here you have this kind of diagram. So what people would do is cut it into two clustering because that looks like most reasonable from the dendrogram. Yeah, this graph is called the dendrogram, the tree. Okay, another measure very commonly used is a silhouette statistic. An eigengap is used for spectral clustering. So if you have a graph that you are trying to cluster, you're trying to look at the difference in the eigenvalues for that graph and the spectral decomposition. This is if you're using spectral clustering. There are many more. There's a thesis that was published that looked at the different scenarios, simulated different scenarios and compared the matrix, all the matrix like maybe 15 of them on those different scenarios. So basically there's not one metric that is good for every single scenario. So some of them are good for kind of in a Euclidean space and some of them are good on a manifold where it's not so straightforward, right? Where if you compress the clusters, they look very close and 2D space but further away when you consider high dimensions. So basically there are many examples and there is not one good metric. And if somebody tells you, I found your clusters, you can ask them how they have found them and which metric they used and you can test other metrics to be sure. Can we have access to that PhD thesis? Yeah, it's online. Yeah, I just, yeah, search for the clustering. This is just, yeah, you can just search for Jan and PhD thesis 2005. I think it's from Waterloo. Yeah. So here I want to tell you a little bit about Silhouette because it is kind of very, very commonly used if you, especially if you don't use the hierarchical clustering or to verify whether you like, you know, the clustering, what kind of clustering that you get from Silhouette. So the Silhouette statistic was originally published in 1987 and the idea behind it is it tries to figure out if the points are closer to each other in your cluster versus further apart between clusters. So this is the difference between points within a cluster versus across clusters. This is what the B and I stand for. So AI is the average distance to all other patients for the individual I within a cluster and BI is the average distance to all other patients in other clusters. So, for example, you can see in DNF, all of these points will be very close to, in our previous example, will be closer to each other than to this cluster, for example. But it's not always the case. So Silhouette statistic, it ranges between minus one and one. If it's one, it's an excellent assignment. If it's minus one, it's a bad assignment, a really bad one. Maybe you want to do exactly the opposite. And zero is borderline, but zero means also that there is not much agreement in your cluster. It's basically saying that your points are as close to each other within a cluster as to the points in the next cluster. So this is an example of what Silhouette value plots look like. And basically if it's positive, it means that all these points are closer to each other within a cluster. And the negative part means that within this one cluster, in this three cluster case, within one cluster, these points are closer to other clusters than to the other points in this cluster, on average. Right? It's on average. So this is actually, this is a very important point about clustering. What clustering methods do is they place everything into a particular cluster, right? They assign everything. It doesn't matter if the data cluster is naturally or it doesn't cluster naturally. It just will, it will produce clustering for you. You'll say I want three clusters and it will find three clusters for you. Now Silhouette allows you to evaluate how good your clusters are. And this is, this is a point to always keep in mind. What can, yeah? Do you find what a pattern is in this case? Is it just all the features? Are you referring to? The pattern here is just a point. Okay. Just a point. Yeah, like in this example, it's just a point in space. Yeah, pattern entity object. And some of the... It can be, but in this particular case, you are talking about each individual point in your clustering. So one way to see how robust your clustering is, is to do this idea called consensus clustering. So this was published in 2003, but people use it, have been using it for a long time. So basically what you do is you take a set of your features or samples that you want to cluster and you sub-sample. You say, what if I take 80% of my patients? Will they still be together in the same cluster or not? And you do it a thousand times. You take a different 80% or 70% of your individuals and you cluster them a thousand times and then you construct a consensus matrix. And the consensus matrix basically tells you whether the individuals clustered in the same cluster if they were sampled together. So how often they belonged to the same cluster? So here there is an important point that the number of the cluster doesn't really matter, right? You can permute the labels of the clusters. It doesn't really matter. Cluster one and one case, cluster three and a second case, that doesn't matter. What matters is whether the individuals belong to the same cluster or not. So you look at all pairs of individuals. You construct a consensus matrix by being something actually something like this, I would say. So I imagine that these numbers are how often individual A appeared with individual B in the cluster. So suppose 70% of the time that you've sampled them together, they were clustered together. So that's a pretty good and accurate measure of the fact that this is a pretty stable cluster. Now this matrix doesn't work too well because normally you would have all the values between zero and one, right? You cannot cluster together more than 100% of the time. But the point is the smaller the number, the less evidence you have of the fact, the less robust that pair is. And sometimes it's good to identify those individuals that don't cluster together well with anybody and just say these are my outliers, these are not the individuals that should belong to a cluster together and maybe consider them separately, right? So this way you identify the cores of the clusters, something that you are very fairly certain about. Yeah. Does this work equally well on all sample sizes or would it work that matter? Well, the smaller the sample size, the less variance you will have when you sample, right? If you have 10 points and you sample them 80% of the time, you get 8 points. So there is a huge overlap in your samples. So it's less of an evidence. You will have less of a confidence than if you have a much larger space. Yeah. Okay, so this is the most standard approach, the concatenation and clustering and using consensus clustering improves the robustness of the results. So I would recommend always to always do consensus clustering. And I think there is a package in R also that does some consensus clustering. It takes much longer, of course, because it does the clustering procedure a thousand times. But it's worth it for the stability of the result. Okay, so the next type of tool is this eye cluster. So this is more of a latent kind of a factor model. So if you are familiar, this is great. So what this approach does, and I will briefly mention it because it's also using TCGA papers and other type of papers. And I've seen it in a lot of different publications recently. So what happens is that they're trying to identify some kind of a latent space, some kind of a latent embedding, which is common among all data types. So let's say we believe that our patients, for example, cluster all in the same way, should cluster in the same way, given any type of data on my patients. And if that is the case, what they're trying to identify is an optimization problem where they try to identify this latent variable Z, which represents that latent embedding. We don't know what it is, but every one of the types of data that we have is informative of that. So what they do is they identify the common Z, and this W1, W2, and WM, these are the projections of your original observed data, say copy number data, how that projects onto this latent space. This is what they do, and this is what they find. They find this latent clustering in the latent space. The problem with this approach, even though in their paper, they kind of say that they don't just look at the similarity. It actually ends up being that they are looking at the similarity of the data sets. And if you have some complementarity, it is a problem. So this is kind of what we looked at so far. Eye cluster is also a problem because the complexity of the method depends on the number of the measurements that you use, number of the features, so you can't use your 20,000 genes or all of your methylation probes in the eye cluster. You have to select about 1,500 features. How you select that is magic. And you select 1,500 features usually by most variable ones or something like that, maybe t-testing with respect to some held out group. This is not necessarily part of the method, but you have to pre-select your features and then do the clustering in the latent space using those features. So this is part. There are many steps in this pipeline. The feature pre-selection which is different between the different types of analysis and that, et cetera. And I already mentioned that these methods don't really take the complementarity of the data into account. So seeing that we have developed a different methodology. The methodology is this called similarity network fusion. It consists of two steps. The first step is to integrate the data in patients, construct the patient similarity matrix similar to how you do for hierarchical clustering. And then we fuse this multiple matrices using nonlinear approach. So this is the first step. You go from the patient by say mRNA or gene expression and construct the similarity just according to the gene expression. So, for example, the darker spots here, they mean that these patients are more similar according to the gene expression, to the genes that you have in your sample. Now, this matrix, if you just look at the correlation, this matrix is likely to be a full matrix, no zero entries. But you can actually sparsify it. And if you do sparsify it, so there are two ways to sparsify matrices. Generally, one is to cut by small values. So if you have 0.01 correlation, maybe you don't care about that, even 10% correlation, you don't care. The way that we sparsify the data here is by k nearest neighbors. So we keep the 20 most similar individuals regardless of their value and we throw away the rest. And that helps us with numerical stability when we combine the matrices. So you do it for every single type of the data. So you have these two similarity matrices. For example, one is from gene expression and the other one from DNA methylation. There's kind of a legend on the bottom. And the second step is to iterate. We iteratively make them more similar. So how we do that is based on graph diffusion. So we basically take a matrix, we multiply it by the matrix. We want to make it more similar too. And this is kind of like starting at a node in one graph and continuing in another graph and coming back to the first graph. So what this procedure does, and we do it iteratively until it converges. So this is the similarity. So what this procedure does is it helps you to get rid of the noise, for example. If you have some edges here that are very weak. So this indicates weak correlations. They're not supported by most of the data. So in that case, they disappear as you make the matrices more similar. On the other hand, for example, you have a very strong correlation here between these individuals, but it's not supported by all of the data sets. But because it's so strong, it actually permeates all levels. And we can keep this complementarity that is introduced by the mRNA-based kind of similarity into the other types of data. So this probably converges. And at the end of the day, we have one matrix, one network corresponding network, which is actually supported by all the different data types, all the types of data. So are there any questions as I'll present some of the results using this approach? Yeah? So we don't actually use other methods as part of this method. Oh, you mean like types of measurements? Yeah, yeah, it's definitely, I mean, whether you keep weak correlation or not is numerical, right? So there is no particular threshold that tells you, if it's 10%, we lose it, if it's 20%, we keep it. So it is possible that in one measurement, one type of measurement, we have the information, but we have 10 types of measurements. And the nine types of measurements do not actually support that similarity, and of course we lose it then. So it definitely happens, yes. But I guess it is the question of how much support there is across the different types. So if there is a very, we definitely have the situation, I'll actually show it in a real example. Let me show a real example. I will tell you that it doesn't always happen. So these are the TCGA types of data. We looked at five different cancers. So this is the glioblastoma that I've been talking about. 215 patients, slightly different cohort. We had three types of measurements, mRNA, methylation, and microRNA. And this is the number of measurements in each. It was in 2014 when we did the analysis. This is breast cancer, breast invasive carcinoma, kidney renal clear cell carcinoma, lung squamous, and colonadena carcinoma. So we have five different cancers here, and we also have controls, so healthy individuals for whom some of the data was measured for these cancers. Actually, not necessarily healthy individuals, but the healthy samples from the same individuals. Obviously, not healthy individuals. Nobody does biopsy on healthy individuals. So this is case, so this is glioblastoma, and this is the patient by patient similarity. This is the full matrix, and this is kind of the corresponding graph. The topology doesn't really matter in the graph because the topology was taken from the fusion matrix. But it makes it a bit easier to compare. So you can see, and this goes back to the question that was just asked, you can see that there is some similarity according to mRNA, which is not really here in methylation, but some of it is here in microRNAs, for example. And there is a lot of similarity according to methylation between those two clusters, which is not supported as strongly by a gene expression or by microRNAs. So a lot of this would go away. But what you can see is that each of the types of data provide different structure of how patients are similar to each other. And each of the different types, they also correspond to some degree. So every type has these clusters. But mRNA, for example, in this particular case, did not yield such a strong structure. So maybe here there is some structure but such a strong structure as the other types of data. So by combining, we hope to take advantage of all of the evidence that we have. And this is the fused matrix. It looks quite a bit cleaner, because of the off-diagonal noise, the entries that were kind of the similarities, they did not correspond to each other in this different data type. So they kind of went away. But you can see that the clusters are similar to each other, to some extent. But what's important and interesting here, even if this type of data, this data set is clustered, you can see within clusters. Sometimes you will get a disease that looks like one of these clusters. And there is no intrinsic clustering to this particular cluster, at least not according to the data. But what's important to notice is that the colors, each edge is colored by what type of data supports it the most. So green is microRNA, pink is DNA methylation, blue is mRNA. What's interesting to note is there are pockets of this microRNA supporting the similarity between patients within cluster. They're also very heterogeneous. So for example, there's a whole pocket here of patients who are also similar to other patients but are similar to each other according to microRNA and DNA methylation. There is some signature between these patients. So when we start looking at patients as a whole, we can't necessarily distill it down to five variables that we can profile for these patients. It's a heterogeneous disease, and even if there are some more better-defined overall groupings, it's still very heterogeneous within each cluster. So we looked at the clinical information, and I think you will have an example of that in the workshop. For example, this is the survival, and this is subtype 3. In subtype 3, it's here in blue, and this is the IDH1 subtype that I talked about. So this was actually very interesting. Of course, I'm not a glioblastoma researcher, but I had a meeting with another jobado who was visiting here, and she came into my office and I had this plot, and she said, this is IDH1. I said, what is IDH1? This is a check if this is IDH1, and every single patient for us, we didn't incorporate the DNA, the mutation data here, but every single one of the patients for whom we had the mutation data was IDH1 mutant. So this is primarily due to the mutilation, the very strong mutilation signature that we had. And in glioblastoma, there is this very nice relation in leukemia. The relation is more complicated. So there is also IDH1 mutations that do not lead to hypermetilation of almost the whole genome. But the point is sometimes you can recover biological signal if it permeates other types of data that you have collected, even by proxy. So the IDH1 patients, they're usually younger, but they have a better prognosis. So what was interesting is that subtype 1, this big subtype 1, was the only one that seemed that actually responded in some way to thymazolomide. Thymazolomide is the standard of treatment in glioblastoma. The other subtype did not actually seem responsive to thymazolomide. When you look at the data on three different levels of DNA, they still seem to map into the same clusters or the same genes that were mapping into those clusters. No, no, they actually weren't. We did look at that. When I'll be talking about feature selection, I can actually, given that there's almost time, let me talk about that. I can come back to the rest. So this is the feature selection. So this is where we looked back. We have our three clusters, and we looked back at mRNA expression, DNA methylation, and microRNA expression. And this is the standard, the t-test analysis that you would do, the pairwise t-test. So features this cluster versus not, and looking at each individual gene, saying which genes seem to be most responsive. And this was actually interesting because we did this analysis with a t-test, and this is the figure we submitted with the paper and the reviewer said, well, it looks like you don't have signal. What is this figure? And even though it looks like, to us, I mean, heat maps are intrinsically hard to interpret, right? You can see different people see different patterns. But we started thinking about why we are not picking up stronger patterns. And so what we thought about was that this pairwise test, they're missing something. And what they're missing is, these are not the genes that are corresponding to the kind of the three-way analysis. And even further, what would be nice, which we actually didn't do, would be to see whether the patients are similar according to this particular gene in a fused matrix, right? Are they as similar according to this gene as they are in a fused matrix? So even within cluster, you kind of want to keep preserving this similarity of identifying genes that capture the full similarity across the patients, rather than just between the clusters. But even when we looked at kind of this three-way, and we used normalized mutual information, you could see a much better set of genes. So yes, there were a few genes, I think on the order of five to ten genes that actually corresponded between methylation and mRNA. There were genes that had abnormal methylation in the promoter and had a different... differential gene expression, but like five out of this set of genes, 250 genes. So the answer is that these are... this is different biology that's being recovered from each of the individual sets. So what were the initial subtypes of clusters based on which data that we saw? The initial ones in this analysis, all of it. This is the... all three types of data, all of the data. Yeah, I skipped through this, but this is kind of PCA plots that are very useful to visualize the clusterings, so you can see how much the clusterings are spread, and so in BRCA it looks like this is a different cluster. And also one thing that I want to bring up to you, if you work in cancer and you work with survival analysis, if your clusters are very small, for example in kidney renal carcinoma here, this one cluster is very small, so if you move one of the individuals, the p-values go crazy. They just go crazy because p-values are very sensitive to the kind of the size of support that you have, and you can see here... the reason we discovered it was that we started looking at the minus log 10 p-value according to the different number of features, and at some point the p-value just dropped. We added maybe 50 more features and the p-value dropped, and we said, well, this is very unstable, how come? In other cancers we don't really depend on which features we select, we didn't pre-select features for this analysis, and in kidney it seemed to matter in some very strange way, and what we discovered was this really tiny cluster that made the p-values unstable. So this is also something else to keep in mind. All right, and I have a couple of minutes. Yeah, this is NMI. You can look it up. This is what we used for the analysis to identify the features. It's an information theoretic measure, but you can actually use cross-call Wallace if you want to do the analysis. This is what we used in that paper. So what I want to mention now is... especially in our cancer work, what people want to do is build classifiers. I want to identify the subtypes. You say, I have subtype 1, subtype 2, subtype 3. There's a novel subtype. How do I create a clinical test that will tell me, okay, I want my 50 genes. I want to test this 50 genes, and I want to be able to label new patients that come in as subtype 1, subtype 2, or subtype 3. This problem is called classification. And the way that people kind of do it is they take the features that you've identified, either through titas, cross-call Wallace and MI. You identify features that were significant from here, from this type of analysis that associate with the clusters. Then you build a classifier, let us say, Random Forest is a very common and well-loved classifier, and it works really well. And I have a question for you. It's a quiz. What is the problem with this approach? So you take, you've identified your subtypes. You take the features that seem to be associated with the subtypes, and then you build a classifier using those features. What is the problem? Yes, overfitting. Overfitting. In machinery, it's called a leakage. You have a leakage problem. You use the same data to create the clusters, and then you're using the same data to create the classifier. And it's a big problem. So the problem that it creates is with generalizability. What is generalizability is that if you have a different cohort of patients for the same type of disease or outcome, your results might not generalize because you've overfitted to the data that you have built your classifier on. So how do we fix this problem? One way to fix this problem, which doesn't solve it fully, is to take a subset from the original data, run SNF just on that subset, and then identify the features and train the classifier just on a subset of your data. At the same time, you can learn SNF on the full data and see if the classifier that you've learned is actually predicting the same classes as if you had the whole data. Now, obviously, when a new patient comes in, you don't have the original SNF, so you will just be left with a classifier, and that's what you ultimately want. So you can compare the labels. So this is saying if I had all of the data in the world that I needed to have to get maybe the best subtype labels, does my classifier predict those labels as well? So, of course, there is a bit of a leakage here as well, right? But you don't have any other kind of data. So this is a better approach than the original approach proposed, which some of the people use. It's very important to try to separate your training and your test set to know how well your method generalizes. Very important. And I think it's a very common problem in published literature, where people just report the training results. You actually don't know how well it will perform on the cohort outside of the hospital. So this is a better solution. It's not ideal, but if you don't have any other data, this is one of the ways that you could do. So with that, there are some advantages to the networks. I propose it as an exploratory analysis of your data. So what we do is we kind of look at our original data, each individual type of data separately. We compare what kind of classes you would get from each individual type of data. Then we combine all of the data together. One other thing that is useful with networks is that you can visualize the networks and see, well, does it look like there are real clusters or does it look like one big block? Because if you just cluster and then look at the results of your clustering, you don't actually have a sense of, well, how bad it might be, right? So the idea is that you actually have a way to visualize your data for your exploratory analysis. And then at the end of the day, if it looks like there are some clustering, you can cluster and you have some certainty about this particular clustering. This approach scales very well. You don't need to pre-select features. It also scales very well with respect to multiple different types of data that you are trying to incorporate. So one of the plots I had that I skipped was showing that actually the time it takes is about the hierarchical clustering. So the biggest time sink is to construct a similarity matrix. Once you've constructed a similarity matrix, it actually runs really fast. All right, so for the future, I think the simultaneous feature selection and data integration would be great. The kind of a problem with it is that depending on the question you are trying to answer in the end, you want to put a different objective for what kind of features you care about. For example, you're trying to build a drug response predictor. So depending on the drug, you want to incorporate different pathways, right? So your network, the final network will be different depending on what you use to compare the patients. Some patients will respond in exactly the same way, according to one of the pathways, and this information will not be informative at all if you are trying to answer a question for a completely different drug that has nothing to do with this pathway. So that's something important to keep in mind. And that's it for me. Questions? So you can certainly do cross-validation for your classifier as part of learning the classifier. But it doesn't remove the question of generalizability because you've just selected the features from exactly the same data. So it's really a question of do you have any data leakage? Do you have any information that you are using in your classifier that you've already seen? You're already pre-selecting the features that will be used in the classifier. If you have a held-out set that you have not looked at, that's fine. Yeah. So in terms of the measurements, being completely different scale. Yeah, so I would recommend that if they're on a different scale and you don't want to kind of standardize the variables with zero mean, I recommend to separate them into different types of different matrices because for this method, it doesn't really matter how many matrices you combine. So to separate them and to identify similarities according to the variables of the same scale that are comparable because otherwise numerically, the bigger scale variables will dominate. Yeah. So we... Yeah, concatenation, they just normalize. They either standardize the variables to zero mean or one variance or normalize them to be between minus one, one or zero, one. Yeah. So this kind of works. The important step here is also to remove the outliers. So what we've done in some of these data is we removed the outliers and re-imputed the data. And the reason for doing that is if you, for example, scale something to be between zero and one and you have a huge outlier then your outlier will be at one and everything else will be at zero, like completely numerically indistinguishable. Right? So you kind of want to keep track of the outliers. And actually, I think in the workshop you will... No, there's no outliers in the workshop. Okay, that's great. No outliers. I think at some point we definitely encountered... They're outliers, but they're not outliers. We definitely have encountered where we construct a similarity matrix and we just see two individuals are similar and nobody else. And then we say, well, this doesn't seem right. So you go back and you identify that these individuals had some kind of outliers in their data and some of their features that made everything else squash to zero. And so once you remove those outliers you actually get them much better. So you're actually combining the data. So the iterations for the structure is more similar and I just am not really sure... How? Yeah. So I can try to give... It depends on the background so I can try to give some intuition. So you know how if you multiply matrix by itself it gives you all the random walks from that node of the second degree, right? The second order neighbors. You'll get all the second order neighbors with that. So basically if you keep multiplying matrix by itself it'll give you a random walk from that node to nth degree if you multiply to the power of n, right? For that same matrix. So we use that same idea except we multiply by another matrix the second matrix and you come back to this matrix. So its multiplication goes like this. It starts here. It multiplies by this kind of matrix. It goes the random walk. It's not... The random walk is not the best illustration because it's hard to visualize it directly but the math is the same. You kind of you start with this matrix. You say okay what are all the walks that I can do from this node in this other matrix and at the same time I want to preserve my second order. So it's a second order walk that's perturbed by the other evidence. And actually if you have... I didn't talk about this but if you have multiple matrices what we do is we multiply by the average of all the other matrices. And because we sparsified the matrix by the K and N it's relatively stable. You can actually do that. One thing that happens here of course if you take random matrices and you start combining them together you'll get the random matrix back. And we've experienced it once where we combined one matrix with something that really didn't correspond so we were trying to combine cytokine and gene expression data and they didn't correspond which was very surprising to us. But then we started looking at the cytokine data and we had exactly the scale and outlier problem and so we went back and said we need to reprocess the cytokine data. We went back to our collaborators and we actually started working with our original data before it gets transformed and then we were able to combine it beautifully. So the way we do it is we look at the two matrices and we look at the NMI so basically are they likely to produce similar looking clusters to each other or not? For me, you look at the formulas and say that's exactly what happens but I don't have the formulas on these slides unfortunately. But it's in a paper if you want to look at the Nature Methods paper. Disease types or data sets you've seen where this method doesn't work that well because the signal isn't as strong and all the different data types is it only for diseases where there's a really strong signal possible? So the places where it doesn't work is so it's much harder with DNA data. If you take genetic data and you compare patients according to their genetic data you just get ethnicity back and ethnicity is usually not what you're looking for to try to infer the biology of your disease, right? Though it could be correlated with the biology it's not usually the kind of signal that you want so we've had a lot of kind of struggle with how to properly incorporate DNA data on a SNP level for example. You take a GWAS data and you have 500,000 SNPs or a million SNPs and what do you correlate? So there you have to pre-select so basically the underlying assumption of this is that the features that you are combining are informative of the signal you're looking for in the data. So if they're not informative overall like a signal on GWAS maybe, I don't know there are different ideas about how the what the complex diseases can be attributed to but if you imagine that there are 10 genes that are relevant and you're looking at 20,000 then you're not likely to identify the similarity which is relevant to the disease. So the underlying assumption is that the features that go into computing similarity are informative of the disease. Largely. You can have some noise obviously but largely informed.