 Ilya Shmulevich, and I represent the Institute for Systems Biology, GDAC, and also jointly with MD Anderson Cancer Center. It's a pleasure for me to be here, it's a great opportunity to talk to all of you. I thank Linda and Elaine for the kind invitation to talk here. The title of my talk is Integrative Analysis and Interactive Exploration of Data from TCGA. So what I'd like to focus on is really the nature of the data and how we analyze it. TCGA is a very exceptional project in that it has highly heterogeneous data. Perhaps more so than in any other project that I have seen. There is just so much rich data in this project on, as you know, expression and methylation and copy number variation and clinical data. And all these data are really, really heterogeneous. That means that they are continuous, they're discreet, they're categorical, there are parts of the data that are simply missing, and this is not a bad thing, but simply they're not there. And so analyzing these types of data in an integrated way is a real challenge. It's a challenge both from a statistical point of view and the computational point of view. And that's what I'd like to focus my talk on today. And I will start with pairwise analysis. So by that I mean looking at correlations, if you will, between different data elements. So for example, between gene expression and methylation or between a mutation and some clinical variable or clinical outcome. And this is what I mean by pairwise analysis. That type of analysis is commonly done. It's even done on single data types. So for example, we often look for correlations between gene expressions and then we perform unsupervised analysis like clustering. And within TCGA, we also look for correlations between different data types. So for example, we look for correlations between expression and methylation or between expression and copy number variation. And there are a number of analysis modules that are routinely doing this kind of analysis in fire hose and Broad's fire hose pipeline. And so this is very much in the same spirit where we actually look at all the possible data that we have. We compile a big feature matrix illustrated in this cartoon figure where we have clinical information and tumor characteristics, expression, methylation. And we put it all in one big matrix and we start to dig through it and find associations. I should also say that part of our data is also data produced by others. So for example, if you have an algorithm that does some pathway analysis, for example, like paradigm from UCSC, that algorithm produces outputs. It produces pathway activities or levels. And for us, that just becomes more data. We can throw that into our matrix and analyze it jointly with all of the other data. So let me give a few examples of these kinds of analyses. One is, was already mentioned several times, Peter showed it in his talk this morning. So IDH1 status, whether it's mutated or non-mutated, is related to a glioma CPG island methylator phenotype. And this is just a two by two contingency table that Peter showed, but this is just shown graphically where every little circle here, every dot is just one sample. So you can see that this box here is completely empty, meaning that if IDH1 is mutated, then there are no samples that are not SIMP, that are not the CPG island methylator phenotype. And that's a pretty strong relationship. So that's an example of a categorical association. Another example of a categorical or discreet association with continuous variable is expression level of ESR1 in breast cancer. And that is associated with the different subtypes of breast cancer. And we know that elevated expression of ESR1 is one of the most distinguishing features of the luminal subtype of breast cancer. This picture shows P53 and simply showing that most of the mutations occur in the DNA binding domain. So if we look at the mutation as a feature, then we will be able to see that those samples where P53 is mutated in the DNA binding domain, those samples exhibit lower expression levels of the targets of that transcription factor of P53. So in other words, you would expect that a mutation in the DNA binding domain will affect the downstream target expression, which we see evidence for in the data. In ovarian cancer, one example of two continuous variables is expression of RAP25, which was reported on in the TCGA ovarian paper, and its DNA methylation value. So it is known to be driven by its methylation. And as you can see, and this gene, by the way, is important in invasion and cell survival. And as you can see, you have this negative slope, meaning that high methylation goes with lower expression as one might expect. It's a silencing mechanism. Another example is this gene PRAC in colorectal cancer, which is strongly correlated with anatomic organ subdivision. So anatomic organ subdivision is just another feature variable for us, and it's a categorical feature. So it basically tells you where on the colon the tumor actually was found. Transverse colon, descending colon, rectum, and so on. And you can see that PRAC, which is shown in the blue curve, tracks very nicely with this anatomic subdivision. There are several CPG probes that are also tracking nicely with that. So obviously, part of the goal of this type of research is not only to find clinical associations, although those are really important, but also to understand the mechanisms and the function. As Dr. Lander this morning said, that's one of the major goals, really, to go into the mechanisms and understand how these disruptions are causing the networks, the molecular networks, in cancer to dysfunction. So for example, here in colorectal cancer, we know that the wind signaling pathway is highly aberrated, is frequently aberrated. So there are strong associations in the data between the wind interfering factor and a number of CPGs. It's known to be epigenetically silenced in many cancers. Also relationships with RNA transport molecules and molecules involved in transcription. So this kind of analysis could start to give you insights into how these disruptions are actually functionally affecting the behavior of the networks. This heat map shows an all-by-all comparison of a bunch of clinical features and tumor features. What I mean by tumor feature is a feature that specifies that particular sample characteristic. So for example, whether it's MSI, microsatellite unstable, or whether it is in some subtype, for example an expression subtype. So for us that just becomes a label. And also a bunch of clinical features. And so you can see that there are obviously non-trivial relationships here. There are lots of features that are positively and negatively correlated with each other. But here is a slightly zoomed in view of the same thing, again on colorectal cancer, showing some of the relationships between these clinical features. So the red lines indicate positive associations, or the green lines I believe are positive and red are negative. So for example, we see things that we already know and expect. For example, exclusivity of chromosomal instability and CPG island methylator phenotype. We see that MLH1 methylation, this is an important DNA mismatch repair gene, goes with microsatellite instability and methylator phenotype as well. And another example is BRAF mutations associated with MSI or SIMP expression clusters. So we see all of that in the data very nicely coming out of this kind of analysis. But we also see some new things that are perhaps unexpected or have not been discovered previously. So staying a little bit with colorectal cancer. So in panel A what we're showing here is one such example of apolipoprotein L6. This is a gene found only in human. And it's associations with clinically important features such as lymphatic invasion, histological type, so mucinous, non-mucinous, tumor stage, low grade, high grade, whether there is distant metastasis, vascular invasion, those kinds of features that actually tell us something about the severity or aggressiveness of the cancer. And this particular example, apol6, exhibits really strong associations with a number of these clinical features. So in other words it's sort of correlated with aggressiveness, if you will, of the cancer. Panel B shows a few other examples, APC, P53, PI3 kinase, BRAF, a new one, FBXW7 that was recently that in the colorectal work, a working group was found to be significantly mutated. So we see in this heat map that some of these are strongly correlated or anti-correlated with some of these clinical features. For example, the loss of armadillo repeat in APC is strongly associated with vascular invasion. And I mentioned FBXW7. This is the structure of FBXW7. It's called a beta propeller structure. And most of the mutations actually map inside this central region, this binding pocket, which is important in ubiquitin-mediated degradation, and most likely impinges on the wind signaling pathway and beta-catenin degradation. And on the bottom panel here you see that there are, well, here you see apolipoprotein L6 right here, where my mouse cursor is hovering. And the gene expressions are shown in blue. The green ones are the methylation features, so basically the CPG, the probes. And the red one is showing a mutation in EP300. So these are all features that are associated with clinical, these clinical outcomes of aggressiveness. And I'll tell you how we compute that in a second. But what's important to note here is that they are spatially clustered. They occur in a genomic context, okay? So we often see these features occurring together in clusters. So here's another view of this. So we took some of these features that I mentioned, like tumor stage, distant metastasis, lymphatic invasion, tumor, and so on. And we found the molecular features that are strongly associated with these clinical features. But we also combined them into sort of a summarized feature of aggressiveness, for lack of a better word, or you could say tumor severity. So in other words, if it's high-grade tumor, and it's distant metastasis, and it's musinous, and so on, then it's more likely to be an aggressive tumor. And of course that's expected, because these things often go hand in hand, and these clinical features are clearly correlated, and I'll come back to that later. And here in the heat map, you see that there are some features, gene expressions or amplifications or deletions that are associated positively or negatively with aggressive disease. This is the red and the blue parts. One example is this SCN5A voltage-gated sodium channel regulator. It's been shown to be important in colon cancer invasion, so it's nice that it pops up here in this list. And this picture just shows you a snapshot from our tool, which I will show you in a moment, which indeed shows you that these features occur in chromosomally-clustered or genomically-clustered regions. So here's another view of this, another snapshot from our interactive exploratory tool. The features that are associated with aggressiveness, let us call it aggressiveness, are shown on the outside of the circle. So again, the blue features are gene expressions, the green are methylations, yellow are copy number variation segments, and purple are microRNA expressions. There is one purple right here. So we see that these features are indeed clustered spatially, and then their aggressiveness score, which is computed by Fisher's method of combining p-values, and I won't go into the details here. But the higher the score, the more correlated it is with this aggressiveness characteristic. So these features are shown with the red and blue bubbles. So the more red the bubble is, the more correlated it is, or associated it is with aggression, the more blue, the more anti-correlated it is. And here you see a little segment of the QRM of chromosome 20. So now I'm going to do something that some people don't recommend that you do in a live talk. I'm going to give a little demo. So this is our, this is this tool which shows you these aggressiveness associated features. So this just works in a web browser, so you can all try it. And as I said, the red features are the ones that are correlated, and blue ones that are the anti-correlated. And here you can hover your mouse over all of the features and actually see what they are. And the little boxes pop up that tell you what the feature is, what's its aggressiveness score. And if I click on it, I can zoom down to a genomic context here where I can see sort of in a linear genome browser view where these features are located. Let me close this. So you can see, for example, that, let's see, this one is the one I just mentioned, apolipoprotein L6. And it's in the vicinity of other features, APOL1, APOL4, some of the methylation probes, and this one is the EP300 non-silent mutation. So these little clusters of features associated with aggression, aggressiveness, is what we call clinically associated genomic hotspots. And by the way, I should say it's not only explained by the non-uniform density of genes in the genome. So we corrected for that, and we still see an enrichment or clustering of these hotspots. Resume slide show. This is another example from breast cancer showing similar types of associations, but this time with the PAM50 calls of the different subtypes, basal, luminal A, luminal B, and HER2. So ID is the same. Instead of aggressiveness here, we have the different subtypes. So this is a very general approach. It can be used to find associations with anything you like. OK. Well, so someone asked me at the poster session, well, OK, that's fine. But surely there must be cases where one feature might not be associated with another feature strongly. But if you combine it with another feature, then it will be very predictive. So in other words, A is not strongly associated with C. B is not strongly associated with C. But A and B together give you a really good prediction of the value of C. So these kinds of multivariate relationships surely must exist in the data. I think it's almost axiomatic that it must be so. Primarily, because we know that the data are generated by underlying networks and systems, whether they're disrupted or not. So clearly, there are strong correlations and multivariate relationships that we should expect to see in the data. And indeed, we do. So we generalize this type of pairwise analysis to a multivariate analysis. And one of the main problems or challenges, I should say, is how do we do such a multivariate analysis with such heterogeneous data, continuous data, discrete, categorical, OK, with very, very different statistical characteristics and distributions. And the method that we have chosen is based on random forest regression. It is a statistical inference method based on ensembles of decision trees. And the algorithm that we're developing is called RFA's. So let me just very briefly return to some of the challenges that the TCGA data pose. I already mentioned that they're mixed data types, continuous, discrete, categorical, like clinical data is categorical, gene expression data is continuous. There are thousands of features and hundreds or thousands of samples. We expect relationships to be nonlinear, to be noisy, and multivariate. Features are also correlated, sometimes very strongly correlated. For example, proximal, nearby probes, methylation probes, are going to be very correlated frequently. So we have to be able to deal with those kinds of redundancies or correlations in the data. And of course, there's also missing data. So RFA's, which is available here on this Google code page, it handles mixed variable types very, very naturally. It does not require imputation of missing data. I'm always worried about imputing missing data. So this method handles it naturally without needing to impute. Rather than looking for all possible combinations, which is impossible, there will be an astronomical number of combinations of relationships for any particular prediction. So in other words, we take one target feature and we try to predict it using all the other thousands or tens of thousands of features. The number of combinations, even if you limit to the number of three or four predictor features, would still be astronomical. So this method is based on random subsampling rather than a full brute force combinatorial search. Statistical testing in this method removes redundant features and it allows you to assign an importance p-value for each candidate predictor. And Timo Erkula will be giving a talk about this in more detail tomorrow morning at 11.30. I don't have time to go into the details of this method, but let me just say that it is a decision tree-based method, which means that for a particular target, when I say target, I mean a feature that we want to predict. That's what we call a target feature. We select the target feature, we split the data, and the way we split the data in order is to maximize so-called homogeneity of the splits or to decrease the impurity. This is a term from random forests. So in other words, the two pieces of data that we split, you can think of it as a tree going from the root node to the two children node. The data in the children nodes will be more or less homogeneous. And then we repeat the splitting procedure for the children nodes in an iterative manner until there are some stopping criterion is reached. And then at the end, we look at all the leaf nodes, which are colored in green here, and we perform our regression, either by taking the mean of the data in those leaves or by taking a majority vote or the mode if it's a classification problem. Then what we do is we perform the same thing many, many, many times in a bootstrap manner. So we create many bootstrap samples by bootstrapping the data matrix and repeat this tree building procedure on each bootstrap sample. This bootstrap aggregation, also called bagging, allows you to improve the identification of important features. And then we go one step further and we actually create, and this is why it's called a forest because there are many trees, but then what we do is we create many forests in order to observe how important the features are in comparison to artificial contrasts, basically shuffled features. So we can get a sense of how surprised should we be to see a particular importance value for a particular feature, predicting a target feature. So this is a very computationally intensive algorithm. We typically run it on a large cluster and of 1100 nodes, and it takes hours to run on this cluster. But at the end of the day, you get millions of these associations, and these are multivariate associations. But the way that we can explore them is by looking at the importance scores. So in other words, if feature A was a good predictor of feature B, even though it may have had other partners in predicting feature B, not by itself, but if it was a good feature, then it will have a high importance score. And if it has a high importance score, we will represent it by an edge on this circular circus diagram. So in other words, we put all the features around the genome. So these are the chromosomes. And if one feature is an important predictor or is a good predictor of another feature, we draw an edge. And we also color the edge based on the type of features that are associated, whether it's gene expression or mutations or copy number variation. And of course, as you can see, there are hotspots, here I mean something different by the word hotspots, but there are certain features that are strongly associated with lots of other features. So you see these bundles coming out. So one of them is ESR1, this is in breast cancer. Another is BRCA1 or HER2 here in chromosome 17. GATA3 is another one that is one of these bundles. This is all nice, but the problem is that these images are static, okay. You can zoom into them like this. This is chromosome 17. You can see what features are located there. You can see who they're connected to, but still it's a static picture. So it's very hard to really explore this, this type of association data interactively and drill down into the data and ask questions. So what we have done is we have created an interactive tool that we called Regulum Explorer, which is also a web application. So basically you just use your browser and it allows you to explore these multivariate relationships in an interactive manner. And what I'd like to do now is just show you a brief one minute movie rather than doing the live demo. So here you can load in a data set and the data set is, here we're gonna choose glioblastoma, but we have several data sets loaded up here and it loads the associations and here they are. And then by hovering the mouse over the different features, you can see the associations or over the edges. You can see the associations by hovering the mouse cursor over the edges. By clicking on the edge, you will see the supporting data in a scatter plot and here you can see the different patient samples being highlighted. Then you can filter based on names. You can use wildcards or you can type in a name. You can type in chromosomal coordinates and this is FBXW7, which I showed you earlier. And here it is. These are the associations with FBXW7 and by clicking there, you can also drill down to a zoomed in linear browser as I showed you before and the same type of information is available through the linear browser. So if you want to click on one of the associations, you can see the supporting scatter plots in the data. And the reason we have this linear browsers is because some of the associations are so closely located on the genome that you cannot see them on the big circle. We also have other views of the same association. So for example, this is the more traditional network view. This is using Cytoscape Web. And these are the same associations shown in a network view. And then we also have a more traditional table view or spreadsheet view that can be sorted by different columns by importance core, by correlation, by name. And again, clicking on any row brings up the data. Okay. And then finally, this is zoomable. So this is using SVG graphics. So you can, using your mouse wheel, you can zoom in and look at the details more closely. By clicking on one of the features, it will take you to the UCSC genome browser and you can see the information about that particular feature and where it's located. Okay. So here's an example from GBM showing IDH1 mutations. Again, I started my talk with this, showing the association with the CPG Island Methylator phenotype. One thing I did not show is that these clinical associations are represented by these little bubbles in the inner ring of the circle. So if I click on such a bubble, I will get this supporting information. In this case, since this is a categorical variable, you see this two by two contingency table shown graphically. We also have box plots depending on the data type. Here's another example in GBM showing P53 mutations being associated with CDKN2A and TP53 methylation. So this is an example of two different data types. Mutation in P53 and methylation of some other gene. And here's the box plots that I mentioned. So if you click on one of these edges, you'll get this supporting information in the form of this box plot. They're just like box plots, except they also show you the density of the points on the side. So you can see where there are more points. So it has actually more information than the traditional box plot. And here's one final example, showing associations again in glioblastoma between several genes, insulin-like growth factor binding protein two, which has been well known for many years to be associated with high grade glioma, retinal binding protein one, FMP2, which is associated with methylation of HER2, for example, and also days two recurrence, which is a clinical feature. So this is just to show that you can start to explore these kinds of multivariate associations using this tool to try to understand underlying mechanisms and how they're disrupted. But obviously we need other information in order to make sense of it, of these associations. We need information from the literature, right? We don't have all the literature in our head. We need information from protein-protein interactions. We need information from transcriptional regulatory networks and so on. So you can imagine that there are many types of associations that would be important to know about when we look at the associations derived from the TCGA data. So one such approach that we have taken is to use a semantic measure of similarity from the PubMed literature called the Normalized Medline Distance. This is based on published work of other authors called the Normalized Google Distance and I don't have time to go into the details of how this is computed, but it's basically a number that tells you how similar two terms are in PubMed, okay? If it's infinite, that means that they're completely dissimilar, they never occur, occur. And if it's zero, it means it's the same term. So a very small number means they're very much related. So now we can start to look at these associations in conjunction with literature and other types of information like protein-protein interactions. So for instance, the purple arrows are, the purple thick edges are the ones that are the associations that we found in the random forest regression. And in the literature, the dashed purple arrows are the ones that are in the random forest regression, but not in the literature. The triangles are actually small molecules that are associated with these genes in the literature. The dark color indicates significantly mutated in this particular cancer. The diamond means it's a transcription factor. And finally, the yellow lines means that there's a putative protein-protein interaction based on the domain database. It's a domain-domain interaction database. So these kinds of tools can be used to explore not only the associations, not only the random forest derived associations, but also the literature and the protein-protein associations in that context. So this tool we call PubCrawl, this is going to become part of our Regulome Explorer tool. Right now it's sort of, it is a standalone tool. And this is just an example showing a de novo network for retinoic acid. So you can type in a de novo term, like retinoic acid or glioblastoma, and it will construct a network for you based on these literature associations. Here's one wind signaling, wind pathway. And what's really cool about this is that you can see that up here, you see the keg pathway. So this is the canonical keg pathway of wind pathway. On the bottom, you see that the pathway is essentially reconstructed just from the literature and from the protein-protein interaction database. So we see APC here and the accents and the left one and TCF7, TCF7L2. So basically almost the entire wind signaling pathway is kind of automatically reconstructed just from PubCrawl. I don't have time to show you the movie. I know I'm out of time. This is just to show you that we have this site called explorer.cancerregulum.org, where you can explore these tools, both the pair-wise associations and we have the colorectal cancer aggressiveness already loaded in there, as well as the random forest analysis for a bunch of different cancers. Right now we have ovarian breast, GBM, colon. I believe that's it. And also we have a link to PubCrawl that you're welcome to play around with. And I'd like to finish by acknowledging all the people that did this work. So our GDAC, so the work primarily by Brady Bernard, Sheila Reynolds, Weston Thorsen, Dick Kreisberg, Timo Erkila have made contributions, huge contributions to this work, both in terms of the scientific analysis and in terms of the software development. And I should say that TCGA is also an amazing project, not only because of the data and the kind of the scientific problems, but also because of the nature of the collaborations. Because we're working in such large groups with so many different multidisciplinary groups, so many different types of data that it's really, it's an amazing experience. And I'd like to thank everybody that we're working with in the different disease working groups and colorectal and breast and GBM. I don't have time to name everyone, but thank you, and thanks for your attention. Okay, I can take one or two questions, if any. We're already running about three minutes late, so since I'm the session chair, I should try to keep myself on time. Yes? Great talk. So you're combining different features to great overall measure of aggressiveness. Now some of these features, as you have alluded to, are correlated. So when you are trying to use, say Fisher's Test or other methods, that kind of assumes that all these variables are independent. Now, how do you plan to address that? Right, very good question. So in random forest, we naturally address this question. The random forest can handle correlated or redundant features, but your question was about the pairwise analysis with aggressiveness where we didn't use the random forest analysis. And there, it is true that we essentially, that was the assumptions that the features are more or less independent, which we know they're not. But one way that we combine the associated P values by using a weighted Fisher's method. So we basically assigned weights which are proportional to one over log P value of that particular clinical feature association. So in other words, if some particular clinical feature has a really strong association with lots of other features, then it sort of gets weighted down a little bit in order to otherwise it will swamp all the other features. But you have a very good point. And to address that, we need to move into something like random forest, okay? No other, yes, Rehan. Hello, good talk, Ilya. Thanks. So do you use in associations, do you use any FDR criterion, any cutoffs that the users can specify or do you specify? Yes, so in the Regulum Explorer tool, I didn't show there was a box in the right corner that allows you to filter based on importance cores, based on P values, and based on correlations even, not FDR, but yes, we do have a way for you to filter and tune the level of stringency. Yeah, with so many associations, I would imagine that a lot of them might be false discoveries. So it'd be good to have a criterion right like. Absolutely, thanks. Thank you. Okay, great. So let's go on with our next speaker.