 Good, so I think this is sort of an appropriate time for me to give this talk because I'm going to be talking about how, you know, a lot of the things that work in single cohort analyses don't necessarily scale as we add more data as we try and integrate across different data sets. And so I'm going to give you a simple example looking at tumor normal phenotypes, but I think this, the general lesson applies to many things. So TCGA is sort of both a blessing and a curse. We've been talking all, you know, the last couple of days about all the really nice things about it, and it really shows this unprecedented data set to do lots of different things. But as you can see from this figure, you know, from a method standpoint, one of the really nice things that we've thought we wanted to do with this is to integrate across different data types and different tissue types to look at networks and pathways in the pancancer context. But you know, it turns out that this is actually a really hard thing to do. In our lab, we've spent a long time trying to do this, and you know, we get really complicated really quickly. The same methods don't work in one cohort as they do across many cohorts and in one data type integrating across many data types. And so I think part of it, part of the problem is that we need to think about this from a different perspective. So how do we get past this? Taking from a lot of the work that Google has sort of trumpeted in what they're doing, when we scale up, when we get to truly big data, often simple models will get you most of the way. So complicated models may give you better results, but they're going to be much harder to integrate with different data sets and much harder to put into production. Additionally, we need to really work on incremental and transparent methods. So I'm a graduate student. I don't have a full team of software engineers to check all my pipelines and make sure everything's all right. Doing these complicated analyses, it's really hard to find the bugs. It's really hard to do all the tweaks that's required to bring a project to completion. So the goal of this study is going to be relatively straightforward. We're going to look at the tumor phenotype, and that's going to be the mRNA, microRNA methylation data, and we're going to try and get at what is the scope of the events observed in specific tissue cohorts, as well as a pan-cancer context. And so we're going to do this by sort of going the opposite way that we normally do these studies. We're going to first look at what all cancers have in common, and then and only then we're going to look at what are specific to individual cancers. And so traditionally when we're talking about something like differential expression, in 2002 someone might do gene expression profiling and highlight a handful of genes in one specific cancer. Then a few years later we're going to do a couple of different cancer cohorts, and maybe a meta-analysis review will integrate these things and highlight those genes that are sort of ubiquitously turned on or off in cancers and those that are tissue specific. But this data set really allows for a unified analysis to do this sort of all at once with a shared data set. So sorry this isn't coming out too clear, but on the right we're going to start by defining a very simple model of differential expression. So this is an example of one gene profile across TCJ, so this is going to be about 8,000 patients. So here you can see that about 600 patients are going to have matched adjacent normal tissue samples. So the first thing that we're going to do is we're going to take this data and we're going to throw away about 90% of it. We're only going to look at the matched tumor normal samples. And then we're going to do something that's going to throw away even more data. We're only going to observe whether a gene has higher or lower expression in the tumor compared to the normal. And then finally we're going to count how often that happens across a given cohort. So this is, we're going to call this statistic fraction over express, and this is going to get at how often the gene is over expressed in a given context. And we'd expect that if the gene was under sort of random regulation, this would happen half the time. So there's a lot of advantages to this method that may not be obvious immediately. Anyone that's done pan-cancer analysis with expression data knows that tissue baseline, tissue specific baseline expression is really hard to overcome. So this is going to be, this is not going to be sensitive to that because we're not, we're only looking within paired samples and we're not incorporating the magnitudes at all. This test statistic is going to be much easier to interpret. It's a fraction from zero to one. And we're going to be able to carry this forward to integrative analyses and have a common baseline to compare across different data layers and different data types. So obviously we're going to have some disadvantages throwing away this much data. And those are going to be that we're less sensitive to the magnitude of differential expression and obviously less powered. But sort of in preliminary analysis, we found that this very simple metric, given enough samples and, you know, as we know in the future, we're only going to get more samples and actually compares very well to a paired T test, which uses the full data. And so we get about 99% correlation. And this is within one specific cohort. You can imagine if we were trying to carry this over to a pancancer analysis, we would have to look at things like tumor purity differences in the tissues. We would have to look at margin effects. All these different, all these different factors that aren't going to be that big of an issue in our analysis because our sole assumption is that our tumor sample has a higher concentration of tumor cells than our normal sample. So we're going to start out by trying to get at what, what's common across all different cancers. So we can apply this across the methylation data, the mRNA data, and the microRNA data. You can see that we get a full profile. 50% is our sort of random model. Immediately you can see a skew of microRNAs. So this data tells us that microRNAs tend to be overexpressed in this pancancer context. When we look into details, we see a lot of the players that we've seen over and over in these studies. We can see alcohol dehydrogenase, and then a number of the alcohol dehydrogenase are turned on, turned off sort of ubiquitously across all different tumor types. Additionally, we see a number of things that are turned on across many different tumor types. So here's microRNA 21, and you can see it's turned on in every different tissue type except for maybe any chromophilic cancer. We can carry this forward. So what we get here is we get a simple metric. We get a fraction of how often the gene is turned on or off across this pancancer cohort for every gene. So we can use this to do things like pathway analysis, network analysis. We see an enrichment of mitotic genes, and a lot of the metabolic pathways are turned down in tumors. Initially we can do functional enrichment. So this is on the methylation data. And we can see PRC2 probes have an enrichment for hypermethylation, whereas we see a decrease in promoter hypermethylation that's enriched in this across every different, across the full set. So promoters may be methylated to silence a specific patient tumor, but these things aren't generally conserved across many different cancers. So finally, we want to see how well this type of metric would replicate. We're able to get about eight microarray data sets for about 900 subjects. The data is not as good because these are older microarrays, but we do get pretty good replication. So we get about 76% correlation. So now we've sort of established what all cancers have in common. If we were to look without this knowledge in individual cancers, these are the things that we will get. But with this knowledge, we can go further into see what differentiates cancers. But first, we're going to look at, we're going to sort of explore another hypothesis. So we know that these genes that we see over and over again, these are probably genes that are associated with growth and proliferation because this is sort of common processes that we see in all different tumors. But we haven't looked at the levels. So now we want to bring back the levels and see is there differential usage of these genes and does it matter? So we can do this by using our fraction over express profile. So this is our fraction over express for 20,000 genes. And we can correlate this with all of our different patient profiles that we're going to normalize to the normal tissue expression. So you can see this is one patient with a matched sample. You can see in their normal sample there's not much correlation with this statistic, which is sort of by construction. Whereas in the tumor, we see a pretty solid correlation and what we can interpret from this is that in this patient, those genes that are always turned on in the cancers tend to have higher expression levels and those genes that are always turned off tend to have lower expression levels. So just as another sanity check for all of our matched patients, we see 99% of them have lower values for this correlation in the normal compared to the tumor. But you can also see that there is a pretty big spread of this correlation level amongst the tumors. So this analysis doesn't depend on the matched normal data at all. So we can apply this across all 8,000 tumors that have expression levels. We see a pretty big spread with some tumors such as colon and rectal cancer having pretty uniformly high correlation to this tumor growth signal. And others have a bigger spread and pretty low correlations. And finally, we can look at whether this is associated with outcomes and like many of these growth and proliferation signatures, we see a high hazard ratio of many of these tumors which have lower baseline correlation levels. And so the way to interpret this is a hazard ratio about 2. This is in kidney cancer and this means that patients with one standard deviation, higher correlation to this tumor growth signature are about 2-fold increased risk of death. And there's a pretty big effect size for many of these tissue cohorts as you can see down here. So now we want to really get at what are the differential pathways activated in different cancers. And we can do this by doing another very similar analysis. So we're looking at our breast cancer cohort. And here we have our fraction over express metric for all the genes in the breast cancer cohort compared to the same metric for all of the other tissues pool. So you can see that there is a pretty strong correlation between these two things, which is to be expected. So most things are shared between breast cancer and the rest of the tissues. But there are some genes off the diagonal and those might be interested to see what is turned on or off in breast cancer but not in different tissues. And so one of the first things that comes up, this is this gene right here, is MET, which we've been talking a lot about these last couple days. We can see MET has very high overexpression in kidney papillary cell carcinoma, which is to be expected. That's where we know it has clear oncogenic activity. A number of other tissues as well and these tissues that have the highest differential expression are generally the ones where we see familial germline mutations. So we know that it has oncogenic activity in those tissues. Whereas in breast cancer and in prostate cancer, we actually see a decrease in expression and the tumor is compared to the normal tissue. Additionally, a lot of what we've shown in TCGA is that not everything is tissue-specific. There are the genomic context matters. And so we can group by the genomic context and look for things that are differentially activated in different contexts. So here we're dividing based on the status of P53 mutation. And we can see, again, while most things agree, there are some genes off the diagonal. We can see that MCM8 has no differential expression in P53 wild-type patients, whereas it does have higher expression levels in the tumor in P53 mutants. Sort of in contrast, MDM2, which we know has a strong interaction with P53, is overexpressed in the tumor in P53 wild-type patients, whereas there is no differential expression in P53 mutants. So in summary, we've described a very simple metric for studying the tumor phenotype. And we've used this to define a list of differentially expressed genes, microRNA and methylation sites in this pancancer context. We've used these features to stratify patient outcomes and define both tissue and driver-specific changes in cancer. And I encourage you, so we're working on putting this manuscript together, but I encourage you to think about the specificity of results in these targeted analyses to really understand what is specific to a given tissue context or a given model, versus the general features of cancer as a whole. And with that all, I think this is the Eiderker Lab and open it up to questions. Very nice talk, thank you. I'm curious, why not assign a statistical significance score to your statistic, maybe by permutations? So we're using, you know, so our statistic is basically a coin flip, and the value is reported here. We're getting, you know, 90% of the time a gene is turned on. These are extremely significant. We get significant at about 55%. And so in this realm, we're really more worried about the effect size than the p-values that we get on these things, because a traditional differential expression analysis has power with maybe 10, 20 samples. We're using 600. So we're sort of past the realm where significance matters. Anything that we're looking at is going to be very significant. Okay, so we are going to have a break. After a break, we'll have a workshop. But before that, I want to, on behalf of the TCGA project team, I want to thank Julia for taking the time to organize this productive scientific symposium. Julia, thank you. We also would like to thank Josh Shapiro, Jennifer, and the colleagues from the Capital Consulting for getting us organized. So we can have a hotel and we can have our slides. Also, we'd like to thank the TCGA project team, especially J.C. Carling for the work they do behind the scenes for getting this going. And finally, I'd like to thank my co-chair, Katie Holy. Where is she? For her excellent work. Thank you, everyone, for coming to this meeting and see you in the future.