 Hi, so I'm going to talk about using the ENCODE data for cancer genomics. So first of all, I'd just like to point out that lots of the ENCODE data is relevant to cancer. Many of the ENCODE cell lines, as you probably know, are cancerous cell lines. For instance, HEPG2, A549, MCF7, and so forth, and these often can be easily paired with a normal, and they give you a sense of the oncogenic transformation. Furthermore, a number of these cell lines, K562, HEPG2, and so forth, are extremely data-rich and have a tremendous number of assays on them, and are really unprecedented in that way for studying oncogenesis. So I'm going to talk about using this data in four ways to look at cancer genomics. First, I'm going to talk about background mutation rate correction. Then I'm going to talk about studying network wiring, variant prioritization, and then looking at drivers of differential expression. So first of all, background mutation estimation. One of the main ways we find drivers in cancer is through looking at mutational recurrence within a cohort. One of the problems, though, is that often this mutational recurrence can be confounded by genomic covariates such as replication timings. We have a lot more mutations in late replicating regions than early replicating regions. If you look at recurrence, we might be fooled and think, for instance, this region is a driver region because it has a lot of mutations when it just simply has lots of mutations because it's late replicating, so we have to take this into account. Now the mutation rate is correlated with many different types of genomic signals in addition to replication timing. For instance, it's also correlated with open chromatin, and ENCODE has, of course, a wealth of this genomic signal data. So we've shown how you can put a lot of this data together into a model to estimate the background mutation rate. Basically, we do principle components of all the different types of ENCODE signal data, and then we use these principle components to estimate the background mutation rate. And one of the main things we see is that we need lots of PCs, often 10, 20 PCs, to accurately estimate background mutation rate. And then, of course, we can look at the signals that really are driving these PCs. There are things, of course, like replication timing, but a lot of the history on mark data really also is very important for this estimation. So now what do we do with an accurate estimation of background mutation rate? Well, we can use that in a model to look at recurrent mutations and define drivers. Now a simple model that doesn't do that would be, for instance, a binomial model, which assumes a concentrated mutation across the genome. We're going to contrast things to this. We have two different models, a bit of a binomial model and a negative binomial model, that allow the mutation rate to vary across the genomic bins. And this mutation rate to vary with genomic covariates, sort of location timing, history marker data, and so forth. So what do we get when we look at this? Well, here's, for instance, the empirical distribution of mutations in genomic bins. Here's the simple binomial model, and obviously it doesn't fit very well. And here's our correction model. You can see it fits a bit better. And this is for the beta binomial. You can do the same thing for the negative binomial. And then when you come to actually finding or currently mutated regions in the genome, having this better model deflates your p-values, so you don't get these inflated p-values. For instance, if you simply have a simple binomial model, you'll call lots of regions of the genome as we're currently mutated. Whereas if you have our beta binomial model that takes into account the genomic covariates, you'll be much more stingy with finding or currently mutated regions. You can see this, of course, in other types of graphical format. Okay, so let's go on to the next topic, which is genomic rewiring in cancer. So let me sort of tell you about the center of focus on one particular transformation that from K562 to GM12878, GM12878 represents a normal blood cell. K562 represents a white blood cell tumor. And so for these cell lines, we have tremendous amounts of NCOE, TF chip data. We can actually look at the changes in the regulation and how the regulatory network rewires. So for instance, if we look at the tumor, we have some connections there. If we look at the normal, we'll have different connections. If we compare them, we'll see some of the connections are lost, some are gained, and some are retained. Now, here's a kind of global picture of that. There's a very complicated here, but when you look at this, of course, across all the different targets, this just shows the 109 regulators, not all their targets, which is too complicated for a picture. But even here, you can see the things that rewire a lot and all the changed connections. It's a useful picture to look at. However, often these pictures are complicated, and particularly if you have all the targets, it'll be very complicated. So we try to develop ways of simplifying these pictures. And one way we have of doing this is not looking at the change of regulation to all the targets, but grouping our targets into pathways or modules or gene communities, and then looking at the change in regulation of these communities, okay? And the idea of this far fewer communities than there are targets. Now, there's a nice mathematical trick for doing this. It's called latent Dirichlet allocation, which is a method often used in text processing, where you can look at the underlying topics and documents, and I hear we think of the documents as like transcription factors or the topics of these modules. So we repurpose this, our calculation methods, and here's kind of what we find when we look at things. Let's just focus on this GM12878 to K562 transition first. We have all the different transcription factors here. And for instance, we can look at, for instance, let's take a look at Mick. We can look at the edges that are gained in cancer. This is our model for CML, chronic malagenous leukemia. Or we can see lots are gained, very few are lost. And conversely, NBN is a factor where it loses a lot of things. And we can just sort all the factors by how much they gain or lose edges. But perhaps more useful is we can also sort them in terms of these gene communities, and this is the kind of useful things to understand which of the factors are really associated with the change in oncogenesis. Okay, let's go on to the next topic. Just giving you a quick run through. We're going to look at variant prioritization. And variant prioritization, obviously, is a big topic for ENCODE in general. People are always using the ENCODE to prioritize variants. I'm going to focus here on just one particular type of data that's new in ENCODE. This is the RNA binding proteins before the last ENCODE. They're not very much the same, and now there's quite a bit. From the E-clip experiments, two things that are useful to know about this. First of all, the E-clip experiments cover a large fraction of genomic real estate actually more than CDS. But these experiments are actually very precise. The binding sites in them are much more precise actually than TF binding sites. So lots of annotated data, and so we tried to build some simple pipeline. We call it radar for putting all this data together to annotate variants. We obviously put all the data together. We put it together with conservation data, both cross-organism and within the human population. Since we're looking at RNAs, we have to include secondary structure. We use all these to make a score. We have this entropy weighting scheme for putting all these things together to make a score. And then we also combine with tissue-specific score that's related to the cell line that you're looking at, the tissue context that you're looking at. Just a little bit more on these features. You can see that each of the binding protein sites is conserved really to a different degree, which really highlights the fact of why you want to use conservation. And also a lot of the binding proteins, as I'll show you in a bit, they actually come together to make a network. And there's a lot of co-binding in these proteins. And so we obviously want a more highly-weight network hubs in the RNA binding protein network than not, and we take that into account in our procedure. And briefly, we find that our procedure gives higher scores to cosmic genes known as cancer genes than not, and also it tends to give higher scores to recurrently mutated regions, for instance, in breast cancer than not, which gives us some confidence that the score is useful in a cancer context. And now to the last topic, which is finding regulatory drivers of differential expression. So I've told you in a second quality in oncogenesis how you might have the actual regulatory connections between the TAF and the target change, rewiring, but you could also have a situation where the targets remain the same, but the gene expression of the target gene changes a lot, okay? And here, what we want to do is find the TAFs that are associated with the target genes that change most in cancer. So we use the TCGA data, which shows how genes change in cancer, and we use the ENCODE regulatory network to associate regulators with targets. Then we build a simple regression model that tries to explain the change in gene expression from all the regulators. We can find the regulators that are most highly-weighted in this expression model. And that's what we plot here. So we have all the different regulators, we have all the different cancers, and this just shows these regression coefficients, which ones get most highly-weighted. And you can see that some factors really associated with upregulation in cancer make us very famous for being associated with that. And so some are more differential. We can also do the same for RNA binding proteins in their network. And here we find that sub one is the RNA binding protein that's most associated with upregulation in cancer. And it turns out you can also look at survival curves, well known one for MIG. For instance, we have more active MIG. You tend to have differential survival. But you can see the same thing for sub one for the RNA binding proteins. And also now to put the RNA binding proteins and the TFs together, we can build up a regulatory hierarchy just like we showed from TFs. And here we find that the things at the top of the hierarchy, both the RNA and also the TFs tend to drive gene expression the most. So here's the TFs, and here's the RNA, and here's the average correlation with change in gene expression. There's a lot of cross-drop between the TF and RNA networks. And just to sort of highlight here, here's MIG and sub one. And we just pointed out that these can really act together on the same genes effect, promoting transcription and also post transcription regulation. Okay, so with that, I'm going to conclude on my four topics. And we just go through them in a little more detail. So what if I told you, I've told you how the ENCODE data is used for cancer genomics. First way, it's useful for developing accurate models for background mutation rate. We've developed these parametric models for doing this. One of the key things, we have to use lots of data, often 10, 20 different principle components to get to accurately model background mutation rate. Next thing I've told you about is how lots of the ENCODE cell lines are obviously associated with cancer and we can pair them with normals and we can look at the rewiring of the TF regulatory network and some of these cell lines with lots of data, particularly those associated with CML. Here we see lots of complex rewiring. It's very interesting. And we've developed this procedure associated with late and dearthly allocation to simplify this to look at the changes in modules or changes in gene communities. And the third thing we talked about is using the ENCODE data for variant prioritization. We have a pipeline called RADARC, which uses the RNA binding protein data along with secondary structure conservation and so forth and prioritizes variants. And then finally, I talked about using the ENCODE data to find the regulators, RNA binding proteins or TFs that are associated with driving differential exploration by using a simple regression model and then putting these regulators into a hierarchy and seeing that the ones that drive the differential exploration tend to be at the top and there's also a bit of cross talk between the RNA and the TFs. OK, with that, I think I'll conclude. I thank you for your attention. I want to acknowledge the people that worked on this. I think the principal scientist that led most of this work was Jing Zhang, who's an associate research scientist for me for a number of years. She's now moved on to a faculty position at Irvine. She worked very closely with two graduate students in the lab, Jason Liu and Dr. Lee. I show all the different tools that developed with the URLs here. And also, I'll say we have lots of openings for people who wanted to encode related research in the lab and you can just go to our jobs page if you're interested. Thank you very much for your attention.