 Okay, good morning. So continuing on the concert theme and non-coding drivers. So we are fundamentally interested in mutations. So both in mutations as a tool, as two previous speakers described very well, finding functional elements, concert drivers, finding markers, and also in mutations as a process fundamentally. And we're working with two different types of data. So now we have germline mutations and somatic mutations in cancer, with the idea that we have genomes of parents and children, or control cell and the cancer cell, same pulse, and we sequence the DNA, and we interpret the differences as sequence changes. We interpret sequence changes between the two as the null mutations. So these are fundamentally the data. And there are multiple reasons to study this data. One is statistical genetics of cancer, so using it in applied fashion, and this is where I'm going to focus for purpose of this talk. We're also thinking about evolution, and in two different flavors, right? One is mutation rate is a very important parameter in evolutionary studies. So any type of evolutionary inference depends on mutation rate. So mutation rate has a very important place in theory, and because it's controlled by DNA replication efficiency, by a fidelity of replication process, by efficiency of DNA repair, it's a selectable phenotype, and it has a very special place in evolutionary theory, and we're interested in evolution of mutation rate itself. We're also interested in learning about biology of both DNA repair and DNA replication, looking through sequencing data. So this is another reason we're interested. So we're interested for many different reasons, but for today's talk, I'll be talking mostly about statistical genetics inference. And I'll start with a very simple statement, that the problem is very complex, because most of what we know about gene mapping and detection of natural selection is not applicable to our situation, right? So one second you think about what methods you know for gene mapping and for detection of natural selection, and you would think that, oh, we have linkage, statistical linkage, we have statistical association, we have whole toolkit of statistical genetics. The problem is that all of these methods fundamentally rely on recommendation, right? They're only applicable to sexual systems. They are not applicable to our sexual systems. Same story about methods to detect selection, because what cancer genomics does in its applied wing is trying to find cancer drivers, trying to infer regions of functional elements, non-coding elements now under selection, not mutation hotspots. And most methods available in the field like selective sweeps, extended haplotypes, similarly limited to sexual populations. They also use a recombination as the vehicle. So what about cancer? So we have only one way to detect selection or to do gene mapping in a sexual system, and this is recurrence. So we look at individual gene or individual regulatory region, individual functional element identified by ENCODE, and we're looking at different samples from different patients, and we're saying, oh, I see significantly mutated genes, significantly mutated regulatory element, because I see many mutations more than I expect by chance in this region. So the problem with this, it's a great idea, of course, right? But there is one fundamental problem. This signature of selection is completely confounded by mutation rate variation, right? There is no statistical way out. I may wonder whether this gene is a cancer driver gene or this gene is a mutational hotspot in the specific variants I detect in patients happen to mutate more frequently than chance by some other reason than selection. There is no way for me to statistically discriminate between the two, right? So I have to model mutation rate. There is no way out. The problem is that nonfunctional regions, because one idea is I would look at mutations happening in nonfunctional regions, so neutral regions, and I will use them as a control to build a model and apply this model to this functional side. So I would contrast something functional and something nonfunctional. So this idea would be great, however, if my mutation rate variation is correlated with functionality, however I define it, right, this would not work, because again mutation rate variation would completely confound my inference. Another idea is, oh, I can take very large number of samples and then I would look at specific subset of samples or specific patient, and I would observe that these mutations, again, don't conform to null model that I learned on other patients. So this idea would not work well if mutation rate variation is sample specific, right? So I'd like to argue that both of these are present, and this poses fundamental problem for statistical methods. Again, I study mutations, right? I don't want to explain that this is an important thing. I don't want to be very pessimistic. We're all in the same boat. We have to deal with that, but fundamentally we have to think very carefully about mutation rate variation if we want to do cancer genomics. Now all of this is exacerbated in the search for noncoding drivers. With a Peacock project with more and more cancer genomes rather than exomes coming out and the field is paused to analyze them, we just heard two excellent talks about noncoding drivers. In my life, we're fundamentally interested in those, but because we find this beauty fragmented regions by signatures of chromatin, by epigenetic variables, if those also change mutational processes, right? Our search is completely confounded by this effect. So this is what I'd like to discuss, and again, I don't want to sound very pessimistic, but I think the problem is statistically complicated, and I would go through data that support that. Okay, so our first work with encode data, I believe six or seven years back in collaboration with John Stam, was finding the correlation between replication timing and mutation density, first in germline, and then in collaboration with Eric Lander and Gary Gatz, we observed the same effect in cancer data. So late in replication, there are more mutations than early in replication, and there are several biological models. I don't have time really to go through biology because I don't think we settled the argument on that, but this observation is fairly ubiquitous. So across cancer types, across data sets, people observe the same effect. Another effect is correlation with expression level. And Mike Snyder just mentioned both of those effects. We believe we know the culprit. We think that this effect is mediated by the activity of transcription coupled repair. Again, we're not not not sure, but this is the most likely, the most likely effect. Just to remind you about the biology of this, nucleotide excision repair is the pathway that brings this helicase to Tf2H towards deletion on wine's DNA. There is excision step and there is re-synthesis and ligation step. Now the key question here is how is deletion identified? And there are two different ways deletion is identified. One is stalled polymerase, right? Transcription goes on, polymerase stumbles on deletion, cannot proceed forward. And what it does, it recruits this downstream pathway. What it means, it means that the repair, and this is a highly efficient repair pathway, is being brought to deletion at the time of transcription. So more transcription, more repair, less mutations. The other pathway is global genome repair where XPC complex just counts DNA randomly. Okay, so transcription, replication, we'll look at chromatin, and again we can find at one megabase scale wind, other is dependency on mutation density in cancers on chromatin accessibility. Right, and this is DNA's data here shown for melanoma. Couple interesting points here. One is the fact is not limited to DNA's sensitivity. We don't know the causality, right? We don't know which specific biological factor drives it, but the general observation is all activating marks. So marks of active chromatin, active genome, anti-correlate with mutation density. And repressive marks positively correlate with mutation density, right? So what's happening is every place of the genome counts our needs, highly expressed genes, active regions of early replication, regions of active regulation and transcription, and as I mentioned in second, regulatory elements, mutation density is reduced. So this is the basic observation. Now, what I mentioned, this has worked from last year, I think I already mentioned this at this meeting. If you combine all of this marks collectively, we can explain very large proportion of variability of mutation rate, the megabase scale. In some cases, more than 80% of variation can be explained by random forest model, and this is true cross-validation now, not out of bucket. And the signal primarily comes from marks of relevant cell type of origin or relevant tissue, and in this case, melanoma signal is dominated by elements in melanocytes. This is a pigeonome roadmap data. Okay, so this is what's happening at the megabase level. And I would like to mention about story from a couple years back. We looked also at specific regulatory elements, at DNAs, hypersensitive elements, which we believe most of them are involved in regulation. And what we found that time, that in every single sample we analyzed, density of mutations within the DNAs hypersensitive site was lower than density of mutations outside, even in the flanking region, right? And this is per sample, so it depends on number of individual tumor genomes in the sample, so it's not proportional to number of mutations. But again, we have multiple melanoma, colon cancer, melanoma, and so forth. And at that point, the hypothesis we came up with is that global genome arm, maybe also transcription-coupled repair arm, because some of the enhancers are known to be transcribed, but definitely global genome arm is much more efficient in the absence of chromatin, right? So access to DNA is facilitated for that complex, and recruitment of the downstream nucleotide excision repair is more efficient. And we use genetic evidence by splicing melanoma genomes into those that are deficient in nucleotide excision repair. And those where nucleotide excision repair is seemingly intact. And we saw that there is enrichment of nucleotide excision repair samples, nucleotide excision repair deficient samples, among samples where the effect is absent or very weak. So now, recently, there was a cell paper on experimental work. And these two papers, I think, are amazing. What they show is that the effect is indeed mediated by nucleotide excision repair. They use exo-seq approach. And what is more interesting and important for this story is that they identified that even though in distal enhancers, the mutation density is lower due to activity of nucleotide excision repair, at active transcription factor binding sites, at the sites where transcription factor is bound tightly, access to repair is limited. And many of them form mutational hotspots. Right? And this is specifically in melanoma where nucleotide excision repair is very active, repairing UV lesions, and in cancers associated with smoking, for example. Right? So this is, and again, I'm not saying that all hypermutable regulatory regions are false positives, but what's happening here is that you find an active transcription factor binding site within DNA's representative region, and you see that this is a mutational hotspot compared to flanks. And this is purely explained by repair activity rather than by this particular binding site being cancer driver or this nucleotide changes being important for cancer development. Right? So there is this correlation between functional activity and mutation rate, which creates statistical problem. Okay. So moving on to newer work, what we see again, overall mutation density is low in early replicating region, active regulatory elements, and highly expressed genes. However, this is an aggregate. So what if we look at individual mutagens, at individual mechanisms? And how about individual samples? So if I look at individual cancer genome, is the signature present, is it stable, or we have a lot of variation among individual samples. Right? And we selected an example which is very well understood and can be statistically detected, and this example is upper back. So I don't know how many people in the audience are familiar with upper back mutagenesis. This is an exciting story. Some are most aren't, so I'll go through this real quick. So usually when we think about cancer mutagenesis, we think about exogenous factors, smoking, UV, certain carcinogens, and stuff like that. Right? Upper back is our own human protein. It's actually family of proteins. It looks like primary player is upper back 3A. So those are CDN-daminases, and upper back stands for upper B editing coenzyme. It was identified as RNA editing gene, but it's involved in DNA editing, and the hypothesis is that the functional role is innate immunity. So single-strand DNA is being mutated. The problem is it mutates our own single-strand DNA, and this family of our own proteins play an important role in cancer mutagenesis. Why can we identify its signature? Because in experimental systems and in cancer genomes, upper back creates strand-coordinated mutation clusters, because it acts on single-strand DNA, and if I see a strand-coordinated cluster, I know that probably they are produced by upper back, and it also has a characteristic signature, which is not without a lot of information content, but better than, say, UV mutation signature or other mutation signatures. So if I see cluster of this type of changes that are strand-coordinated, I have a rational hypothesis that upper back is in play. This signature was shown by work of Mike Stratton group, Dmitry Gardening group, and others to be correlated with expression levels, and there is association with GWAS signals in upper back for presence of this signature in tumors. So I think we have pretty good amount of evidence that upper back is a player. Now, if I look at enrichment of upper back, right, it varies by cancer type. Some cancers, especially those that are associated with viruses, have a lot of enrichment, and others have almost none. Breast and lung cancers have sizable enrichment of clusters, but also it varies by sample within cancer type. So for some reason, in some patients, upper back is expressed at higher levels and is active, and in other patients, it's not. And I really don't know what underlies this observation, but there is huge heterogeneity among samples, as soon as you stratify cancer genomes in terms of the enrichment of clusters and enrichment of upper back signatures. So upper back activity results in sample specific mutation properties. Now, this is an example you may think about many other things, right? As soon as you have signature of mutation associated with specific repair mechanism, and this repair mechanism can become compromised in subset of patients, you have the same story, right? So this is a story about mutagen. In many cases, changes in polymerase, delta polymerase, epsilon, changes in repair enzymes can generate sample specific mutation processes, so making use of some patients as controls for others is very difficult in cancer genomics. So what we see in upper back samples, first we see absolutely inverse relationship with replication timing and chromatin accessibility, specifically for strength-coordinated clusters with upper back signature. And this is in lung cancer, and the same signature is observed in breast cancer data. However, probably minority of upper back mutations happen in clusters. Many of them happen individually. So for that, we did the following analysis. So we can look at dependency on DNA replication time of upper back signature mutation versus other mutations. And what we see is that largely as upper back signature, small is the slope, and it's actually negative for samples with very high density of upper back mutations, right? So it looks like whether you're positively correlated or negatively correlated for specific mutation time with replication timing depends on the mutagen. And there are some biological stories why because upper back needs a source of single strength in a white would be biased towards early replication. Being a little more sophisticated, we came up with a mixture model where we're saying that some mutations in upper back signature are generated by upper back activities of others by other mutagens, or just spontaneous mutations. And we know the enrichment, which we can estimate separately as a parameter. So we can feed this model to the data. And what we see here, again, if I look at replication timing versus upper back signature mutation enrichment, I see that for samples with low enrichment, I have very strong dependence in replication timing and much weaker dependency with low enrichment. And as I just said, in samples with very high enrichment, it sexually becomes a negative slope. So conclusions of this story is the effects of epigenomic feature and consummation may be mutagen dependent, right? So the idea would be just regress out replication timing or expression may not be sufficient in many of the analysis because it really depends on what is underlying mutational process for this particular sample. Upper back mutations, for example, are unique in genomic distribution. And it looks like mutation model have to be sample specific, right? So this is all a set of sort of pessimistic nodes. But as I said, we're in the same boat with everybody else, so we have to find on coding drivers, right? We know it's very difficult. We have to try. Now I have to run. OK, so I'll try to talk very fast. So this work is in collaboration with Kelly Fraser, Leaven San Diego, with Mateo D'Antonio. So what we try to do is we try to cluster regulatory elements marked by DNA is one hypersensitivity. By all covariates we could come up with in as homogeneous set of breast cancer samples as we could. We assume Poisson statistics within clusters, and we use non-matching tissue types as a control to derive FDR. Now there is a good news and a bad news. The good news is that Kelly's lab found a lot of corroborating evidence and experimental work to suggest that there is signal in the data. The bad news, as a statistician, what I wanted to see, I wanted to see that my data, my some sort of control data are exactly Poisson distributed because this would mean that I explain every single covariate and I know the behavior of the data. And we never achieved that. It's always inflated. In this approach, we've never seen clean and nice Poisson distribution. Now saying that, if we look at random controls and non-matching types and we look at p-value enrichment specifically for breast cancer DHSs and breast cancer samples, we obviously have enrichment in number of different data sets. For example, if we look at aberrantly expressed genes, significantly mutated regions are enriched. And then we looked at independent data sets and ended up with 16 elements that seem to provide where evidence seems to be sufficiently solid. However, this is negging, so I cannot come up with statistical models with as many covariates as I tried to use. So we decided using last 30 seconds, I'll try to walk you through ongoing work, where what we're trying to do is we decided to give up on estimating local mutation rate as a point estimator. It's just very difficult. And you're looking at a significantly mutated region. And if your model is very little wrong, you're still writing on the tail of the distribution and your gene rate falls positive. So we decided to, and we have to model a set of samples in hand, rather than the general process. So we opt for a hierarchical model. So this is the idea. We take all data across genomes. We fit parametric model for mutation heterogeneity overall, so hyperparameters. Took us two years to find very good parametric solution for that. Then we take known covariates, which is still good information and sometimes neutral density in the local. So if we think we have some reasonable proxies to neutral mutations, but we don't believe any of this, as much as we don't believe that, but we can use this to generate pasteurias for the set of functional sites. So it's not like I'm saying that I have exact model and this is my p-value. This is a much more flexible approach. Of course, the only way to test it right now is to look at coding variation. So the way it looks for different concertats, for example, this is Langa-Dannaker's Sonoma. We fit synonymous sites. So this model has several parameters. We fit synonymous sites and then we just recompute the target. So non-synonymous, missing synonsons mutations, this is zero parameter fit. We basically just take the synonymous model and scale it up. And as you see, it fits pretty well. However, there is some signature of possibly negative selection and possibly positive selection. So I don't have time to go through all feeds for different concert type. But the basic result is that if we use the model and look at our ability to find known concert genes against MuteSec, we usually generate greater AUC. And what was our interest is finding negative selection if we look at cell essential genes according to CRISPR screen and most of the signal is driven by obvious things like ribosomal proteins and so forth, we also find evidence for negative selection in the same model. And this again goes across several concert types where we see both increase in AUC for positive selection and some however weak signal for negative selection. So the whole story is that understanding mutation rate heterogeneity helps understand basic biology but also develop methods for concert genomics. That mutation rate varies by concert type and by sample. Epigenomic features are key to understand it and we obviously need better statistical approaches because there is a lot of confounding between functional and code features we used to assign function and the same functional and code features influencing mutation rate. And that's the major complexity in the field. Thanks to my lab, we're very careful thinking about our projects. Most of the work is done by Paspolek who now left the lab and then other Beckhorn who is current postdoc and of course John and Bob, Rosa Carlich and Adam Coran, Dimitri Marat and Steve on Apobek project and Kelly Fraser and Matteo D'Antonio I decided to include non-coding drivers at the last minute so they should be in the slide. Yeah, that was a great talk. I have a question about the essential genes. I didn't get that quite clearly. So most of the genes actually don't show mutations in cancer samples. So that's a very hard problem to identify the ones that are less than expected because most don't show mutation. How were you getting that? Right, so the story is this. So we were interested in identifying negative selection in cancer for a long time. The signal is amazingly weak. Then what happens is that similar method when again you estimate local mutation rate and look at the deficit has very low power. So apparently hierarchical model has more power and still quantifying is very difficult but what is done here we're showing AUC for collective group of genes that show the effect in the CRISPR screen, right? So those are functionally significant gene. And again, I cannot be benferonic. I cannot give you the new non-cogenic addiction gene which is great target for therapeutic intervention. But for the first time we see significant change for group of gene, right? So it's a combination of, I think, greater power of the statistic and the ability to have a controlled data set. It seems, I know why you threw out the local context correction, but it seems like in the cancers where there are lots of mutations you could actually bring that in. Some of the cancers have 100,000 or more or some of the esophageal and things like that. So you actually could do, there's enough data, I think you could probably get meaningful correction. Have you considered that or? Yeah, correction, right. So this is what I did sweep under the rug. So I think what you're asking is context-specific mutation correction. Right, so this model, this is done by sample or by cancer type, we tried both. I expected originally to have huge boost but of doing it by sample. Of course you're working with less data. We didn't see a huge boost so far, it's working, but the improvement is not up to the point that it's night and day. The problem what I see here is that regional covariates also are sample-specific and those are much more difficult to deal with, right? So this is why our idea at the moment is you try to find signature of selection on the sample of genomes. The nice thing about sample of genomes is that if you get your estimative mutation rate right or the distribution randomized right, locally it's Poisson. So if I sum across sample, it's still Poisson. So I can model of randomized Poisson for this specific sample rather than for cancer type or individual genome, right? And this solves a bunch of problems. I think we cut power with this approach but we definitely cut down some false positives. Great point Mark. Quick question with regards to you. So you point out the need for taking into account the chromatin state, epigenetic states to effectively take into account the rate of mutations. Could you bypass that or improve on that by simply expending the size of the nucleotide looked at for defining signatures of mutations, right? Because right now we're restricting to three nucleotides which doesn't really relate to the biology that's specific to what goes on in regulatory elements for instance. Right, so there are two different stories, right? One is again, context specificity. So association of mutation rate with chromatin, I believe is not context, only context specificity phenomenon. It's not only driven by sequence features. Because, and you can obviously find it by looking at different genomes and different cells where DNA's high productivity was measured. So we'll look for example at breast specific elements versus liver genome and liver specific elements versus breast genome. And it is clear that activity of the element actually openness of chromatin or tightly tight binding by transcription factors. This is what drives a lot of this, right? Some of this is indeed driven by the context. There is a very recent paper in Nature Genetics by Ben Voigt suggesting in germline data that seven mer outperforms just neighboring nucleotides. I think it's a convincing paper. Of course, the only reason they could have enough power is that they don't use it the novel mutations, they don't use it cancer data, they use it population sequencing. Whether this extends to cancer data sets and whether extending the model to larger context would improve the signal, I don't know, we never tried. And we never tried because of data sparsity, you just can't learn the model. I was wondering how do you choose what your wrong tissue type is and how much of an effect does that have? How do we choose what, Saray? Like your wrong tissue type? You're right, so this is, again, this is the complicated part and this is why I believe our models, like the simple models, not the hierarchical models don't fully work. So you can take DNAs data and you can say, okay, so I have, for example, breast cancer and I have liver cancer. And we'll look at whole bunch, I'm just giving you two for as an example. So I can have the following set of DNA speaks, right? Shared, specific to breast, not in liver, specific to liver, not in breast. And I can look at all these three separately and I can see how my signatures, my mutational model changes. So this is what we've done. Now the caveat here, of course, is that it should take into account sequence features, it should take into account some general features. If the activity of the element is driving mutational process, we cannot take this into account. The other consideration and what Kelly noted to me, say there is liver cancer element which is not active in breast tissue. How do you know in some breast cancer samples this element is not open, is not working? I don't, right? So at this point, I give up and move to modeling overall distribution parametrically rather than thinking about what's happening in specific element because I just don't think this is doable at this point in time. Great, thanks. I have a question and I'm not from cancer genetics. But it seems to me that within cancer, the mutation rate would be heterogeneous between the cells of ours and much more homogenous within the normal tissue. Am I correct? Sorry, I'm not sure I- The cancer cells are all in different stages of differentiation or mutation and so on. And therefore the mutation rate in each cell and the mutation potential in each of those cells would be different. Right, that's an excellent question. So first of all, we do not look at heterogeneity for now. So we don't have single cell resolution polymorphism data in tumor. I'd love to have this data. I think I'm not early adopters of new technology data sets so we think we should wait a little bit. Now what we see are mutations that are either fixed or reach very high frequency in tumors. We think most of them are early. And again, we're thinking about modeling the sample, not the individual progression. Now I was extremely interested in the model where mutation rate evolves with the progression from due to stages. The reason I was interested because of our interest in evolution of mutation rate because you can imagine that secondary selection, selection of mutation rate may be much more efficient than selection on any individual deleterious change. I don't think we saw evidence of that. So we looked into this. I think I presented this last year at this meeting. I don't think there is sufficient evidence to corroborate the model. Again, theoretically that's a beautiful model but I'm just not sure we see the evidence in the data of that type of heterogeneity.