 Okay, we'll go ahead and jump into the next session, we're really just blending here now session three and four into one since we're all kind of in the same theme. And so Ekta Karana is going to continue on the themes from the first few talks and then we're going to blend into a functional analysis with Tim Reddy and Matt Friedman. So I would like to start by thanking the organizers for inviting me in such a great list of speakers. And so I'm going to continue on the cancer theme which you heard before the break. So a lot of it is going to be revision of what you heard because it's in the similar spirit. But hopefully I can talk about some things that we are doing in the lab so that's what I'll be discussing, the methods that we are developing. I will start with this slide which everybody I think here in this room knows that the cost of whole genome sequencing has been dropping. But the reason I show this slide is to stress the availability of cancer whole genomes. So in 2008 we had the first cancer whole genome sequence but now we have more than 3,000 whole genome sequences available just from ICGC and TCGA. And for these sequences I think some of you might know about this effort called the pan cancer analysis of whole genomes which is a collaboration between ICGC and TCGA. And the main motivation behind this effort was that a lot of these cancer whole genome sequences that have actually been sequenced by researchers have not really been analyzed. So the data is the sequencing was done but in many cases even the mutations were not called. And the major reason for that I think was that people don't know, didn't know until very recently what to do with non-coding regions in cancer. And now as you can see from this session and you know from the interest of everybody here there is a very increased interest in non-coding regions in cancer which was the motivation behind this big consortium effort which is now close to a thousand people I think. So I'm aware my lab and myself we are very active members of this effort and variant calls are going to be ready soon from 3,000 whole genomes and then we are going to analyze and hopefully get all the results from analyzing this big scale of data sometime soon. So some of the slides that I present are going to echo like I said what the speakers before said which is you know the concepts about the non-coding variants in the cancer genome. So this is something that Matthew also showed and we know that most of the variants when you sequence cancer whole genomes lie in non-coding regions of the genome not surprisingly because most of the genome is non-coding. So this is showing the ratio for all these different cancer types from TCGA and as you can see that 96 to 99.9% of variants in some samples lie in non-coding regions. So there are many modes as you have been listening throughout this session and even yesterday how non-coding where sequence variants can have an impact on disease and in this case cancer but the one of the most common modes that you have been listening is the disruption of transcription factor binding. So the most famous example that came that we made non-coding regions popular in the cancer community I think is that of the third promoter. So in 2013 there were two papers published back to back in science that talked about highly recurrent mutations in the promoter of the third gene in melanoma and this kind of started a wave of everybody starting to look at the non-coding regions and then I think in the next two years within the next two years there were about 50 papers published that talked about how people are seeing mutations and so many different types of cancer in the promoter of the third gene and this is a table from Kilela et al that shows that the third promoter is mutated in all these different cancer types and in some at really high frequencies like 79.1 here. So and the result of this is that actually the third promoter mutations have made it to the clinical stage I mean because for example the Memorial Sloan Kettering panel which tests for cancer mutations in large cohorts actually tests for the presence of the third promoter too. So another famous example that was highlighted recently was a creation of the Mib motive that drives talvan overexpression into ALL. So these were some of the few early examples that came and then now as you already heard in three wonderful talks before me there are many examples that we are finding in the non-coding genome and I'm going to talk about some of the methods that we have developed and what we are identifying using those methods. So before I start talking about the methods this is also a message that was echoed by all the all the talks before and especially in Shamil's talk about the covariates of the mutation rates. So like Shamil explained very beautifully that there are there is a lot of heterogeneity in mutation rates in the genome and there are factors that lead to this heterogeneity especially in the non-coding regions. So this example that Shamil also described is that the regions of open chromatin are generally showing lower mutation density but if you look within those regions at the exact places where transcription factors bind they are showing higher density and this was shown beautifully into recent papers in nature that it's because of the because these transcription factor binding actually impairs the binding of the nucleotide excision repair proteins which is why you see the specific increasing mutation density in the cancer types where nucleotide excision repair is the process of repair mechanism. So is the repair a prominent repair mechanism which is lung cancer and melanoma. So these examples you know this these phenomena are very new these papers were published this year so we are just understanding what are the covariates or the mutation rates in this complex non-coding genome and a message that I'm again saying repeating from the earlier talks is we have to really account for these when we build our computational models to identify the drivers because otherwise that can lead to a lot of false positive signals which are hot spots of mutations but not necessarily drivers. So the basic idea how we look for drivers is by looking at for positive selection which is recurrence across multiple genomes and if this recurrence happens because of these because of these mechanisms and not because they were positively selected then we'll reach to result that there are many drivers which is not true. So I'm going to discuss two major methods first method is something that I developed when I was a postdoc in Margarine's lab at Yale and which we call FUNSI because it's so much fun to run or is because of the it's was for functional based prioritization of sequence variants and then I'm going to discuss another method that we are developing in my lab which basically converges the signals of the functional importance and their recurrence and account for the covariates to identify the driver elements and gives a p-value what is the likelihood this region is going to be a driver and that we are calling composite driver. So I think this is a this is a slight kind of combining some of the things that we heard yesterday and what we heard today about the germline genomes the GWAS and somatic genomes. So this is just to give you an idea of when we are looking at the when we try to identify non-coding variants associated with cancer then the most common approach to identify those the germline variants associated with cancer susceptibility for the common variants it's through GWAS and for the rare variants it's through the germ it's through the whole genome sequencing and when we talk about the drivers in the somatic cancer genome then we need the whole genome sequencing to identify these mutations in the non-coding genome and so these are and after we have these we have the data collected then there are statistical tests for enrichment so this is a very very broad outline of how generally we identify non-coding variants associated with cancer and then there are many methods for computational based prioritization and functional interpretation including ours fun seek which we are going to which I'm going to talk about and then this is followed by experimental validation to make sure that this prediction is really true it's having a functional impact which hopefully will lead to a lot a lot of translation to the clinic it has led to one but hopefully we'll see more soon so first I'm going to talk about our approach fun seek which is for the for the functional prioritization and I would just like to know that this approach can be used not only for somatic variants but also for germline variants to infer what is going to be the functional impact of non-coding mutations so the idea behind this approach was to look for regions in the non-coding genome that are under negative selection so here and by negative selection I'm talking about negative selection in germline in not in somatic genomes but in germline so the idea here was that if we understand the patterns of polymorphisms in healthy human genomes then we can use those patterns to identify what are going to be high functional impact mutations or in case of cancer the driver mutations but first we need to understand which which regions in the genome normally are under negative selection do not want to mutate and of course I'm sure everybody knows that the evolutionary conservation across multiple species has been used to detect signals of negative selection and we also included besides that and importantly the signal for across human population from the thousand genomes data so the idea here was that if we find regions that are depleted of common variants or in other words enriched for rare variants which is the same thing then we are going to identify the regions where if mutations hit are they are going to have a stronger functional impact so first we did this for coding genes because we wanted to be sure that the metrics we apply before going into the rather unexplored non-coding world work because we know what are the different functional impacts of mutations in coding genes well so first we did that for coding genes and so what you're looking at here is a fraction of rare release on the y-axis and these are the different categories of coding genes and as you can see that we really saw a very clear signal that the genes that are not expected to have a big functional impact the mutations in those genes which the loss of function tolerant genes show a depletion of rare variants and if you look at the genes that are the disease genes and especially at the end of the spectrum these are the cancer driver genes they show an enrichment of rare variants compared to the random background which is here so these signals were small but statistically significant to make sure to tell us okay we can use this metric now in the non-coding parts of the genome to discover new things so that's what we did we look at the like i said this is germline organism level negative selection in non-coding elements so here you're looking at the different categories of non-coding regions all from encode data and this is as you can see so this is the random background the fraction of rare variants in the entire genome and this is what you just saw for the coding variants and as you can see the different non-coding categories enhances DNAs hypersensitivity sites transcription factor binding sites non-coding RNAs show a small but statistically significant enrichment of rare variants showing that they are under some selection constraint which was a message that was also shown very nicely in the previous encode paper and some other papers that came around that time before our paper but the advantage that we had at that time was that we have thousands of samples from thousand genomes project as opposed to the previous papers that were working with hundreds of samples so we had much more statistical power and what did that mean that meant that besides looking at these very broad categories we could now start looking at higher resolution so for example we could instead of looking all the transcription factor binding sites we could look at this binding sites of specific transcription factor families and as you can see that once you start zooming in you see that the constraints actually differ a lot for the different transcription factor families and with this sample size we could even go more higher resolution which is looking at the specific SNPs that can conserve or break transcription factor motives and this is the result I'm showing for two different transcription factor families but this was consistent across all the families that we checked in fact all the families and the consistent trend was that the SNPs that tend to break the motif tend to be under stronger selection than the ones that do not do that and this was based on the PWMs and this is what we expected and we saw a nice clear signal which was also shown in some other papers before so that was there was good to see that this is happening all over in the from the SNP data and all the transcription factor families so then we could also look at tissue specificity of these regulatory regions so what we are what you're looking at here is the DNA hypersensitivity sites from all these different tissues this was all data from John Stanislav and from ENCODE and then this is the coding genes and this is the genes that are expressed specifically in certain tissues all these tissues and what you're looking on the y-axis is the fraction of rare variants and as you can see that the fraction is different for the different tissue so we see a lot of tissue specific selection constraints too and what we saw is that the ubiquiturously expressed genes and bound regions generally tend to show stronger selection and that of course as you can see in the slide there is difference in constraints amongst tissues and then constraints in coding genes and regulatory genes are correlated across tissues so we only had this data for six tissues which are marked in the same color in this plot for which we had the regulatory regions and the genes that are expressed specifically in that tissue but for whatever tissues we had that data they were correlated so that was nice so then the question that we wanted to ask and I'm sure everybody wanted to know is well what are the specific non-coding categories that are under very strong selection because even when you we look at these broad regions they're nowhere they were nowhere close to the coding genes even if you look at the specific transcription factor binding families so what we did is I don't have time to go into the details of the statistics but this is all published work so you can look at it so we so we looked at the we divided the entire encode categories into about 700 categories and we then permuted all these categories or sorry but yeah permuted all these categories in the genome keeping the underlying SNP structure constant so to take into account the linkage risk equilibrium and we analyze the categories that come up under a negative selection based on this null model which we got from permuting these categories so after we had that we looked at the top categories that are under very strong significant selection and this is what you're looking at so when we look at the top 25 categories which we call the sensitive categories they show a fraction of rare variants that's of course high because these were the top categories that came up and then if you look specifically at the very top the top five categories which we call the ultra sensitive regions they show a very high fraction of rare variants and this we call yeah ultra sensitive regions and an independent validation for this came from the fact that the known disease causing mutations from human gene mutation database are strongly enriched in these categories the sensitive and the ultra sensitive regions so remember we didn't use any disease data to arrive at these categories so only two kinds of data we used was thousand genomes polymorphisms and encode annotations so we based on that we said okay these categories cannot tolerate mutations and then we think and then we looked at the known non-coding mutations in human gene mutation database and they are strongly enriched there providing an independent validation that these are important regions so of course the question is well what are these regions so these regions contain binding peaks of some general transcription factors for example FAM48A and core motifs of some TF families like June and Gata and DNA's hypersensitivity sites and spinal cord and connective tissue and I actually find this very interesting that some of the top categories that we are getting that are very resistant to mutations are actually not transcription factor motifs which we think disturb the PWM but it's rather the DNA's hypersensitivity sites and this transcription factor which is not binding to specific sequence showing that maybe these mutations can all the mutations here can alter the chromatin structure but we don't know that but this is just based on what we get as a category okay so I'm going to briefly talk about regulatory networks because that's something that we use in our approach but I don't have time to go into a lot of details of that so so we constructed a regulatory network based again on chip seek data from ENCODE and so we know the transcription factor binding sites for 119 we knew the transcription factor binding sites for 119 TFs and then we knew the regions that are assigned to the target genes based on that so the assigning the promoters to the target genes is relatively easier and then assigning enhances to the target genes is more complex and there were a lot of talks about that yesterday and Zeeping explained very beautifully what's being done now so I won't go into the detail but this was a correlation based method and based on these connections there was this network which consisted of 119 TFs and 9,000 target genes went 28,000 interactions between these genes so in this network if we look at the transcription factors and the target genes we define the in-degree and the out-degree of these genes and what we saw consistently was that the genes that are connected to more genes in this network tend to be under stronger selection so we saw this using many different statistics and the one that I'm using here is by looking at loss of function tolerant and essential genes so essential genes tend to be more central and bigger which means they're connected to more nodes and the loss of function tolerant genes tend to be smaller because the size of no is scaled by total degree and so this movie was made by Zena who is also in the audience here so with that I gave you some of the big features but there were other features that went into the scheme which we call fun seek and the idea behind the scheme is that well we have identified all these features in the genome that are resistant to mutations all these regions in the genome now can we put all this together into an algorithm to say okay if you give me a set of mutations which one of them is more likely to have a strong impact and so if you this is just for example if you start with cancer genome variants which is about thousands of variants from a cancer genome from a somatic cancer variants and then you first filter the thousand genomes common variants and then you look at the different functional annotations from encode basically and then you look at the ones that lie in the sensitive and ultra sensitive regions that we derived that and then you look at the one specifically that caused disruption of transcription factor binding and then you look at the ones that tend to target these highly connected genes in the regulatory network so in the end you reach about five to ten variants in a cancer genome if you started with thousands of variants if you go through the scheme so now you are at a very reasonable number of variants which you can then take to the lab and do all the functional validations to see okay these are really having a functional impact so this part of the scheme can actually be applied also for the rare germline variants and there's nothing specific just for cancer but then we can or if we have the cancer genomes we can also include the signal of positive selection which is looking at recurrence in multiple samples so that was the original fun seek scheme and then we developed the scheme further which was which we call fun seek 2 and this was published in a genome biology paper in 2014 so the idea behind this was that instead of looking at all these features just as binary we converted them into a weighted scoring scheme and the idea behind this weighted scoring scheme is again training on the thousand genomes data so if you're more likely to see a feature as belonging to common polymorphisms in the thousand genomes data then it's less likely to contribute to a strong deleterious functional impact so the weight is derived based on the probability of the feature overlapping natural polymorphisms so this is an entropy based method and using this we get a weight for each feature that I just described in the previous slide and by summing those weights over all the features each variant gets a score and this this is available from this website and this website also is a web server actually where you can upload your variants and then get the results and if you want to look at the latest scheme that we are using the and a more extensive set of annotations also including roadmap annotations and this is a set we are using currently for peacock then you can also get it from github so so now i'm going to talk about the method that we are okay so now i'm going to talk about the method that we are developing which accounts for all these complex covariates that we heard from shamil and i described a bit earlier so the idea behind this is funcic looks at the functional interpretation and we want to include these signals of positive selection from the cancer genome accounting for all these different all the mutational heterogeneity and that's what we are calling this composite driver because it's it's a composite signal based on the functional and recurrent signals so in the original funcic scheme the recurrence was input naively but now we are doing it more in a more sophisticated fashion which is to really say given a number of mutations in a functional element what is the likelihood that this these mutations are more recurrent and more functional than you would expect randomly and this randomly is basically we have to take care of all these covariates in the feature space so if we are looking at promoters and we have to compare with the regions that are also that are promoters if we are talking about enhancers and we have to compare with mutations in enhancers so the basic idea of the scheme is we look at the this is shown for a coding gene which these are the exons but it's the same scheme for promoters and enhancers so you have the mutations and we give the functional impact score to each mutation and then we look at the recurrence of mutation across the multiple samples and then we compute this composite score for each element which is the functional impact score times recurrence and then we say well okay given this number of mutations if I was to draw mutations from the random set which is informed of these covariates so the random set is picked from the same type of element from the feature space and what is the what what will I get so that's how I regenerate the null background which gives us the p-value for the composite score being as high as we see for the for a given element and the Benjamin Horschberg method is used for multiple hypothesis testing so I think you all heard from a lot of talks that these QQ plots of QQ plots are basically a way of testing that our test statistic is performing well it's not inflated and the p-values are not inflated and as you can see that we see nice QQ plots here and this is the result I'm showing for lung cancer samples and we are applying this method on many different cancer types all the actually all the peacock samples and this is data from TCGA and as you can see for the coding sequence we get KRAS which is a very no very important lung cancer gene and then if you look at the promoters and the link RNAs we get a novel candidate so this is to show that our scheme is working we are getting the known candidates and then we are getting novel candidates and I would just like to comment briefly that some of these candidates are indeed very interesting for example the WDR7-4 promoter which we also found in prostate cancer that I'm going to just describe and has been reported now in many other cancer types including I think it was reported in breast cancer also in the recent Stratton paper and then for link RNAs we are identifying neat one and malat one in lung cancer and actually in many many different cancer types so these I don't we are still doing further functional validations on that and seeing if these mutations are really having a driver effect so that was a result for lung cancer and this is a result from 180 at prostate cancer samples again from data from ICGC and these published papers and as you can see the QQ plots look very nice and the gene that we identify is SPOP which is the only known gene that is significantly mutated there are other important genes in prostate cancer but that's because of rearrangements not because of point mutations so this is the only important gene and we identified that and then we identified new promoters and enhancers and in the same spirit that we heard already we did do functional validation using reporter assays so this is for the WDR7-4 promoter which we actually reported in the science paper and remember that we didn't really have a very big cohort at that time in 2013 so we identified this completely based on the functional impact scheme fund seek and now actually many papers when they looked at larger cohorts including William Lee's paper and like I said the recent breast cancer paper they are finding this promoter to be highly mutated so then the other two examples are the red promoter and this gene's promoter which is too long a name but so we but when we validated them we we see the same effect as we predicted based on our scheme's composite drivers so this shows increased activity and here we see decreased activity because of the mutations so with that I would like to first thank the 1000 Genomes Functional Interpretation group because I was leading this group as part of the 1000 Genomes Consortium and fund seek scheme was developed there and Yao was a major contributor who is now at BINA and this is when I was in Mark's lab and I would then thank all the people in my lab and my collaborators at Wild Cornell with whom we are actively applying this composite driver scheme on many different cancer types and really following up with detailed functional validation and just a plug-in to say that I'm looking for postdocs so thanks a lot. I think this question was this talk maybe applied to other other talk as well so when you use the histone mark different I mean histone elements sorry can you start from the beginning I don't hear. Yeah basically the question is like when you use the element I mean the different functional region to look at the mutations or variant so you basically you use the parental normal tissue right so in the cancer case that's the other cancer cancer have a different lane escape of the histone modification mark so how much information can you look like you use normal as like a reference but in the cancer they have different like a enhancer or promoter they can be changed. Right. How do you think of this kind of thing? No that's a very interesting question and I would like to answer it two ways so first is because of the nice work of Polak et al and Shamil is here you can see that actually the histone modifications show a nice history of the cancer genome landscape so even though they might have changed this is this is when you see the relationship of mutations with the histone modifications that are obtained from the normal cell type right there in their random forest models where they're predicting the mutation rates it's obtained from histone modifications from normal cell not from the tumor cell right so because the as the cell is progressing towards the cancer state it's the original the histone modifications of the cellular state are from the normal cell state right and that's where the somatic mutations start accumulating so all these covariates that you see are related to the modifications of the normal cellular state so when you can predict the mutation density from histone modifications DNA is hypersensitivity it's from the normal state so that's that's one comment I would like to give and the second comment is that what we are doing right now which is something I didn't discuss is we look at these elements in the normal state and then we actually see how the enhances and the promoters themselves that's changed so the main data source that we have right now is DNA methylation so we can see which regions got hypo methylated or hyper methylated so their regulatory activity change and fortunately we have this data as I showed in the peacock slide that most half of the peacock samples are matching methylation and RNA data so it's a wonderful resource to see how the regulatory elements change from the normal state to the tumor state and we are using that in our models that's something I didn't discuss yeah because like for the talk yesterday like for the lung cancer like a lung cancer you just may use the adenocarcinoma but in lung cancer there are another type of air squeamers so that's the cancer totally very different from I don't think in the lung there's that kind of normal normal normal squeamers so in that case if you use like his own modification mark from encode I don't know which tissue you are going to use yeah so that's a very just highly discussed question I think Nicholas was asking that in the previous talk too so from Shamil's lab we have the data from Palspolac which is based on the mutations but actually if you look at the expression profiles that can give you a very nice idea of which tissue you should use because even though the expression of a lot of genes changes in the tumor sample if you match the expression from tumor with all the different G-tex tissues if you do principal component analysis and the tissues the tumor tissue still aggregate with the normal tissue type so you can choose based on that that okay this looks like the cell type you know tissue type that I should use thank you yeah and nice plug for DNA methylation