 Hi. So I'm a postdoc in the Redslap at the Sloan Kettering Institute, and I will be showing you some really cool results of our analysis on the TCGA data, where we're trying to take a pan-canned view on alternative splicing. So let me give you a little bit of a motivation here. So this is a figure from a paper, a nature review paper from 2012, and it shows you how, if you have alternative splicing, you can have a few genes where you have alternative three prime ends, exon-skips, which may change certain domains in the gene and can in fact change the function of the gene. You can switch from having pro-apoptotic to anti-apoptotic functions, and you can also change function from being angiogenic to anti-angiogenic. So there's a certain motivation to look into these things as potential targets for cancer treatment. So we set out, therefore, with the following gods. So we want to identify cancer-specific splicing patterns based on the TCGA RNA-seq data. So we're also interested in identifying variants regulating splicing in the same genes. So trying to find cis-associations, essentially, and then also trying to find variants regulating splicing in other cancer genes. And so the TCGA data is particularly suited because it has RNA-seq data and the matching exome data. For the RNA-seq, we will find and quantify the splicing events, and the exome data does in fact cover enough of the flanking regions into the introns to actually have variant codes there and try to find cis-associations. However, there's one problem with this. Unfortunately, the TCGA data has not been uniformly processed, and so we set out to do the following, and this was quite a major effort on our side. We have basically gone back, re-analyzed all of the raw sequencing data. We have gone, taken all the exome data and the RNA-seq data and designed our own pipeline to remap everything and do a joint variant calling using the Unified Genotyper, but also trying to find somatic variants using Mutec. And then we have done a splicing quantification using Splatter. It's a tool developed in the Ratch Lab to actually find alternative splicing events and add towards the annotations of the already known existing splice events. So let me walk you through our approach of finding new splicing events. So we basically start off with building a splice graph from the known splice annotations, and then using the RNA-seq data, we're trying to find new splice junctions if you have support for new events. We will then count the amount of reads supporting, and this is an example of an axon skip, so in this case, we will try to find the reads supporting this alternative axon and then basically normalize it across the reads covering the whole region, and this gives us something we call a splicing index or the percent spliced in. And so here you can see an overview of our efforts. On the very left you can see that these are already annotated splice events broken down to the different types. Here is what we have filtered down to, this is actually a very stringent filter towards a set of events where we have high confidence and based on the data we have seen, and this is what we have generated using splatter, again filtered with the same thresholds, adding a lot of new splicing events towards already existing annotation. And so in order to find cancer-specific splicing events, you can see here a little bit of a map where we have broken down the splicing events into specific tumor types and sorted them. So each color here represents a little bit of the fraction of samples which support a specific axon skip, and on the left you can see a comparison to whether we see that particular event in the encode data or in the amount of normal RNA-seq data available from the TCGA. Unfortunately, there is not much of the match normals available, so you have to take this with a sort of grain, but you can definitely see a bunch of cancer-specific, in some cases possibly tissue-specific events of alternative splicing. So basically, this gets us to our first goal. So we basically have gone and analyzed the RNA-seq splicing events. We detected new splicing events that occur frequently in specific cancer type. Of course, we still need some more independent validation, but they are very interesting cases which may be suited as potential targets for treatment. So let me go to the next goal, which is trying to take a statistical genetics view on finding variants associated with alternative splicing. Here you can see on the left what people have done so far in the statistical genetics field and doing QTL analysis on next-generation sequencing data, with GEOVAD as being one of the most recent one. But if you're looking on the right, different colors represent the heterogeneity in the cancer, the different cancer types, but you have quite a large data set available already to do this type of analysis. So it is perfectly suited to understand tissue and cancer-specific specificity of splicing as well as finding cancer association, which has been very limitedly possible due to the lack of power in previous projects. However, there's a problem if you're doing this on this type of data. Of course, it's noisier, you have heterogeneity, you have purity of the samples, and the many rare events due to the somatic mutations definitely cause a problem. I won't be talking much about rare variant analysis. We have done this as well. Come talk to me if you're interested in this. But I want to focus here more on the common variant analysis, where we're basically taking the typical linear regression approach, but we are going a step further and doing something which is called a linear mixed model. So you're basically just looking for a correlation between an exon skip here and a variant you observe in your data. But what often is a problem is something we call a population structure, where your data set is composed out of patients from different populations, and you have a lot of population-specific variants. And if you're doing a PCA on the germline data, you can see how your population nicely separates. Doing this on the somatic variants, actually, you don't get that type of separation, but nevertheless, the structure in cancer requires you to actually account for it. And if you're actually now coloring this PCA based on cancer types, you can see a little bit better how your cancer types cluster based on the somatic variants, which is quite interesting, and we believe that it is necessary to account for this in doing an association analysis. So let me show you an example here. This is a cis-association in SNERB-C. It generates a protein which is essential for the formation of the spliceosome. So this is why it's particularly interesting. On the bottom here, you can see the gene structure with the three exons. This is the exon which is being skipped. You can see on the top right the different splice indices. You can see how the inclusion of the exon goes down from the alter, or respectively up from the reference to the alternate allele. This is all your sample sizes we have in the TCGA data. And here you can see on the left, you can see the P-values, right? So you see the SNPs and the associated P-values. Each color represents a cancer type here. And you can see how you have a nice cis-association right around the splice junction, which is associated with the change in using this in the exon usage for the data we have seen. And so if you're actually doing this across multiple cancer types, you get actually to this nice plot here. So this is basically just a plot on a set of cancer census genes. And we have tried to indicate here on the left the different cancer types. And each blue dot marks its cis-association within the cancer type of a particular gene. So this is particularly remarkable, MMAB, where we can replicate the cis-association in each of the cancer types just using a subset of samples. On the right side you can see this on doing the same thing on splicing-related genes. You can see here some of the interesting ones, like RBM-25. And we can replicate a lot of these cis-associations over across different cancer types. So this is all across 5% FDR, just to show. So we also look at trans-association. And this is a little bit of an example. This is an interesting gene. It's directly related to P53-mediated apoptosis. And we find if you're looking actually at trans-association at the same FDR threshold of 5%, we find various other factors connected to this, to alternative splicing in this gene. And here you can see obviously the loop itself indicate cis-association. And you can see we found some other links in the datasets supporting several trans-associations. And this is a subset of what we have found so far. We find several trans-associations in various genes. Of course, most of them will be cis-association. It's simply where you have the most power for, and most of the signal is coming from that. And we certainly don't believe that all of them are right. And they definitely still need some confirmation. There's a certain error rate associated with it. But the truth will probably be lying somewhere in the middle. So let me go to conclusions. So basically we have developed a new resource for novel and known alternative splice events. So we will try to make this available to the community as soon as we can. We identified cancer-specific isoforms that appear rarely expressed. You have seen that in the heat map. We have a slightly updated version now where we actually accounting for the existence of tissue specificity. So we performed common variant associations to map splicing phenotypes. And the sample size in TCGA enables us to detect a bunch of trans-associations. Again, not all of them may be functional, but maybe correct. But some of them certainly will. And so we're definitely looking further into that some of them still need some validation. And again, particularly in trans, they are usually hard to find. And there certainly may be some previous associations. So let me acknowledge particularly this group which has been working with me on this project. Particularly, I want to emphasize Andre here. He has been particularly helpful in handling the next generation sequencing data. We have been working with the 3,000 sample. And it has been a huge amount of work. It's a relatively small group for this scale of a project. And with that, thank you. And I'm taking this. And two questions. One is, do you see what's the frequency or prevalence of those kinds of specific splicing? In other words, do you see a small group of gene that has high enough frequency in tumor that could be novel therapeutic targets? A second question is regarding, did you do any association with somatic mutations, especially those splicing, you know, machinery? So yeah, these are two questions. So first, we're definitely looking into trying to filter down these events. Right now, we're just looking at the subset of genes, which seems to be cancer-less tissue-specific. We're trying to particularly sort out what's tissue-specific and what's cancer-specific using the normal data available. Towards doing somatic variant association, we are working on doing a rare variant association using the somatic SNPs, but we have also associated all the common somatic variants available in the data set. So there's a certain frequency threshold where you just have to cut off, otherwise you're just increasing the amount of spurious associations you get. But for the common somatic, it's already available and the rare variant is currently in progress. So I think on your first slide, you talk about recalling all the somatic mutations as well as splicing events using the same pipeline pancancer. So how important, in your opinion, is that versus a lot of people that are using individual calls from different working groups? It depends on what you want to do. We particularly are interested in also looking across cancer types. And if you're doing that and using QCs, which are done separately for each cancer type and you're looking across various cancer types, there's definitely a difference. Particularly if you're looking at mutation calls, you can sometimes see that there's an increase of calls in certain cancer types. For example, all your lung cancers are hypermutators, right? And sometimes the thresholds have been more stringent, but you want to have sort of uniform thresholds to actually analyze this. So I think for that type of analysis, it's important. It's not necessary for any type of analysis. It really depends on what you do. Are you guys going to make those mutation calls available? We can make the somatic variant calls available. The germline call, however, is a hip-power rate. Okay, Aaron. Next speaker is Dmitri Gardinian from NIEHS, and he will talk to us about pancancer analysis of apobec mutagenesis.