 Hi everyone, I'm Chai Banlamoudi. I'm a graduate student at Kevin White Lab at the University of Chicago. First I want to thank the organizers for giving me an opportunity to present my work. I'm going to talk about the work we've been doing to identify recurrent fusions from the TCGA cohort. So the first part of my talk is going to be on the approach we take to identify these fusions and then I'm going to talk about some of the results and end up with some of the functional validations we've done on this. So cosmic maintains a database of known fusion genes. If you look at it similar to the some of the power law distributions we saw earlier today. So cosmic maintains a database of known fusion genes. About 1% of them, less than 10% of them should have a frequency of greater than 1%. So which is kind of interesting. So there's an interesting few interesting things from this is that if you look at the tissue specificity some of these known recurrent fusion genes for example tempers 2, ERG is very prostate cancer specific and some of the fusions like the FGFR3 tag are you see across multiple different tissue types. So there are a few interest a few future directions like how is this distribution going to change with larger sample sizes and more sensitive methods to detect fusions. So the basic there are two pieces of evidence that we need to identify make a fusion call. So the reads that are mapping to two different genes these are called discordant reads and the reads that are mapping to the fusion junction these we call as anchor reads and the specificity of this anchor read is determined by this anchor length which is the minimum of the two overhang regions the left and right. So this is the anchor length. So as you know one of the main challenges in the fusion discovery is the high false positive rate and this is primarily because of misalignment of this read to this fusion junction and this can happen in several different ways. First is if this read aligns to this read that's mapping to this junction get aligned to one of the two genes of the fusion incomplete and also it can align to another gene called gene C or it can align to another transcriptional locus and these things it's very incredibly hard to computationally identify these sorts of artifacts. So we attempted to control for some of these artifacts and we implemented this approach in a method called mojo. This identifies fusions between at canonical exon exon junctions. So the basic approach is similar to the most of the other methods that are out there but the novelty comes in the way we filter for these various artifacts and also the approach we take to align to identify these candidate fusion genes. So this is on I have a poster out so I'm not going to go into detail about this please feel free to come by. So obvious thing is like how are we going to evaluate this so the way I evaluated mojo in comparison with nine other published methods using a company of 18 cell line transcriptomes these have validated fusions and so I there's a list of 137 131 fusions with that are known with experimental validation and I'm comparing these with some of the other published methods and I want to make a point that to me all the parameters are consistent across all these methods so if you look at some of these methods that show similar sensitivity despite that you see like there's a significant discordance between the types of fusions they call so we're showing like very good sensitivity but what about false positive rate so instead of false positive I'm going to call these as low conference fusions because we don't have evidence to say that these are like false positives we don't have experimental evidence so you'll notice that some of them share some of these calls are shared by more more than one of these methods these could be polymorphic or these could be new fusions that haven't yet been validated but you see a lot of fusion calls that are singleton to that are specific to each of these methods and I should point out that some of these methods are designed to identify fusions involving intronic regions and breaks in the middle of exons therefore larger search space and like more number of calls so what I'm showing to the right is a false discovery rate plot so the anchor length is a proxy for like conference in a fusion call so the increasing number of anchor reads you expect to gain more specificity but it's I say it's a proxy it's a weak proxy because you can have a fusion call with only one anchor read because it's like low expressed so we've taken this this algorithm and ran this on 70 nearly 7200 tumor transcriptomes one thing I want to point out is the median runtime across all these methods is around four between four and five hours for an 80 million paired and read sample and these these are run on a machine with each node has has 24 cores and 32 gigs of memory so our so we're despite our efforts to control for like false positives since we are looking at large number of samples inevitably we're gonna see some recurrent fusions that are false positives so the way we accounted for that is as follows so we ran mojo in the highest sensitivity mode requiring only two discordant reads and one anchor read of 10 base pairs and then first the way we do singletons is singleton calls the one without the recurrence is requiring two anchor reads two proper anchor reads we define this as an anchor read one enough anchor read mapping to the junction with the largest overhang mapping to the other gene than the it's paired read maps to and for the recurrent fusions we pull the reads across all samples and we look for this sort of uniform distribution at the fusion junction and we felt out reads that are biased to the gene A or gene B these sorts of most likely alignment artifacts and then we require similar one fusion call with highest high stringency or high conference fusion call so we compile these fusion calls and then to account for some of the annotation artifacts in the human genome and transcriptome and also someone to take out some polymorphic fusions we ran our algorithm on 1800 nearly 1800 normal transcriptomes in the genotype and tissue expression database and will filter out all calls that overlap with this and similarly with the TCGA 674 paired normal transcriptomes so we identified close to sorry we identified 15,600 high conference somatic fusion calls so are we capturing all the known ones so so I compile a list of fusions from the different marker papers and we are capturing most of them the only one that we missed was the not it's a non-canonical fusion that has a break in the middle of the exon interestingly we're also capturing some of the fusions with more with much higher sensitivity for example fgfr3 tack 3 we are finding it at 20% frequency in the glioblastoma significantly higher than what's been reported before and some of the other fusions and in this tumor type you're also finding it some higher frequency it's interesting to know that 11 out of the 26 different tumor types we analyzed have at least one fusion one fgfr3 tack 3 fusion so looking at the distribution of these fusion calls so obviously breast and ovarian have large number of samples therefore you see higher number of calls and most likely because they have a lot of copy number alterations too so if you look at known cosmic and novel fusions approximately 9% of all the fusion calls have a gene involving a cosmic gene or is a novel or is a known cancer gene known fusion gene so and the distribution of number of fusions per sample across the different tumor types so some of the solid tumors have like higher number of fusion calls per tumor and you see some tumor types like the thyroid adrenal tumor type showing like very few fusions so one thing to look at is like are these fusions caused by local rearrangements or distal long-range rearrangements like translocations or rearrangements farther than 10 megabases so the interesting thing you see again with this thyroid cancer is that it has far fewer rearrangements but most of them are like distal and some of the leukemia as expected show like higher number of translocations and if you look at the distribution of samples that have at least one fusion or more than one fusion you see this sort of pattern where like tumors that are certainly very highly arranged like ovarian have like large number of fusions and some of them have greater than 20 these are candidates for chromatripsis and so on and another thing is approximately more than 75% of the thyroid tumors do not have a fusion yet there's like a quite a large fraction of them have like are unknown driver potentially driver fusion so again this is the same pattern with the ovarian I'm just trying to reiterate the large number of rearrangements in the ovarian tumor and you see fewer rearrangements but large number of translocations here and with the thyroid like fewer rearrangements and many known cancer genes so these are the some of the novel fusions we found there are some interesting ones in here for example this fusion in involving BMP R1B has is seen in breast ovarian and prostate and these are all hormone driven cancers and some fusions like ESR1 CCDC it's found in MCF7a cell line 2 there some of them are interesting and we're like following up on functional validation some on some of these so our approach for functional validations is we synthesize a fusion construct package them into wire particles and create stable cell lines expressing the fusion construct and perform three different types of assays proliferation invasion and evasion of apoptosis so I'm going to show some preliminary data on on the on 11 fusion constructs we built using this pipeline in MCF10a cell line and we did proliferation assays and we acid two different phenotypes the EDU and cell titer glow once for DNA synthesis and one for ATP metabolism so these are some of the so we have positive two positive controls here FGFR3 tag 3 and PMLR array you can see that some of these fusion show greater proliferation than FGFR3 tag 3 you can argue that FGFR3 tag 3 is not it's not as andogenous environment in MCF10a yet it's interesting we are replicating these assays in NIH3D3 cells you'll notice that the one odd oddity here is that we have two isoforms of this fusion one doesn't have a DNA binding domain and one has a DNA binding domain so the one with the DNA binding domain has significant show significant increase in proliferation so in summary we're repeating these assays in NIH3D cells and we're overexpressing full-length individual genes as controls and we're and then follow up with functional further functional validations analysis and so on so and other thing that I bring up is that recurrently fused genes not the fused gene pairs but the genes are also interesting and finally integrating this analysis with copy number alterations mutations and fusion events is of significant interest just looking at the bigger scale of like genomic instability and what drives these alterations so finally I want to thank Patien Lin a postdoc in our lab that has she spent an incredible amount of time and effort trying to set up these experimental validations and following up on them and so on and I want to thank my advisor Kevin and some of our helpful lab mates and and the bio-numbers team for providing the TCG data and the Beagle supercomputer for providing computing resources and so on thank you very much I'm interested in what the influence of depth of RNA sequencing would be on your discovery so in the mouth that we've been usually doing in TCGA the number of reads is that maxed out your discovery rate or can you model whether you do a lot better with much deeper RNA absolutely the more you the more you deeper you tend to pick up more clonal events so a large fraction of our fusions have only one or two supporting reads and we're able to pick them up because we have some samples that are deeply sequenced or are in those samples that that is actually a clonal event compared to the subclonal event so then using that can you model how if we were going to do a new project sure how deep would we want if you have an estimate of what's the clonality or what sort of clonal events or what's the purity of certain tumor type that you expect you you can model it we have enough data to look at that have you tried to validate the biarty PCR followed by Sanger sequencing some of the fusions you ever that's a very good point so we don't have material for some of or most of these so the way we are looking at is looking at the whole genome sequencing so I ask this question because I'm a little bit concerned for some calls in particular the FGFR3-TAC-3 fusion in GBM we have been studying the fusion for quite a long time and we definitely know that the frequency is not 20% in particular the problem they are might be I would suggest that in some batch of RNA-seq tumors there are real positives with a very high expression of the fusion gene that contaminate a lower level also all the other symbols in that batch so I would suggest that you give a look at the possibility to the validations yeah that is there's pretty yeah contamination is pretty hard to I wish you you are right in that case because that would make actually therapeutic trials so much easier if we had the frequency but unfortunately that's not the case I think we need to move on we'll take the remaining questions after the talk or at the poster session our next speaker is Audrey Foo from University of Chicago presenting on widespread genetic epistasis among cancer genes