 Okay, welcome to the module for gene fusions. I hope everyone had a great lunch. So in this module, we're going to learn about the impact of gene fusions in cancer. We're going to learn about the different types of evidence for gene fusions. We're going to focus mainly on RNA-Seq. And we're going to learn a little bit about how the available detection methods work and what the types of false positives that they produce, and also try to be able to assess a gene fusion's potential function. So first, we'll define a gene fusion as just a novel gene formed by a fusion of two distinct wild-type genes. And in cancer, of course, these are produced by genomic rearrangements. The canonical example, of course, is the BCR-ABL fusion that defines chronic myelogenous leukemia. This is an interesting rearrangement because it was basically the first somatic change that was linked to cancer identified some 40 or 50 years ago. And it's also one of the success stories because we have drugs that target it. So we know that gene fusions are relevant in clinical features in cancer. There are diagnostic markers here at BCR-ABL. One is our canonical example of this. They're targetable. We have a drug aminotib that targets BCR-ABL. And we also have a lot of excitement recently in discovering new gene fusions. This is spurned by a few things. There was a discovery about five or six years ago of TM-PRESS's two ERG fusions in prostate cancer fusions that were before then thought to only drive leukemias and sarcomas and other more rare cancers. But then they were discovered in solid tumors. There was a hope that the more common solid tumors for those we could also find gene fusions that could be targeted the same as we could target BCR-ABL one. At the same time, as discovering TM-PRESS's two ERG, all of these new available platforms became available for discovering novel sequences including gene fusions and those include RNA-Seq and genome sequencing. So we know that in some cancers gene fusions are initiators of carcinogenesis because they correlate with cancer phenotype. If you successfully treat some of these cancers then it eradicates the fusion products that are no longer detectable in the bloodstream. We see in mouse models that gene fusions produce neoplastic disorders and also in cell lines if we silence the gene fusions then we reverse tumor genesis in some instances. We can think of a couple classes of gene fusions. One is up-regulation of proto-oncogenes. So these are genes that are normally very tightly regulated in their transcription. And if we translocate, say, for instance, MIC to a different loci that juxtaposes it with a new promoter that results in overexpression of MIC then that can result in tumor genesis and that's exactly what happens in Burkitt's lymphoma. And then, of course, the other type of fusion we can get is a chimeric gene where a novel fusion gene is created that co-ops the domains of two wild-type genes. So it's taking the functionality from two different genes and creating a new gene. If we look at the partners that are commonly forming, the genes that are commonly forming gene fusions as their partners, we see a few different classes. Hyrosine kinases are quite common. Transcription factors are another common class of gene fusion partners and then generally oncogenes again are normally tightly regulated but then because of a gene fusion they become upregulated, those are another class of gene fusion partners. If we sort of map out the connections between genes, here we're creating a network of genes connected by edges. If those two genes are formed at a gene fusion then they form a scale-free network and that basically is telling us that there's a few genes that are very promiscuous. So they form gene fusions with a very large number of partners and then there's the majority of genes just have one or two partners. In terms of the genomic effects of gene... Oh, yeah. So when a gene forms a partner, is there any specific sequence that's just causing other... having partner preference or it's just... Yeah, the question is, is there a specific sequence that is associated with these fusions maybe at the breakpoint? I don't think that's the case. If you look at the... So if you look at the breakpoints themselves, there could be a signature of the mechanism by which the actual genomic DNA was broken and then rejoined. So there could be a sequence signature there. But then in the fusion itself, usually the genomic breakpoint is spliced out. So then what you end up with at the fusion boundary is just the sequences of each gene. Like, A gene forms a fusion when it would be, but A is not forming with C or D or anything. There is a preference for A to form a... Yeah. So is there any sequence similarity between the genes? I see what you mean. So I think the preference that we're seeing in terms of how these networks come about, the gene fusion network is basically because of the function of the genes, not necessarily sequence. So at the sequence level, there's no similarity. It's just because at a higher level, it's bringing together these groups of exons that have some specific functions. Okay, so we can leverage these three different signals. Comeric DNA sequence, RNA sequence, and expression change to identify gene fusions. And so the first... For the first one, expression arrays, this is actually how the TMPRSS2 erg fusions were discovered five or six years ago. I think RNAseq was just coming about, but they used, in this case, they used expression arrays with something called COPA, which isn't actually never being used since this discovery, because RNAseq kind of preempted it, or not preempted it, but came on the scene and people started using that. Anyway, so they basically looked for outlier expression. And used that to identify a candidate set of genes, and then they further restricted their analysis to one particular set of genes. So that was based purely on expression. Of course, we can also do genome sequencing. And this has led to the discovery of some fusions in colorectal adenocarcinomas, but most of the fusions discovered have not been with genome sequencing, I think, because it's a little bit more expensive. We don't get expression information, so we can find translocations, but we don't know if that's actually resulting in our fusion product. It's a fusion transcript that's being expressed and then turned into a protein. And then mRNAseq, I think, is the standard of choice for discovering fusions currently. That's the benefits here, that it's relatively inexpensive compared to whole genome. It gives us information about the expression, so we can find the chimeric fusion transcript, and we can also see if that fusion transcript is highly expressed and therefore maybe relevant to the biology of the cancer. Okay, so going into how some specifics of RNAseq, this is going to be a little bit similar to what Jared's talking about. So I'll go over it quickly, and then you guys can ask questions. So the difference with RNAseq compared to whole genome sequencing, say, is just in how we select the molecules that we're going to put on the sequencing machine. And so for RNAseq, we're isolating mRNA by doing a poly-A pull-down and then doing reverse transcription. So one of the important things here is reverse transcription. During that process, we take single-stranded mRNA and we turn it into double-stranded CDNA, but we don't know for the subsequent CDNA molecules, we don't know which of those two strands came from the original mRNA, so we lose strand information in most cases. There's libraries that get around this, but I don't think they're quite as common as the regular mRNAseq libraries for which you lose the strand information. So now we have a collection of reads that we've sequenced, and some of them may come from fusion transcripts. And we can classify with the types of reads that would come from fusion transcript. We get wild-type reads, which basically just appear as if they come from one gene or the other. We get what we call spanning reads where one, or I think Jared called them discordant reads, where one read maps entirely to one gene and one to another gene, and then we get split reads where the actual read itself is split by the fusion boundary. And also, these are evidence for gene fusions. I guess wild-type reads are evidence for gene fusions just in terms of the expression information that they yield. The spanning and split reads? Yeah. Because they look like spanning reads will span the gene fusion boundary? Yeah. Split read? So for a split read, the fusion boundary is occurring within the read sequence itself, not within the unsequenced portion of the read in the middle. That's the distinction. But the spanning read also has the both colors, right? Yeah. So both of them are discordant, you could say, and that they can't be mapped to one continuous location in the genome. But there's a small difference in that for the spanning reads, if we independently map each end, we can get a full contiguous alignment of each end to the original transcripts. But for the split read, it's a little bit more difficult to map those split reads because there's a fusion boundary in the middle. Okay, I'll talk a little bit about assembly, but it's not as used for fusion discovery. But nevertheless, we're going to talk mainly about alignment. I guess I can contrast assembly and alignment by saying that there's two processes or two parts to the process where you predict fusions or in general, aberrant sequences. And the first is to cluster reads that support the same change or mutation. And the second is to... So the first is to cluster reads and the second is to align sequences to the reference and compare those sequences. And so we're kind of reversing these between alignment and assembly. In assembly, we're first grouping the reads together into contigs by looking at reads that overlap and then we're aligning to the reference genome to find out where those contigs are different and then nominate changes like gene fusions. With alignment, we're doing independently doing the mapping first to find these alignments that support a mutation or a translocation if the reads are discordantly mapped and then we're clustering those after the fact. So I'll mainly talk about alignment because the majority of the tools are alignment-based. Okay, so with alignment, we have this problem of having these discordant reads that we have to map back to the genome or assign back to genomic loci. If we can do that, then we can nominate these fusion transcripts. The problem here is that some of our reads are going to be split by introns. So you can see on the... Do we have a laser pointer? Oh, that's perfect. Okay, so this read on the right is split by this intron and gene Y. So this is going to make mapping difficult. And on the left here, this read is split by this fusion boundary. So in order to fully assign this read to the genomic loci that it originated from, we'd have to take all of these different segments and find those locations in the genome. We can make the problem a little bit easier by assigning these reads to gene sequences themselves. If we assume that we know all of the gene models as we know all of these wild type transcripts, then we don't have to deal with the fact that this read on the right here is split by this intron in gene Y. And so that kind of alludes to the fact that the choice of reference is important for RNA-seq, and this is not in the... This is on the Wiki and a slightly updated version of the slides that I apologize for changing the slides, but this basically shows that these guys were comparing ensemble, RefGene and UCSC, and generally comparing using both the transcriptome and the genome as a reference and using the transcriptome only or the genome only. And what they show is that using both the transcriptome and the genome as the reference to which we align means that we get about 95% of the reads to have a mapping to either the transcriptome or the genome compared with if we do not map to any of the gene models just to the genome, we get 90% of the reads mapping. And so combining those two references is going to give us more power, and that's what most of these tools do. Yeah. If we use the standardness of the transcript and we actually put it into DNA, how do we know whether the gene is on the positive or the negative strand? If that transcript corresponds to the gene of the positive strand... That's a really good question. I think you have to use... You have to sort of infer it from additional information. And so I think one of the strongest signals that you can get is that if, say, you predict a fusion contact and then you align it back to the genome, if when you align it back to the genome the splicing looks like the splicing of an existing gene, then that's a good sign that it's following this. It's on the same strand as that existing gene. I think we'll have an example of that in the lab. On the topic of splicing, does it map into the known gene taking into account of the splice variants? In these... In fact, what you choose... Yeah. So, yeah, for instance, in this study they're mapping back to all of the splice variants that are in these particular databases. And so if the intron that you read aligns across is in that database, then it should map recently well. But of course, novel splicing, then that's going to be pretty much the same problem as finding a fusion read, except it's restricted to a region the size of a gene. Yeah. Okay, so when we are thinking about this problem, I guess we can think about the alignment problem as being quite easy if we're just looking for exact matches. It gets a little bit harder when we are looking for alignments with some indels and mismatches. And then if we're looking for non-contiguous alignments and maybe we're also accounting for indels and mismatches, this is perhaps the hardest alignment problem we can think of. Generally what we do is we leverage the fact that we know something about how to solve the easy problem to solve the hard problem. So for instance, one of the strategies that is quite widely used is to just segment the reads. So given this original read sequence, we chop it up into three pieces and then hopefully individual pieces from this read align reasonably well to the reference genome and we can leverage existing tools to align those segments back to the genome. We can also do something where we try and, instead of modifying the reads, we modify the reference. So if a prior we know that there's a translocation between two genes, then we can try and align our reads to a merged reference, basically. There's just those two gene sequences. And a combination of these approaches I think is what's most common, which takes information that we get from paired-end reads. So if we can independently map this green end to GeneX and this orange end to GeneY, then we know perhaps that there's a gene fusion involving GeneX and GeneY. We don't know exactly where the fusion boundary is, but then we can use this previous approach where we, in some way, create a pseudo-reference out of GeneX and GeneY and then do a more sensitive alignment to get a split read that tells us what the exact fusion boundary is between GeneX and GeneY. And I guess this slide is more for reference as to what the available tools use. So in the first column, most of them produce an exact sequence. There's quite a few of these fusion discovery tools, as you can see. And I think this is probably not all of them. Even one of the ones, StarFusion, that we're using in the lab is not even on here. This second column is mostly methodological of whether or not the reads are segmented before they're mapped back to the genome. The third column, whether or not they leverage paired end information, whether or not they use an approximate reference scheme, as we are showing here, is this fourth column. Another alternative to this idea is where we say we nominate GeneX and GeneY as being fused. We can just concatenate all of the pairs of exons between GeneX and GeneY and look for reads that map exactly to those exon boundaries. And so that's this column. And then whether or not they search for an exact fusion boundary and account for the possibility that there could be mismatches in the reads that map to the fusion boundary is on the far right. There are a few assembly-based tools. And I think transibus and Trinity have been used to find fusions in various studies. And then, obviously, there's two steps to the assembly method. You assemble some contigs, and then you have to basically sift through those contigs and try and identify the ones that are fusions. And this has been done using GMAP or dissect or barnacle those three methods. So based on this one study that independently evaluates a number of these tools, you can see that the most sensitive tools also produce a lot of results. So the two most sensitive are Top Hat Fusion and KameraScan. Each of those in this study, well, KameraScan produces 13,000 fusion predictions and Top Hat Fusion produces 136,000. So there's still a little bit of work to be done in terms of producing a reasonable number of fusions with the utmost accuracy. So my own tool, Diffuse, is reasonably sensitive, but thankfully only produces 900 fusion predictions, which is still quite a lot to parse in one sample. And then, of course, they showed it also in the same paper that you can't really rely on simulated data entirely for understanding the sensitivity of these methods. The real data is too complex. Sources of false positives. So we have alignment artifacts that are a huge source of false positives and chimeric reads that come from the molecular biology. So we can get template switching when we're doing reverse transcript days. And we can also get ligation artifacts. These are usually random, and so they don't produce a large number of reads, and so just filtering predictions that have fewer reads gets rid of those. And then, of course, we have natural sources of rearrangements such as immunoglobulin rearrangements. If your sample contains a lot of immune infiltration, that'll be a problem. And other transposons, including mitochondrial insertions, et cetera. Okay, so solutions for alignment artifacts that we've found is to calculate features of the supporting alignments and then either use heuristic filters or machine learning techniques. And that's what we used in Defuse, my own tool, although those techniques have been applied for pretty much all of the fusion discovery tools. And so I'll go through a few of the features that are this most distinguished false positives from true fusions. The first one here, so what we're showing is a histogram green as the positive examples and red as the negative examples. I'm just showing that there's some separation between positive and negative for each of these features. And the first feature is how well the reads distribute across the fusion boundary. We expect, if they're aligned, we expect in a true fusion that they should be well distributed across the fusion boundary because there's no reason that a particular location should capture all of these reads unless they're PCR duplicates or an artifact. The second feature that we can calculate is whether or not the fusion boundary coincides with an exon boundary with a known splice signal. So what happens usually is you get a translocation in the middle of two introns. So it brings together two intron sequences. And then in the fusion transcript, that intron is spliced out. And so what you end up in RNA-seq is an exon boundary fused to another exon boundary. And so then if you look in the genome at the fusion boundary, you should see the splice signal on one side and the matching splice signal on the other. So gt to ag, which is the most common. So the same with genome rearrangements that we saw in the previous lecture and lab. If you have too many possible alignments of all of your supporting reads, then it's a good sign that you have an artifact, either that or you can't reasonably tell which of the possible mapping locations is going to be your actual gene fusion. And usually these don't validate. Another thing people do, and this is done in both assembly methods and in mapping-based methods, is look at, given the assembled fusion sequence, how well do the reads align to that assembled fusion contig? And how well do they align? That can include whether or not the paired-end reads align and have a length between them that is what we expect given the fragment length distribution. This is pretty much the same as one of the other slides where we are looking at how well the spanning reads span across the fusion, or are distributed across the fusion boundary. And finally, another strong signal of a false positive is where we don't have a... where we can't exactly assign part of the fusion contig sequence to one gene and a distinct part to the other gene. If we have a significant amount of overlap, say 75% of it maps to one gene and an overlapping section, the other 75% plus 50% that is overlapping maps to the other gene, then this is a good sign of a false positive. And I'll show you an example of this in the lecture or say in the lab. Okay, so we have also natural sources of rearrangement. A good way to get rid of these is by through database searches. And we also have transcription induced chimeras or read-throughs. And this is what happens to produce these is a gene is not rearranged, but when it's being transcribed, a transcription stop site is skipped and then it reads through into the next gene. And so then those two genes are co-transcribed and it looks as if it's a gene fusion between adjacent genes. And these are very common, even in the nine samples. Yep. And those are transcription induced chimeras easily filtered out by most of these programs? Yeah, I think a lot of them, some of them you have an option to filter them and some of them they're flagged. Yeah, I think all of the people who have designed these tools are cognizant of them. Because they're very common. They dominate like the bulk of the predictions. Okay, so when we want to prioritize real gene fusions, we can look at a number of things, so expression, see if the fusion is highly expressed, particularly if the three prime gene is highly expressed, as that's often the one that has the function in the fusion gene. We can look at recurrence, whether or not it's seen across multiple samples. We can look for a corroborating rearrangement if we have any information about the genome. We could look at the function, particularly of the three prime gene, is it a kinase and could it serve as a drug target? We could look at whether or not the function is preserved by the fusion. So whether or not the breakpoint occurs within an intron, and also whether or not it preserves the reading frame of both genes, and that's a little bit complicated, and I'll go into that in one of the next slides. So looking at expression, we can also look at not only whether or not the three prime gene is highly expressed, but also whether or not the expression is interrupted, and that tells us basically that if we see a large discontinuity in the expression across one of the genes for which we predicted a fusion. So here, for instance, in H and F1A, we see that expression starts right at the breakpoint that we predicted from the mRNA sequence, and so before this breakpoint, there's pretty much no expression of these five prime exons, probably because only the fusion version of H and F1A is being expressed, and the fusion version only has this last three prime exons after the breakpoint. So we can look at things that are recurrent. Here's a good example where they looked across multiple different tumor types, prostate, thyroid, et cetera, and they found that these fusions were recurrent in that a single partner, VRAF, was always fused to another five prime gene, and they would have actually missed this if they hadn't looked across multiple cancer types. So this slide, I'm just looking at two different data sets that show that read-throughs are quite common in benign samples on the right. So on the right, we have a lot of, or maybe 20 fusions that we've predicted in an LN cap, and then we've looked across some benign and other tumor samples, and we see that the ones that are shown in blue, with this blue annotation here, are all recurrent. The only one that is between adjacent genes and is not recurrent is this one that is also associated with a deletion between those two genes. So generally, these read-throughs are recurrent even across cancer types, sorry, not across cancer types, but across tumors and across benign samples, and they're generally not considered to be drivers of oncogenesis, although there's one counter-example to that. And that counter-example is SLC 45A3 out four. This one was discovered in prostate cancer and it's being associated with cancer cell proliferation. So not all read-throughs are something that we should discount. Related to read-throughs, another thing we can look at to identify true fusions is the distribution of which exons in the genes that are fused. And we see that for the majority of read-throughs, it's usually the second to last exon is fused to the... of the upstream gene is fused to the second exon of the downstream gene. So that's a good sign that perhaps this is not a functional gene fusion. It's just a missed transcription stop site. And, yep. Yeah, I think a promoter exchange would be more likely to be one of the first few exons is fused to... of the upstream gene, the 5-prime gene is fused to one of the first few exons of the downstream gene. So that would... I'm not sure if there's a picture of that in this particular slide, actually. I think that... So the point here with this graph is that read-throughs mostly look like second to last to second. And then for the ones that are actually driven by or created by intracromosomal translocations are any distribution of exons to exons. Okay, reading frame is also important for assessing function. So, say we have a fusion gene where the exons come together such that we break the original wild-type transcripts exactly at a codon boundary in both the 5-prime and 3-prime gene. So then in this obvious simple case we'll definitely preserve the reading frame and the subsequent peptides after the fusion boundary will all be the same peptides that were produced if we had a wild-type copy of this 3-prime gene. And correspondingly, we can also think of an example in which we don't break exactly at the codon boundary but we break in the middle of two codons in such a way that we bring these to fusion transcript together and we produce a nonsense codon at the fusion boundary, but the subsequent codons are exactly as they should be in the 3-prime original transcript. So in these two cases we're preserving the reading frame of the 3-prime gene and preserving the function. And then, of course, for other cases if there's a slight mismatch in how the codons are broken then we get nonsense after the fusion boundary in the 3-prime gene. Since the 3-prime gene is the one that's often associated with bringing the majority of the function to the fusion gene, then that's quite important. Okay, so we can also look at the associated rearrangements to tell us a little bit about the fusions. For instance, this is an example that we found in a prostate cancer where we had what we thought were two independent fusions, one involving Shank II, one involving Mick. And then looking in the genome we see that there's actually a complex rearrangement that's simultaneously breaking three different chromosomes in four different loci, rearranging them, and simultaneously producing these two fusion transcripts. So that tells us a little bit of how these fusions were formed. Another thing we can look at is what the structure is of a simpler breakpoint. And we can find these examples where there's actually complex breakpoints where we have insertions of disparate sequence at the breakpoint. So in this case, it's not a simple breakpoint between SAMD12 and this gene, PHF20L1, because there's one kilobytes of this other gene and one kilobytes of this intergenic sequence that's inserted at the breakpoint. And so this is a good example of why we can't just rely on whole genome sequencing, too, because we wouldn't discover the connection between SAMD12 and the PHF gene just from the genome sequencing. We discover some independent breakpoints between all these other regions. Some considerations for experimental design. I think larger cohorts is always going to give you more power to detect these rarer fusions, and often I think fusions are going to be rare if we're being realistic, because if they weren't rare, people have been looking for a lot of them already, and so it's unlikely we're going to find something that's in 50% of ovarian cancer tumors, because people have been looking enough already that it would have shown up, for sure. So I think RNA-seq and fusion discovery has a lot of promise of finding these more rare fusion genes that can be targeted in a way that's sort of patient-specific. And so for that, we need larger cohort sizes. And then because of that, we can use the fact that we are doing a larger cohort size to filter based on the fact that we will... a lot of the artifacts that we see across, say, a cohort of 100 patients, if we see something that's in 75% of those patients, a lot of the times we can actually filter those because the artifacts that come out of these RNA-seq experiments are going to be very prevalent across a large number of patients. Okay, so some encouraging results from our recent study in which they've... so they discovered this alka fusion in lung cancer, and now there's phase three clinical trials for the chrysinotin, which is looking like it could successfully treat this disease. And finally, I think in the future, this field is going to go towards trying to look for the actual protein products of these fusion transcripts. So what... in oncoproteogenomics, what people do currently is they take mass-spec data shown up here on the top left, and they have to combine this with a nucleotide database to try and find these short peptide sequences that they can then map back to the genome, and then they can use these peptide sequences to say which peptides are in their sample based on the mass-spec data. And so this is... it's a very indirect way of trying to understand the protein content of a sample because we have to know the nucleotide sequence that we're searching for. And so some future studies are very likely to leverage this data by predicting from mRNA-seq fusion transcripts taking those fusion transcript sequences and then augmenting this nucleotide database so that they can look in their sample for the associated protein product. And so I'm sure that in the next few years there's going to be a couple of papers on this. All right.