 Okay, so the last section we're going to cover is another new section, probably the newest section, and it's to do with gene discovery, gene fusion discovery, sorry, and I need to give credit to Chris Maher, who is another faculty at Washington University and affiliated with the Genome Institute, and he very generously let me use a lot of slides, so he's really an expert on fusion detection, he wrote one of the popular fusion detection algorithms and he gives a lot of talks on fusion detection, so I shamelessly stole many of his slides. So just to remind you, we are on the final module, and this part won't be quite as working example, well I hope it's a working example when you have the time to run it, but it takes quite a while to run and I wasn't able to get it down to a kind of bite-size reasonable running time, so we're going to have to kind of pretend a little bit with the tutorial. So learning objectives for this part, which is the lecture, we're going to discuss the relevance of gene fusions in cancer, so gene fusions are probably not only important in cancer, but that's certainly where you think of them most often, RNA-seq is of course an approach for detecting gene fusions, it's one of the good approaches, we're going to talk about some criteria for removing false positives and for prioritizing your candidates, so one of the biggest challenges with gene fusions is no matter how clever your method is for predicting them, you generally always have more false positives than true positives and so you need to filter out false positives and then you still need to narrow down to what candidates are actually likely to be interesting in terms of biology. So just a quick introduction to gene fusions in cancer, they obviously have clinical significance, they can sometimes be the ideal diagnostic or prognostic markers, so for example 95% of people with CML, the kind of leukemia, harbor the BCR-Able fusion, so this is the fusion resulting from the Philadelphia chromosome or Philadelphia translocation and the fusion between these two genes and what it basically does, actually I think there's another slide that explains it, so I'll save the explanation of BCR-Able. Another reason that fusions are interesting in terms of clinical significance is that they can be a good target for therapy, actually BCR-Able is an example of that as well and there are other fusions which indicate a good treatment course. There's a kind of paradox amongst the common epithelial cancers, so to date a large percentage, 80% of known gene fusions in cancer have been found, I think something's missing there, been found within 10% of all human cancers, so they're heavily biased, at least in terms of what's been discovered towards blood cancers like lymphomas and leukemias, but we're starting to find them more and more in the solid tumors as we get better at looking for them and the common epithelial cancers count for 80% of cancer related deaths, yet contribute only 10% of known gene fusions, so there might be some interesting discoveries still to be made there. So how do gene fusions work or what's some like the consequence of a gene fusion? One, like the BCR-Able example is basically the classic example of creating a fusion, so you have some exons of one gene basically being fused together with some downstream exons of another gene and in a case like BCR-Able what's happening is you have like an important tyrosine kinase domain here that's being essentially super activated or constitutively expressed and activated by the fusion with this other gene and that basically ramps up this kinase activity, causes all kinds of growth factor pathways to be activated and leads to tumor genesis. Another similar mechanism or related mechanism is when you have the regulatory region, so not the coding sequence but actually just the upstream region of a gene being fused onto another gene. So here you have an oncogene like MIC being upregulated by being fused to the regulatory features of IgH. So both of these are fairly common mechanisms of fusions. That's really fuzzy. This is just to give a little history of fusion detection. So in the slides fusion and chimera are kind of synonymous interchangeable words. We started with longer sequence data actually before 454 we started looking for fusions and Sanger data. So then we had long reads and we were basically looking for long single reads or short single reads in some cases and we're looking for that single read to span a break point essentially. So we're looking for reads that don't align properly to the reference genome, taking the unaligned reads and then trying to map each part of the partial alignment to where it goes in the genome. With the Lumina reads alone, just the short single reads, it's a little bit trickier because you don't tend to have like these big long alignments. So you get, but the concept is basically the same. You get some unaligned reads or a read that's partially mapping but has some mismatches at one end and you go through a kind of iterative process to identify cases where you've got alignments against more than one chromosome or more than one part of a chromosome that appears too far apart. And the concept hasn't changed much, but with paired end reads we get basically an extra kind of evidence. So it's too small to really see, but you have two kinds of situations. One where the read actually spans the fusion break point, which is similar to what we were looking at with just the single reads and then you have paired reads which encompass the fusion. They don't actually cross the break point, but one read is on one side and the other read is on the other side. And from that you get basically two kinds of evidence, the so-called spanning reads and the encompassing reads. And this is a little bit easier to see that. So if this is a break point here, we've got these mate pairs which are encompassing that fusion junction. So these reads are on the left side of the fusion and these reads are on the right side of the fusion. And then we also have some mate pairs where one of the mates is actually spanning the fusion, where the read is partially broken by the fusion. And most of the fusion detection algorithms, including the one we're going to look at, top hat fusion and chimera scan, which Chris has developed, are looking for some amount of evidence from both of these categories. And you can detect some fairly complicated rearrangements with these approaches, not just like a simple fusion. So a lot of gene fusions have actually been discovered using RNA sequencing. There's a list from a review in 2012 covering a wide range of different tumor types. And you'll notice the sequencing technology is listed here. So some of these were identified with whole genome sequencing, but the majority of them have actually come from RNA sequencing. So RNA sequencing has been really useful for identifying novel fusion genes in tumors, in non-tumor cells. I think they have been, and I think people have done surveys of non-tumor cells, but I don't think you see, you don't see that many of them. They definitely happen, like just naturally it happens. I mean, you don't see as much genomic rearrangement in normal cells as you do in tumor cells. I mean, that's a feature of cancer. But there are translocations and sometimes those translocations cause fusions, which in many cases maybe don't matter. It just cause like a nonsense product and doesn't cause any problems. But every once in a while, one of those translocations will cause a fusion that causes an oncogene to be activated and then you potentially have a tumor. So actually we're going to talk about that as a kind of false positive. I don't know if I've read any papers about it. I doubt they're generally very highly expressed, but I think they could be. Similarly, if you've got a normal tissue where you're not expecting a lot of fusions, can you do that to see underlayed sequences if they're stitched together when it shouldn't be? Because we had some guys that tried to develop a cell movement themselves, and then when I liked the blast of a random cell, in some ways it was just a so many cognitive reactions that shouldn't have been there. But is there a lack of apocalyptic news to describe what this is for rubbish DNA or something? I mean, I guess you might consider making a practice of running your RNA-seq data through a fusion detection pipeline, whether you expect there to be fusions or not. Then you'll start to get a sense of what's a reasonable amount of fusions for a normal tissue of type X or whatever. And then you might see some red flags if you're trying a new assembler, and it's like way off the map compared to what you've seen with other assemblers. The only thing I would say is that these pipelines for fusion detection are much less polished than the, for example, the cufflinks top hat pipelines. They're also slower. They're also famous for producing crazy amounts of false positives. So the data is already noisy. So using it to assess noise will somehow be a little bit tricky. Like you tend to see with the fusion detection pipelines, a lot of manual massaging and post-processing going into get down to sometimes finally deciding you don't believe in any of them, or maybe you find one or two or half a dozen that seem convincing. But you're usually starting from lists of hundreds or thousands of, or tens of thousands of potential fusions. So it could work, but you'd have to like, you'd have to get a good sense of what the variability in the results looks like for even when it's working properly on well-assembled data. Do you have to have a lot of coverage to account for any tumor-heterocomotives in the center? You're dealing with skeletons? Yeah, you do. I mean, if the fusion is, fusions are hard to detect. Like there's this one breakpoint and you're hoping to get a few reads spanning it. So already you need good coverage in order to give your sense a chance of detecting. Like we see with SV detection, structural variant detection at the DNA level, that we miss a lot or we're only now getting to the point where we feel confident that we're catching them with like 30, 40, 50x coverage. So with the RNA, if it's a highly expressed fusion, you get a lot more chances to detect it. If it's lowly expressed, you will quite likely miss it just like you're going to miss SVs in DNA data at low coverage. And then if you have tumor-heterogeneity on top of that, it just compounds a problem. So if the fusion is in a subclone at 10% frequency or if you've got maybe 50% purity of tumor and then heterogeneity in there, it's exactly the same argument you can make for rare splice forms, rare variants. But yeah, coverage really matters for detecting them because they're tricky to detect. So if you see like the default settings for top-hat fusion, I think they're only asking for one spanning read and two encompassing reads minimum and a total of five reads of either type. So you can see how low they've set that to hopefully catch things. Like I think if we had better data that consistently had more coverage, everyone would probably feel more comfortable with allowing as little as one spanning read as evidence. But the fact that the parameters are set that way is probably indicative of the suboptimal coverage levels you can expect. So gene fusions discovered. We went through these, the different technologies. This is a list of some of the tools that are available. So top-hat fusion is one we're going to try to run today or look at how it would be run. Other ones I'm vaguely familiar with are defuse. Camerascam is the one that Chris Maher developed and that's actually what we pretty much use at WashU. We're trying to get it automated in our pipelines and he's running a lot of it through his own somewhat manual process in his lab. How does it differ from top-hat fusion? I think they're conceptually all similar to how they differ in the details. You'd have to read the papers and really study them to know. I know that Camerascam and top-hat fusion conceptually work very similar. I think we have a comparison here. So this was actually not from Chris's lab. It was someone else did this analysis or a paper called State of the Art for Fusion Finding Algorithms, Sensitivity and Specificity. And they looked at a set of 27 validated fusions in a data set and compared many of the fusion finders on the last slide. And they actually found that Camerascam was one of the best. So a couple things to note. First, none of them detect all 27 of the known fusions. The best that we see is actually top-hat fusion finding 19, but it's got a bunch of them wrong. Camerascam detects about the same number and gets them all, well, I guess also about half right and half. No, sorry, all right, none of them wrong. So I guess this is why Chris was happy to have this slide to say that Camerascam is one of the best performing. So it's basically everything it's detecting is right, is a real fusion. And it's detecting almost as much as the best performing in terms of total detections. But none of them are detecting all of them and most of them have problems. So a big issue, like I said, is prioritizing the gene fusion predictions. So you're going to run this and you're going to get a ton of potential fusions. The first major task is to remove false positives. So one thing you can do if you're fortunate enough to have normal samples is to exclude fusions that are predicted in the normal. If you're looking for tumor-specific events. In the absence of adjacent normal tissue, which is I would say almost always, it's quite unusual that you're lucky enough to have match normal RNA-seq data, you can do things like run fusion detection on a compendium of normal tissues. So Chris has suggested the human body map project to identify fusions in normal tissues. Then if you see these fusions coming up in like normal tissue or a wide array of different tissues, it's more likely that they're false positives and in any case not tumor-specific. So he recommends this, you can get that from the human body map project. I was reading the top half, the top half of human being were both lying down here. I'm just wondering, they said that when they were assessing their false positive rate, they looked at a normal cell and they said that a normal cell should have no fusions. A normal cell could have fusions. Not all fusions can use cancer, right? Yeah, it could have fusions. It's probably not true that you would expect few fusions in the normal. I don't know, maybe it has none, but I don't know if they certainly can guarantee that. And I guess even if you have some normal cells with fusions, they won't necessarily express an RNA product. It's probably a generally safe assumption, but it's not 100%. It might happen to you. Yeah, you don't... I mean, what kind of rearrangements would work? Sorry, what kind of rearrangements? You may want to... You need IDG to rearrange. Yeah. So that's a source of a lot of false positives that you have to filter out. There could also be these assemblies in your room. But yeah, one convenience you have when studying tumors is you have this clonal expansion phenomenon where some things cause a tumor cell to proliferate like crazy. And so you're actually getting a concentration of the very thing you're interested in. The main problem is that other events that happened right before that expansion also get carried along for the ride. So you have to figure out which one is the driving and which one is the passenger. But you do have that advantage compared to maybe a population of normal cells where there's a few random stochastic events occurring in some of them that they're not expanding so they don't kind of rise to the top. So you might get some fusions with very low coverage from ligation artifacts that can occur during library preparation. So that's another source of false positives. And it's important to remember that reads which support fusions may not agree with the fragment size distributions of the library. So actually that's one of the ways we can kind of spot potentially interesting reads in terms of a fusion. And like looking at IGV, there's that one mode you can turn on where it colors reads that aren't agreeing with the expected fragment size. And sometimes that's a good way of spotting regions where there's rearrangement or fusion. Another thing you can do is look for expression imbalances. A lot of times what happens with a fusion is it's, first of all, you can look not only for expression imbalances but also copy number imbalances. So at the break point of the fusion it's quite common to see, in addition to the fusion amplification of one side. So you'll see like a very clear pattern in the alumina coverage that looks kind of like this where it's like there's a certain coverage level and then you'll get a big spike in a higher level. And the break point of that spike is the same break point as where you predicted the fusion. And you'll also get potentially a spike in the expression levels. So here he's showing an example of expression for exons 2, 3, 4, 5, 6, 7 and so on and how there's like a big spike right at exons, I don't know, 8 to 18. So for these exons basically being massively upregulated and the rest aren't. And that can sometimes be a way of confirming a fusion if you see changing the expression that matches the predicted break point from the fusion data. Yeah, it can be. Yeah, like I would definitely go and look at the expression estimates at an exon by exon level along with your fusions, your fusion predictions. Another good way of prioritizing your fusions is recurrence. So this is the most popular in a way. Everyone would love to find a fusion that is recurrent across many samples. And this is an ideal candidate for further screening and potentially functional validation. So you're not often that lucky, but if you're evaluating a set of 50 tumors and you see a fusion occurring in 10% of them, that's kind of like Yahtzee. A lot of different fusions can occur. So here we're showing you some of the different situations. So the common one that people think about is like an interchromosomal translocation where you have a fusion between part of gene A on one chromosome and part of gene B, or in this case gene X on another chromosome. You can also have interchromosomal complex rearrangements. So when break points occur and chromosomes are being translocated there can often be inversions or amplifications. And so the fusion product that results can be more complicated than just a simple end-to-end joining of X-ons from two genes. Similarly, you can have interchromosomal rearrangements like deletions. So you can have a big chunk of the chromosome kind of drop out and then fuse two genes together that would normally be quite distant. And again, there can be complex rearrangements occurring within a chromosome that can cause fusions. And then you can have read-throughs, which can be caused by splicing where basically you're getting transcripts that are just reading from one gene into the next gene through, let's say, aberrant splicing. And that can look like a fusion. And you may or may not be interested in those. Some of those might be biologically interesting some of them might just be kind of noisy transcription that causes false positives in your data. This is a challenge because actually read-through transcription occurs fairly frequently and much more frequently than the inter- and intracromosomal rearrangements that we tend to be interested in. So here's a group or population of different samples across different tissues. And these are the numbers of samples which have various read-through events compared to having inter- or intracromosomal events. So those inter- and intracromosomal rearrangements are here appearing generally just in one cohort or even just a single sample whereas the read-throughs are typically occurring in multiple samples. And you have to be careful because even though it's occurring across multiple samples sometimes it's at a fairly low level and it can just end up being tumor-specific by chance. So you see it in a few of your tumors and not in your normal and think, ah, this is a tumor-specific event, but really it's not. It's just that you maybe didn't see it by chance in the normals. Another piece of evidence you can use to kind of prioritize or validate your fusions is look at the distribution of exon-exon junctions across different classes. Basically, there's certain biases like read-through events or bias towards skipping the last exon of the five-prime partner. That's just a common pattern of read-throughs whereas the inter- and intracromosomal rearrangements are biased more towards random exon-exon fusions. So many of the recurrent read-throughs are thought to be likely splicing events. So you might be interested in those or you might not be. A third way of prioritizing your fusions is to look at the predicted functional effect or see if they're functionally recurrent. So if there's selective pressure to alter a gene in order to achieve like a similar consequence you might see that same gene being a participant in fusions again and again. So you'll see like the same three-prime partner appearing with different five-prime partners because the main thing is maybe these are various genes where just the regulatory elements are being attached and the main thing is up-regulation of this three-prime partner gene. And you tend to see that like you'll see a lot of BRAF fusions where BRAF is being fused and up-regulated by all kinds of different five-prime partners. And the same can occur at the other end where you have over-representation of a five-prime partner with different three-prime partners. So that can be another clue of the functionality of the fusion and therefore make it higher priority. So this is just giving an illustration of that using BRAF. So here you can see BRAF is being fused with a whole bunch of different five-prime partners. And they were occurring kind of spotty within different tumor types. People were just seeing it one here and there. And it wasn't until you really look across a large set that you realize that this is an important and recurrent event that's occurring in a large number of different tumor types. So sometimes actually integrating with larger data collections can reveal functionally relevant fusions. So in our fusion detection pipeline, whenever we see a fusion, even if it's a singleton, if it looks real, we would compare that against basically a database of previously reported fusions and use that to assess whether it might be significant or interesting for that particular tumor. Whether the fusion is in frame or not can tell you a lot about whether the fusion is important. So typically we're most interested in fusions that are resulting in in-frame predicted proteins just because if the protein is going to do something and drive tumor genesis and especially if you're going to target it in a drug, you want it to be translating to a real protein. But it's not necessarily the case that you should exclude genes which are not in frame. So for example P53, the tumor suppressor everyone knows is really important in cancer and it's commonly participating in fusions with different three prime partners and it's usually happening around the first exon. So what's happening here is P53 is being fused right early in the gene to various other partners and that's effectively truncating P53 and it's actually in this case makes more sense for it to be not in frame because it causes a non-functioning protein and that's just a way of shutting down P53. It's kind of like you see P53 deletions. So another way of basically knocking out P53 is fusing close to the five prime exon with some other gene causing a nonsense product. So gene annotation databases, we mentioned this briefly, basically are either of the genes within the fusion interesting in terms of cancer. Do they involve kinase in the three prime positions like following on the pattern of VCR able? Could they serve as a drug target? Have they been shown to be rearranged in other cancer types? So when we get our final list of fusion partners that are predicted we usually go through this exercise of saying are they cancer genes? Are they thought to be or known to be drug-able? And searching the literature and searching some databases to see if they've been previously implicated as rearranged. So the general pipeline or flow of what I just described, you start with your fusion predictions from Top Hat Fusion or PrimeraScan or some other software. You would typically filter false positives, subtracting perhaps adjacent normal predictions if you have adjacent normal data utilizing different filters. You're going to look for recurrence, both just in terms of overall recurrence across the patient set and maybe screen for functional recurrence. So could be the same pathway genes being involved in fusions or something like that. Then for singletons you might prioritize those based on whether they're in frame or other criteria. And then you would typically validate in the index case, so the case where it was discovered, and then screen additional samples. So you might design a PCR product or a PCR experiment or a fish experiment to go and look through a large set of maybe 100 or 1000 other tumors of a similar type and hopefully detect that same event occurring elsewhere. And then of course there's functional validation, so you can create maybe in cell lines you could express that fusion product and see what effect it has and so on. So Top Hat Fusion was chosen, mainly because it comes with the Top Hat 2 installation so we didn't have to do any extra work there and we thought this would be convenient. It has a relatively simple installation and run procedure. The problem we've had with it is it requires a lot of extra data files including some very very large ones and it takes a long time to run. So I don't think we would have had much different experience with another software, but it's just the way it is with fusion detectors I think. So how does Top Hat Fusion work? At a very simplistic level it's basically starting with these initially unmapped reads. So reads that map perfectly to the genome and or transcriptome we're usually not interested in. They aren't likely to be evidence of any kind of fusion. So we're looking for stuff that didn't map or didn't map well using the conventional mapping procedure. And then we're taking those initially unmapped reads and in this case you have some clue based on the segmented nature of alignment from Top Hat that maybe part of it aligns to one chromosome and part aligns to another chromosome. And there's this fusion database that's created and you basically try to align that read across those fusion breakpoints. So you have these unaligned reads and you create these segments and then you go back to other reads and basically try to align them to those same fusion breakpoints to see if you can build up the evidence further for that fusion event. And there's several criteria that Top Hat Fusion uses like we talked about. It's looking for these supporting reads. So there's some reads that map properly along the transcript for that region like covered. Don't extend across a breakpoint and just map normally in that region. And there are other reads that either span or encompass this predicted breakpoint. And so those are supporting reads and these are contradicting reads. So it's going to somehow look for preponderance of these supporting reads. And it has some other criteria and this is one thing that really tripped me up when I was trying to make a simulated dataset with fusions. So I have a hard time getting even stacking the deck and creating like a really obvious fusion. I had a hard time getting it to pass these various criteria. So it's looking for supporting reads that have an even distribution across the fusion breakpoint and also it has to be across a sufficiently wide window. So if you have just a bunch of fusions crossing the breakpoint but nothing to the left or right like if it's too tight a distribution it won't qualify or if there are some gaps it may be disqualified. I guess this is something they've developed to avoid obvious or consistent false positives that result from other kinds of events besides real fusions. But it made it hard to make a fake dataset.