 All right. Good morning, everyone. I'm Brian Haas. I'm here from the Barrowed Institute and I'm going to teach this morning's module on gene fusion discovery and cancer. And just to mention up front that the person who usually teaches this lecture is Andrew McPherson. He's done it at least once or twice in the past. And this is the first time I'm doing it here. So I apologize ahead of in advance if it's a little rough. And I'm actually borrowing a lot of slides from Andrew. And I've just basically injected a lot of my own material onto it. So give Andrew credit where credit is to. Here's our contact information for Andrew. Here's my contact information. So we have a few learning objectives on fusions. We want to be able to explore the impact of gene fusions in cancer. I'm going to learn about the different types of evidence for gene fusions, understand the available detection methods and the different tools that are available for bioinformatics tools. And it's a kind of a it's a it's a messy problem. All right. It's it's challenging. You can end up with the finding important cancer fusions, but you can also have lots of artifacts. And that's one of the big challenges in the field is dealing with all the false positives. So we'll talk a little bit about that. And you might be able to assess a gene fusions potential function. Like how might it be contributing to tumor genesis? First we'll define what a gene fusion is. Gene fusion is a fusion between two genes. How do you get it? Oftentimes what will happen is you'll have a translocation between two chromosomes. This is this is an example of a balanced translocation where you can think of it as like a recombination event, but it's a recombination event that's not supposed to happen. And it's not a recombination that happens, but some fracturing happens with the genome and the repair system. It doesn't do a perfect job when it tries to put things back together again. And you can end up with these kinds of translocations. In this case, you've got a chromosome nine. This is actually this is the most famous fusion. This is the VCR able fusion that you find in the chronic myelogenous leukemia. About 95% of patients have this particular chromosomal aberration and involves a translocation between chromosome nine and chromosome two, which puts the able gene and the VCR gene right next to each other and creates a fusion event, creates a whole new chromosome. It's called the Philadelphia chromosome and they actually give it a name. So you end up with these two different chromosomes. Nine ends up being a little bit bigger than it was before, and 22 ends up being this this chimera with this VCR able fusion gene. So these are the kinds of things that happen. I guess this would be a balanced rearrangement, but not all all fusions are due to these kinds of rearrangements. Why are these important to us? Because these kinds of events, they are really highly relevant to cancer biology. In many cases, these fusion genes generate fusion transcripts, and which may or may not generate fusion proteins. And in any case, these fusions can get in drive cancer, right? They can actually be drivers of cancer. Like I said before, a VCR able one is the most famous case and again, 95% of cases. But the nice thing about this is that this is one that can actually be treated. It actually creates a new kind of kinase and we'll talk about this. It creates a fusion kinase that's basically unregulated, but you can actually treat it with a drug which is a kinase inhibitor. It can be highly effective. You find fusions in both liquid tumors. These are leukemias as well as some solid tumors. They're probably better well known for liquid tumors like leukemias. But you do find them in some solid tumors as well. And in some cases, they can actually be the signature driving element of that particular cancer type. It's just like the hallmark of that cancer type, such as the VCR able case with CML. In prostate cancer, about 50% of prostate cancers, you find a Temprous II ERG fusion. This would be in the category of the solid tumors. You find another kinase fusion, EML4 kinase, and non-small cell 1 carcinoma. It's only 4% of the cases. But this is relevant because for the 4% of patients that have this diagnostic feature, you can actually try to treat it with kinase inhibitors. There's evidence that it can improve the patient outcome. There's another tumor which actually is kind of a rare cancer. It's a fibrolemel or hepatocellular carcinoma. It's difficult to say. But 100% of these tumors are driven by this fusion. And there are some others too that we'll talk about that are relates like this one fusion is the defining hallmark for that cancer. And then certain brain cancer, gluoblastoma patients, 8.3%. I only have this FGFR3, TAC3. So it's important to be able to identify these because then you know what kind of cancer you're dealing with. Because not all cancers are the same. You might have a cancer type and like a brain cancer. And depending upon what's actually driving that brain cancer, it's going to lead to different treatment options. And also prognostics are important too. What's your expected survival if you have these different kinds of events that are driving the cancer? Because some cancers you know they can be you know they're more dangerous to have than other cancers. And a lot of times it's just going to depend on whatever the event is that's sort of driving that tumor. So what is the evidence that these gene infusions are initiating cancer? Well for one they correlate with the cancer phenotype. We know that you can successfully treat some of these. Like the kinase inhibitors and CML and treat the or even a lung adenocarcinoma. The fusion involves the kinase, treat the kinase inhibitors and oftentimes that can improve the outcome. You can take that gene fusion and you can put it in a cell and inject that cell into a mouse and see if it develops into a tumor. So that's more proof that the fusion is playing a role in driving that tumor. And we also there's experiments where you can try to just silence. If you know that fusion exists you can try to silence it with micro RNAs, short hairpin RNAs, or other techniques. And you can show you can you can drive down cellular proliferation. All right so there's good experimental evidence that these things are are the driving events. We'll see some examples of that as well as we go forward. So how you know what's the molecular mechanism that's involved and how a fusion can can drive cancer. It's really just just like with mutations and cancer and you guys already learned about you know somatic mutations and some of the other things that are you know responsible for for for driving cancer. Two really two key elements here. In some cases you're activating a tumor oncogene or the other option is you're deactivating a tumor suppressor. Either these paths separately or together and can can drive the phenotype. So our fusion is going to do that. Oh that is my little buddy. Tim Tauti. Anyone know what Tim Tauti is? Any old school pearl programmers? Ann, you don't know? I thought you were raising your hand. You're like yeah I know what it is. Yeah just stretch on here. Anyone know? Somebody knows. No? All right there's more than one way to do it. All right that's that's what this is. More than one way to do it and it's just like if you're if you're a old pearl scripter like I am this is sort of like the the acronym of pearl programmers. So this is no one really uses pearl anywhere. I'm switching to python now. I figured I'd try to you know inform everyone about this and carry the pearl legacy forward. So and it just it just seems like something this guy would say. So all right so we've got some examples of all the different ways in which you can accomplish those goals. Knocking out tumor suppressors or activating oncogenes. This is just one of a bunch of examples. We'll start with the most famous one, BCR Able. Here we have the BCR gene and we have the Able gene and what's important here is that the Able gene encodes a tyrosine kinase. This thing is really kind of crapping out. No worry you can do this or can I? I say can. So we have tyrosine kinase in this Able gene and this you can see these these little arrows here point to where breakage points occur when you find these fusions. So you can see it's not just one place where you might find a breakpoint in the DNA. There's several locations and over here you find there's a few different locations that typically light up. When you create a fusion between these two genes you end up with this which is the the fusion product and what this actually does it creates a fusion protein which has the n-terminus of the BCR protein with the basically the kinase region of the Able protein. And what this does is it deregulates the kinase. You actually end up with this this fusion protein. It has kinase activity but it lacks the regulatory components in it. All right so it's just constitutively active and what is it doing? It's driving cellular proliferation. All right so that's how that's how this one works generating this this fusion protein. In the case of of Temporous Erg which I mentioned earlier that about 50% of prostate cancers have this particular fusion. It involves Temporous and an ETS family transcription factor coming together. Most of them are the Temporous Erg. This is the 50% of them have this Temporous 2 Erg fusion. Another about 4% I think have the ETv1 fused onto it but they belong to the same family the ETS family of transcription factors. I mean you find there's a lot of fusions out there that the ETS family of transcription factors are actually pretty promiscuous when it comes to fusing with other things and different cancers that will show up quite often in different kinds of fusions. So what this does is so we can see here this one's kind of interesting because you have the the way this is drawn is you have the the coding region shown in the darker color and the the non-coding regions. So these are the the untranslated regions the untranslated exons 5 prime UTR 3 prime UTR and here on ETv1 we have the 5 prime UTR 3 prime UTR. So what happens here is that the first exon which actually it's a non-coding exon of Temporous 2 comes together to form this this fusion transcript with ETv1 or in this case ERG. So you get most of the coding regions for the transcription factors right but the 5 prime UTR is basically different. What's also different here the actually the the key part of this is that the promoter of Temporous 2 you know is right upstream so now you basically have this fusion transcript being driven by a different promoter. It's being driven by the Temporous 2 promoter and Temporous 2 is it's highly expressed in prostate so now what you end up having is this transcription factor which is basically lost its original regulation and now it's just being driven at like full throttle by the Temporous 2 promoter and that's that drives the cellular proliferation. So no fusion protein here it's just the you know most of the most of regular protein in this case you only have half the protein because the fusion event occurs right here but you have enough of it where you have the DNA binding domain and you know it's going to go on it's going to do its thing at full speed. Another interesting example is the IgH-MIC fusion that you find in Burkitt's Lymphoma is another one of those signature fusions and here you have the MIC transcription factor and here you have the IgH which is the immunoglobulin heavy chain and the fusion event here is basically it's going to be it's kind of similar here you end up with putting the at least part of the MIC transcription factor under different regulatory controls all right so you have now the you know whatever enhancers or promoters are actually driving the IgH they're now going to be driving MIC so there's really not much of a fusion protein here if any it's mostly just it's mostly just a regulatory thing and you find there's there's a huge diversity of these things you know find fragments of genes inserting into the the IgH region and it's bad news you know the inducing expression. There's another one this one is actually really interesting so this is a case where you have you have the Mib transcription factor with an N-Fib transcription factor these are both transcription factors and you can see how the diffusion event occurs here you're basically getting the full Mib protein you're getting the full 5 prime end you know all the promoter information everything like this is all intact up here all upstream but you're basically you're missing just a tiny little bit here at the 3 prime end of the coding region and you're basically missing almost all the N-Fib protein. You capture just a very very little tail piece of this and you end up with a fusion protein but it's basically just the it's the Mib protein because you're only getting this just tiny little bit the end here but the key part here is really regulation because this this Mib transcription factor the 3 prime UTR has micro RNA binding sites all right so I actually had micro RNAs that are involved in post transcriptionally down regulating Mib expression okay and when you make this fusion what are you doing you're losing the whole 3 prime end here all right so you're losing the MI the micro RNA a binding motif and that's really the key thing here so and they do a little experiment here where if you have you think this is being like the wild type but it's not wild type it actually has Mib and multiple copies but then you have here you just have the fusion if you transfect this with micro RNA oligos now what you'll see is that there's a decrease an expression of the Mib but in the case of the fusion because it's actually it's lost that regulatory region of the transcripts you really don't see much of an effect at all so Mib responds the fusion does not another example they only have a few more of these but I could do like two hours of just going through all the examples because I think it's interesting but just a couple more so there's here's lack b2 nCoA2 fusion it's in colorectal cancer and in this case you have let's see okay nCoA2 is a transcriptional activator and I'm trying to remember how this one works the this is actually this is actually a tumor suppressor in this case so nCoA2 is a tumor suppressor lack b2 it's an endonuclease but it's not really critical what what this is doing the key thing here is that the fusion actually disrupts this a tumor suppressor and because it's disrupting a tumor suppressor it's basically then activating you know the cell proliferation and you have evidence for that here where if you if you enforce expression of the tumor suppressor and you can this is a measure of cellular proliferation and you can see that there's a significant decrease in the amount of cellular proliferation or it's basically just trying to restore nCoA2 functionality so I guess it's just a nice example of you know you're not making a fusion protein you are making a fusion transcript but fusion transcript doesn't really doesn't seem to be doing much biologically the key thing here is that you know a tumor suppressor that's being knocked out by the result of the the fusion event it's another another interesting one because it actually involves a few different mechanisms this involves a mid transcription factor and then the QKI it's an RNA processing gene and you can see the fusion event that happens here you basically get the n-terminus of MIB and you get the very 3 prime end of QKI mostly just knocking out QKI but here you have here you have H3K27 acetylation peaks all right so this is a transcriptional activation mark epigenetic mark and what happens here is that because this fusion event it's basically bringing this this activation mark this epigenetic activation mark to the MIB transcription factor and and this is one of the ways in which is actually going to start activating the the the fusion transcript all right so it's really it's the epigenetics here that's playing a key role having this enhancer here is now going to turn on MIB when it would normally not be on and this actually makes a functional fusion protein and what makes this worse is that this this protein has an auto regulatory feedback loop on the MIB promoter all right so now the so the the acetylation is now turning this on it's making the protein the protein's going back and turning it on even more all right and QKI turns out to be it's in the category of tumor suppressor so you end up knocking this out too so it's a combination of these three different things that really are thought to play a role here in the disease EWS, FLI1 and Ewing sarcoma here you have RNA binding protein which is a social transcriptional activator and we have another ETS family transcription factor FLI1 this is a huge family transcription factors and again they they seem to show up time time and time again in these different fusions here you make a functional fusion protein and you can see this here in this little illustration here this is the fusion protein it it binds DNA through a DNA binding domain and the transcription factor the FLI transcription factor you have the transcriptional activation domain of the EWS all right so this is going to just basically help drive transcription at places where the transcription factor would normally bind all right and what is this going to do it actually up regulates aurorokinase, cyclin D1 and these are involved in in the cell cycle so this is a this is a turning on the cell cycle and it's going to start proliferation I think this is the last one this is another really important one this is another one of those signature fusions that you know if you have this it basically means you have this cancer everyone has this cancer is found with this fusion to my knowledge this involves and this is this is really interesting too because it has a very different strategy for how the you know the underlying biology and how the the fusion actually drives cancer so you have SS18 which is a subunit of the SWISNF this YSNF chromatin remodeling complex and you have the SSX gene which is a transcription factor and this makes a functional fusion protein as well okay I see get this fusion fusion protein here that combines different domains and what happens here is that this is SS18 is actually part of this chromatin remodeling complex the SWISNF or SWISNF chromatin remodeling complex it's now anchored to the SSX transcription factor so it's doing okay so it's SSX transcription factor is now going to recruit this complex the chromatin remodeling complex to the genome at sites where it would normally bind and there are certain regions in the genome that are repressed by epigenetics involving the polycomb repressive complex right so the especially heterochromatin is really tightly wound up DNA you had polycomb complex involved in trying to keep it all nice and compact well this is recruiting now in the SWISNF complex to regions of a normal heterochromatin it's basically helping to unwind it and open it up and what's that going to do it activates it for being transcriptionally active and so that's and that because it's opening up areas that actually have a lot of developmental regulatory genes it's then activating parts of the genome that are normally activated and that's driving so a proliferation and causing cancer and this is a pretty serious one so this one's really all about the chromatin remodeling and and changing the the epigenetic structures so there are a few different mechanisms you know basically just taking all these examples and looking at you know the different the strategies that you have for fusions to drive and basically targeting every aspect of biology you can think of as far as like the central dogma of biology at the protein level we're we're interfering with cell signaling cascades a lot of times this happens through kinases at the RNA level we have transcriptional activation right with different transcription factors lots of these things have different transcription factors that are bound in there and that's just how they're operating post-transcriptional regulation right by removing microRNA binding sites from the RNA and then of course at the DNA level we're looking at things like chromatin remodeling you know repositioning of enhancers changing the epigenetic marks Tentality right so you have different different signatures of these events in the genome you have chimeric DNA sequences that are formed as a result of fusing fusing different regions of the genome together a lot of times these produce fusion transcripts right so you can detect that as being a chimeric mRNA sequence or you know a fusion transcript sequence you can also there are other hallmarks or effects that you can detect like expression changes right genes that aren't normally expressed in a certain cancer type you might find them as being very highly expressed or expressed as an outlier that that would be a good hint that there's something going on that hint it could be hinting at maybe a fusion having happened and that being involved in driving the transcription there are different discovery platforms that have been used for for observing the fusion events some of the earliest ones just involved you know seeing strange looking chromosomes under the microscope all right you have what's called the karyotype where you basically can take the highly compact chromosome structures and you basically just just you know see them under the microscope you can stain them with different staining this is this is called g-banding you think that would the g-banding would involve having bands that like gc root sequences but it's actually the other way around especially the at rich regions that are actually dark bands when you do g g-standing g-banding so just an example of karyotype you can just see the chromosomes under the microscope and you can see acquisitions where you have arrows here you know there's there's a clearly something going on here's a big you know chromosome with a smaller one next to it and you can look at the banding patterns you can try to see okay it looks like this piece of this chromosome ended up on this chromosome and a lot of times you know these are the events that really led to you know these are the initiating events that led to cancer so these are the kinds of things you can look for if there's uh there's other approaches i mean it's kind of hard to if you have like a very well trained eye right to see these banding patterns and figure out which chromosome is which um there's other techniques you do sort of a chromosome painting strategy or spectral karyotyping now we basically have um have markers that have different fluorophores attached to them and you know these markers correspond to different chromosomes so you can effectively just light up the chromosomes according to these these different markers and I think it's a lot easier to look at this you can see okay here's a big yellow with an orange right that's that's clearly different or white with the gray it's a lot easier to sit to look at this way than than i look at the banding patterns since i guess it's just the more you know more modern ways of trying to do these things um you can start looking at entire chromosomes you might have specific events that you want to be able to study for example the you know our favorite that with the bcr able you can you can have a marker that has a fluorophore attached to it like a green one in this case green for bcr and the red for able one and it's basically just just that region of that gene that you're you're looking at and you can do fish which is a fluorescence and c2 hybridization and basically just hybridize that marker to the chromosome and then you can look you know under a fluorescence microscope you can look and see you know where these regions of the chromosomes and you see cases where you know they're lighting up together in the same place where they shouldn't really be uh they should really be in separate places and that would be you know evidence that you actually have a fusion gene so it's another nice um a diagnostic kind of way of looking at these things expression arrays have been used uh basically maybe just expression arrays I mean just expression in general has been used to to hint at you know what what's you know what could be the the driving event in cancer so some earlier studies as is back in 2005 um just would look at cancer compare cancer to normal and say okay what are the genes that are our expression outliers all right and those genes that are expression outliers maybe you know they're the ones that are really driving the the phenotype in this case so they developed this this method called copa it's cancer outlier profile analysis and you can think of this as I have like uh a little bit differently but the way to think of it is this has to do with like standard distributions or normal this normal distributions of Gaussian distributions you have z scores all right you basically just you have your normal and you think of this as being just like a uh a the expression you think of it as like a normal plot sort of flipped over on its on its side all right so here you have normal prostate and these would be you know zero is basically the center of the distribution and you're looking at the expression level for a specific gene in this case etv1 and in this case erg and in the normal case you basically have you know sort of a normal distribution you know most cases are going to be centered around the mean which in this case is going to be zero because it's been standardized but you want to look for outliers all right if we look at prostate cancer you know you have you know there's a lot of expression profiles that basically fall you know similar ranges to the um let's just get out my own pointer um similar regions to or similar uh intensities to the normal right the distributions aren't that different but then you have these outliers here all right so if you think of these like what kind of z scores would these guys have all right they'd be they'd be high and they'd probably be significant all right so this is what it's really looking for is you know do you find you find these expression outliers um in this case in etv1 you find some expression outliers and erg is another one that showed up is having um you know more expression outliers than um than most other genes so this came up as as being two candidates for maybe maybe they have something to do with the underlying uh prostate cancer biology and um and our thing is is uh you know do you see these um in the same patients or do you see them in different oh it's question okay yes i'm sorry they're different samples so these are different patients all right so so here you have you know patient one patient two patient three these are all different patients and they basically just just ranked according to their um the the expression intensity that's been centered at the mean so zero here is actually the mean they're all mean centered expression values so again i just just think of this as like this is like a if we could just like a draw something on here i would uh you think this is being like a normal distribution or and that's the mean of the distribution and you have like crazy outliers up here um so that's all that's doing and um yes another thing is you could ask you know if you have these expression outliers you're seeing the the same outliers in the same patient sample so you're finding them in different patient samples all right so do they go together or they you find them in separate in this case you find them in separate cases so if these are playing a role in cancer they could be you know have different um they could be different drivers in the different uh cancer samples the different patient samples but we don't know anything about all we know is that it's highly expressed all we know is that it's an expression outlier and and it's it's uh yeah it might have something to do with cancer we don't know it's a fusion all right we don't know if it was maybe just you know something else happened um you know it could be a mutation in the promoter element or something that then just induced the expression of this thing uh it could be some other chromatin remodeling things that happened and it caused the expression to go way up you really don't know but um yeah because this is prostate cancer they figured that um you know might have something to do with the fusions because maybe we know a little bit more about we know about prostate cancer um and we suspect that fusions might be involved there's some techniques some molecular techniques there's a technique called race rapid amplification of cd and ams where if you have if you have part of a transcript you basically you can basically piece it's a fancy pcr it's a fancy rt pcr method to get the full length transcript if you just know like a little piece of it so if we know what the etv1 gene is or we know what the erg is we can we can fish out that full length transcript that encodes etv1 and see what else is attached to it okay and they did that and what do they find they find tempers 2 all right you find tempers 2 at the 5 prime end um and that's that's one of the ways they they discovered these tempers 2 erg and tempers 2 etv1 fusion transcripts in prostate cancer we can do it other ways all right we can do it more more direct ways right doing microarrays and or just doing gene expression and having to do race or some other technique to find the fusion transcripts it's a lot of work instead maybe just do whole genome sequencing sequence everything and then see do we do we find evidence for fusions when we assemble the reads or we just take the reads and align the reads the genome and see okay is there evidence for for breakpoints that suggests fusion events and that's highly effective it's really it's a great way to do it but you know even though they keep saying you know sequencing is almost free you know sequencing keeps getting cheaper and cheaper it's still not exactly free right still you're still talking about hundreds of thousands of dollars and if you want to do a whole genome sequencing you know it's still it's it's it can be expensive you might you might cost you a grand but it's a grand per sample and you know it'll get cheaper and we'll see more and more in the future but it's still I mean even even today it's still it's not like it's it's it's free um and there are other methods that are cheaper all right so that's the cheaper and and highly effective and we'll talk about those too so just an example here of a study that was done in 2011 the whole genome sequencing and they discovered an important fusion and and colorectal cancer this will if you do genome sequencing it's uh you know it's going to give you evidence about where fusion events might occur if you have lots of genome instability like you do in some of these cases like this case here you can see this is a circus plot and if you have all the rearrangements shown um like inter chromosomal arrange rearrangements and you have intracromosomal rearrangements these little green ticks lots lots going on in this one cancer sample all right lots of genome instability uh in this one cancer sample see tons of rearrangements and then there's another sample up here where you see like there's hardly anything going on there's a few little things going on all right if you do whole genome sequencing you're going to find all these events all right but some might be important some might be relevant to cancer they might be the like the initiator or the driving event others could be just you know just random or uh passenger mutations or there aren't necessarily selected for in any way it just happened to just be there just just an event that happened so being able to differentiate between driver variance and passenger sort of neutral events is something that's of great interest you're not really going to get a lot of that information from doing genome sequencing right you get breakpoint information but it's not going to tell you exactly like this is the this is the only biology that's that's driving it why because it's not going to give you any expression information if there's a fusion transcript you're not going to know what that fusion transcript is all right all you're going to know is that there's uh some rearrangements that have been happening at the larger genomic scale however if we do RNA sequencing um in addition to genome sequencing or just do RNA sequencing alone it's cheap especially compared to doing genome sequencing right why is it cheap well because the you know if you're looking at just like the you know the coding part of the genome is just a couple percent of the genome size or like two or three percent of the genome size so with the targets that you're sequencing or it's a small fraction of the total genomic material that's available but it's so it's cheaper you get expression information so it's just one of those multifaceted data types where you're not only getting sequence information but you're getting expression information and if there is a fusion transcript that is driving tumor genesis we can actually capture that fusion transcript in our RNA seek all right and we'll know that it's there and it's it's better sort of pinpointing you know what could be driving the the tumor in this case but it's not as comprehensive as genome sequencing all right so we're sort of relying on that if that fusion is driving cancer we're sort of relying on the fact that it's got to be expressed and hopefully at reasonable levels where we can detect it there could be other rearrangements that are happening and that are irrelevant to cancer but if there's no expression information then we're not going to see it we would need genome sequencing in order to be able to see it and there's tons of fusions that have been detected in this manner this is just one one example one of the earlier studies in 2011 where they detected a whole bunch of massed and notched fusions and breast cancer so it's been a massive increase in fusion discovery over the years and a lot of this has been driven mostly by the the technology improvements you can see where see if I have two kinds of fusions that have been categorized here you have the the guided all right so this is based on cytogenetics all right it's looking at karyotypes or more targeted approaches you know where you do something like race or to pull them out or using using fish to find them and you have oops you have unbiased approaches which would involve sequencing so doing genome sequencing doing RNA sequencing and you can see right around when next generation sequencing came available which is around 2008 you start seeing a huge increase a rapid increase the number of fusions that the folks have been detecting and in different cancer tissues and current estimate is over 20,000 okay but it depends on where you get your estimate and you get be 30,000 could be 40,000 this is one of the things that we'll we'll get into so how is RNA-seq data generated this is a basic overview you'd start with your total RNA which is going to have like 95 percent ribosomal RNA all right so you got lots of most of the RNA when you isolate from the cell is actually not the stuff that you're you're probably going to be interested in so what typically do is find a way to deplete all of that ribosomal RNA if we're interested in protein coding genes which is what most of us are interested in we could do something like poly A capture or just basically grab onto the poly A tail of transcripts and that way we can get rid of all the ribosomal RNA and lots of other non-coding RNAs because they tend to ribosomal RNA is not polyadenylated and a lot of non-coding RNAs are also not polyadenylated so just by grabbing onto the poly A tail you can just capture those those those protein coding transcripts for the most part there are some non-coding transcripts too that have are polyadenylated but primarily we're grabbing the the coding transcripts so then we'll turn them into cDNA through reverse transcription fragment them using other enzymatic or some mechanical method and then grab grab sizes that we can we can easily sequence with alumina we're looking at maybe yeah 300 base fragments on average and then um and when you do the sequencing you've got choices here as far as you know how long are the reads do you want a sequence yeah do you want to do really short read sequencing like 25 bases or do you want to do longer reads 100 150 bases do you want to do single end or do you want to do paradend sequencing typically for fusions i mean that you're going to be better off of longer reads all right because read mapping is is one of the complexities especially if you're trying to do read mapping around break points in cancer all right so longer reads is going to be important anywhere from like 75 to 150 base reads longer the reads the better typically and then doing paradend sequencing is important too because with paradends you can actually you get reads that will will span like the read sequence itself will span the break point in the upcases where the paradend reads themselves will will straddle the the break point we're going to use both of those kinds of evidence and i'll show you examples of that too so if we have a fusion gene and we have um so in this case we've got chromosome a and chromosome b that came together to form a fusion transcript we have exons of the gene x and exons of gene y if we do RNA seek on that we're going to end up with lots of RNA seek reads and these reads are going to have they're going to encode the sequence around the evidence for the fusion all right so here's cases where we have these four reads yeah in this case here we have a read that that the fragment was derived completely from um from gene x the green gene right so there's no evidence of a fusion in this read and down here we've got a read that this fragment it's not a read it's actually paired reads all right so you can get reads from each end so these are these are pairs uh this fragment came from gene y and there's there's no evidence of there being any any fusion in that read right but these two reads here in the middle uh this read or this fragment keep getting same reads of fragments this is a fragment the left read here comes from the green gene all right and the right the right read from this paired end mate uh comes from the from gene y all right so here we have evidence that you know this there might be a fusion event here all right just from this one read alone we have evidence that there could be a fusion and um this would be called the uh this would be a spanning read okay because um the read itself does not actually cross the breakpoint you basically just have one fragment end aligning over on one side and you have the other fragment and aligning the other side and it's just straddling the the breakpoint all right so that that's our spanning read category and we have split reads here we have a we have a fragment where the the right read the right read of this fragment comes entirely from gene y but the left read part of that read aligns to gene x and another little piece of that read aligns to gene y all right so this is going to go in our split read category okay so it's not straddling the breakpoint it actually it actually crosses the breakpoint right so this is this is important because this actually gives us evidence so when we align this read back and we see that this okay this green part of the read aligns over here and the red part aligns over here this is basically telling us with single nucleotide precision you know where the the breakpoint in the transcript is okay and that's why that's important and the goal for a lot of bioinformatics tools and operating on these data is to take these reads and from these reads be able to infer what's the fusion transcript that these reads came from all right and based on finding these these fusion transcripts and for what's the gene what's the breakage that happened maybe at the chromosome level okay that could have resulted in in this fusion transcript being generated these are really the the key goals here and there's been lots of lots of developments in bioinformatics tools over the last you know 10 years now to tackle this problem it's been a very very competitive area and there's a couple of key strategies that have been used one strategy is starting from the the RNA-seq reads themselves one of the first things we can do is we can we can align the reads directly to the genome and we can look for these different flavors of reads right we can look for those reads that we have standing fragments are aligning discordantly to the genome such that one fragment aligns to one chromosome and the other read of that fragment aligns to another chromosome or another gene on the same chromosome but in the very very like distant part or in a different orientation or something that's just it's not concordant right it's discordant suggesting that there's something you know something wrong here something unexpected and could be that there's a fusion and we also look for the other flavor which is the the split reads i call them junction reads you find lots of different names for these things split read is probably the the better term but here you've got you know you've got the one read from the fragment we're part of the read aligns here you know part of lines over here and this is telling us that okay this breakpoint must be at this xon with this xon all right that's that would be the breakpoint at the at the transcript level so the spanning reads they're really good for just saying that you know there could be a fusion event somewhere in this fuzzy area right but the split reads are giving you again that the breakpoint precision the transcript breakpoint precision which is needed the alternative approach that some have taken is to take take all your RNA-seq reads and in a genome free way do a denoubo transcriptome assembly so just reconstruct all the transcripts straight from the reads not using the genome and then once you have you reconstruct the transcripts then align those transcripts to the genome and see do they align as you'd expect or do they align in a chimeric kind of way right and if you have this case where you have this nice transcript you've assembled you know it could be a 2kb or 3kb long by half of it aligns the one chromosome the other half lines another chromosome that would be uh suggestive of uh of there being a fusion of that but you have to do this is sort of in a very careful way because it's very easy to get lots of artifacts when you do denoubo transcriptome assembly and you can add up with lots of chimeric transcripts and not every chimeric transcript is really going to be indicative of a fusion transcript but this is this is uh with some more careful analysis you can you can be somewhat effective in doing it this way when you have your reads and you want to align them to the genome you have you have choices available to you you can align just to the genome you can align just to the transcriptome and uh and there are different challenges in this case here you have your chimeric read and you can align this to the genome and you'll see that you'll have you know part of the reads aligned here and here uh or instead of aligning to the genome we can just take the credit a database that has the reference transcripts in it all right and instead of searching the genome we just search we just search the transcriptome and uh it's a lot easier or you don't have to worry about introns or in this case you're going to see okay well you get part of the lines of this and part of it lines over to gene y but you don't have to worry about the splicing or you just make an assumption that you know about all your targets and you just want to quickly identify those that are um their split reads one quick point to make here is that um even if you if you if you have your spanning fragments uh and they don't they don't have um split reads you're not going to know exactly where the the junction is or you're not going to know where the breakpoint is um and another sort of as an aside um even if you do have a split read and it's telling you at the transcript level this is where the breakpoint is it's not telling you much about where the actual physical breakpoint is in the chromosome all right because in the chromosome it could be you know between sort of introns there's lots of places that could be you could have the breakpoint at the chromosome level all right because at the transcript level we're really just looking at the you know what happens after splicing uh the choice of the reference that you use where they use the transcriptome only or if you use some combination of the transcriptome and the genome it's going to uh it's going to impact the how many reads actually align and also the accuracy of the read alignments too so this is just uh work um to demonstrate that if you here if you have all these RNA-seq data sets from different tissues and you align them to uh the different targets um for the transcriptome only so you have these different reference databases like ensemble or ucsc or rough gene yeah these these uh you know these these special reference um uh transcript sets you know no one can agree on exactly what the proper reference set should be so everyone has their own or some are more comprehensive than others like ensemble has has many more transcripts than um than than ucsc and because of that you'll end up having higher mapping rates of your reads um to that data set than you might to you know rough gene or ucsc now but if you include the transcriptome and the genome and you look at your your percent of reads that got mapped yeah you'll see that you have a much higher mapping all right and in large in large this is because um there's a lot of reads that correspond to transcripts that are just not included within your reference transcript set okay and because the reference transcripts are just it's just not going to be as comprehensive as maybe um you'd like or hope um so you get much more information if you if you search a combination of transcriptome and the genome and there's gonna be other limitations to it so we really just this it's not just two choices there's actually there's three choices here you can search the genome alone you can search the transcriptome alone uh where you can search some combination of the transcriptome and the genome there's gonna be different reasons for why you'd want to do this um in the case of genome only that this is the challenge here is is mapping these reads especially short reads all right because when you're mapping short reads of the genome you have to take entrants into account um you can take into account known entrants from your reference annotations but there could be other other splicing events that are not that are novel splicing events that are not included in those reference annotation sets and those are going to be harder to identify based on the short reads so read mapping to the genome is is more challenging than aligning directly to the transcriptome but in the case if you're if you're searching just the transcriptome only you're going to have other issues you're going to deal with all right um mismapping is is one of the issues because you have reads that you know if you if you search that read against the transcriptome only it might find an alignment all right but it might not be the best alignment if you had searched the genome it might have put it somewhere else all right and the other reason it's giving you that alignment the transcriptome is because it's the best alignment it could get when it only had the transcriptome information to work from all right but if you search the genome it would end up in a different place and it may have been derived from a different location so read mismapping is one of the things you have to worry about the other issue is if you're searching transcriptome only you're limited to whatever you know whatever state of that transcriptome is all right you're limited to that knowledge base how comprehensive is that transcriptome set does that include every possible isoform that might exist within your your sample all right answer is generally no so it's usually best yeah given a choice the most rigorous approach is to use some combination all right so you have the genome with annotations all right in addition to that you have the transcriptome and you can search them both together and have intelligent ways of of capturing the alignments and identifying those those reads that are aligning in ways they wouldn't expect and could be suggestive if not indicative of a fusion event lots of tools like I said this has been going on for a number of years now and and there are many many tools that have been developed I threw my hat in the ring about a year and a half ago maybe a couple years ago so we have our own tool for this and and you'll be hearing more about that you actually be using the tools that we developed in the lab another hour so there's a paper from 2012 that listed a number of tools and that wasn't even all the tools that were available in 2012 this is probably two or three times this these are the ones that are sort of popular at the time and and you still can see you know more and more of being developed all the time the tool that that I involved in developing along with alex doben we said coltsman harbour not this campus but he's he's nearby is star fusion so alex doben wrote a tool called star to align reads to the genome it's very fast it's very popular and and it generates lots of useful information about about the reads that are aligning in discordant ways so I worked with him to develop this tool called star fusion which will take that information map it to genes and do various kinds of filters and we'll talk about these different kinds of filters for trying to identify those those fusions that are more most likely to be to be correct and most relevant to cancer biology but you'll see more like I said is this is a it's a very active very research it's a it's a problem that's messier than I ever thought it would be and and we'll talk about all the reasons why when all else fails all right so this this this kind of cracks me up because this is a paper it was when was this published wasn't that long ago um I don't know if it's not on there it's like 2011 2012 I think but so they used they had a fusion that they knew it probably did exist they had evidence that would exist but then when they ran the different fusion finding tools are available at the time to find it it didn't turn up instead like a whole bunch others would turn up you know they still had tons of predictions but not not a single one of those predictions was the one that they expected to find which turned out to be the actual driving event in this cancer type all right so you can just imagine that the frustration but the thing that cracks me up is you know they basically they said it makes it makes sense what they did um if there's a fusion event all right then I should be able to detect that that part of that read that supports that fusion event in my rna seek data all right so they basically just created this little sequence of the expected sequence around that breakpoint you know and they they took that it just basically did a string match all right they ran unix grep all right so this is just a standard grep command right you can grep any kind of text document they just say you know find find this word in the text all right and uh and so they did he just on the command line like grep gave it this like 40 base pair sequence and they gave it the fast q files all right and voila they found reads that supported their fusion event all right it's not a fusion finding tool this is just basic like you know search you know and like a word document you just do a search and this is their their their technique anyway they got a plus one paper out of it and uh and it's uh it's it's hilarious but at the same time it's it's like super frustrating right because it's like okay tools should be able to find this um it turns out this is not an easy one to find anyway all right so so this is not like a bcr abel or tempers erd or some of the other ones that you see showing up all the time those are relatively easy to identify if whatever reason and i don't remember the exact reason i think i had to do with repeat structures um but this is actually this is a tricky one to uh to find so um but it's nice that you know when all this fails you know you can just uh get back to the very basics so normally you're trying to do this but i don't know what you're doing that's all right that's all right how did they have a good idea of what their fusion option was? yeah so they had um i think they had some cytogenetic information going into this so they knew like what chromosome locations were involved and um and there have been some previous work to demonstrate that this kind of fusion i just showed up and this kind of cancer before um so uh so they sort of went into it with this this you know previous knowledge and sort of expectations of what they should be finding um yeah so a couple years ago um i got involved in this and um and you'll hear about a lot of different tools that we've been developing um and and again there's there's like you know 30 some tools that are out there and yeah i had questions just like a lot of other people had questions like how how well do these other tools work you know um you know how work you know what's the merit of the tools that we're developing you know where are we actually adding value you know what are the key challenges here that haven't been met before and um and some of the challenges were there were a lot of tools that were out there but um one of the big challenges is that they weren't very fast all right i mean there's some tools where you give it a cancer sample it could crank for not hours but days you know and there are some cases where like it would go on for like over weeks and um yeah there are other cases where you know it might run for for days and then just crash all right um and that's that's uh there's just a lot of frustrating elements to this you know so they're slow um they require you know huge amounts of resources to get them set up and running and um the nice thing about uh about star uh is that it's super fast or it's one of the things we wanted to capitalize on we wanted to have a tool that we could would not only be accurate but it would be very fast so we could get through you know lots of samples in a short period of time um so really it's just those those are the two key things that were after accuracy and speed and uh so one of the things that i did was um i wanted to benchmark all the different tools that are out there and i came up with um some some ways of benchmarking it would use both simulated data and we also use genuine data in the case of simulated data um i simulated uh thousands of simulated fusions um there were other benchmarking papers that have been published before this and a lot of times they'll use just really small data sets so they have like the number of fusions that they're using for like the true positive set the true set is like you know maybe you know 20 or 30 or 40 right and um i didn't think that was enough right i want like i want thousands of these things i want to just you know see um give it give it a lot to work on so we did five replicates at 2500 simulated simulated fusions um and then 30 million paradigm simulated reads and then for the genuine data one of the nice things about simulated data is that you know ahead of time what all the the correct answers are right when you work with real data you don't always know right you'll know about some of the things that you are going to meet true fusions but for every fusion that you predict you're not going to know necessarily you know was it was a true fusion or not and you don't want to have to go in like pcr validate everything and see that's another thing people do but again that's sort of expensive and it's been done in the lower throughput kind of way so simulated data is nice but at the same time simulated data is not the end all because real data is very different than working with simulated data simulated data is just it's way way cleaner and it's way way easier to um to find fusions than it is with with genuine data for genuine data we use 65 cancer cell lines and and we took an approach where um if at least three or four forget the number i think we did three four five um tried different different ways of doing this if three four or five of the 15 different tools that we're using if they all agreed then we'll decide if that's going to be the true set right and then what's the fullest positive would be all the stuff that the things that are identified uniquely and then um so that's just you know one way of coming up with a way to benchmark these things with genuine data and uh you calculate uh metrics like precision and recalls of precision you know looking at true positives and false positives and recall you're looking at true positives and and false negatives and you can do these uh these plots are kind of like rock plots to receive our operating characteristic plots um you think of it as like plotting true positives and false positives and you can you can compute the accuracy as the the area under the curve so i mean it's it's kind of complicated but in short you're basically just plotting out true positives and false positives and then just taking the area under the curve as a as your measure of accuracy and we do that and we found that um of course you know i wouldn't be showing you this if uh if we didn't do as well go hide under a rock somewhere uh but we did you know this is simulated data with uh with short reads and long reads and um it would do quite well compared to a lot of the other competition and then um and then on the real data um we did like exceptionally well um compared to a lot of the other ones so we're quite happy with that and you're gonna have some time um they're in a lab to experiment with uh with star fusion um and there are some other um tools that i'll talk about yeah yeah sure yeah yeah yeah uh so yeah what's the what's the magic of the secret sauce right yeah yeah it's a it's a it's a fail safe in the end it runs gross yeah yeah well i mean there's a couple of key reasons for that um one is is that the the star aligner has a lot built into it to find these these mismapped reads it's a really rigorous but it's a maximum unique match mapping algorithm uh so i think that a lot of this really comes from from star just being as accurate it is with its um it's a read alignment strategy but the second piece is is how we do filtering okay and we'll talk about the sources of false positives but so we're not only we're sensitive at finding the events when there's good evidence for it because of the star aligner we have we have a number of uh advanced filters that are baked into the system to try to decide now what's a good fusion versus what's likely to be a false positive and to weed those out so it's really those are the two key pieces yeah yeah another question yes yeah so in this case uh so what we did was uh you take all the predictions you imagine like making a giant venn diagram of all the predictions from all the different methods and you look at the overlaps and if there's at least like three or four uh programs that all agreed that yes they all found the same fusion we put that in a truth set okay and all the in all the cases where they each any program prove to something uniquely we put that in the false positive set okay and then anything in between is just sort of ignored especially as you get truth you get your truth set and you get false positives and then but when you have your truth and your false positive set is your positive and your negative side then you can basically benchmark them and we did that we did that requiring at least like you know two overlaps or three overlaps we did each each and presented the results at each of these different stages so it's not like yeah we just did if we used you know if we used overlap of four programs then we do best then if we do three then we're not best we we tried lots of different things just to show that the results were robust so how was your overlap like the numbers like number points like because i did integrate and have so few and the overlap was tiny right yeah yeah so i was very puzzled like why there is no overlap yeah no it's it's it's one of those really frustrating things yeah so you're showing like percentage here but well this is a you see values yes yeah so when you do so when you do this you can think of this when you're making this plot basically every every point on the plot is going to involve some number of true positives and some number of false positives all right and um and you basically you measure the accuracy of each program using some threshold of evidence right so you say okay at this data point you have to have at least say like five where you support the fusion all right so you apply that criteria to the different programs and count off how many shrewd you have how many false you have and do your decision on recall computation all right but yeah when you're doing this when you're doing this this Venn diagram of how things overlap each other you know there's not there's usually a lot of false positives all right but the number of false positives you have it's going to depend on your your minimum evidence support all right so if you actually if you require that you had at least say like you know five reeds that support a fusion then the agreement the area of agreement usually starts to get bigger relative to the all you know all the ones where they disagree all right the issue here really has to do with those fusions that have to least support the programs kind of go wild in those areas and that's how you end up with these huge numbers of predictions just to give an example like top hat fusion there's been a lot of papers in the past that have really they've showed like okay top hat fusion you know is is an awful tool because or a chimera scan that's another one chimera scan i get thousands of predictions of fusions all right whereas if i ran Prada i might get like 15 or 20 all right so one tool is telling me there's 20 predictions right and another one's telling me there's a thousand all right yeah but if you look at the evidence threshold right you take top hat fusion or you take chimera scan and you say okay i'm going to require at least like three or four or five reeds supporting that fusion well then the number of predictions goes down drastically very very fast right you actually you know they're not so bad all right but if you if you're running them at sort of like the maximum sensitivity then you get these like huge lists of fusions and it makes it look like these programs are maybe not as good as they actually really are so that's another key issue so that's the way when you're making these plots you know you're not taking just uh you know taking all the results from running that program but you sort of you're benchmarking them at different um levels of support for the fusions like have to have at least five fusion reads or at least six or seven or ten and each time you'll get a different different accuracy level and if you're curious um there's there's lots more this i'm just showing you like one figure here maybe two figures from our paper but we have a paper in um bio archive um that really walks through like how we did this work and um and hopefully later this year we'll have another paper um it's more comprehensive than this any other questions before we go on yeah okay and do we do like a coffee break at some point or do we just wait do that between what's that in 15 minutes okay no no i was just curious like when i want to see how much time i have left for okay nine forty four okay so ten o'clock is okay and the lab starts at ten thirty okay so i have 15 minutes to wrap up this okay that's what i'm trying to figure out yeah um are these tools that should be kind of what's that you're asking how these things are different from each other as far as the aligners so these are fusion connections that they work on the aligned balance or they work on the fastq files oh oh so so the the inputs to all of these are typically fastq files all right um but then they'll use some aligner or use some collection of aligner some of these tools will use like three or four different alignment tools and um and then you like combine the results or they do it in like a hierarchical way like if they don't find something using this tool then they sort of go to the next tool um yeah so there's uh yeah lots of different choices as far like top hat fusion uses the top hat alignment tool star fusion uses the star aligner there are others that have used like bow tie um and uh and some others soap fuse uses soap under the hood so each you know there's an alignment tool someone's got a fusion thing that they slapped on it to and called it something fusion that's how it works okay so uh okay so i gotta gotta move a little faster here for just 15 minutes all right so source of the false positives we have we have technical artifacts we have biological artifacts for technical artifacts so other alignment artifacts um you know if you have homologous genes then it's very easy to have reads sort of mismap or or show up as like split read alignments between uh between related genes um if you have genes that are highly expressed you know they'll have um the more apt to have reads that have errors in them right and if you have reads that have errors in them it's very easy to map them to the wrong places in the genome right because because of those errors they'll end up mapping better somewhere else or providing you with evidence for a fusion when really the fusion doesn't exist uh you have reverse transcriptase template switching a lot of these these tools what we're generating are RNA seek and we're doing our next-gen sequencing these are steps that involve um you know reverse transcriptase and in making our A to C DNA but there's also other PCR based steps all right and because you're doing PCR you can end up with um uh mis mispriming in the different PCR steps and if you have mispriming going on that can make it look like you have a fusion when you you really don't it's just uh it's a sequencing artifact it's a sequencing artifact or library construction artifact um as far as its biological artifacts go yeah there's this you know of course natural genetic variation and that's not all encapsulated within the reference genome sequence uh they're trying to make it so that you can you can do effective searches on human genome taking into account all the different variation that exists but it's it's a complex problem most of us right now we'll just download a single FASTA file and and run our tools given that single reference well we know that's just you know that's just one model and there's lots of variation in the human human population and uh and because of that we're taking a sample from one person that doesn't necessarily match exactly to the reference um and because there is just natural copy number variation and other kinds of variations that exist um other hyper variable regions like the um MHC region HLA uh you've done a lots of reads that are sort of mismapping and showing up as fusions with uh with HLA and uh and most of the time those those are just uh those are false positives and then we have um you have transcription induced chimeras um which the problem here is that it could be actually cancer related but it might not be you just have you have lots of genes that are just they're close together on the genome and you'll find that they they form these fusion transcripts and it's just because you know just how transcription occurs sometimes it's just natural right but sometimes there's a small deletion or something out of the genome and that forms these fusion transcripts between neighboring genes that wouldn't normally exist and and uh and there's evidence that some of these could actually um be uh driving cancer so so you have to be careful it's a double-edged sword and we have transploicing you can have actual fusion transcripts being generated that don't involve any rearrangement at the the DNA level all right you're just just the splicing machinery sometimes it makes mistakes sometimes it's actually intentional so there are transploiced products that are actually important for biology or supposedly supposedly important for biology that's what the papers tell us um I have uh reservations about some of these things um so how do we mitigate the the different artifacts that show up um we can in a bioinformatic way and this is one of the things we do with star fusion we have all these different screens um you can take into account repeats in the genome um you can have uh lists of what are called red herrings these are gene lists these are lists of fusions that if you see them maybe you should just ignore them because they're not they're not meaningful and I'll I'll say more about that for readers you could just have a minimum distance threshold so if you have a fusion between two genes those genes are pretty close on the genome and they're in the same orientation you know maybe we'll just ignore them or put them in a separate pile um consider the strength of the evidence like we talked about earlier you can have different different uh numbers of reads that are going to be supporting these fusion events if you only have one or two reads supporting that fusion you know maybe you're not going to trust this so much right as opposed to if you have you know 50 reads or 100 reads that are supporting that fusion event right having one or two reads yeah maybe maybe not um you ever have to think harder about that because we do know the false positive rate does go up considerably with these tools when you have very little support examine the transcript breakpoint do you have actual um splicing going on at the breakpoint where the fusion transcript is right a lot of times when you have fusion transcripts that are generated from uh translocations those translocations happen within introns okay so when you find a fusion transcript it's actually it's it's a natural normal splicing that happens to generate that fusion transcript it's just in a different genomic context if you have a breakpoint at your fusion transcript and it's not canonical splicing it could be that you know the it's a translocation maybe that translocation happened between exonic regions all right and that's why the breakpoint is not happening at splice size but more often than not it's because you have an artifact an artifact from um from mispriming or reverse transcription slippage or something else going on and it's not real uh but it's not always the case you can do a supervised fusion analysis all right if you know what you're looking for you know run grep all right we have other ways to do that now too so it's not just grep but that's the general idea right if you if you have a list of fusions you have like a panel that you're super interested in and you want to make sure you're going to do the best job you can at finding those um you can do a supervised analysis uh capture any evidence for these specific fusions uh we have a we have a tool it's called fusion inspector that does that and there's other uh periods of tools that are being developed to do the similar kinds of things uh you can you can characterize the evidence you can re-score the fusion uh look at the expression of the fusion versus the non-fused versions of those alleles and we look at the functional impact we can better characterize that fusion yeah does it does it is an in-frame fusion event that could give us a fusion protein or does it look like it's disrupting one or both of the genes and also just facilitating the visualization of the data make it easier for us to to study the evidence that exists so we have a tool called the fusion inspector we developed uh you basically just give it the list of the fusions that you're interested in uh in this case those fusion predictions could come from other tools like you could run your star fusion or top hat fusion or you know whatever your favorite fusion tool is you get a list of candidates give it to the tool and what that tool will do is it actually it'll create these mini fusion contexts where it takes those genes that are supposedly fused and it just puts them in the same orientation so they look like you know they're not fused right they're they're complete genes but in the context of this genome they're like normal relative to each other as compared to in the regular genome and when you uh you can realign the reads to the whole genome including these mini fusion contexts and you can see okay are there reads that actually align better in this context of the mini fusion genes than in the whole genome all right so these are reads that would normally in the whole genome align as being discordant all right but in the context of these mini fusion gene contexts they're going to line as concordant okay this is a nice way to sort of recapture that evidence the reads are going to line normally but only in this you know better in this context than in the whole genome and once we have that then we can we can easily make it easier to to visualize these things we can use our trinity software which does de novo transcript reconstruction to to reconstruct that fusion transcripts include that evidence as well we can put that into igv you're gonna actually do this later during the lab put in igv and then in igv we can we can look at any given mini fusion contig in this case we have bcr able again it's our favorite and here we have all the reads they're aligning to this contig and then the evidence for the fusion we have the split reads and we have the standing fragments that show us where the breakpoint is in the context of the reference gene structures this is a really great way to be able to evaluate that there's another tool that is uh it's under review don't ask me how i know that it's also in bible archive and it does something very similar so it can give you output that you can put in igv it works in the context of what we're called super transcripts which we're not going to cover but it's just a way of of taking your transcript data and turning it into a genome a genome-like situation which makes it easier to to view and interrogate the evidence if you want to prioritize these fusions there's a number of ways we can do that again you get these long lists sometimes all right you got to figure out okay what are the what are the fusions that we're going to care about which are the ones that we're going to pursue we can look at the expression is there an expression outlier here which would suggest that it could be playing an important functional role if we're doing a cancer study is it is it found as being recurrent all right so do we see the same fusion pairs showing up time and time again in different samples you know if we have a fusion that's a hallmark of a cancer type it will find it in every single sample and that is the case for some of these cancers in some cases we'll find that just one of the one of the pairs one of the fusion pairs is the same all right but its partner might be different okay that that can also happen and do we have corroborating rearrangements do we have a case where it's a balanced rearrangement or if we have chromosome situations like this you know a b c d we find a fusion between a and d well if we find a and d do we also find the the balanced rearrangement fusion as well right we find c and b together and that would give us more you know strong evidence that this is actually a fusion event not only is it likely to be real but it's likely to involve a translocation like this a balanced translocation like this at the genome level look at the function of genes have they have they've been previously implicated to be involved in cancer like do we find do we find a cancer knowing cancer gene we find a kinase we find a kinase that's important because we might be able to actually treat it with kinase inhibitors so in the clinic that would be an important thing to know is it in-frame fusion and what's the evidence supporting it in terms of the number of reads additional data from dna and we have so these these these are fusions that tend to show up time and time again in these different fusion contacts and different cancers certain transcription factors like the ets family transcription factors and if you don't find the same pair over and over again a lot of times you'll find like one of these there are some very large scale studies that have been performed and this has really been the goal with somebody's large cancer projects is to characterize what are all the genetic variations that are are relevant to cancer so we have you know we have dozens of different cancer types and you have hundreds of thousands of samples for each of these cancer types and characterizing them at different levels the dna level at the RNA level epigenetics basically doing everything and there's lots of different papers in the coming out showcasing findings at the the pan cancer level there's another big paper it's going to be coming out later this year probably for the pan cancer analysis that goes beyond this this earlier TCGA study but this has been hugely useful because for each cancer type we can look and see what are the different fusions that are characteristic of that cancer type are we seeing the same fusion showing up over and over again or is the cancer type specific we look at the partners that are involved some some cancer types are have a lot of fusions or cancer types have have low fusions we can correlate the number of fusions that we have with is it genomic instability or is it balance rearrangements that are going on here so there's a lot of really interesting things that are going on here and this is just these are different cancer types of thyroid cancer bladder cancer percent of samples actually have that have a fusion or any fusion you can see most bladder cancer samples have a fusion very few thyroid cancers do this is just each sample is being plotted here the number of fusions that it has this is just showing a measure of genome instability so we can see no variant cancer there's lots of genome instability bladder cancer we also have significant but thyroid cancer is actually very quiet so it's a quiet type has very few fusions you can look at copy number variation is copy number variation involved in in these fusions and a lot of times when you have fusion events you also see copy number variation nearby it's associated with it but there are cases like NAML where these balance translocations and and do not involve copy number variation so a lot of interesting things going on there we're building up these huge databases of fusions so we have the chimer db where we're collecting fusions this is not my network they are collecting information about fusions you have expert curators they're collecting these things they have counts here of the number of fusions they have in their databases so there's a thousand that have been curated scraping pub med they have about three thousand and then if you're just looking at all the predicted ones coming from running these different prediction tools and these large-scale studies they say now it's 30 000 gene pairs all right my guess is that a lot of these are not relevant a lot of them are probably passengers they're false positives or artifacts i'm sure this is a lot of junk that's in here but but the the knowledge base is going to be high quality and the chimer pub is going to be at high quality so these are two really great data sets this is also useful but you know the more sequencing data we have the more samples we have is continue to grow like everything else at a very high rate some earlier studies had looked at relationships among the different fusion partners and this is this is not using ngs data this is actually using the cytogenetic data and earlier experimental kinds of data and you find that you know there's some genes that are promiscuous but this is kind of biased because they're doing race and they have very specific targets they're looking at so there's a lot of over connectivity in here due to that that's kind of bias with more modern approaches and doing next generation sequencing approaches you don't find these these huge you know connections if you look at these specific cancer types and the fusions that are being found you find that they're more loosely connected you still find hubs you still find genes that that are you know they truly little hubs here where you find this gene over and over again fused with other things relevant to cancer but you find like an ovarian cancer I mean some like 97 or 98 percent of the fusions that you find do not form these these big vast networks and they're also these are all predicted from doing like RNA-seq kinds of studies all right so it's not it's not like they're all they've all been characterized and shown to be relevant to cancer biology so you know are the passengers are the drivers are the artifacts you know this is one of the things that a lot of us are really focused on right now to try to clean this stuff up and figure out what's important what's not important other databases you can sort of you can just go and you can you can surf or you can scrape or you can download you gotta be careful in some cases because in some cases they want they have a commercial licensing and other kinds of criteria so you gotta be careful it makes it hard when you have commercial licensing for people like me to develop tools that are built around these kinds of databases because I'm a free software kind of guy an open data kind of guy and I tried to keep the barrier with the tool use and accessibility as low as possible and when you have a sign of license to do anything it really gets in the way we have a tool called fusion annotator where I've basically gone and I've scraped what I could from the stuff that is is freely accessible and if you have a fusion pair you can basically just run it through my fusion annotator and it will tell you it's shown up in these different databases they'll tell you whether it's been previously shown to be relevant to cancer biology those kinds of things and I find that's been a really helpful tool and this is incorporated in our tools and the reports just a few more slides and then we'll finish you up other hints at what are what are fusions that are likely to be real and not artifacts you can look at expression data this is an example where you have you have an expression profile along the genome for for this gene and it ends abruptly it's truncated all right so ends the expression evidence ends long before you'd expect it to end all right you expect to end down here at the transcription stop site but it doesn't and then someone over here it starts much earlier than than you'd kind of expect that you expected to start over here but it starts over here yeah so if you have if you have expression evidence along with other evidence for this fusion that's that's also helpful to convince yourself that it's it's not an artifact there's some metrics that i'm including in some of our software to try to address the evidence for fusion versus the evidence against the fusion to help us to try to prioritize some of these things i'm not going to have time to get into this right now but um but this is this is one of the metrics that we're going to be looking at um reading frame preservation is this key are we making a fusion protein are we not making a fusion protein all right if we're making a fusion protein that has different ramifications than if we're not making one so um so we can look at that we can see given the fusion event between these transcripts are we working a code on are we shifting the reading frame um those are all things that we can look at and we have tools for that too we have uh this examine coding effect functionality which you'll run later and that will tell you yes the this fusion here would be serable it's in frame here's the fusion protein sequence here's the domains that we find in the combined protein sequence we can give you all that information beware of red herrings these are the ones that are likely to be false positives we need to be careful here though you know what there's there's lots of fusions that can occur naturally or might just show up as like regular kinds of artifacts you might not want to trust so we can look at g-tex which is a RNA-seq project for normal tissues and we can see what kind of fusions do we find if we look at normal tissues all right and if we see those showing up time and time again and we see in our cancer data sets then maybe we're going to discard it or maybe put it in a separate pile um because if we're seeing it in normal tissues then it's probably not relevant to the cancer gotta be careful though um and i'll also say about that but anyway there's just these different databases of a basically normal variation that you might expect to see and we can take that into account when we're doing filtering um there are caveats to this you know if you run fusion predictors and you you're running on g-tex you might find that there's there's a few samples and um and g-tex that have bcr able showing up but wait a minute these aren't leukemia patients these are these are normal people they die of natural causes all right so why why might that be i mean it could be a false positive it could be someone that that was starting to come down with leukemia but didn't present any symptoms yet right um you know constantly people are carriers for things right and so until later in life where they kick in and actually start causing disease so you gotta be careful about some of these things um and you have other cases too like this is a really confusing one yeah here's here's this uh this fusion uh jazz f1 jazz one or i don't know what it is uh it's a it's a weird one because it's it's one that shows up um we'll have it it shows up very early in development okay and it results from transplacing and it's apparently important for biology all right that you have this transplaced part and this is another one of where i'm just you know it's just too kind of weird and too science fiction-y for me to fully believe this stuff but it could very well be real they show evidence it's a science paper from 2008 but it's just one of those things that kind of just strikes you as being just like crazy bizarre um and if it's real then it's just it's a crazy bizarre real thing that happens but it's it's transplacing right but if you find this as actual fuge so it doesn't involve translocations or anything right it's just a normal transplaced product that is apparently important for biology um the other issues you can't trust like half of things that are published these days anyway know that too right so it's another thing that kind of goes into my thought process with a lot of these um except for all my papers you can trust all my papers right um but you know they show that this is relevant to cancer if you actually have a translocation it generates a fusion product it can be relevant to cancer so uh you know so not every time you see it is a cancer related and it's just it's very weird um there's another one is just i have to tell you about that is also very weird um and uh it's called so it's it's Cancel ARL17i you see it all over the place you'll find it in all different cell lines you'll find it in cancer samples you find it in normal samples it just it's a natural variant 30 of people of the european ancestry have this as a natural fusion all right it just shows up and um and this paper saying it's statistically enriched in glioma samples glioma is a brain cancer okay uh which could be you know this would be a big deal honestly why isn't this a science paper why isn't this a nature paper you know it's that's one of the first things that comes up and so you look at the and i'm going to pick on this paper a little bit so i have this figure here and and this figure it's supposed to be glioma not total so i x that output glioma there and they have the number of of of um of patient samples um to have the fusion um and the number of patient samples um that um yeah number of normal samples that have the fusion number that are brain cancer that have the fusion you see the percentage of the back 52 percent versus 12 percent right the numbers there on the or the numbers of samples that they had and they're saying because of this because you had 52 percent in glioma and you have 12 percent in normal to have this fusion that um you know it's statistically enriched for this all right and yeah if you run you know it's a little r-code for you if you run a little fisher's exact test on this um you get a p-value of 0.02 and the paper they say is less than 0.01 but maybe they did a one-sided test or something i don't know i'm doing a two-sided test it's 0.02 right this this looks significant so the first thing i'm thinking of okay this is why it's not a nature paper and i'm not sure i believe this and then um well you have to ask yourself a question too what if it you know basically in normal in normal samples that don't have glioma you have there's two that have the fusion right at a 17 total right so two have the fusion 15 don't what if we had three right maybe instead of two we had three all right well if you do that use stats on it p-value is 0.06 all right so now let's say you're not you're not this this is so significant so i i just get one of those things where it's just like let's let's get some more samples let's look at this a little bit further um if i'm you know i'm a person that has european ancestor obviously if i'm one of those people that has you know i have a 30 percent chance of having this fusion i'm not going to be super worried about having a enriched probability of coming down with uh cancer because of this uh last uh one of the last slides here is just to say that if you know if you want to do the most rigorous job you can and infusion detection RNA is great again it's cheap it's effective you can get things that are biologically relevant because they're expressed but you really need the DNA data and RNA seed data in order to get the full picture the DNA data you can generate these breakpoint graphs for complex arrangements involved more than just putting two chromosome pieces together you can have very very complex rearrangements and having the DNA data is going to help you and also having the DNA data is not going to give you all the information because with the RNA data you might see just the parts that are expressed and the parts that are spliced out so you have much more complex fusion events going on and the RNA seed data might be just giving you sort of the tip of the iceberg but hopefully it's the biologically relevant tip of the iceberg in a lot of cases other considerations you know if you're going to do these kind of studies how many samples do you need it can get expensive you know if our method costs a dollar per sample on the farm then and if i have 10,000 samples to run i'm easily cranking up 10 grand worth of debt and and i've done that i spent 13 grand in two days writing things on on the cloud so it really it doesn't take much much effort to to sort of rocking up some costs thankfully it was my own 13 grand NCI has been good to me fusion partners are they known so you can consider if they're known you can do things like a target to capture so it's more of a supervised kind of analysis how sensitive do you need to be you know how long your reads need to be how many reads do you need you know these are all important things too because there are samples that i'm running through right now that are supposed to have fusion syndrome and i'm not finding them and it's because they're so drowned out by all the other transcripts i'm just not seeing them but the company says that they're there and this is one of the sort of ongoing battles i'm having experimental design considerations it's good to validate predictions when you can you know if you're don't just pick one favorite fusion finder and just run that you might like our star fusion software but i'm not going to tell you just use star fusion yeah use star fusion use fusion catcher use some of the other ones kim pipe some of the other ones that have demonstrated to be highly effective because we're not going to capture everything i know that and i know that there's certain dark areas that we're not very good at yet and it's constantly it's the moving target to make these things better this is what you're going to do during a lab later and take RNA-seq data you're going to use star fusion to find initial fusion predictions we're going to run that through our fusion spectra software so you can in in silico validate those fusion predictions and evaluate the evidence manually using igv and then we have our fusion annotator software and fusion coating effect software that will tell you about have you seen that fusion before in cancer and is it an in-frame or does it look like it's frame shifted so that is it