 Hello everyone, I'm Brian Haas. I come from the Broad Institute in Cambridge, Massachusetts. I live about myself. I've been doing bioinformatics since they started calling it bioinformatics, I think, late 90s. Mostly I do a lot of software development, bioinformatics tool development. My original training was actually molecular biology and biochemistry. But I decided that I didn't like working with radioactively labeled bases and pouring gels, and it wasn't fun for me. But I loved doing bioinformatics, you know, and programming was a lot of fun. So I just really enjoyed doing that. So I put down the pipetnum in early 1999 and started typing on the laptop ever since. I started my career in bioinformatics at a place called the Institute for Genomic Research in Tiger, which is done in Maryland. It's famous for having seen one of the first bacterial genome in the mid-90s. It eventually became the J. Craig Venter Institute in, I think, 2003 maybe. I left there in 2007, went to the Broad Institute. I've been at the Broad Institute ever since. So what's not on my CV? So I was supposed to tell you something that's not on my CV. And one thing that comes to mind is that my house basically looks like a gift shop on the inside. Because I get lots of opportunities to travel, and every time I travel somewhere, I always got to bring home a bunch of trinkets. So, you know, you walk in my house, there's just like trinkets from all over the place. I'm hanging on the wall, on the shelves, you know, you name it. That's my wife, Nuts. So anyway, that's what I do. I travel. I collect things that collect dust. All right, so today I'm here to teach you about fusions. Fusions of the genome, fusions related to cancer, how we find fusions, why they're relevant, why we care. And this is the second year I'm teaching this module. Before that, Andrew McPherson, who's an expert in fusions and cancer, he's developed tools in this area. And I basically took over from him a couple of years ago to teach this. So some of the slides are actually his slides. And you'll see his initials in the bottom corner of the slide, if I basically reused his material. And I incorporated my own material in this as well. So contact information is there. So learning objectives of the module. Let's see if I can just move this around a little bit real quick. Okay. So we want to explore the impact of gene fusions and cancer. When I learned about the types of evidence that we have for finding gene fusions. Understand what it's like to use the different tools for finding fusions. Identify the common sources of false positives. And assess a gene fusion's potential function. This is also the definition of a gene fusion. It's formed by a fusion between two distinct, called wild type genes. The normal versions of the genes. And this happens in cancer quite often. It happens through somatic genome rearrangements. So basically have translocations that happen within chromosomes, between chromosomes, that basically put two genes together that aren't normally found together. Creating a new gene product. And the best example of this is something that was discovered back in, I think it was 1960. It's called the Philadelphia chromosome. And where you have recombination or translocation. It's actually a reciprocal translocation. Between chromosome 9 and chromosome 22. Generating this longer version of chromosome 9. This tiny version here of an adjusted chromosome 22, chromosome 9, chimera. And this little one here is named the Philadelphia chromosome. And basically this translocation puts these two genes. This BCR gene and this Abel gene. Basically it connects them together. Half of the BCR gene is connected with half of the Abel gene. Creating this new gene product. This is a fusion between the two genes. And you find this in about 95% of chronic myelogenous leukemia samples. And it's considered the hallmark of this disease, this cancer. That's probably one of the best known examples of this. There are many other fusions that you find in cancer. And some of them you find quite often in cancer types. And they're considered to be the hallmarks of those cancer types. Others you find at maybe smaller frequencies, but they're still highly relevant. In many cases they're thought to be or known to be drivers of the cancer phenotype. And in some cases they can be treatable. So it's really important to verify them in patient samples. Because it's going to have an impact on their treatment and also on their prognosis. The BCR Abel 1 is the best known example with their 95% of cases. They can be treated with a little more tiny inhibitors than they specifically treat this type of cancer with. Another well-known fusion is Temporus 2 Erg, which is a fusion found in about half of the BCR Abel. Well, you find fusions not only in leukemias and what we call liquid tumors. And basically tumors in your blood. You also find them in some solid tumors as well. So prostate cancer is probably the best example of a fusion where you find it in a solid tumor. There's the EML4 Alk, another kinase fusion that you find in a small percentage of non-small cell-long carcinoma cases. It's important to identify this one too, because this is another example of a treatable one. Or X-ray treat with kinase inhibitors. There's the DNA JB1 fusion. They find in 100% of cases for this type of liver cancer. Fibrolemel or epithelial carcinoma. 100% of the cases. This is another hallmark fusion. And there are some other examples that I'll walk you through later. Another one in brain cancer, FGFR3, TAC3 is well known, 8%. So we'll see some examples of these. And there's basic evidence that gene fusions are playing an active role in tumor genesis. The evidence comes from the fact that we find that they correlate with cancer phenotype. We know correlation is not necessarily causation, but there's other evidence too. We know that we can treat. So if you have a kinase fusion, you can treat it with a kinase inhibitor, and that effectively treats the cancer. There's other studies in mouse, where you take a fusion and you put it into a mouse. It can produce a tumor. And you can also use silenced fusion transcripts. You can use techniques like microRNAs or short hairpin RNAs, SHR RNAs, to reduce the tumors. So how can fusions drive cancer? And you can basically say, how can whatever drive cancer, fusions, how can something drive cancer? There's really two key mechanisms that are involved. There might be others too, but the two primary ways are, one is that you can somehow activate a tumor oncogene. Another way is that you can deactivate a tumor suppressor. So those are the two main paths. So how do fusions do this? Well, anyone know what that means? No? Nobody knows? No one's googled it yet to find out what it means? All right. It means there is more than one way to do it. Okay. Yeah, so if you're a pearl scripter from the late 90s, you'd know what this means. If you weren't a pearl scripter from the late 90s, you probably would. So there aren't any pearl scripters from the late 90s in this room, is what I'm taking as. Anyway, timtati is kind of like the acronym that kind of goes with the pearl programming language. And the pearl was the most popular scripting language for bioinformatics when they started calling it bioinformatics. Now everyone's doing Python, including myself to a large degree. All right, so you learn something. You can learn anything during this workshop. You know, this is timtati. It's gold right there. Okay. So I have a number of examples here of how fusions are involved. It was thought to be involved or experimentally demonstrated to have a molecular mechanism that's this driving cancer. So I'm going to walk you through these. Some will spend more time on another's. And by the eighth one, you're probably really bored. So I'll try to get through the last couple very quickly. So we'll start with the most famous one, right? BCR Abel 1. Here you have this Abel 1, which is a tyrosine kinase. And you have this BCR gene. And you can see these arrows here point to where fusion breakpoints are commonly occurring when they find these fusions in patient samples. All right, so there's a few different places where you have the actual fusion event. And when you create this fusion, you're basically putting these exons here for the BCR gene, together with the tail end here of this Abel 1 gene. And what's interesting about this is that this creates an in-frame protein fusion. All right, so we actually get a fusion protein out of this. We get a new protein that part of the protein is from the BCR gene. And the other part of the protein includes the tyrosine kinase domain. And when this fusion protein occurs, it's basically missing important regulatory controls that would be at the end terminus of this Abel 1 gene. So you end up with this kinase protein, which is constituatively active. And what does it do? Well, it drives cellular proliferation. All the fusions don't all create fusion proteins. In some cases, you're not creating a functional protein at all. There's lots of different ways to do this. And creating an active new protein product that has altered functionality, that's just one way to do it. And we'll see if there's other ways to do it. Again, you're doing two things. You're actually creating an aqua gene or stimulating an aqua gene, or you're knocking out a tumor suppressor. So this is going to fit in one of those categories, which categories are going to be. We're stimulating an aqua gene, basically. All right, so the other really well-known one is this Tempris 2 Erg. Erg or you can have another option here is ETv1. Erg and ETv1 are both transcription factors, and they're both from the same family, the ETS family of transcription factors. And when you find in prostate cancer, when you find the Tempris 2 fusion, most of the time it's going to be with the ERG gene, the Erg gene. A small percentage of the time it will be something else. It could be ETv1. But when you create this fusion, either of these fusions, look at the Tempris ETv1 first. This is on top. And you can see where the actual breakpoints are happening here. And the way this is drawn is you have the coding region in a darker color, and you have the 5-prime untranslated region, the 3-prime untranslated region, the UTRs in a lighter color. So the pink here is the UTR, as the red is the coding region of Tempris 2, which is a serum protease. And in ETv1, here you've got dark green, you've got the coding region, and the light green or yellow, you've got the untranslated regions. So if you look at the fusion products here, you have the transcript. What you can see here is that you're not getting any of the coding region from Tempris 2. You're just getting this first exon, which happens to be an untranslated exon. It's part of the 5-prime UTR. But if you look at ETv1, you're basically getting the second half of the coding region. You're missing the first part, the N-terminus of the ETv1 protein, but you're getting the whole C-terminus here in this fusion product. So what you're doing is you're making this fusion transcript that basically encodes the second half of the transcription factor. But the point here is really that you're not making necessarily a new protein, you're basically making part of a protein. The key is that it's now driven by a different promoter. So by making this fusion event, you effectively put the ETv1 DNA binding domain under control of the Tempris 2 promoter. This is a very active promoter because Tempris 2 gene is up-regulated and prostate. So it's basically you're over-expressing the second half of this. The same thing happens here. If you look at the ERG, you're actually getting more of the ERG. You're actually almost getting the entire ERG coding region here, but you're still getting just the 5-prime exon Tempris 2. So ultimately, the Tempris 2 promoter is driving over-expression of this transcription factor, and that's really the key driving event in prostate cancer for that fusion. Okay, so another one, IGH, MIC. So MIC, you find it's a transcription factor, and MIC is a well-known aqua gene. It's probably one of the best known aqua genes. And IGH is part of the immunoglobulin heavy chain. So B cells produce antibodies. And what happens here is that the fusion where part of the IGH locus gets fused with the tail end of MIC. And it's another case where now you have, it's really not a fusion protein that you're interested in. It's actually that the promoter for IGH is now driving a good portion of MIC. All right, and that's actually what's driving now solid proliferation. This one is really interesting. So here's one where you have the MIB transcription factor, and you have this NFIB transcription factor. And if you look at the structure of the MIB transcription factor, you see you get the whole coding region here. You have this 3-prime UTR, and this 3-prime UTR has microRNA binding sites. Okay, so microRNAs are actually involved in post-transcriptional regulation of the MIB transcription factor. With NFIB, I don't think it really matters what's happening here with NFIB. But you can see when you create this fusion here, you're only getting this tiny little piece, like really the very tip of the C-terminus here in this fusion protein. You're basically getting most all of MIB's coding region. But the key here is that you're knocking out the 3-prime UTR, not including any of the 3-prime UTR from MIB that has those microRNA binding sites. So you end up with this fusion product, which is essentially its MIB without the microRNA regulation. All right, so now you have loss of transcriptional control, and that's now going to be the driving event. There's some experimental evidence for this, I just wanted to show you. Here you have this wild-type MIB, and if you treat it with microRNA oligos, try to shut it down, you'll see a decrease in expression. Over here you get the fusion, and the fusion is not affected by introducing the microRNAs. This has a little bit of evidence for that one. Here's another one, it's a complex fusion. This is a fusion you find in colorectal cancer, and in this case you have this ENCOAT2 transcriptional co-activator, which is thought to be a tumor suppressor. And although this is a really complex fusion, it actually involves three different chromosomal regions here. You can see there's orange, blue, and green that all come together to create this new fusion structure. The fusion transcript actually only includes exons from the orange and the green. So the little blue section here, although it's part of the fusion gene, it's not contributing to the transcript. And it turns out that all you're really doing with this fusion transcript is this fusion is basically just knocking out ENCOAT2. Since ENCOAT2 is considered to be a tumor suppressor, and we're knocking out a tumor suppressor, so that's again one of our key mechanisms for being able to have fusions contribute to cancer. Now there's some evidence for this where you basically have a growth curve, and it shows you that if you overexpress the normal ENCOAT2 gene, not the fusion, but the normal one, you get a reduction in the cell index. So if you overexpress it, then you're not growing as fast. So that's the idea. There's just a couple more here. This is where everyone gets really bored, like when is this going to stop? And I've seen enough, but trust me there's a couple here that are really interesting. So it's worth going through. I keep wanting to take some out, because it does seem like you spend so much time on this, but each one is just really interesting in its own way. So I haven't figured out which one is to remove yet. Okay, so this one is interesting because it involves an epigenetic component. So here we have MIB, another well-known aqua gene, we've seen it before, with this QKI RNA processing gene. And if you look at the genes, the normal genes in the genome, here you have the MIB transcription factor, and here you have the QKI gene. If you look at the epigenetic signatures, epigenetic marks around these genes, what you see is that the QKI gene has got tons of this H3K27 acetylation peak. So this is an epigenetic mark that's consistent with being expressed, like very well expressed. You might consider it an enhancer mark. Now when you make this fusion, so you don't see any of this H3K27, it's a histone acetylation. You don't see a histone acetylation for MIB. But in this fusion gene, what happens is that you create this fusion, you're basically just taking the tail end of the QKI, tacking it on to MIB, and you're going to inherit the H3K27 acetylation at the 3' end. But now you can see it's actually now it's showing up at the beginning of the MIB transcription factor too. So some of that signal is actually now sort of propagated over. So now this enhancer mark is now helping to drive expression of MIB. And there's actually a few different things that are happening here, so that's why this one's interesting. There's actually multiple mechanisms that are involved and at least thought to be involved and contributing to cancer. So the first one here is just inheriting the histone acetylation marks. But the other thing is that the fusion protein can sort of drive its own transcription as well. There's an auto regulatory feedback loop where this basically comes back and activates the MIB promoter. So that's the second way this happens. And the third, since we're basically knocking out the QKI RNA processing gene, and that turns out to be a tumor suppressor, by knocking that out we're also contributing. So it's sort of like a triple whammy here as far as how it's driving cancer. This one here, this EWS FLI-1 is another signature fusion you find in Ewing's sarcoma. And this is an interesting case here. You have the EWS gene, which is an RNA binding gene, and it's a transcriptional activator. So you have a transcriptional activation part here in the N-terminus. You have the RNA binding domain here at the C-terminus. And then you have the FLI-1 ETS family transcription factor. And we've seen ETS family transcription factors before. All right, like Tempraseur, Temprasee TV-1 is another one that's part of the same family. And what happens here is when you create this fusion, you're basically removing the RNA binding domain from EWS. But you're attaching this transcriptional activation domain onto the DNA binding domain of a different transcription factor. And you can just picture this here. So now you get this transcription factor FLI-1, where it binds to where it likes to bind in the genome, and now it's taking with it the transcriptional activation domain of this other protein. So what's going to happen when you get transcriptional activation at all places in the genome where that transcription factor likes to bind that aren't normally activated by this transcriptional activator? And of course, that has consequences. In this case, there's a kinase that happens to be upregulated to this, and that's going to drive cellular proliferation. I'm pretty sure it is the last one. No guarantees, though. All right, so this is another one that's important because it's another one that's a signature fusion. You find this in synovial sarcoma. And actually, we have collaborators that focus on slitting this cancer and actually see patients with this disease. It's pretty bad. But every patient sample has this specific fusion. All right, so this is another, like, it's a perfect hallmark of this disease. And the reason why this one is interesting is not because it's knocking it... Well, it's interesting because it involves several components, but the key reason why it's interesting is because it involves epigenetic marks. Basically, we have regions of the genome that are tightly compact and the heterochromatin are not being expressed. And now this fusion gene will recruit a chromatin remodeling structure into these regions. And it does chromatin remodeling. It basically opens up these compact regions of the genome, allows for transcription to occur. Now you start getting activation of genes being expressed that normally would be really tightly wound up and not expressed. So you have these two genes, SS18 and SSX. SS18 is part of what's called the SWI-SNF chromatin remodeling complex. And you have this SSX, which is a transcription factor. And when you create this... This is actually a functional fusion protein in this case. You basically create this SSX fusion protein and you can see now this complex, this SWI-SNF complex which has this subunit now that is this fusion protein. This is now going to be recruited to where SSX binds in the genome. And when it does that, well, it's why SNF gets to work. It starts remodeling the chromatin in those areas. We start seeing transcription and bad things happen. So that's the... I think that's the last example. So lots of ways we can do this. We can impact cellular proliferation at the protein level by messing around with the kinase cascades, signaling pathways. We can do this at the RNA level by manipulating transcription factors, post-transcriptional regulation by removing regulatory motifs like we saw with removing the microRNA binding sites. Then there's also at the level of epigenetic marks, the DNA chromatin remodeling. We can reposition enhancers and we can perform chromatin remodeling in areas that normally would be tightly wound up in herochromatin. So there's lots of ways this can happen. And that's where this little guy comes in. So there's more than one way to do it. Perfect. So what are the genomic effects of fusions? Genomic effects of fusions. Now this is more about the signatures. So we have different ways in which we can detect gene fusions. We have different technologies. We have different molecules that we can look for. We can detect fusions at the level of the DNA sequence looking at the genome. We can do it by looking for chimeric transcripts because if there's a translocation that happens and it creates a fusion gene, the fusion gene is expressed, then hopefully we can find it at the transcript level. Also, if it's the driving event, if it's the diffusion that is responsible for the phenotype, then hopefully we'll find good evidence of expression for the capture at the transcript level. We might also see, might be able to detect it indirectly by expression changes. We might see that there are certain genes in certain cancers and maybe that's a signature. Maybe there's a fusion event that repositioned the promoter element and that's really what's involved. So there are a few different ways in which you can try to detect the fusion events or hence fusions might be involved. So one of the earliest ways to look for fusions is to look at the chromosomes themselves and see if you see any evidence of structural abnormalities. So you can do a keratite. You can look at chromosome banding patterns, G-bands, they would call them, even though I think it's AT rich regions that I should let up, they're called G-bands. You do spectral keratotyping. So this is like an easier way to do this kind of analysis where you basically paint the chromosomes by using probes that have certain colored fluorophores attached to them. It just makes it I think a lot easier to be able to detect the fusion events. Now like here, you can see there's a yellow and orange. You know, if I was looking at this, you have to be an expert to look at these keratotypes and see what's happening here. But I think anyone could look at this and say, okay, there's an orange thing attached to a yellow thing and that looks different, right? So I kind of like this. There's another one here where you have gray attached to the white. So that just makes it a lot easier. You might be looking for, so these are cases that are not necessarily, these are unbiased approaches. They don't have any kind of difference. But you can have, you can target, you can hone in on specific fusions you're interested in. So if someone comes in and it looks like it might be a CML, in case of chronic myelogism leukemia, and you want to know, you know, if you have the PCR Able I fusion, you can do a targeted assay. You can use probes that are specifically designed to find, you know, those specific fusions. And again, you can light them up with fluorophores. You can look at the chromosomes under the, with the colors attached. There's a yellow and a red, or basically green and red. Green and red make yellow. So that's the evidence for perfusion. So this is fish fluorescence and C2 hybridization approach. It's targeted, it's low throughput, but it can be useful. Another thing that was useful early on, especially with microarrays is just to look for expression outliers. So there's an analysis called COPA, cancer outlier profile analysis. Essentially what you're doing is you're looking at all gene expression. For all genes, and you're comparing genes from normal tissues or normal samples to cancer samples. And you ask the question, are there certain genes that look like their expression outliers as compared to the normal samples? And this is kind of a study. I don't really care for these plots. It makes it hard to figure out what's going on. But you just picture it like it's a normal distribution of expression for normal and you have the same sort of distribution for tumor and you want to be able to compare them. These distributions are basically ranked. We have patient samples. These are each individual patient samples along the X axis. And for each patient sample, you're looking at how much does that gene expression deviate from the mean? So all the samples are centered here at the mean. So the mean is zero. Same patient samples is normal, normal prostate. Some patient samples have expression that's below normal. Some have it that's above normal. But then if you look at the prostate cancer cases, you'll see there's a lot of them that are below normal. Some that are above normal, above the mean. But you see that there's a whole bunch here of samples that have ETV1 really highly expressed compared to what you're seeing here the top for normal. And then if you look at lymph node metastasis you'll see that even compared to the extent of normal here, you're getting a couple outliers here. But this is really the key you're finding a lot of patient samples here that appear to be outliers. And the same thing if you look at the ERG or ERG gene you see normal you get tumor and you got a whole bunch of tumor samples here that have expression that is a far outlier compared to what you're seeing here with the normal samples. And with the metastasis you're seeing lots of outliers too. So there's something going on here. You can see it's an outlier. We don't know really why it's an outlier. Maybe it's because there's a fusion. It could be that there's a promoter that gets fused on and that's driving the expression now. It could be something else too. You could have a somatic mutation in the promoter region and that's driving it. There's lots of... Biology is rich with different ways to do things. One thing we could do is what they did here was look at are these patterns mutually exclusive or not? When you find ETV1 as an expression outlier are you also seeing ERG as an expression outlier in that same sample? Or are they in different samples? So what they did here was just plot the ETV1 for each patient. Each dot is basically a patient. You have an expression level of ETV1 and an expression level of ERG and you basically see that when you have ERG as an expression outlier ETV1 is basically not an expression outlier. In vice-versa when you have ETV1 as an expression outlier you basically have a pretty normal order of expression. So something here is driving this. And using a technique called race is an earlier approach where if you have just any sequence for a gene like in the middle of the gene you can do RT-PCR and basically get the whole full length product. You can sequence that and see what it looks like. So in this case you could take ERG or take ETV1 and run the race to get the full length transcript and see what does it look like and what do you find? You find a fusion transcript and this is actually how they tempers to as being a fusion partner with ERG or ETV1. So that was pretty clever. Other discovery platforms for doing this you use genome sequencing this is going to be comprehensive because you're looking at the whole genome if there's any rearrangements there hopefully you can detect it within the genome sequencing data. The main issue here, well it's two issues. One is it's kind of expensive to do both genome sequencing even though prices keep plummeting and getting lower and lower it's still compared to other approaches it's relatively expensive. And also it doesn't give you any functional information regarding to expression. Because you could have lots of fusions. Tumors and patient samples so you can sell lines the genomes are very dynamic and you'll find tons and tons of rearrangements in some of the samples. Not every rearrangement is contributing necessarily to cancer. Or it might be that you have one fusion driver but then you got like a hundred other different events that happen they're just neutral or sort of silent. They don't really contribute to cancer in a big way. So how do you differentiate between a fusion you find that is actually contributing to cancer and are the other fusions that are maybe just there because of rearrangements that just happen because of the chromosome dynamics. So it's useful to have expression information or to have a moment on things that might fusions that might be actually important and contributing. Here's an example here we have Circos pots. Everyone familiar with Circos pots. You see these a lot of times in literature are pretty common. And it's basically just showing using DNA sequencing data different rearrangements that take place between chromosomes which is shown in like a pinkish color I guess. And then intracromosomal rearrangements are shown in green. And what you can see here is that there's some tumors that are very noisy there's a lot of rearrangements that are going on here and there's others that are fairly silent. There's only a few events. So you have a case like this where there's just tons of rearrangements going on. You can imagine that there's going to be lots of fusion genes created here. Which are the ones that we care about? Which are the ones that are contributing to cancer? Which are the ones that visit a patient sample? Are there any fusions that we might be able to treat based on? So that's why we might use other methods. mRNA sequencing is one of the most popular methods for doing this. Just get at the transcriptome. It's inexpensive because you sequence the transcriptome it's really a small percentage of the genome that you're sequencing it for. It's only 2 or 3% of the genome is actually corresponding to mRNAs. A nice thing about this is that if there's a fusion that is well expressed and it's contributing to cancer and I hope we'll find it because it's expressed. There are other fusions that are in the genome that are not really contributing. If they're not expressed, we're not going to detect them. So it helps us to put the lamp post next to those fusion events that might be most meaningful. Okay. The intent is just one study here where they discovered mass and not fusions. One of the earliest applications back in 2008-2009 where they basically along with the development of RNA-seq itself the show that you can actually detect fusion transcripts with RNA-seq data. And there's been a how many fusions are there? We've been cataloging these things for a while. There are a lot of them and there's been an explosive growth in the size of these catalogs since the invention of next generation sequencing or in the application of next generation sequencing. So this is just from a paper from 2015 where we have guided approaches. So this would be more like looking at keratites or doing direct like PCR based approaches for targeting fusions. And you'll see we've got relatively small numbers. Catalogs haven't really been growing all that much. And once you get to around 2009 when aluminum comes on the scene and along these other approaches we see this massive increase in the number of fusions that people are finding. This is 2015. Top that at around 8,000 fusions. The number of fusions that we have now is just it eclipses that in a big way. You're looking at this is from 2018 there's a collection that had 21,000 fusion genes and it depends on what resource you're using because it could easily be 100,000. Right now it's somewhere between 20,000 and probably 50,000. Yeah, question? Yeah, my question is just looking at the fusion landscape and knowing all the different colors in the literature it says that a lot of them are neuro-front. So looking at this number could this include false pauses as well? Oh yeah, absolutely. We just don't know for a fact. Yeah, so how many of these actually matter? Well, how many of them matter? That's this one. How many of them are real versus being artifacts? That's another question. So these are all things that we're dealing with right now. But the key here is that before we get to that the key is to follow up real quick. The number of fusions that we're finding is still far smaller than the number of fusions that are possible. If you had to say 20,000 genes and yet 20,000 times 20,000 is what? I don't know. What is it? 440 million? I don't know. It's a big number, right? And 21,000 is a lot smaller than that. So it's not like we're seeing every possible fusion out of the sun. But anyway, the numbers keep growing and we really don't know. True versus false. Really, you mentioned you can use RNA-seq for the most of the fusions, right? I guess I'm wondering when you were talking about earlier you gave specifically outside of doing a whole genome and RNA-seq are there ways of doing that? Right. So what other methods could you use to detect those kinds of fusions besides RNA-seq and whole genome sequencing? So one of the ways is just having a targeted approach. So if you know that one of the partners is involved, then you can see what are the other possible partners. So there are different approaches for doing that. The earlier approach was called race, rapid amplification of CB and ANS. It's like a PCR type approach. If you know one partner and you want to know what are the other possible partners, you can try to figure that out by doing sequencing based on that. There are other panels of genes that are commonly used with PCR type approaches for doing fusion diagnostics. These are things that would be used in the clinic. So there's a company called Archer that has a panel to do this. Archer DX, I think it's called. But there's a few different sort of solutions that are out there. But they're targeted to specific genes in their collection. Sometimes they have specific fusion pairs that we're looking for. Other cases they know like one partner is sort of promiscuous and you'll find it linked up with a bunch of others. So it'll basically just try to figure out what those others are. But really it's PCR based techniques at that level. Unless you want to do things like the karyotypes and other more visual assays. That's the other way to do it. Any other questions? No? Okay. How RNA-seq data is generated? You guys probably saw this maybe. No? Yes? It's part of the workshop. There's a lot of stuff like measuring expression and kind of RNA-seq to do that. So yeah. So you just start a total RNA, do a poly-acapture, fragment, and size select. Then you get paradigm sequencing. So basically you do single line sequencing. But you guys are already experts in that. Alright. So if we have a fusion gene. So this is a fusion gene up here. We got gene X with a gene Y. And if we transcribe this we splice out all the introns. We end up with this fusion transcript. Alright. Now if we're going to do RNA-seq we're going to get reads that are going to look like these different flavors. Call them. We'll say we'll do paradigm sequencing here. So let's just look at this top fragment here. This is one RNA-seq fragment. We're doing paradigm sequencing. So we get a little green read on each end. And when we map this to the fusion transcript or we map it back to the genome, we're going to see that it corresponds to gene X. Alright. Both these reads are going to correspond to gene X. We'll have this flavor, where we have read one and read two. Read one, it maps to the green gene X. But the orange part maps to gene Y. Alright. So this would be called a discordant read pair. Because the reads are going to different genes. Not like you would expect them to. Each read aligns entirely to its target. Alright. This entire green read here maps entirely to gene X. And this orange read maps entirely to gene Y. Now we have another flavor. Alright. We have one read over here from this fragment. Alright. The left read or the read one. Part of that read maps to the green. Alright. And part of that read maps to the orange. This is a split read alignment. This is also evidence that there's a fusion there. But if we look at the orange and the right read here that entire read aligns to gene Y. And the reason why this little line here is because it probably aligns between lines like right here across the junction. Splicing junction. Alright. Then we have another category or fourth category here where we have both reads and they both align to gene Y. They're both orange. Alright. So basically there's different categories. We have a category of just being a normal read. It's aligning properly. A properly parented alignment. And this top one here, this would give us a properly parented alignment so we wouldn't suspect there's anything wrong. Alright. Looks good. And the bottom one. It's going to be just like a normal R and C read alignment. Alright. So until something about being a fusion it just aligns and there's no flag set. It just aligns as a normal read. But these two reads in the middle, these are really our evidence for fusion transcripts. Alright. And they come in two types. They come in the type where you have just discordant read pairs. Alright. And we have, which is this one. It's aligning somewhere and your read is aligning in a place where you wouldn't expect it to be. And then we have the other category, which is the split reads. Okay. Where one read sequence itself. Alright. Part of that read sequence is aligning to different places that you wouldn't expect. So we're basically looking for both of these types of reads. The spanning. It's called spanning fragments. And split reads or junction reads. Okay. So with RNA-seq, using RNA-seq to find fusions, the goal is really just to take all the RNA-seq data that we get and then infer fusion genes from those split reads and those discordant reads. So that's the job of all these different tools. And then once we have the fusion transcript, we're basically inferring that if we see a fusion transcript we're inferring that there's probably a fusion event that happens at the chromosome level. Not always the case. You have something called transploicing. Which would give you the same kind of fusion transcript. But it does not involve any chromosomal rearrangements. So a splicing machinery can make mistakes and accidentally splice things together and maybe shouldn't go together. Maybe it's a neutral and there's no impact. In other cases it might make a transploiced product and that's actually playing an important role for the biology. But you cannot from the transcript level, you cannot differentiate transploicing from a gene fusion event. You can only get that if you look at the DNA level. But since transploicing is so rare and given that we're studying cancer we're coming in with a high probability that we find a fusion transcript that's probably due to a chromosomal rearrangement. There's probably some good Bayesian statistics argument having a high prior or something like that. That's beyond me. Alright, so with RNA-seq data, there's a couple key ways which we might go about detecting these fusions and this is the stuff they're going to be doing after lunch when we do our fusion lab. You can be playing around with these different approaches. We'll start with our RNA-seq reads and we're going to align them to the genome. We're going to see, is there evidence that we might have a fusion? We're going to have those two flavors of reads. We're going to have the discordant read pairs. We're going to have a spanning fragment where one read is aligned to the one gene and the other read is aligned to a different gene and we have split reads where we have a single read that has a split alignment between two genes and these are the two flavors of evidence that we're looking for to find fusions. The other way to do it is to first do de novo transcriptome assembly. We can try to reconstruct transcripts in a genome freeway directly from the reads and once we have those reconstructed transcripts some of them might actually be fusion transcripts that we've reconstructed. We can detect them as fusion transcripts by aligning those transcripts. Not the reads, but aligning the transcripts to the genome sequence and seeing if we have evidence that there's a chimeric transcript. Now I'll give this evidence for fusion. There's also an assembly artifact part of a group that built one of the popular ones. They do generate lots of artifacts too. So being able to tell the difference between an artifact or is it a real fusion there's some challenges there. But that can give us good evidence for fusion. This is just driving the same point home. When you do the alignments you're basically starting with these reads and depending upon where the read was derived from in the fusion transcript you could have split read going to the genome or you can align to the transcripts themselves instead of the genome as a target in this case you find the same thing. Part of the read is aligning to one gene part of the read is aligning to a different gene. Basically the key point here is that there's two ways to do this. One is the genome. You have to take into account introns in the genes. The other way is to align to the transcriptome. You don't have to worry about introns. Just align directly to the transcripts. But regardless of the approach you're taking you're still looking for evidence from the reads to support your underlying fusions. The paradens in the case you don't have a split read or if you don't have a split read then you have a general idea these two genes might go together. Where the break point is. You really need that split read or that junction read in order to define where the break point is in the transcripts. The other point here is that if you have a split read you know which exons go together at the transcript level. But at the DNA level, the genome level you don't necessarily know where that break point is. Because these genes the human genome is huge. Introns can be 10, 20, 30 kb long. Some of them are almost 100 kb or even longer than 100 kb. And you don't know exactly where that break point is in the genome. You're not going to have that level of precision. You'll have a general idea but you won't know. So even though you can detect the break point at the transcripts you don't necessarily know where that break point is exactly in the genome. And sometimes you'll have alternatively spliced isoforms that are fusion transcripts. That give you different break points at the transcript level. But again, at the genome level you're not going to know for sure. You'll have some hints. Okay, this is if you map to the transcript you're going to get a different alignment different percent of your reads that align than if you align to the genome. The choice of the annotation they use like if you use UCSC annotations or use RefSeq annotations whatever choice you use for your reference is going to make a difference as well in terms of how many reads you're going to map what your sensitivity is going to be for pulling up these fusion transcripts. In general we tend to use an approach where we use the genome and the transcriptome. Instead of just using the genome only or using the transcriptome only we can use tools that are basically targeting the genome but the transcriptome will wear. You're also getting it your reference annotation. Hopefully GenCode is what I use that's what I recommend as your source for reference annotations to a very good quality product. Give it the genome give it your reference GTF file use a tool that align the reads to the genome in a transcript-aware way or it's taking into account the known splice junctions. That's going to be your best approach going forward. Yeah, lots of tools have been Yeah, go ahead. Doesn't matter. Doesn't matter. It doesn't take into account anything to do with the coding level. Yeah, okay. Again, some of your some refusions are going to give you fusion proteins. Others are going to give you just basically knockouts and stuff. Yeah, okay. So some of the first tools came out around 2008-2009 These are the ones that are currently available. I'm sure there's a lot more. These are the ones that I've come across. But occasionally you'll find a new one that was published in some obscure journal and I came to get the paper because the PDF is behind a payroll. So there's just just many, many tools that are available to do this. Yeah, every every so often I just do a PubMed search and you'll find a bunch of new ones. And my general experience with this I've been working in this area for a few years now. My general experience, especially when I first got a story with this a few years ago, you pick a few different programs and they have different challenges in terms of their setup. When you run them there's different challenges and when you get the results you pick the results of three programs being as Venn Diagram. You got one tool that's going to predict the whole bunch of fusions. The other one is a moderate number and the other one is a small number. If you compare the predictions that they're agreeing upon it's generally a small fraction of all the fusions. Which one do you trust? Do you just take the intersection of two or more? How do you do this? Do you have to run three tools? Do you have to run four tools? You've got like 30 or 40 tools to choose from now. There's a lot of tools that are out there. How do you do this? When you're running these things there's other logistical challenges. One might take weeks to run. You probably got some crash in that time or you get kicked out of your grid submission system or whatever you're using. Power doesn't go out. Another one, you can't run your laptop you can run a sophisticated compute farm using a specialized grid submission system. It's not set up for anyone to just come along and sort of set it up and give it a whirl. You have to have that environment that the people that develop the software are tuned into. It really limits the accessibility. Or maybe you can get it to run but it takes you a long time and an effort to get it going on your system. There's others where dinosaurs jump out and try each up while you're trying to get it set up. There's just lots of things to take into consideration here. When all this fails, I love this paper. I think the recording is so I'm going to stop saying certain things. What I like about this they ran a few different programs to find a fusion. They had the specific fusion they were looking for. They had some evidence going into it that this fusion exists. But when they ran these different programs they didn't find any of the programs calling that fusion. So they basically turned to using grep. I heard you guys talking about grep earlier. It's just like a search utility on Linux. You just give it a text string. It's like grep and they'll just find it. They took the fusion sequence they would expect to find if this fusion exists. It basically just grep that sequence like a 30 or 40 base pair sequence. From the raw reads. You just take a basket file and you just grep on it. This is like the last approach. You try everything else. I'm just going to go and do a perfect string matching on this thing and see if I can find it. Lo and behold, they find evidence for it. They found some reads that support this fusion. So there is evidence in there but the tools didn't find it for whatever reason. So it's unfortunate. I'm trying to get a hold of this data set. I'm curious to see how some of our tools will behave with it. Turns out this is just not like a regular typical fusion. This isn't like BCRE or Tempers to Earth or one of the common fusions that we're familiar with that are pretty clean and pretty easy to find. There's some complexities with this guy. I'm very curious about it. But I love this paper. I just love this paper. Anyway. Yeah. The source is the false positives. You have different kinds of artifacts. There's alignment artifacts. You have chimeric read artifacts just from your library prep. You get ligation artifacts, reverse transcriptase, template switching. And there's biological artifacts. We're searching the regular standard reference genome. Now there's alignment tools that are some of the newer alignment tools that will take into account natural variation. Natural human variation that exists. So you're actually searching like a graph genome instead of just a linear set of characters. We know there's lots of structural variation. There's polymorphisms. There's differences between different individuals. Some of these tools can actually capitalize on that. So you actually get better alignments. But if you don't take that into account, you'll get false evidence. Sometimes you'll get mismappings that's indicative of a fusion event. If you actually included the fact that there's polymorphisms at this place for different populations or structural variations that we know about, then you would have got the correct alignment and it would not have been a false mapping suggesting a fusion. And you have things like transplacing that occurs. So there's a lot of peculiarities there. There are things you can do. I know we only have a few minutes left before we go to lunch. So I'm going to try to run through that. So there's just various filters that some of the newer tools will apply. The screen things against databases of known artifacts or fusions that would routinely turn up in normal datasets. Consider the strength of the evidence. If you have 100 million reads, you only find it's like one or two reads of a supporting a fusion event. Is that really relevant? It could be just that the deeper you sequence, the more artifacts you're going to see. Eventually you're going to have some of these artifacts agreeing with each other. So you really should take into account. It's kind of like expression. If you have gene that's globally expressed, you're going to pick up evidence for it as you sequence deeper. But eventually you sequence so deep, it looks like the entire human genome is expressed. And we know that that's not the case. That caused a lot of controversy a few years ago. Meaningful expression. If every base in the human genome is transcribed, there's meaningful expression and there's just basal transcription that probably makes no difference in addition to artifacts. Taking that into account is important. Some other things we have in the tool called fusion inspector that will do a supervised view. So if you give it a pair of fusions and say, okay, find me the evidence for the BCRE able fusion. It will do that. It will make fusion contigs and allow us to visualize it. We can put it up in IGV. We can see the fusion genes. We can see the evidence that corresponds to the fusion partners. We can do de novo transcriptome assembly to reconstruct those fusions. But I really like this. It's one thing to look at a report and see, okay, I've got two reads to support this as a conjunction reads. I want to see the evidence. Show me the reads and where they align with something we're going to play with in the afternoon. There's other visualization tools you can use. We can prioritize fusion candidates in a number of ways. If you're screening hundreds of patient samples, if you find that they're all a certain type of cancer, if you find that there's a certain fusion that keeps showing up multiple times in this cancer, then you might consider that to be a good indicator that it's playing a role in that cancer. If you only see it once, the expression, you might consider it to be a potential or not. It balances rearrangements. So if you find a fusion event, you might find the reciprocal fusion events. That gives you more evidence to support that a translocation happened. Strength of the evidence. The types of genes that are involved, because you find kinases, you might want to consider that because then you have kinase inhibitors. They're treatable in some cases. Other genes that show up, they're transcription factors that are the usual suspects. Take them into consideration. There's papers that have been published in the last few years that are doing pancan studies across all cancers, like in TCGA. So there's lots of good resources that are being developed. And also more insights into what kinds of fusions are you finding in different cancers? What fusions are relevant to specific cancers? What fusions are basically maybe relevant to cancer, but you find in a bunch of different cancers that are really cancer specific in that way. Correlations of fusions with measurements of genome instability. So these are all the kinds of things that people are looking into right now. Building large collections of fusions, big databases now, like ChimerSeq. This is the one that has 30,000 gene pairs. But again, most of those are just predicted. They don't have experimental evidence that's predicted from different programs. You might find some promiscuous partners or some genes that like to fuse with other genes. Depending upon what study you're looking at, you'll find some cancers that there's certain gene partners that show up time and time again. There's a pair that are always fused. One of the genes is found fused often with other partners. So we can take that into account. Fusion databases, so TCGA, Cosmic. There's this nice collections of fusions that are known to be relevant to cancer biology. We've got a tool that you'll play with later called Fusion Anitator. They give it the fusion gene pair. It'll tell you, hey we've seen this in Cosmic or we've seen this in TCGA and other databases. So it's basically my way of trying to easily annotate fusions according to what's been previously reported. It'll really be able to flag fusions that could be relevant, particularly in studying patient samples. If you have expression information, look at the expression information and see if that supports the fusion. Open it by GV. Look at the expression profile. If you have a prediction for a fusion that looks like this and you see that at our predicted break point, we basically have a loss of expression. That's a good sign that there's something peculiar going on here and maybe it is a fusion gene that's well supported by the RNA-seq data. So that's a good thing to do. Reading Frames. We have a tool called it's an option to actually examine coding effects. So if you're curious, if you have this fusion event, does it make a fusion protein? That's something we might be interested in. We have some reports that will basically indicate that this is like a fusion protein. We have databases of fusions that are probably not relevant to cancer. These are just fusions that are showing up in normal data sets, like G-Tex. G-Tex is a great collection of transcriptomes from normal samples. If we see that in a cancer sample, maybe we're going to discard it because it's really not critical. All right, let me just see what else we have here. People are getting hungry. All right, so this is what we're going to do after lunch. We're going to take our RNA-seq data. We're going to use star fusion, define fusion predictions. We're going to use fusion inspector to visualize the evidence for those predictions. Then we use fusion annotator to see if any of them have shown up before in different databases or are known to be cancer biology. We'll look at the fusion corning effect to see if it looks like it's making a fusion protein or not.