 Okay, welcome everyone to module four. So this module will follow along pretty nicely from what Jared presented to you guys. It's about one of the most important effects of genome rearrangements, which is gene fusion. They're very important events in cancer, so we're gonna talk about these events and how to detect them. So here's an outline of our objectives for what we're gonna learn. We'll explore the impact of gene fusions in cancer, learn about the different types of evidence that we'll be looking for when we're predicting these events. We'll try to understand the differences between the different available detection methods and tools. Understand how to identify false positives, which are, this is an important aspect of being able to understand these data sets because they're often replete with false positives. And then we'll look a little bit about assessing a gene fusion's potential collagen. So just to give a broad definition of a gene fusion, it's a novel gene formed by the fusion of two distinct wild-type genes. The canonical example here is BCR-ABL, which is probably the first somatic event that was found in any cancer that was attributed to the cancer biology. Originally it was found as something called the Philadelphia Corgan Zone, discovered by Noel and Hungerford in Philadelphia, that's the name. And then later they found that this transification actually fused the BCR and ABL genes and produced this gene fusion. They're relevant in cancer as prognostic markers. Here are our CML, or sorry, BCR-ABL example. It's a prognostic marker for 90 to 95% of CML patients. BCR-ABL one is also a success story for being able to drug the target drugs towards these gene fusions because we can now target the gene fusion with iminitib. And I think one of the reasons for our resurgence in excitement about gene fusions is the newly available sequencing platforms that make it easier to identify these events. The evidence that these are important events in cancer are they correlate with the cancer phenotype, hence why there's such good prognostic markers in some types of cancer. Successful treatment of cancer will often eradicate any evidence of the gene fusion. So if we treat CML with iminitib, then we can no longer detect the Philadelphia chromosome in the patient's blood. Gene fusions, when we look in, when we are using, say, mouse models or cell lines, if we transfect with gene fusion, then we will produce a neoplastic disorder often. And silencing fusion transcripts will reverse tumor genesis in all the systems. So what kind of, if we want to classify gene fusions, what kind of different classes can we come up with? One is deregulation of proto-oncogene. So this, in the example here, is MIC fusing to IgH. IgH is regulatory elements. The five primary regulatory elements get fused to the three prime axons of MIC, producing a functional MIC that's up-regulated because now it's controlled by the regulatory elements of IgH. We can also have alternate forms of deregulation. For instance, recently discovered with this MYB fusion, which MYB is up-regulated because its three prime UTR is replaced. Three prime UTR contains microRNA binding sites and then, and thus, the cell is no longer able to regulate MYB expression and translation through the microRNA binding mechanism. So another class of fusions is a formation of a chimeric hybrid gene. This is where the fusion gene itself is more than just the sum of the two parts. So this is what BCR, ABL one, is an example of. So putting these two genes together produces a new gene with sort of distinct function. Another way in which these events can affect the biology of the cancer is, for instance, we produce a gene fusion that's non-functional and the original function of one of the partners, perhaps it was a tumor suppressor function. And I think one of the most interesting examples I've seen recently is this example of a MYB QKI rearrangement that really all three of the mechanisms we just described come into play here. I mean, we've already seen MYB can lose a five prime UTR and be deregulated. So that's happening. We also are, we're disabling QKI, which is a tumor suppressor by cutting in half with this rearrangement. And then the new fusion gene, MYB QKI, has a distinct fusion gene function. So when we're thinking about how to find these events, so what are the available types of evidence? What are the genomic effects that we can look for? Camaric DNA sequence, Camaric mRNA sequence, and also changes in expression of the genes involved. So the discovery platforms started out at the starting point here from, I guess it's 60 or so years ago are cytogenetics. These are labor-intensive and low throughput, but they were the techniques that originally had to discovering gene fusions and include chromosome banding analysis and spectral keratetin, which basically enable us to look at a high level which chromosomes are translocated. Although then there's a lot of work that has to go into finding the actual genes that have been involved and disrupted. Another technique that cytogenetics is fluorescence in situ hybridization, so fish. This is still quite relevant for fusion discovery, I'd say, because it's very useful as a validation technique. And for this technique, we basically take probes that are targeting specific genes and fluoresce with different colors. And then we produce an image in which we can see whether or not two genes are coincident because they have two colors that are overlapping. As you can see here, where we're showing the sort of signal of BCR-ABL, which is the green and red that's overlapping. Okay, so I'll mention expression arrays as a discovery platform, and this is basically because expression arrays were used to find the ETS fusions and prostate cancer, although they were used to find that very big discovery, but now they're no longer used, I would say, in this way. Specifically, I think RNA-seq is used more than expression arrays for fusion discovery, but this sort of analysis was, what they did was they looked for outlier expression of specific genes and they found that ETv1 and ERG so in the green parts of this histogram showed outlier expression and that outlier expression was mutually exclusive across their samples and then they were able to attribute that finding to a translocation that translocated either of those genes next to a team for SS2. This particular fusion is pretty important because it was one of the first that was found in a solid tumor. Previously, gene fusions were thought of as events that happened in sarcomas and blood disorders, so this got everyone pretty excited about trying to find other recurrent fusions and other solid tumors and be able to, since solid tumors make up a vast majority of the morbidity, due to cancer, perhaps there were other fusions to discover in solid tumors. Genome sequencing has also been used to discover fusions such as these in colorectal cancer. It's comprehensive, although one of the issues whenever we find a break point, as you guys did in the previous lab, we're not ever sure, so maybe it will connect to genes but we're not sure what the larger picture is. There could be other break points, it could be a complex rearrangement or just an insertion leaving the genes intact and we don't know whether or not the product is expressed. And then finally mRNA sequencing is probably the platform of choice to discover gene fusions, because it's inexpensive, it provides information about the expression of the genes involved. It gives you an exact nucleotide level prediction of the fusion sequence. Although it doesn't provide quite as much information as genome sequencing, it doesn't give you the translocation break points, et cetera. And it's been used not, I mean, I'm just showing an example here of mass and notch fusions that were found in breast cancer, but it's been used to find many thousands of fusions in recent years. So this graph is showing using guided approaches. We are sort of trundling along, finding maybe 50 fusions a year and as soon as we started using mRNA sequencing, last, I guess, it was 2014, there were reportedly discovered 7,800 fusions. So, yes, I guess if it's not being expressed at the time point that you were in the sample, when you take your sample, there's a question of whether or not it's relevant across, I guess there is a question of evolution there too, because it could be relevant and microenvironments, because it could be relevant in a specific microenvironment or after some selective pressure like a drug. So those are things, I guess, you could miss the relevance to a specific environment. Yeah, that's something to be careful of. So I'll just review how RNA-seq is generated. In general, with the sequencing platforms, you'll find the way of specifically with Illumina, you often take some kind of different library preparation and tack it onto the regular sequencing chemistry pipeline to get some kind of different other sequencing. And in this way, we're targeting the messenger RNA by first doing a pull down where we pull down anything with a poly-A tail. And then we're doing reverse transcription to turn it into CDNA and then we're running just the regular Illumina protocol where we fragment the CDNA, do a size selection step, and then sequence each end. Okay, so what happens when we do that, apply that to the transcriptome of a tumor sample? Well, the first thing that is happening is we have, say, a rearrangement that combines chromosome A and B. The first thing that happens is if we have a fusion between gene X and gene Y on A and B, then that fusion transcript will be transcribed, and then there will be splicing machinery will take effect and will splice out the introns. And so what we're actually sequencing is the transcript here shown in the middle. The way I've drawn it here is actually what, the way it takes place, the way this happens in most of the functional fusions and that is that most often, especially because the introns make up the vast majority of the gene sequence, most often the translocation breakpoint happens within the intron of the two genes and then the actual genomic breakpoint is spliced out by the splicing machinery. So often what we're sequencing is not the actual genomic break, we're sequencing the boundary between the two genes at the exon boundaries. Then we sequence these transcripts and we can get three types of reads, the wild type reads, which are just reads that look indistinguishable from reads that come from the wild type gene X and gene Y. What I call spanning reads, which are reads, maybe we talked about these type of reads, I think last module, so spanning reads would be ones where the entire read on one side maps to one gene and on the other side to another gene and then split reads where one end maps entirely to one gene and the other end is half and half gene X, gene Y. So to recover these events, broadly speaking, there's two possible ways to deal with this data and one is alignment. I think that's predominantly what people do. I'll also mention assembly. So there's two opposite ways of doing this, the way I think about it. The first is in the alignment based approach, we are independently aligning each read and then clustering that data into contigs or transcripts based on the alignment of those reads to the reference genome and that's this path on the left. With assembly, we are instead first clustering the reads according to how similar they look and then taking the resulting clustered reads, turning them into longer contigs and then aligning those to the reference genome and the way in which they are grouped together is just by looking at overlaps between those reads and sort of tiling them out to make longer sequences. Okay, so I'm gonna talk mainly about alignment because that's predominantly the methods that have been used to discover fusions thus far in the literature. So one of the problems here is again, slicing when we're trying to align RNA-seq data, especially when we're trying to look for fusions. We have two ways in which we have to deal with non-contiguous alignments of pieces of the reads and one is because our reads come from transcript sequences. If we try to align these to the genome, then some of those reads will be split by the introns. So half of the read will map to one exon and half to an adjacent exon. And then the other problem, of course, the reads that we're looking for that are split by the fusion, with those one end will map to a gene x, one to g, y. We can make our lives a little bit easier if we assume we know everything about the gene models. We can align just to the gene transcript sequences themselves and that gets around the problem of aligning across introns. But in general, what people do is they align to both a combined reference that is both the transcriptome and the genome. And here we're just drawing that with transcript, in terms of mapping rates, which is what is shown on the left, using the transcriptome and the genome, we're getting the highest mapping rates. In the middle here, this middle column is with no transcriptome and we're getting poorer mapping rates. And definitely we're getting better using a transcriptome and genome reference over just the transcriptome only. We're not mapping as many reads. And on the right, we're showing similar situation for the percentage of reads that cannot be mapped uniquely. For this, actually using just the transcriptome and especially RefGene and UCSC, we get a lower number of reads that are not uniquely mappable, but this is at the expense of not being able to map a lot of the reads at all. So generally a transcriptome and genome reference is preferable. Yes, so genome alone is this column. It's a bit awkward of a figure, but yeah, so none down here means they're not using transcript database at all. And then same way over here. So on the left we have a percentage of reads that can map and then on the right, percentage of the reads you can map uniquely. Or sorry, that's the other way around. It's a percentage of reads you cannot map uniquely. Okay, so perhaps a lot of these tools, the tools that we'll talk about, the first step is to find spanning reads. So this is reads that where one end maps fully to the gene X and one to gene Y. And this gives us sort of a preliminary information about the approximate region where the break point occurs. And just in general tells us that perhaps there's an event involving gene X and gene Y. And this, but the problem with just this analysis is it doesn't give us an exact fusion transcript sequence. It just gives us approximately gene X and gene Y are somehow involved in a fusion. And so the next step is always in most of these tools is to refine the approximate break point that you've created that you get from gene X and gene Y by trying to find reads that are exactly split by the fusion boundary. And I think the earliest tools would do this by taking all of the exons from gene X and all of the exons from gene Y and just pairing them up in all possible combinations and then just doing a subsequent alignment step using say, BGIL, WAMM to align to all of those possible combinations and find out which pair of exons is supported by a split read. So that's one way you can do it. I guess there's other more refined ways that have also been proposed. In general, this is a hard problem though. I may have switched the slides around, slide over around any guys, sorry about that. So when we go from say matching a subsequence exactly that's sort of, that's mostly a solved problem and has been, so now there's algorithms that can do this quite quickly and they can do it within quite a bit less memory than previously using compressive arrays, et cetera. When we get to more features that distinguish the read from the reference sequence such as indels and mismatches, then this problem becomes a little bit harder and then I would say the hardest problem is when you also have the possibility that sub-segments map to different parts of the reference. So this is why the split read analysis just by itself is quite difficult. So what people do, what these tools do is generally they try to take the hard problem and turn it into the easier problem. So by segmenting the reads, basically taking small sub-segments and then trying to align each of those sub-segments exactly, so here we're taking this read at the top, chopping it up into three pieces, two of which map exactly and then to identify what the exact alignment is, then we sort of do a more refined analysis using the information about where each of these segments map. So and one way we can do that is to use something called dynamic programming, which is the most sensitive way of producing an alignment, but it's also the most time consuming and it's not something we can do but it's not possible really or tractable to take one read and dynamic programming, use dynamic programming to align it to the full reference genome. So instead what we can do is just based on the information that we've found from mapping the sub-segments, we can construct sort of a pseudo reference by taking the two genes and putting them together approximately and then using dynamic programming on this smaller pseudo reference sequence. Sure, yep, yep, definitely. Yes, so then you inevitably want to be able to have a set of reads that are aligning exactly across the fusion boundary. This, especially with fusion, it's pretty important to have the exact transcript sequence so that you can understand what's function. Okay, so this table is more for reference. It's, this is, there's quite a lot of tools that will do fusion discovery for you, including my own, which is at the bottom here called Diffuse. And this is just showing you, I'll describe the columns really briefly, so which tools produce an exact sequence? Which tools use the strategy of breaking the read up into segments to get a better idea of what the exact fusion sequence is? Which ones leverage paired end read information? Which one, and then which ones use this technique where they reconstruct the exact fusion sequence using an approximate reference or versus the other alternative, the earlier alternative, which is to combine all pairs of axons and try a regular alignment to those, that database of all pairs of axons. And then the last column just tells you whether or not the secondary alignment to find split reads is robust to say a split read that also has like say a small indole in it or a small mismatch. So there are assembly-based methods for finding gene fusions. So to assemble, you can use something like transibus or trinity. And then the process is basically to map the context that you produce from those assembly tools using something like GMAP or BLAT and then finally pose process and trying to understand which ones are false positives versus which ones are true fusions and this pipeline called Barnacle that will do that for you. I think one of the problems we've had when using assembly techniques is assembly techniques are they're good at producing long sort of consensus contigs for the main transcript in your sample that is representing a particular gene. But then the splice variants of those transcripts and in some cases the fusion transcript, those are just resolved as very short subsequences that sort of give the alternate path from say you have gene A, you have a full transcript reconstructed for gene A and you have full transcript reconstructed for gene B and then the fusion information in the context that you get out of these tools is just a short subsequence of A and a short subsequence of B. And so then it's just as difficult to find a fusion based on those short subsequences as it is to find a fusion based on the reads originally. So I think it's the long short of it is it's not a completely solved problem. So you're asking about heterogeneity and yeah. Heterogeneity is, I think it's something that's tractable when looking at genome sequences because we expect some kind of, we have some coverage expectations. We don't really have those with RNA-seq. Things are expressed at drastically different levels that sometimes it's difficult to even, I distinguish the normal population that contaminates your sample and the tumor population. So really I don't think we're gonna be able to understand heterogeneity from RNA-seq until we do single cell RNA-seq, that's my opinion. Yes, yeah I have a whole sequence of slides on this. Okay, continuing on talking about all the different tools that are available. There have been a number of evaluations. I think that what comes out of these is that a lot of the tools produce very numerous, the number of predictions that come out of them is sometimes very large and most of those are likely false positives. This is showing a comparison between six different tools. Top at Fusion here is producing 136,000 gene fusions that are predicted. Diffuse my own tool is producing 915, which is not, that's still probably too many to troll through. And they have variable performance in terms of pulling out all of the actual gene fusions in the data set that they're presenting here. Oh, so yes, so what is the key here I mean? Oh, how do you know, so the ground truth here I think was they took a cell line and that was fairly well understood in that they had done a lot of PCR experiments to try and validate the fusions and then just tried to understand how many of those fusions they were rediscovered by these different tools. Does that answer your question? And then on the right I'm just showing here that the main result from the simulations that they did in this paper is that they're not very reliable. Simulations of RNA-C data are often too perfect. Yeah, I guess I think there were 19 original fusions and Cameras scan identified all of them in this data set, although it did produce 13,000 other predictions. So there's a question of sensitivity versus specificity there. It's not the most specific. Yes, this is, it's a bit confusing and I would just look at them all because what they've done here is they've tried to look at which tools produce a prediction of this is the five prime gene and this is the three prime gene, but a lot of tools don't even bother making that prediction. So I don't think that that part of the results has any validity personally. So RNA-C I would say it's, in my opinion, it's a pretty difficult type of data to work with. There's a lot of different, you can look at data that comes from different cell types, we'll have different artifacts because a whole new set of genes are being predicted and it is quite tricky. So I'll just list some of the different sources of false positives that come through in this data because often you'll have to produce some predictions and then you're only really a quarter of the way there because you have to try and understand where the false positives are coming from and filter things down. The technical artifacts here are alignment artifacts. So you could have two genes that are quite similar and so you have A and A prime and one end maps to A and the other one has a mismatch that instead of mapping next door to A in A it maps to A prime. So you can have things like that where homologous genes produce a lot of artifacts. And then that is confounded by high expression. So if you have a lot of reads produced by a particular gene, perhaps there's a small percentage of those reads have errors in them that will cause those reads to map to another part of the genome and nominated gene fusion. So that's something to be careful of. We can get within the chemistry, there's ways in which we can get what look like chimeric reads during reverse transcription. You can get template switching and also during the library preparation you can have reads or pieces of sequence that randomly lie together. The good thing about these types of artifacts is they are quite random and so just by clustering reads together and filtering things that have very low read support you can usually remove things that are produced in the chemistry. Also, I need to mention biological artifacts such as natural sources of rearrangement. So IG rearrangements, transposons, e.g. mitochondrial insertions, things like this produce our sources of rearrangement just naturally and they can often be mistaken for gene fusions. Yes, I would have a mixed answer to that. So if you increase the depth do you get more artifacts or less artifacts? Some of the artifacts are definitely generated by regions with very high depth and then subsequent to having a lot of reads in this one region. So if you have even a tiny percentage of reads that don't map correctly that's still gonna look like a lot of read support for a false positive fusion. But if you do have higher coverage then you can have a higher threshold on things like read count. And so you can get rid of some of the more random artifacts. But always when you're doing more depth you have to worry about if there's some confounding factor that is non-random. So things that are randomly distributed those will just be filtered out if you as background noise. But things that are completely non-random if there's some process that's taking reads and putting them in a specific location because of a specific repeat or something then those won't get any better for more sequence. Yes and the last thing I will mention about biological artifacts is transcription induced chimeras or read throughs. So these happen in benign tissue and in tumor tissue. So you get pairs of genes that because perhaps the chromatin is open those genes end up being transcribed together and then the slicing machinery slices them as if it's a regular gene and you essentially get a very long gene that's basically two adjacent genes. All right so dealing with alignment artifacts first how do we identify these and just robustly filter them out? The approach we used with diffuse was to train a classifier on the first thing we did was to predict a number of fusions and pick things as best as we could randomly so that we could understand and then validate them so that we can understand what the signal is for a false positive and what the signal is for a true positive. And then the next step was to train a classifier that could do the filtering automatically for us and I'll just briefly describe some of the indicators of an artifact and these are presented in order of sort of their importance according to training a classifier. So the first is whether or not the reads that you've aligned to the fusion transcript stack up all in the same place or whether or not they tile uniformly across your fusion boundaries. So if they're tiling at different locations across your fusion boundary then this is a good indicator of a good prediction. So in the histograms here we're showing how towards the right we're showing well distributed across the fusion boundary and towards the left not well distributed so all in one place on the left here and then red is the false positives and green is the true positives. Another signal of a good prediction is whether or not at the fusion boundary we have a canonical splice site signal. So this is the splicing machinery looks for the signal of GT at one end and AT at the other and removes that as an entrant. So that is a good indicator of a true positive. If we cannot uniquely assign most of the reads at one end to one gene then that's another good indicator that if they have many mapping especially if one end always has many other mappings for the reads that support our fusion then that's a sign that we have a false positive. Another thing that people do is reconstruct a fusion transcript. This is something people do with when they're using both assembly and alignment based approaches so you reconstruct your fusion transcript and then you realign your reads back to that fusion transcript and the ones that span or are split by the fusion boundary you look basically at whether or not those reads have the same length distribution as your wild type reads. Finally, analogous to how split reads distribute across your fusion boundary we can look at how spanning reads whether or not they align all in one place or are well distributed across your fusion boundary. So, sorry, what is your question? So, what is the- What's the difference between those two? Is this examines for well distributed or no well distributed? Yeah, totally. Okay. Straddle. Yeah, so this one is looking at how long are the fragments when you map them back to the- Oh, the fragments. Yeah, so what is the insert size when you map them back to the fusion transcript? And the other one is just looking at how well do they distribute across the fusion boundary, yeah. Okay, and finally the last indicator of an artifact is something that's a little bit tricky, but if we look at the how well each side of the fusion maps back to the genome a good indicator of a solid fusion prediction is that we can take the fusion transcript that's predicted and divide it into one half that goes to one gene and the other half that goes to another gene almost precisely with a distinct fusion boundary. And in comparison to that, it's possible that say 75% starting at the beginning maps to one gene. So the first 75% maps to one gene and the last 75% maps to the other gene. And then we have this big chunk in the middle that could go either way. It could go to gene A or gene B. So that particular, if that's true, then that's often an indicator that of an alignment artifact producing a fusion. If we, so if you take two random sequences or sorry, if you take two random subsequences of the genome and then you concatenate them together and then you align them back to the genome you wouldn't expect much of the sequence from this location to be able to align also to this location, right? Just by random chance. So maybe you would expect a couple of nucleotides from sequence B to also be able to map to sequence A. But then, unless there's quite a lot of repetitiveness to that sequence at the fusion boundary should map uniquely. So far, we're all separated by interest. Yeah. If there's actually a fusion. We'll get some experience in the lab with this kind of thing. Yes, but by chance you could have say and you could have sequence A, sequence B. This could have a T at the end of it that could map equally well at this low side or this low side, giving some ambiguity in where the fusion boundary is. Yeah, we can work through that in a long tutorial though. Okay, so for natural sources of rearrangement I think the thing to do here is to annotate as comprehensively as you can things that you predict from databases such as repeat masker, IG gene lists and be able to filter out things that you identify as not necessarily gene fusions but sources from sources of rearrangement other than chromosomal translocations. And then rethrus are easily identified because they are involving adjacent genes. So those can be flagged. To prioritize candidates from any prediction tool we can look at expression. So how highly expressed is specifically the three prime gene is often the gene that provides most of the function to a gene fusion. And then whether or not if we look across the expression of the exons across each of the genes is there a change in expression at the fusion boundary? If the fusion itself is much more highly expressed than either of the wild type genes then you should expect the three prime exons of one gene to be highly expressed and the five prime exons of the other gene to be highly expressed. Recurrence we can look at whether or not either the same fusion pair has been identified across our cohort or whether or not one gene is consistently fused to other genes across our cohort. And if we have whole genome data we can also look for operating rearrangement. We can look at gene function. So has the gene being implated in cancer? Does it involve a kinase? Could it serve as a drug target? Maybe something that you would want to annotate your data set with. And also, is the function of your gene fusion preserved or of your fused gene? Does it preserve, does your fusion preserve the original function of the two genes? And this can be done by looking at domains. One of the most important things though is whether or not when you fuse these two genes together, whether or not it preserves the reading frames of each of those genes because there's ways in which they can fuse and especially the three prime gene when translated will become nonsense and will not be the original peptide sequence that comes out of that translation of that part of the gene. When looking at what the gene fusion partners are that are consistently fused, often the ones we know that are functional are tyrosine kinases, is one class of gene fusions. Transcription factors is another class that are commonly fused. This includes the ones that were found in prostate cancer and oncogenes that are upregulated. We can also build a network of the gene fusions that have been found to date. This actually network was produced I guess it was nine years ago in this sort of survey paper. And what they found was that the fusions were highly connected to a few clusters centered around a few genes. And in this paper they postulated that maybe with a lot more work on gene fusions we'll be able to connect all of these clusters, these three clusters specifically together to form sort of a single network of gene fusions, which is not actually what ended up happening. When we look at something like ovarian cancer, a lot of work has been done on since that original picture on trying to identify gene fusions. And what they find is generally the gene fusion network is very disconnected. One of the reasons for that is just the guided approaches that they were using for gene fusion networks before they started doing sequencing. And the other reason is because possibly a lot of these fusions are non-functional created by genomic instability. There's been some landscape papers that have happened very recently where they've sequenced a lot of tumors and looked for patterns of fusions across these cancers. And what they do find is they find a lot of the fusions are associated with copy number change. And perhaps that is a sign that a lot of the newly discovered gene fusions are passengers and the product of genome instability. I'll mention a couple of the databases that are available that provide some details about gene fusions that have already been discovered, TCGA gene fusion portal, that's actually the produces, provides all of the data that has been identified and associated with this fusion landscape paper. So you can query that. You can't download it, unfortunately. You can also look at cosmic, which has a more restricted set of gene fusions that are curated from publications. That one is downloadable. ChimerDB pulls together a number of different sources and I would say is one of the most useful ones. And conjoined gene is a database primarily for conjoined genes in healthy tissues. So read throughs. Okay, continuing on the vein of what to look for in terms of function of a gene fusion. One of the things we can do is again, look at the expression across the exons of your genes that are involved in your gene fusion. And here, for instance, we're showing that for this particular gene fusion, R3, CRAD, and this other one involving HNF1A, if you look across the expression of each of those genes, then you see a very stark change point when you get to the break point. And that's showing that there's really no wild type expression of the wild type gene. All of the expression is coming from the fusion gene. We can look, sir. So there's been, what's happened is there's been a translocation exactly at that break point. Either, you can think of this almost as a deletion that whole end of that transcript could actually be, maybe it doesn't even exist anymore or perhaps it's transly located somewhere else and where it's been translocated to, there's no promoter that would actually lead to expression of those sequences. Yeah? I don't know if gene-sharping, but when you put this with gene-sharping, is it important to do something with this? Yeah. Would it, other than this thing? So you would, yeah, that's exactly what, this is exactly what you would see, say, for R3 in the top, say it was split by some kind of genome-shattering event. But maybe it's translocated somewhere, not as part of a gene fusion, it's just moved to another part of the genome and the five prime sequences perhaps in odd sense. So that wouldn't, I guess we wouldn't class that as a gene fusion, it would just be a gene disruption. But that can definitely happen. And it would look the same as this. So I mean, really what we'd have to do is look at R3 and also what we're not showing is what the expression of CRAT is. And we would expect that we would see just the five prime exon, sorry, the three prime exons expressed for the CRAT. So we would have to do this game where we match up the two genes' expressions. Okay, so I might go a bit faster so that we can get to the tutorial quickly. So here I'm just, in this slide, I'm just describing that if we look across different cancers, we could end up finding perhaps the same fusion partner that is fused to many other different genes, but is relevant in the same way to all of these diverse tumor types. And that's what they found for BRAF, for prostate and several other cancer types. When we're looking for read-throughs, we have to be quite wary that, okay, so they are a very recurrent, including in benign tissue, they're a very recurrent real event. At the bottom here, we're showing all of the read-throughs and the histogram of how many samples they're found in. And then at the top, the events that are not read-throughs, and the only thing that's recurrent here is TMPRSS2-RG, which is our known biologically relevant fusion in prostate cancer, which is what these samples are. We have to be careful if we were just blank at filtering things out, because we could end up removing something that's either produced by a small deletion, which is what I'm showing in this part of the slide on the right. So we have things that are recurrent on the bottom that are read-throughs, shown by this blue annotation. And then we have things at the top that are specific to our one cell line that are shown with the green annotation as rearrangement-based. And then we have this one event that is both a read-through and produced by a rearrangement and is very specific to this sample. So clearly being adjacent doesn't mean, just because gene fusion is predicting a fusion of two adjacent genes, it doesn't mean it's going to be produced specifically because those genes are adjacent. There needs to be something else going on sometimes. For instance, if the gene generates it. Yeah, potentially, yeah, yeah, definitely. I'm not sure that's the case for this example that I showed, but also there's at least one example of a gene fusion that is a read-through that has had some implication in prostate cancer biology. And that's this SLC-45A3L4 that is on the right here we're showing that it has the tissue specificity of the expression of this gene fusion is due to the specificity of the expression of SLC-45A3. And also it's being shown to have some role in self-reliferation in prostate cancer. Another feature we can look at when we're trying to understand the biology of these fusions is how the splicing, where in each of the genes the splicing happens when we're splicing together these two genes to make a gene fusion. I'm gonna just skip over this slide. And I think one of the most interesting things we can calculate is whether or not the gene fusion hasn't preserved open reading frame across the fusion transcript. So if we bring together two transcript sequences about I guess it's one in nine times we will have this situation where the codons exactly line up. And in that situation then we'll preserve all of the codons when they're translated. We'll be the original codons of the original transcript sequences. We can also two out of nine times we have the situation in which we get a nonsense codon right at the fusion boundary but these things are brought together in such a way that the remaining three prime codons are the original codons of the three prime gene. And then also two thirds of the time at least the majority of the time we have a situation in which all of the three prime codons are just nonsense. So if we look for these two situations at the top then that's if we see something if we see a prediction that looks like we see the situation at the bottom then that's a sign that there's no not gonna be any biological function that's preserved. Yeah. But for the first two events, so I guess potentially in this situation on the right top if yeah you could end up with a codon that's yeah it could be the same as the original one from gene energy. Yeah, that's true. Okay so we can look in the genome for evidence that will sort of help us understand these events. Here I'm showing at the bottom way in which we should be or just some evidence that we should be careful when we're doing this because what I'm showing is a fusion transcript between SAMD1-2 and this PHF gene. So this fusion transcript is at the top involves these two genes but in the genome what we actually had that produced this fusion transcript is a translocation between those two genes but at the break point we have the insertion of one KB sequence from another part of the genome and a one KB sequence from a third part of the genome. So if we are looking just at individual break points we'll never see SAMD12 directly connected to PHF and yet in the transcriptome that's what we see. So sometimes you have to count for the complexity of the break points when you're looking for corroborating of evidence in the genome. If you don't have the classical reads. So yeah you'll never find reads that in the genome that connects SAMD12 and PHF because unless you have very long reads which is the case with the sequencing data most of the time. So I mean what we did for this is we had to look for paths through the genome graph so we were looking for a path that goes from SAMD12 via some other path through the genome to PHF. And this top example is an example of a complex rearrangement that produces multiple gene fusion. So this happens in prostate cancer where you have the DNA gets cut at four different locations per mutated and then rejoined to produce two gene fusions on two different chromosomes. Okay considerations when you're designing experiments I would say if you have a large cohort size then some of these tools are a bit prohibitive in terms of their computational costs. So you can use some simpler methods such as the methods that don't do a sort of refined alignment. They just join pairs of exons and do a standard alignment, things like that and then filter recurrent artifacts. If we know a fusion partner then you can often do a capture experiment is more efficient. Artifacts are very prevalent so we have to be very careful when we're getting excited about finding an event. We should always, I would say, spot check individual events is really important and manually check supporting reads. As we've seen, it's important with SMBs and other events and rearrangements to do that in IGB. It's the same with gene fusions. Leverage, definitely leverage multiple computational methods and validation is always important at least for a subset of peer predictions. So for our recent advances, I think this happened last year, Cresatinib passed phase three trials so we're starting to make some gains clinically from being able to identify gene fusions and then treat them with drugs. And finally, future research in gene fusions I think is going to look at how they affect the proteome and one of the tools that people are starting to use for this is mass spec data. And so now what I think what we're gonna do unless anyone has any idea.