 Okay, part two of this module is gene fusions, so this is sort of looking at how the rearrangements affect the transcriptome. The definition of a gene fusion is just a novel gene formed by the fusion of two distinct wild-type genes, so from a translocation or some other type of rearrangement event, we have two normally distinct genes that are brought together, and we have the formation of a gene that is a combination of two. These we know are very relevant in clinical features of some cancers. I think the best example here is the BCREBL1 fusion, one of the first somatic events to be discovered. Originally we discovered it in a genome, it's inside a genetics. The Philadelphia chromosome, the translocation that creates this gene fusion was discovered, and now it's kind of held up as this example of how targeted therapies can work because we've created a small molecule that inhibits the BCREBL1 gene fusion product and that drug is called a minitave and it's very effective for treating CML patients. Another aspect of these gene fusions is they can be used as prognostic markers, usually by looking at fish results, similar cytogenetic results, so pathologists will look at these results to try and identify whether or not a patient has a cancer that's driven by one of these gene fusions. Then recent developments in RNA sequencing and genome sequencing have developed a lot of excitement of finding more of these events. Further evidence that gene fusions are clinically relevant, they're initiators of carcinogenesis and we see that because they correlate with the cancer phenotype. If we treat a patient which harbors one of these gene fusions in their tumor, then if we successfully treat them, then that eradicates any products that are associated with gene fusion. Gene fusions produce neoplastic disorders when they're put into mouse models and silencing fusion transcripts will reverse the tumor genetic process. There's several different classes of gene fusion. The first shown here is deregulation of a proto-oncogene, so we have a translocation that brings together the promoter region of one gene and the functional domains of another gene and that means that now this new fusion gene, the expression of that fusion gene is driven by a promoter that is perhaps up-regulated as in this example where IgH now drives over expression of MIC and brachycelimphoma. We can have other forms of deregulation such as a microRNA binding site that is swapped between two genes making it difficult for the cell to use translational repression through this microRNA binding site to inhibit the expression or translation of a particular transcript and that leads to an overabundance of the protein products of a particular gene. So it's not only the case that a gene fusion will just result in over expression of the functional domains of the 3-prime gene. You can also have in the BCR ABL1 example functional domains from both genes forming a new gene fusion with sort of a novel function within the cell. Another example of a class of gene fusions is just a disruption of a tumor suppressor gene. So in this case, really the fusion product, the transcript that we're seeing that is a fusion transcript, is evidence that there's somehow the wild type genes are being disrupted and this is just through the translocation that is interrupting or translocating the gene so it's no longer functional. And then of course, this was published recently as an example where we have multiple mechanisms caused by just one reciprocal rearrangement or actually inversion in this case. So there's an inversion, an inversion if you remember from the previous part of the module and inversion will create two break points. So in that way it can create two gene fusions and for this particular example it creates two gene fusions, one of which the result of that gene fusion is silencing of the tumor suppressor and the result of the other gene fusion is this new fusion transcript that has some oncogenic function. In terms of discovering gene fusions, we've seen a massive increase in the number of gene fusions that have been discovered since the advent of RNA sequencing. This seems to be the platform that is used most commonly to identify these events. I think genome sequencing is also possible but for reasons I'll show you in a subsequent slide it's not always apparent what the gene fusion is that relates to a break point you find in whole genome sequencing. I think you will cover this also in the expression module but I'll just briefly detail the steps here. It's pretty much the same as whole genome sequencing but we have this ad set where we have to dig our mRNA that we pull down using sort of a poly A pull down and then we reverse transcribe the mRNA into cDNA and then we just follow the regular sequencing protocol on that cDNA which involves fragmenting, size-selecting and then paired-in-read sequencing. Okay so what happens in the steps from a translocation that creates a gene fusion all the way to what we see which is the discordant reads, well we have an additional issue to deal with in RNA sequencing when we're looking for gene fusions in that we'll have a chromosomal rearrangement say a reciprocal trans or a translocation that brings together gene X and gene Y on chromosome A and B then the transcript that is produced from this fusion gene splices out the introns usually for most of the gene fusions that have some kind of relevant biological function the break point occurs in the middle of one of the introns that's just more likely to happen because introns are larger and so the break point itself gets spliced out and we end up with this fusion transcript which is partly partly the exons from gene X and partly the exons from gene Y then the sequencing is applied to or any sequencing is applied to this fusion transcript and we get similar to whole genome sequencing we get wild type reads so concordant reads and then we get what I will call split reads and spanning reads or but these are equivalently called discordant reads or split reads again we have two choices similar to in whole genome sequencing we can do an alignment based approach you know in our analysis of RNA sequencing or an assembly assembly of RNA sequencing because the sort of space of possible sequences is smaller it's actually quite a bit more tractable to do an assembly of an RNA seek data set than it is for a whole genome sequencing data set so this is definitely an approach that is used in the field is to do in a simple assembly of the transcript on and then map those transcripts that we've assembled back to the genome and identify ones that are involving the exons of one gene and another gene for as possible fusions but I think it's probably still more common to use an alignment based approach where we take the reads try to align them back to the genome and the transcriptome and then assemble transcripts by looking at clusters of a lot of aligned reads that support the same fusion event so that but there is this problem of given that our RNA seek reads are given that they undergo this there's this process that the sequences undergo to get to the place we're observing these read sequences of of splicing so it's no longer the case that we're just looking at reads that come directly from the genome there's this extra process of splicing that complicates things so the question is what reference do we use we can either align to the genome in which case we have to deal with split reads that are not only caused by a fusion as you can see on the left side of this paired end read in the middle here but we also have to deal with the problem of aligning reads that cross a splicing boundary that sort of span an intron to get around this problem we can we can align also to transcripts so cdna sequences of what unknown genes with known splicing patterns and this will this helps to recover a lot of the alignments that normally would be difficult and this slide just is some evidence that the best the optimal way of doing this is to align to both the genome and the transcriptome also quickly skip through that so false positives and gene fusion prediction come from similar to whole genome sequencing they come from alignment artifacts especially homologous genes and I think the problem it's unique to Arnie seek as we have very high expression as parts of the transcriptome especially ribosomal Arnie I think ribosomal Arnie is generally tried to people try to filter that with in the molecular biology steps but still some of that gets through and we have because they're so highly expressed they're just more prone to producing reads that are erroneously mapped and perhaps suggest to gene fusion so I think one of the computational steps is identifying sources of these highly expressed region of the transcriptome like the ribosomal Arnie and then just removing those there's also a few very there's a there's a there's a couple artifacts or processes that generate artifacts that will generate this sort of it's low level sequencing noise reverse transcriptase and template switching is one of them and then there's also ligation artifacts these are generally just producing small very small numbers of uncorrelated chimeric reads and they can usually be filtered the predictions that come from those can be usually filtered out pretty easily by just looking at only at predictions that have a reasonable number of supporting reads another source of artifacts is sources of natural rearrangement and in the transcriptome such as IG rearrangements if we end up sequencing a lot of immune cells then sometimes what we can end up within our RNA sequencing is a lot of IG rearrangements that are then producing express express transcripts and then those will end up looking like fusions and so we can just filter those out by looking at specifically the biology that could lead to the particular artifacts and one of the other interesting ones is that's quite prevalent is something called transcription induced chimera or a read-through this is just where so in a tumor genome you have various reasons such as open chromatin that transcription is more active across the genome and very frequently you have genes that are adjacent that are co-transcribed just because the various mechanisms in the cancer cell are disrupted that would prevent this from happening and maybe you get more frequent skipping of a transcription stop site and so you get a lot of the events that you will predict when you're using a gene fusion tool are just adjacent genes that are co-transcribed these are called read-throughs and they're quite common in both tumor and normal samples and benign samples so to reduce these artifacts techniques that are used for alignment artifacts are just to calculate features of the alignments that are supporting a gene fusion and use hard filtering or some machine learning to classify these as true or false for natural sources of rearrangement it's just important to find the relevant database such as gene lists of IG genes and annotate and then filter those and transcription and choose chimeras or read-throughs these can be easily identified as just involving adjacent genes um now to prioritize gene fusion lists of gene fusions I think oh go ahead yeah right um there's not really a coverage in RNA-seq because it's so dependent on expression so there's other measures like reads per kilobat per map to kilobase rp can um but then okay so the question is for more for fusions like what is an appropriate threshold for the number of reads you would expect um I think like separate artifacts from something that's actual biology right so something it's just sub-conal yeah it may only show up in you know yeah I mean and it could have low expression that still somehow be relevant or maybe it was historically relevant in that in that tumor um I think I mean I could give you sort of a ballpark number between five and ten reads at least that you should have that would remove then those those types of artifacts that are the ligation artifacts in reverse transcriptase template switching but yep that's true yeah um sure all right um to to prioritize gene fusions generally look at expression of the of the exons that are brought together by the predicted gene fusion look at the if those are highly expressed if the expression is interrupted if we look at the expression across the wild type gene that would imply that the wild type gene is is less expressed compared to the fusion and we can look at recurrence across multiple samples recurrence can be looked at in terms of the pair of genes and then also it's possible that one of those genes involved in the fused pair of genes is frequently fused to other genes and that can be an indicator that that's a relevant fusion in the samples that we're looking at and then of course we can look at any corroborating rearrangements that we identify in the in the genome if we have matched whole genome sequencing we can look at gene functions that such as whether or not from say cosmic the the gene is implicated in cancer kinases are frequently involved in gene fusions and that the ones that usually are in the three prime position and also of course if one of them is already a drug target so another way of prioritizing these is to try and understand whether or not the the fusion gene product so the protein would actually be the transcript we translate into fusion gene protein that has some relevant function and the way in which this is assessed is by looking at whether or not the codons of the three prime gene are would be successfully be translated so what happens is a breakpoint usually occurs in the in the intron of the fusion gene and then at the junction between the exons of those two genes if we have a frame shift then all of the downstream codons will be nonsense and so this is something to look for that whether or not the breakpoint and the fusion boundary preserves the reading frame of both genes I'm just going to grab some water all right so it's it's somewhat interesting to look at analyze the gene fusion partners so if you're looking at a novel gene fusion to analyze what the partners general function is and then look at the fusion gene fusions that we've discovered so far in the literature how your maybe gene fusion that you're assessing for its function relates to those known fused partners and tyrosine kinases are frequently involved in fusions again usually the three prime of the of the partner because it's up regulation of some tyrosine kinase to change the change how the signals and are propagated through the cell and how perhaps the regulation of different part of the cell processes are disrupted transcription factors again are also frequently involved in gene fusions and then oncogenes are frequently up regulated by something called a promoter exchange so about maybe eight years ago Felix Middleman looked at all the gene fusions that have been discovered so far with cytogenetic methods and then built gene fusion a network out of these gene fusion partners where he took all of the genes and then just drew edges between those genes if they were fused in some sample and from this analysis they they ended up with three larger clusters and within those clusters it was mostly predominantly there was isolated or singleton genes that were connected just to one other gene and then there was a smaller number of genes that were very promiscuous so they would fuse to multiple partners in multiple different samples and so there's the possible reasons for this are that they were using targeted approaches and so that would sort of there's say if you have for the mll gene you have race assays or fish assay fish probes there where for which you're looking for everything that's connected to that particular gene that's going to bias you to for looking at only identifying partners of mll and so that was one theory for why it was so connected in that way now that we have discovered many many more gene fusions we find that the fusion network especially in cancers such as ovarian cancer is much more sort of diffuse and less connected it's more the case that we have just isolated pairs of gene fusions and then of course the question is what function of any do these have I think this so that the the main result of those these recent gene fusions or panfusion pan cancer gene fusion studies is that we have many more predictions but now we realize that there's these events are frequent and we have to do there we need a lot more work to understand which of them are just passengers of genome instability so would it be the same sort of ratio we possibly see for the point mutations of very small I think it would probably be a higher ratio than that because the fusions that we're we're talking about are there's a number of things that have to happen before they're classified as a fusion so it has to have a break point yeah it's it's it's more restricted than just say all of the break points that involve genes because then they have to bring back the proteins together in a way that like makes it so that that's not the downstream protein is not nonsense etc etc so but it's still a large fraction of them that's have got to be passengers that's so the question is whether or not there's background a background level of germline gene fusions that's a very good question certainly there are read-throughs I'm not sure I would classify those as gene fusions but read-throughs do happen in normal samples and there must be gene fusions that happen very rarely in so you you're saying that you can't just sequence the blood and then compare to the reference of like a healthy person if you talk about hereditary hematological malignancies yeah so classic example is leukemia if you take a blood sample from the leukemia patient you don't have normal in there right right yeah so then you have to find another tissue so um it's very hard to find the kind of normal um sets of the skin as a standard hmm and there's a lot of discussion because we're trying to set it up locally and microwave is a standard but one of the things I wondered was whether you can do something like benign or another third generation long sequence something that just long sequence reads that would potentially depend on the size sorry I missed that yeah so microwaves they won't give you the sequence level but uh so if you think about so tmps is too erg found in in prostate cancer that that was actually picked up by expression arrays using an analysis of just outlier expression expression and then of course they went and validated the sequence but you can still I guess you would nowadays though you would just use RNA sequencing because you get expression and then you get the tight level information okay so more on these landscape papers this figure is just showing how genome instability is related to the number of of gene fusions that are found in particular cancers so in the middle plot we're showing some measurement of genome instability that's rising as we go from left to right and then we see at the top the proportion of samples for which we find gene fusions is is rising also in the bottom figure here this is drawing the distinction between balanced and unbalanced of rearrangement events that produce gene fusions if you look at in general at the one at the gene fusions that have been discovered they're known to have some functional impact then more frequently those are created by balanced rearrangement events and this seems to be that so for for unbalanced rearrangement events the effect is uh predominantly going to be a change in copy number and so that's going to be what is uh driving a cancer but then for unbalanced or balanced events for like reciprocal translocations inversions things like that then the effect is not going to be in the copy number space it's going to be at the boundary at the breakpoint and that's going to be more often something that looks like a fusion or an interruption of a gene so the several databases for which we can look at information about gene fusions that have already been discovered including the TCGA gene fusion portal which basically takes the landscape the data from the landscape paper in the previous slide and makes it searchable it can also search for and look for information about gene fusions in cosmic there's another database called ChimerDB that builds on top of cosmic and a database called conjoined g which is more for read-throughs or transcription induced chimeras okay so now a little bit more about trying to understand the impact of gene fusions on the transcriptome and how we generally prioritize these events one of the things that happens is if we look at expression of the exons of genes that are involved in gene fusion across across the gene then often we'll find that at the breakpoint there's a transition and expression because the wild type is not being expressed or is being expressed at a very low level whereas the the fusion transcript is being expressed at a high level so this is this is evidence that perhaps this these particular gene fusions are at least highly expressed we don't whether or not they're impacting the cancer we can have groups of gene fusions that are similar just in the three prime gene that is say upregulated or has it is involved for some reason with it usually what happens is we have a promoter that is specific to a cancer that um is up regulating the same uh three prime gene so across multiple different cancers that's three prime gene is common and the mechanism for creating additional increased expression of that oncogene is just a translating location of a promoter in front of that gene and of course the promoter that is sort of selected is dependent on the cancer because different tissue types will have different genes that are turned on in that particular cancer and so it can be interesting to look at in this figure we're showing a set of fusions that were known that involved BRAF and then in I believe so gastric tumors and prostate tumors um the authors at the bottom here found additional BRAF fusions that hadn't been discovered yet and they were able to say well since BRAF is involved in the as a partner in these other cancers then they're probably functional in the cancers we found them in as I was saying before read-throughs are these types of events that you will find to be very prevalent if you're looking at results of gene fusion analysis tools but they're less likely to be functional in the the figure on the left here we're just showing what is the most common combination of exons for read-throughs and it's showing for read-throughs for for intra chromosomal events which is in pale blue here for read-through events most commonly it's uh say the n minus one exon of the upstream gene is joined to the uh the second exon of the downstream gene so basically we are skipping those exons the final exon of the five prime gene and the first exon three prime gene whereas uh for inter chromosomal events this like dark blue bar here then that can be sort of any combination of the exons and then on the right uh is showing across different tissue types inter chromosomal versus so we're showing the prevalence on the on the far right we're showing the prevalence of particular uh inter chromosomal fusion events and then at the bottom here read-through events and you can see that the prevalence across uh samples is quite high for read-through events and then across uh all of these across these samples for all of these read-through events they are present usually in just one or two samples except for this tm-press is too erg i mentioned this briefly but to understand whether or not the function of the three prime gene is is preserved after a fusion we have to look at how the resulting fusion transcript would be translated correctly or incorrectly uh in for for the three prime exons and there's a number of possibilities uh we could have a fusion transcript for which the exons join together such that the codons are sort of perfectly aligned uh and we go from uh a codon in three prime gene or five prime gene directly to a codon in a three prime gene that are fully formed uh this is about one in nine chance by random um at random also possible is that we have uh the exons are brought together such that uh just the exon right at the fusion boundary is nonsense but then we continue on with uh the exons of the three prime gene as being properly formed so everything is aligned except for this this middle codon and then uh two out of three times we have this possibility where uh their um the exons are brought together in a way that we uh the three prime protein sequences effectively nonsense between so the the one that I have my mouse over and the bottom one yeah so the bottom one is the one for which we lose the function of the three prime gene and then this uh top right one is probably um there's no loss of function of the three prime domains uh because I don't I mean generally one codon is not going to affect the function especially you know unless this is interrupting a particular domain in this example yeah uh yeah no it's just a nonsense codon um or I should make a sense yeah sorry about that yeah so I showed this this figure in at the top in the slide on uh complex rearrangements um chromoplexy in the rearrangement section I'll just highlight it here again because it is sort of an example of how we can get additional information by looking in whole genome sequencing and looking at the the rearrangements that are causative of the fusions uh here we see that there is this complex event where four loci have been broken, permuted, and then rejoined, and then those uh resulting four gene fusions uh of those four two of them are actually involving genes uh with known oncogenic function and those two are in this sample putatively they're affecting the cancer of the the biology of the cancer um and I think this is this particular example is interesting because it shows the sort of an example of a double hit where two uh one event is creating two gene fusions simultaneously so we can definitely get additional information by looking in um in the rearrangement space when we're analyzing fusions another example below here is a gene fusion that we would not have picked up in the if we were analyzing the rearrangements but in isolation because uh what we're showing is uh the gene fusion at the top and uh you'll see that the first and second exon the intron in between the first and second exon that's where the break point is is residing and the break point is complex so it's not just a clean break of two genes brought together in the middle of the break point there's two other one kilobase regions and so if we were to analyze the whole genome data we would see these three break points in isolation but we would never find the connection between uh the two genes that we found using RNA-seq this is just something to be aware of all right so to give a sort of forward-looking um a couple of slides uh recently we had uh a drug called Resopnib that passed uh phase three trials or it's in phase three trials this is targeting uh the ml4-alk fusion in lung cancer that is uh is active in about five percent of lung cancers and so this is uh I guess where our gene fusion prediction um from RNA-seq is coming to fruition and producing results in terms of targeted therapies and then finally uh also interesting and sort of in a future development is how we're now using uh gene fusions in the context of oncoproteogenomics try and go into mass spec data try and identify which uh fusions are producing peptides um and the details of this are in the slide but I think the interesting way in which this could help cancer patients is perhaps if we can identify fusions even if they're coming from from genome instability if their clonal is in their present in all of the cancer cells and they're producing uh novel peptides then these can be perhaps the used in um immunotherapy so targeting those those peptides in immunotherapy could potentially be a pseudo solution for some of these patients okay and so now uh that's the end of the the lecture part and I'll start the gene fusion discovery characterization left