 Okay. Good morning, everyone. My name is Babak Aloui. I'm a PhD student in Eric Larson Group at the University of Gothenburg. First, I would like to thank the organizer to give me this opportunity to talk here about my project. Today I will talk about somatic structural alterations and their influence on gene expression. Cancer is a disease of the genome, point mutations, chromosomal rearrangements, and copy number change that all contribute to activating oncogenes and inactivating tumor supressors. Structural variations or chromosomal rearrangements, they can result in formation of oncogenic fusion genes. And there is also recent data suggesting that they can alter the expression in some other ways, such as shuffling regulatory region around the transcription start site. Copy number changes and structural variation, they have typically been studied in isolation, but in fact, they are tightly connected. A copy number change arises as a consequence of a structural variation in the genome. As you all know, there are now several pancancer studies thoroughly exploring mutations and copy number, many of them using TCGA data sets. You can easily explore them, for example, using C-bio portal, as it is shown here for EGFR gene. However, when it comes to structural variations, there is a few systematic studies investigating these events across multiple cancers. This has motivated us to comprehensively map structural variations in cancer genome using TCGA whole genome sequencing data. We also wanted to investigate the relationship between copy number changes and structural variations as the structural basis of a copy number is not well explored. Additionally, from central dogma of biology, we know that any genomic alteration to be important in cancer, it should have an effect on RNA produced by the tumor. Because of that, we wanted to systematically explore the global effect of a structural change on tumor transcriptomics. To do all of these things, we used 600 cancer patients in 18 different cancer types and set up a pipeline to integrate copy number data, expression data with structural variation determined from whole genome sequencing. As it turns out, detecting structural variation was not without challenges. We did this using available tools, but on computational resources available in our own lab apart from being massive computational efforts, it is also a challenge to get clean good results. And not all the available tools will provide that. The problem is that it's difficult to know if the results are good or not. However, by combining copy number data and structural variation data, we believe that we found a way to evaluate and increase the performance of the results. By combining these two data types, we could investigate the structural basis of copy number alterations. As an important part of analysis, we looked into the structural alterations in regulatory regions and their effect on mRNA. And last, we also combined DNA and RNA data to get better functional fusions for cancer. As I said earlier, whole genome-based structural variation detection was not so trivial, and we didn't like the fact that different tools were giving different results. So we needed a way to benchmark it, and we didn't have the answer. So to find a way around this, we used the array-based copy number data, and we know that the copy number change is a result of structural alteration in the genome. And because of that, we can learn about some of the structural breakpoints in the samples. Of course, this is not perfect, but still it provides a true positive set. And we know that a perfect structural variation detection tool should basically be able to identify all of these copy number breakpoints. So one of the things we did was to calculate a sensitivity score based on the overlap of copy number breakpoints and structural variation breakpoints. After applying this method on four different tools, we came to the conclusion that Meerkat gives the highest sensitivity and works as the most specific tool since it had the lowest correlation with randomized copy number breakpoints. We also could improve the performance by adding some post filters to our pipeline, such as removing the breakpoints in repeated regions. So we applied this pipeline on 600 whole genome sequencing samples and saw that around 35% of array-based copy number breakpoints can be explained structurally. This number did vary between cancer types, and it appears that those cancer types that have low scores have mostly arm level or chromosomal level copy number change. Out of those copy number segments that had support from our structural variation data, we had almost all the deleted regions classified as deletion in whole genome sequencing, which increased the confidence in the result. Interestingly, we also saw that most of the copy number amplifications are tandem duplications. For those of you who don't know what tandem duplication is, tandem duplication is a piece of DNA that is duplicated and inserted adjacent to itself. Interestingly, when we compared the two data types, we find that some simple copy number change are in fact more complicated than they seem. Here in this example, we have two copy number deleted regions with a copy number neutral region in the middle of it. But when we included the structural variation data, we saw that there was only one tandem duplication in this region. We could explain this by an arm level deletion in one allele and the tandem duplication in the other allele, which neutralized the amplitude for that region. Here is another example, but instead the deletion is the whole chromosome. Again, the tandem duplications neutralized the amplitude of those regions. For the structural alterations to be important in cancer, they need to have an influence on transcription in some way by altering mRNA level or structure. What we did next was to systematically investigate the influence of a structural change in regulatory regions near the transcription start site on mRNA. This is still work in progress and we are digging into the data at the moment, but to give you a couple of examples, we found structural alteration in third promoter region in kidney chromophobic cancer and these alterations were associated with strong transcriptional activity. This has been described recently in a paper in Oncocyan's journal. Another example is FUBP1 in breast cancer. Again, we saw strong association with mRNA level in the sample with structural alteration in the promoter regulatory region. Another way of affecting mRNA by structural alterations is to swap the strong and weak promoters in context of gene fusions. We did a systematic screen for these cases and to see the effect on expression. We found that in many cases these structural alterations changed the transcription activity. Here we used quantile-quantile plots to show the observed fold change value for the gene expression with the structural change relative to the other samples against expected fold change using randomly picked sample. The blue line and dot line shows the median and 90% confidence interval for the null distribution respectively. As you see here, for many cases these structural changes alter the expression level. A well-known example of this would be RET2CCDC6 fusion in thyroid carcinoma. This has been described before but we can see that this fusion happens by an inversion and RET gene is being activated by hijacking the strong promoter of CCDC6 gene. Knowing about the structural basis of gene fusions gave us a reason to look into the gene fusions using our structural variation data. So lately we've seen a lot of studies using RNA-seq to identify gene fusions. We did this too. We applied a fusion catcher on CCJ RNA-seq data but looking at the data from RNA-seq we came to the conclusion that the data is so noisy and the specificity is so low. So we used the intersection of DNA and RNA to identify the cancer-relevant genes, gene fusions. Again we needed a way to benchmark our approach. And for that we used cancer gene sensors, cosmic cancer genes as a metric to benchmark the results. Using only the DNA data gave us 8% overlap with the known cancer genes. But when we used in-frame fusions from RNA-seq we had again the same specificity. However, when we used the intersection of these two data types we had much higher specificity with more cancer-relevant fusions. Using this approach we could identify several known fusions in different cancers but also some novel functional fusions. To give you an example we found PaxA to NRF2 fusions in thyroid carcinoma. This happens by a tandem duplication and NRF2 gene is being activated by losing its keep-on binding sites in this fusion as it is shown here. So to summarize we used array-based copy number data it's useful to optimize structural variation detection. Most copy number amplifications are due to tandem duplication shuffling regulatory regions such as promoters and enhancers impact expression level globally and detection of fusions can be improved by combining whole genome sequencing and RNA-seq. I would like to thank my lab members and colleagues. Thank you for listening. Can you comment on the use of cosmic to assess specificity because it seems like the hypothesis there is that you're more likely to get fusions in cancer genes which I guess I'm not sure why that would be. Using only RNA-seq data we will find a lot of different fusions but a lot of them are happening randomly or they are not functional in cancer. So we need a matrix to see if these fusions the ones that are functional most likely they are in the genes that are relevant to cancer so if we use that to get rid of the noise in the data. Okay, so specificity with respect to functional. Yes, exactly. Thank you.