 Uh, can you see my screen? That looks great. Thank you for putting that up there and you have 15 minutes. Take it away. Thank you. Um, hello everyone. I'm Jim Park and I'm a graduate student at UNC Chapel Hill. And today I like to share my recent project and finding associating risk factors with mutational signatures. So, I noticed that there's a lot of signatures that were presented in this talk. So, I want to start off by clarifying what mutational signatures are. So, in the human body mutations, or specifically somatic mutations that occur throughout life and these mutations can be derived by mutagenic processes such as aging or smoking. And additional mutations can also be derived by existing mutations due to, for, for instance, apobyc gene activity, or homologous recombination deficiency. The thing is that each of these mutagenic process leaves a unique pattern in the DNA and we call this mutational signatures. And also, as the somatic mutations accumulate and become excessive that eventually develops as cancer. Since whole genome or exome sequencing allows us to extract mutation information, we can now analyze the composition of mutational signatures in patients, mutational profiles, or counts, which would be invaluable in understanding cancer development. So, how do we find these signatures from the sequencing data we have? In things to a collaboration of institutes, including WSA, WSI, and UC San Diego, there's already a set of well-defined signatures found from a large cohort pan cancer data that are stored in a public website named catalog of somatic mutations in cancer or cosmic assured. These signatures were identified using non-negative matrix factorization algorithm, or they factorize the mutational accounts into two matrices. Each representing mutational signatures and what is still called contribution. And contribution basically shows how much each mutational signature contributed on each sample's mutational accounts. So, you can kind of understand it as an estimated weight of the signatures that built up that each sample's mutational profile or counts. And the mutational accounts and signatures are often summarized into 96 dimensions, where each dimension represents the single base substitution with the mutation site. And it's 5 prime and 3 prime 3 prime flanking basis. And the mutational signatures looks like this figure down below here, which is basically a vector of the composition of the 96 dimensions, which adds up to 1. And so far, cosmic has identified about 94 signatures with some additional sub signatures. And as the interest in mutational signatures increased in cancer studies, there were more and more and more this developed to find denver or novel signatures using other algorithms like variations of NMF's expectation maximization or mixed membership models. While finding signatures itself is important, it is also important to identify the associated etiology in order to better understand cancer development. And many of the cosmic signatures already have identified etiologies. For example, SPS 1 is related to spontaneous denomination of 5-methylcytosin, which relates to aging. And SPS 2 is related to a bubeck enzyme activity. However, a great number of signatures still have unknown etiologies like SPS 94, and same goes for signatures found from de novo mutational signature methods. So developing a method that helps finding the related etiologies seems to be a crucial challenge now. And some of the methods have been trying to tackle this problem, of course, by trying to associate some risk factors that may help find etiologies. Initially, some de novo methods have used their estimated contributions to do a simple two-group comparison with Wilkomsten-Rangson tests on a binary risk factor, for example, for non-smoking versus smoking groups. This is very intuitive and can be nicely added to existing methods, but authors from a more recent method named Hilda pointed out that these methods may lack power. And this is because depending on the samples used just to meet the contributions, the contributions may highly vary. So taking the analysis results without considering the uncertainty may lead to loss of test power. So to overcome this, Hilda used a unified hierarchical Bayesian LDA model, which can detect de novo signatures and test for differences between binary groups while accounting for the uncertainty and estimated contributions. And we really like this idea of Hilda, but we also wanted to change and add more features. So we developed what we called the SIG and what it does can be summarized in three-fold. One is that we consider the uncertainties derived from sampling like Hilda using a Bayesian Dirichlet hierarchical model specified in STAN programming using our STAN. And second, we allow for more than one risk factor at a time that can be of any type, binary categorical or continuous, instead of just allowing just like a single binary variable. And last but not least, we allow to use predefined set of signatures instead of including that layer into our model to find our own signatures so that it is more comparable to the previous studies done based on cosmic signatures. Or also allow users to use their own preferred signatures from a de novo method. Due to the time limit, I will skip the details of the model and the simulated results and jump right into the real data results. Here we use TCG breast cancer data, where after filtering had about 900 samples with normal mutation counts range. And to validate our model, we tested for risk factors, which we have preliminary knowledge in the associations with the breast cancer related cosmic signatures, which are SPS 2, 3, 5, 8 and 13. And one risk factor we used here was a continuous variable HRD score, which is a metric calculated as a sum of three independent DNA based measures of genomic instability. These HRD scores and the kind of signatures are independent or orthogonal, but higher HRD scores indicate greater homologous recombination deficiency. So we expected that the HRD score as a risk factor have high association with the HRD signature from cosmic, which is SPS 3. And that is what you're seeing in this figure on the right. You can see that the HRD score has the highest association with SPS 3 while having negative or zero associations with the other signatures. Then we tested on a risk factor variable that represents the molecular subtypes of each sample. So each of the breast cancer sample belongs to one of the molecular subtypes, these like HER 2, liminal A or liminal B. And from a previous study by Pitt et al, it is known that HER 2 samples has a much higher Ablebeck mutagenesis compared to basal subtype, which you can see in the figure on the right where the length of the white sheeted areas differ greatly between basal and HER 2. So what we did is we subset it only basal like in HER 2 enriched samples and tested the sig with a binary indicator of HER 2 subtype sample. And as you can see in this figure in the middle, HER 2 samples compared to basal like samples have a higher association with Ablebeck signatures, which are SPS 2 and 13 compared to the other three. And it is like an additional validation. We also confirmed that the same pattern was found with the HILDA model that we were inspired by when applying this binary indicator. But here note that HILDA signatures are de novo. So here we compared the HILDA signatures that share the most mutational context with COD. So it was done by I. And further on, we wanted to kind of validate that our model works well with various types of risk factors. So we tested on a continuous version of the molecular subtypes, which is the correlation metric to the centroid of the HER 2 subtypes. And we also saw that the Ablebeck signatures have the highest association with the continuous searches subtype, which also matches the results that we saw in the previous slide. So in a nutshell, we have seen that our model is capable of accurately finding associated risk factors with mutational signatures on various types of risk factors. And we anticipate to aid the process of connected e-ologies with mutational signatures. And lastly, I want to briefly acknowledge my advisors, Dr. Love and Dr. Vo, and all my collaborators and the amazing low-flat members and genomics group. And our bioconductor R package, the SIG, is still work in progress. So please stay tuned if you're interested. And thank you so much for listening. Thanks very much. Questions from the audience? Yes. Come to the microphone. Hello. Can you hear me? I guess so. Yes. Thank you for your talk. I had a question about whether you think a similar model could be applied for a copy number mutational signatures as well. So our research was kind of limited to single-day substitutions. We haven't really applied it to double substitutions or copy number. But we naively believe that since it's the matter of dimensions, it can be applied, but there might be some biological obstacles that we have to run through to use copy number variation. But that would be very interesting. That was something that we were kind of curious as well. Thank you. I noticed that our stand is part of your toolkit here. And I was wondering whether the, you're showing certain box plots relative to some parameter called beta that I didn't quite follow to show some of your comparisons. And I'm wondering, are those posterior distribution summaries for a parameter out of the R stand model? Yeah, those are the posterior estimates of our parameter where beta is the association between a risk factor and mutational signatures. Right, yeah. And what kinds of run times do you have when you're doing the R stand MCMC? It depends on the size of the data and it also depends on the type of the risk factors. I can't really remember top of my head of how long it took for the real data sets, but for the simulation data sets, we tried 50, 100, and 200, and up to 1,000 sample sizes. And I think the max for 1,000 sample were within within two hours or so. Okay. And are the convergence criteria for the sampling part of the of the model or you just sort of pre specify the number of samples you're going to use? We have to pre specify the number of samples we from simulation and from the real data set we have come from that the convergence test using our hat was mostly below 1.03, which we are very happy about. But it will depend on the data set and we we would love to kind of try to stick on other mutations, mutational data. So if anyone has any nice educational data that they want to try out, please reach out to me. Sounds great. Any other questions. And anything in the chat. No, our chatters are quiet. Well, I want to thank all the speakers for a wonderful session. And I think with that, we'll bring it to a close. One more thanks. Bye bye.