 Good afternoon, everyone. I am sort of hoping or assuming that somebody else is going to start my presentation. Okay, great So my name is Theo Knanenburg. I'm going to talk about an ongoing research project where we are Trying to identify using TCGA data Mutation hotspots in in proteins. Here we go So we had a very nice presentation this morning by Angela Brooks who Also was trying to Investigate these mutation hotspots and she had a nice experimental approach to do that and we are actually using a statistical approach Basically using TCGA data as well as cancer cell line data to associate with these mutation hotspots that we find so the actual work here has been done by a William Poole who is a student of Brady Bernard and myself and we are all part of the lab of Ilya Smulevich at the Institute for Systems Biology in Seattle Okay, so here we are looking at the drug response of about 900 cancer cell lines to the drug vermarathenib, which is an inhibitor of V600E B-Rath mutants, right? So the mutants that have an amino acid substitution of V2E valine to glutamate at the 600 amino acid position in the in the B-Rath protein and These 900 cell lines here are Divided into four groups based on the B-Rath mutation status, right? So on the left-hand side, we have about 800 cell lines or so which are B-Rath wild type They have a high IC50 IC50 is the concentration of the drug at which about 50% of the cells is killed And so these B-Rath wild types are relatively resistant to the drug then we have a group of Cell lines which have a mutation in a hotspot around amino acid 465 we have a group of Pretty large group of cell lines which have mutations around the hotspot At the amino acid position 600 so these include all these V600E mutants And then we have a bunch of cell lines which we find outside of hotspots So we used TCGA data and our algorithm to infer these mutation hotspots And it is pretty clear to see that the location of the mutation matters in this particular case not surprisingly the cell lines which have this mutation around the 600 position are very sensitive to this drug So most of these cell lines are colorectal skin and thyroid cancer cell lines now the Cell lines which have mutations B-Rath mutations in the other hotspot Or even outside of hotspots are much less sensitive to this drug But they are more sensitive than the than the wild types So with this Well known and well understood example in the back of our minds We set out to identify these mutation hotspots across all the genes in TCGA with a decent mutation frequency so these mutation hotspots in our case are defined as regions of high mutation density on the linear amino acid sequence of the protein Right, so they can be much bigger as you will see than one amino acid EGFR has been used quite a lot today as an example. I'm also using it EGFR is of course a well-known cancer gene It is aberrated in quite a few tumor types in most cases It is amplified on the copy number level However, when you look at the PENCAN 11, which is the PENCAN 12 minus the lowly mutated blood cancer AML We find quite a large number of mutations in this gene and they are distributed as you can see over here So these are both synonymous as well as non synonymous mutations So there clearly are three Peaks here. These are amino acid positions where this gene is very frequently mutated But besides the three peaks you also see regions where the mutations are quite a bit more sparse or where they are more dense So our algorithm smoothed this mutation count at multiple different bandwidths and Then it uses the local maxima of these smoothed signals as seeds for a Mixture model which is a mixture of multiple gaussians as well as a uniform background distribution So using expectation maximization we then find at each scale the clusters with high mutation density and then the final step in our algorithm is a greedy approach to Identify locally those clusters that optimize the archaic information criterion Which is indicated by the red clusters that you see over there and we require these red clusters to have at least five non synonymous mutations We find in general that these hotspots that we have identified overlap very well with protein domains You can see that here for eGFR and in general when we look at all the hotspots that we have found across all the genes About 75% of the mutations in the hotspots overlap with protein domains, which from a statistical point is very significant so we think that we have Created a pretty robust algorithm to identify these mutation hotspots across many hundreds or even thousands of genes in TCGA, which actually have quite a variation in their mutation frequency as well as in their spatial distribution So some some global characteristics of these mutation hotspots of the 20,000 hotspots that we identified in about 2500 genes we see that they vary quite a bit in size they go from say one amino acid all the way up to hotspots Which are more than 500 amino acids in that case. I'm not sure if the term hotspot is actually a good term to use We can also use the word cluster most of these Hotspots are between 10 and 50 amino acids So in comparison uncle drive class, which is a method developed in the group of Nuria Lopez bigas Is confined to a much smaller skill? Their clusters are usually smaller than 10 amino acids and many Clusters or hotspots are only one amino acid on average our multi-skill clusters contain between five and 50 Mutations which is also a little bit larger than the than the uncle drive clusters However, in general, we do find clusters at the same locations as Evidence by the fact that 84% of the uncle drive clusters overlap with the multi-skill clusters Okay, so now our main question in this research project was are these mutation hotspots functional and The way in which we try to establish the functionality is by doing these large-scale statistical analyses with relevant data And I've already shown you The association or the overlap with protein domains and a little bit of the drug response And if I have time left I will show some more examples of drug response But now I will I want to highlight some associations with gene expression data and with signaling pathways So the way in which we Can create these statistical associations between these mutation hotspots and gene expression data is by tree But it's basically by making binary vectors out of our mutation hotspots. So for each Sample in a TCA tumor type we can ask the question Does the sample have a mutation in this particular hotspot and this way we can create these binary vectors for all our hotspots So we have a very big binary matrix here Say for one particular tumor type and then we can do a correlation analysis with a gene expression data set basically the gene expression profiles Which gives us in the end pairwise correlations and p-values Between hotspots and genes So for this we use our pairwise statistical test engine that also underlies a regular explorer Which is a tool that some of you might be familiar with Okay, I will just highlight briefly two examples This is a thing which we find quite frequently. So this is an example in the TCGA and the material cancer. You see you see So we see that there's a there's a slightly stronger Association when you look at a hotspot versus looking at all mutations in the gene This particular case the expression levels of this gene cam k2b are Lower in the k res mutants the 51 k res mutants compared to the wild type However, when we then focus on the K res hotspot, which is around amino acids 12 and 13 We find a slightly stronger association Right, and this is a thing that we find find a lot What we find much less frequently, but which is also quite interesting Is where we have a much stronger association when we look at a particular mutation hotspot So in this case, we're looking at the expression levels of the gene glist 2 and they are higher for the 24 ppp to our a1 Mutants compared to the wild types The p-value is not very strong though. So it's a moderately significant. However, if we were then to focus on The the 10 Samples which are which have mutations in this particular hotspot that we found we find a much stronger association So finding these significant associations only in hotspots occurs quite infrequently 60 cases in in uc ec Both in the gene and the hotspot occurs quite a lot and also only in the gene So we don't find it in the mutation hotspot and in that case Many of these mutation hotspots only contain a very small number of the samples and you have much less statistical power to detect something Okay, so let's move now to the the next thing that we did which was look at pathways So the way in which we try to assess the statistical association between these mutation hotspots and signaling pathways is by taking the pairwise p-values from the previous analysis between hotspots and genes and Then using the membership of genes in the NCI PID pathways these manually annotated cancer signaling pathways and Basically combine the p-values of the genes in the pathway to get to p-values between hotspots and pathways now here it is important to note that the p-values of genes of the genes in the pathway are not Independent because the expression profiles of genes occurring in the same pathway are often quite correlated to each other So if one were to use features way of combining p-values One would get a lot of false positives. So one thing that William did was successfully implement Brown's method to compute the combination of Dependent p-values. So we think that this is actually a very interesting contribution of his work, but I will not go into more detail about that here So let's look let's look at an example In this case we are looking at the statistical association Looking at p10 hotspots and signaling pathways in the TCGA clear blastoma data set So here we have a heat map with p-values Where the the low and significant p-values are depicted in dark red? So and dark red means that there is a statistical association between this mutation hotspot and the pathway which means that there's a statistical association between these samples which which are mutated in this hotspot and genes having correlated expression profiles the genes which are in the particular pathway Okay, so there are three things that I would like to point out here One looking at p10. There are actually 27 samples which have a p10 deletion on the copy number level So we use this binary feature of homozygous deletions and also ran it in our pairwise calculations And if you then compare that particular feature to the feature which basically says is there a mutation anywhere in the Gene we see very different pathways light up So basically this is telling us that Deleting p10 or having a mutation in p10 has a very different functional consequence if we look at the pathway level second observation When comparing a particular mutation hotspot over here we see that the pattern is very similar to Just having a mutation somewhere in the p10 gene and although there are many fewer Samples in this particular hotspot many of the p-values are actually more significant indicating that there is actually a strong relationship When we look at that particular hotspot So and then finally looking across different hotspots we see different patterns of Pathways light up again once again indicating that it really matters where you find these particular hotspots Sorry, where you find the particular mutations In the case of p10 this is quite Interesting because these hotspots can be directly related to the protein structure of p10 Right. So for example the hotspot around amino acid 330 is in this c-alpha 2 domain And the one here around 170 is basically this ti loop so one can begin to think about the interplay between Structural changes to the protein and what this means on the pathway level and maybe even beyond on The the cellular or phenotypic level and this brings me back to the drug response with only 30 seconds 30 seconds to go so I'm going to do this very quickly basically seeing that we have That we see that in this case we are looking at mutations in PR3 kinase That depending on the drug that we are looking at we see that different Hotspots sensitize to different drugs Okay, so very quickly summarized So we have developed a novel multi-skill clustering algorithm to robustly identify mutation hotspots and genes and Uncovered many statistical associations between these hotspots and what we think are relevant data sets in terms of gene expression Signaling pathways and drug response We want to take this work which we have done now on pancan 11 to all the cancer types in TCJ Which I think is going to be called the pancan atlas We want to integrate these mutation hotspots in our two recolomb explorers such that the association with these hotspots are available to To everybody that is interested in querying them And write this up and make it available Once again William did the work here want to thank Brady Ilya Sheila Fiesta and all the members of our G-DAC in MD Anderson and ISB and The organizers for the opportunity to present our work here If you have questions now is a good time to ask but also tomorrow at poster number 57. Thank you so much Have you Try to figure out how different types of mutations within the same gene might affect like I guess that last part you're showing might be able to distinguish You know Inactivating mutations from potentially activating mutations that might occur within the same type of gene Yeah, that's a that's a good question and indeed all our hotspots are annotated with the types of mutations that are in there Maybe I should have made this more clear when we consider hotspots. We only consider the non synonymous mutations But of course we make a difference between the missense and all the other types of mutations and we do see large differences Things that you would expect such as in in the the say the tumor suppressor genes We find somewhat larger hotspots, which are less dense and there are more missense mutations in there As I understood you so far looked at Protein coding sequence, but could you also annotate hotspots along the whole genome? Would that make sense? Yeah, in principle this algorithm can indeed be applied on the DNA sequence And it would be quite interesting also to find hotspots there the interpretation is going to be quite a bit more more challenging I would say but It's definitely possible to apply the algorithm on DNA sequences. Yeah Tilted up so we can hear you. Okay. There we go. Okay for the people you switch you had where they're multiple testing corrected Because you have quite a few tests and they were only in the order of 10 to the minus 4 something Yes, so you're referring to the gene to the gene expression relationships. Those were not corrected for multiple testing however, we so the number of genes for which I tried in this particular case for this presentation actually It would amount to I say Bonferroni correction of around 10 to the power minus 4 So so the associations that I that I showed there would be significant at At a family-wise error rate of 1 in this particular case Okay, and did you look at correlations between these somatic mutations because you have quite a bit of correlation structure And then you would get quite quite a few of false positives Do you mean in between the different hotspots? Yeah Yeah, I think that this is definitely the case that we that we have to be very stringent with that. Yeah Thank you So I just want to thank all the speakers again for this afternoon session So the last thing that we have for today is we have a series of workshops that are starting at four o'clock We have three different workshops. They're going to run twice So once at four o'clock and once at five o'clock So if there's one a two of them that you would like to attend you have the time The first workshop is the sea by a portal for cancer genomics Which will be in balcony B and session to a workshop to is the writing and approval Db gap controlled access request. That's in balcony C Those are up the stairs and around the corners of here And then the third workshop will remain in this room And that is TCJ imaging resources So we've got about a half hour as they get set up for these workshops and then we can Reconvene for whatever workshop you would like to attend. Oh wait and and tomorrow morning We will reconvene here at nine o'clock for session three and when In the pancreas AWG will be meeting at four o'clock Outside and we will walk over to where we're going anything else JC. Okay. That is all my announcements. Thank you guys