 So I'm really excited to talk to everyone about the work we have done. Recently it's about an algorithm to identify combination of mutual exclusive alteration in cancer. So we know the key challenge in cancer sequencing project is how do we distinguish the driver mutation which are risk-a-month for cancer from the patient's mutation which are just randomly distributed on the cancer and do not contribute to cancer. And this is a particular difficult problem because in the typical tumor, the number of passenger mutations is way larger than the driver mutation. So one common strategy to solving this problem is to find a recurrence in a large number of samples. And the idea here is that if you mutated gene that are found in more simple than we expect, you'll be a good candidate for a candidate for harboring the driver mutation. And so based on this idea, several sophisticated methods have been developed. And no matter which method you use, if you plot the single score for each gene, you'll get a distribution like this one. That's a long tail. And with a few genes that are extremely significant and a long tail of rarely mutated genes. And this long tail phenomena indeed complicated problem of finding driver mutation because we do know there are some genes in the long tail of the harboring driver mutation. So one explanation for why there are driver mutation in the long tail like B, it's that because it's target pathway rather than individual gene. As you can see here, several TCGS study although they demonstrate that driver mutation target the pathway rather than individual genes. So to test the combination of mutation, there are several approach. As you can see here, they are very in terms of the prior knowledge they used. So you can either test the enrichment of mutated gene in the non pathway or you could identify the seemingly mutated subnetwork in this large scale PPI network. However, such prior knowledge are incomplete. So this make it very difficult to detect the novel pathway. So in today's talk, I will mainly focus on the approach that without using any prior knowledge and de novo identify the combination of driver mutation. So one limitation for identify using this de novo approach is that you can see here, there are just too many hypothesis you have to be test while still retaining the statistical power. So to overcome this limitation, one promising approach we could use is that we can restrict the possible combination of mutation by focusing on those combination that exhibit some particular patents. So here is one particular patent that we can use which is the mutual exclusive between driver mutation and cancer pathway. Here I'm showing you the previous study. They demonstrated that in a REST RTK signaling pathway, contain hopper mutual excludes mutation. As you can see from this mutation metrics with a corresponding gene in this pathway, most patients in this mutation metrics only have one mutation. We call this exclusivity. And in fact, in the past few years, there are several approach have been introduced to identify this mutual exclusive patent. Like dangerous, IME first came out at almost the same time. And you can see there are tons of approach, improved approach to identify the mutual exclusive mutation. So how do these current methods scoring the exclusivity? Let's try to form all the problem in a formal way. That's given a binary mutation matrix A, you try to find a combination M of genes that by considering the exclusivity. So dendrix, multidendrix, IME, they consider exclusivity and coverage is simultaneously to score the gene sets. And in MUX, they use a generative model to score the exclusivity. In MIMO, they use permutation test with coverage as the test statistic to score these gene sets or combination. However, MIMO only consider this combination must be appearing in the interaction network. And this method works pretty well. As you can see, MIMO and dendrix, they have been applied on several TCGS starting here. So one limitation in the dendrix weight function is that we found that dendrix weight function kind of favor high mutated gene. Here I give you two example, the two mutation matrix here, they actually have the same coverage and perfect exclusivity. And you can see dendrix cannot distinguish these two matrix, even these two combination. Even you can see in the top of this matrix that a mutually exclusive signal is kind of mostly dominated by this highly mutated gene. And in fact, if you run dendrix in real cancer result like lipostoma, you will find a gene set that with highly mutated gene EGFR with some random gene, they actually report the same score for a gene set that with so many well-known cancer gene in there. As a MUX, it's an approach that published last year that recognized this problem and try to solve that. But you can see end up with actually the same problem. This motivates us to come out better scoring function. So we propose a new algorithm called COMET for identified driver mutation based on this contribution. So in the following talk, I will give you an overview based on this contribution. So last first talk about the first two is this new scoring function and we can identify multiple combination simultaneously. So we define the problem for score one combination of mutation M is that given a binary mutation matrix A, we try to find a combination M of k genes with significant mutually exclusive mutation conditioned on a number of mutations in each gene. So turn this in a formal way, we turn this mutation matrix as a two by two convenient table XM and each cell in this table tell you the number of sample whether the gene is mutated or not. So if we can see that the orange table here telling you the number of sample that both gene are mutated and the blue sample here, blue cells here tell you number of sample that mutated in either one of the gene which actually is the exclusive mutation. So back to our problem, we would like to ask a question is how do you compute the significance of observed mutually exclusive? Probably everyone is now, this is we can do this in just doing the one side feature test for independent because there are just one degree of freedom for two by two convenient table that enables us to find the non-independent either toward more co-currents or much exclusivity. So you can define a score for a gene set is just to use to sum the tail probability here toward the direction to more exclusive. So actually this feature test for independent for a pair of gene has been already like widely used in several cancer study you can see here. So we'll ask a question is what about if you have a combination with three or more genes? The same year we turn this mutation matrix into a two by two convenient table exam. Here this table is not two by three it's higher dimension to two to the K table that's two to the three table. We ask the same question can we run the one side feature test the same? The answer is no because you can see the degree of freedom is just there are just too many degree of freedom. So for one side feature exact test you will have too many ways in which the corresponding random variable can be non-independent. So rather than test independent we define our test statistic it's the sum of the exclusive entry in this table. So it's actually some of these blue cells and then we can enumerate all the tables and then summing over a table property as exclusive or more exclusive than your test statistic. However there's a bottleneck for this problem is that as you can see the number of table you have to enumerate as I'm showing here in y-axis will be exponentially increased when you have the K equal four or even if you have a higher K. Even there are several approach like they use like dynamic programming or like branch and bound strategy for doing the enumerate to efficiently enumerate the table but they are only suitable for the R by C convenient table. As we know there's no algorithm can do the efficiently enumerate all the table to two by two to the K table two to the K convenient table. So we propose a new algorithm that can efficiently enumerate two to the K table. I will give you the intuition here. So first do we need actually enumerate all the tables like this gray bar here? So let's back to the most, the case that we are most interested in it is the perfect exclusive case here. So you can see since it's an extreme case so the score of this combination is actually equal to this table property we observe. So this kind of motivate us we actually don't need to enumerate all the table. We can enumerate start from this test statistic. So we have, we come out a new tail enumerate procedure that can enumerate the table just from here and to the most exclusive case and we sum over them to get the score. For how about for the more co-occurrence? Because there are still might be like too many table you have to enumerate for this community table. So we use approximation by test the permutation test and the binomial test. So summary for the gene sets or combination that was exclusive case we use the exact two to the K tail enumeration algorithm to get the score and which will be result in very almost perfect accuracy and very fast. And for co-occurring case we use the approximation to get the same, the almost good accuracy and very fast as well. So I just tell you the new wave function for comment. I'm going to then briefly describe the pipeline for the comment. So we can take multiple type of alteration and then we turn those alteration into a binary matrix and comment unable to identify multiple pathways simultaneously by multiplying their weight. And then we think that it is typically impossible to enumerate all possible combination in this mutation matrix. So we use MCMC to sampling the combination in proportion to their weight. So in the following talk I will give you how do we summarize the result? So here's actually the table that's showing you the sampling results with running, performing comment for identify two pathway each of them with four genes. And we summarize the result using the marginal probability graph which is going to complete graph with weighted edge. And so each edge M1, M2 are weighted by how often gene M1 is sampled in the same combination as gene M2. And this kind of reveal the consensus subgraph with high sampling frequency. So the advantage of using this approach is it allows us to discover the complex relationship. I will show you this later in the real results. And the second advantage here is that it's identify module that was different size specified in our parameters. So as you can see it's there really a gene set with size five. You can see that four of those five genes will form as a combination and floating around in these top sets. And then finally we'll summarize as a click and report in the marginal probability graph. So I'm going to show you that we perform our approach on the simulated data and real data. So in the simulated we implant one pathway into the simulation data and also the noise. And we run each algorithm on 25 simulated data set for each coverage of the implanted pathway. So for X as in here is actually the coverage of the implanted pathway from low to high. And then we examine the true positive and false positive between the implanted pathway and the predicted gene sets. So as you can see the blue on comments here perform very well in almost all the coverage. And Dendrix and Mu X only perform well in the high coverage sets. And Mu X perform pretty well in the low coverage but you can see it start decreasing the live measure in the high coverage sets. And we also perform comment in the real cancer data. So I will tell you here the graph it's the marginal probability graph. So each edge here represent a mutually exclusive. There's no prior knowledge. So if you put these two module side by side with the TCGS study, you can see there's highly, highly overlap here with the RTKRES PIK3 signaling pathway. And another module here, you can see this is actually PIK3 signaling and here is RB signaling pathway. And then bridged by this CDKN2A. This overlapping structure kind of explain the different isopholes of CDKN2A involved in RB and PIK3 signaling pathway. And one another interesting observation here is that we can see CDKN2A here is a validation. And another here, you can see the co-occurrence between the pairs in here is this is actually co-occurring P value. It's very, very stronger than any individual gene co-occurring between the individual genes. So this kind of suggests us this copy number division on CDKN2A will affect both sides of the phone and result in the alteration of these two signaling pathway. And also you can either type it one gene in each of the pair to this to kind of alter both signaling pathway. And when we develop in this algorithm, we also observe that mutually exclusive is not only inside the cancer pathway. We also found like mutually exclusive put between the subtype in rich mutation. For example, PIK3CA, TP3, or RB, they are highly mutated in different subtype in breast cancer. So you can see if you run combat, you will end up with getting the click like this one telling you they are actually mutually exclusive. So this kind of multiverse, can we simultaneously identify the subtype, specific mutation and the pathway together. So we create a mutation matrix by implant a predefined subtype. So each, we put, we implant each subtype a new role for each subtype and in which they contain all the mutation excluding those of given the subtype such that we can identify this mutually exclusive patent with subtype simultaneously. So here we run TCGA breast cancer with four molecular subtypes. And no surprising you can see like RB2 and CCND1 they are highly associated with the HER2 in rich and lumina B subtype respectively. And there's a complex structure in the middle. You can see it's kind of revealed the subtole relationship due to the subtype. For example, the AKT1, PIK3, CDH1 they are highly related to lumina A and TP3s are highly associated with special like. And also it's also reviewed some subtole relationship due to the pathway. So you can see like MAP3K1, MCL1, AKT1, PIK3CAP10 here form a very strong, a strong exclusive pathway here kind of revealed saying that, oh, there's a pathway that related to lumina A. So taking these together, common can jointly obtain the exclusivity both in the cancer pathway or the subtype in rich mutation. So I don't have time to go through all the data set we run. So I encourage you to go to poster 32. We run ML and gastric cancer with subtype as well. So to summarize, we have a new scoring function and we simultaneously analysis multiple combination and we summarize the result over the high scoring collection and opera from other methods in stimulated data and real data. And I encourage you to encourage you to download the paper to see from more detail and also you can download software to run to Tula to the K 100 table test. I would like to thank my advisor Ben Raphael and also Max Lessison who is the co-operator who has equal contribution in this work also Fabio Benjane and other lab member. And thank you. We have time for one or two questions. I have a couple of questions. So basically you have here defined a new criteria to assess the significance of a group of genes. So do you use this criteria to scan all of the possible combinations in your data? No, we didn't. We actually use, since there's just too many combination you can test, right? Say if you have thousand genes, you have thousand two, three combination of the test. So we use MCMC sampling. So sampling the combination that in proportion to the weight. So end up with you will get the high scoring sets there. High scoring gets one. And then how do you solve the multiple testing problem or putting it in another? How do you calculate the family wise error rate here? Sorry, can I say again? The, how do you correct for the multiple testing here? Because even you use the MCMC, it's essentially, you know, scan all of the possible combinations. So actually we compare our results with random data sets. So you only select the collection that are, we score higher than the random data sets. And then we only choose those significance collection to plot the marginal property graph. So there's indeed a significant test, but I didn't show up here. Yeah. Thanks, that was a great talk. So the mutual exclusivity hypothesis, is it my correct in that it assumes a sort of homogeneous tumor population? So because we often see when we sample multiple, you know, pieces of a tumor convergent mutations to the same pathway. So would that throw off your analysis if you see that? Yeah, so you say, what you are saying is like subclonal things, right? Yeah. So yeah, that's a good question. So we haven't considered that in our data, but I can tell you if you have, if you take the non-passway, and if you see there are subclonal things in one patient, there should be a lot of co-occurrence, right? So we didn't see that so far. I don't know why I said, maybe all the same poll we get is from the main sub-population, or there's just no less problem in the pathway. So our last talk of this session will be Dr. Mary Ellen Geiger from University of Chicago, who's going to tell us about some of the work we've been doing looking at the radiology information we have with TCGA data.