 Great, thank you. I'm really excited to talk to you today about some work we've been doing on inferring intratumor heterogeneity. So many types of tumors are highly heterogeneous, containing multiple distinct populations of tumor cells, each with their own complement of somatic mutations. And this intratumor heterogeneity arises because tumors evolve over time, and as different descendants of the original or founding cell acquire new mutations, which they then pass onto their progeny. Furthermore, many tumors often contain admixture by normal cells. And so there's been a number of recent studies that have really begun to highlight this extent of intratumor heterogeneity, using a variety of different methods from ultra-deep sequencing, whoop, wrong button. So from ultra-deep sequencing to multi-sectioning of different tumor samples to even doing further analysis of single tumor samples. And so given the proliferation of sequencing data from modest coverage for many things like TCGA, this really led us to ask the question of how can you actually infer tumor composition from a single mixed tumor sample? And there's been a lot of really exciting work in this area recently. And so most of these methods fall into two different categories. The first is what I'll call SNV-based methods, which utilize clustering of variant allele frequencies or the fraction of reads that indicate an SNV in order to determine tumor composition. However, because these methods rely on looking at individual points in the genome, they often require higher coverage and able to overcome the variants in the data. So in contrast, there are also CNA-based methods which look at larger regions of the genome that have been potentially duplicated or deleted. Two of the older and more popular methods are Absolute and ASCAT. However, both of these methods were originally designed for SNP array data, and furthermore, they don't explicitly consider multiple subclonal populations. So let's actually take a look at how copy number aberrations affect sequencing data. So we generally identify copy number aberrations by looking at changes in read depth over genomic intervals. And so when we look at, in this example here, a heterozygous dilution, we'll actually see that this decrease in read depth is proportional to the fraction of tumor cells in the sample. So as the amount of tumor cells in the sample decreases, we actually see that this change in read depth actually decreases. Tumor samples also may contain multiple different tumor populations. And so this type of intratumor heterogeneity has traditionally been viewed as problematic when inferring copy number aberrations. However, as sequencing costs have declined, the number of reads available to detect these subtle shifts in read depth have really made read depth a very strong signal in sequencing data. And so especially for some of these larger copy number aberrations, where many thousands to millions of reads have been perturbed. Additionally, if a tumor population contains multiple copy number aberrations, we'd expect this shift in read depth to be consistent across all of them. And so it's really these observations that have motivated the work that we've done in our group. And so in particular, we've devised a probabilistic model of sequencing data when the different tumor populations contain different copy number aberrations. So consider this example here, where we have a true mixture of two different tumor populations. And we've partitioned our genome into three equal length intervals, blue, red, and yellow. And so the data that we observe from a sequencing experiment would be read depth information, where we can count the number of reads that align to each one of these intervals in our genomic partition. So what we'd really like to do is to be able to go from this observed data to infer something about our true underlying mixture. And so we're going to do this by first parameterizing our true mixture using two parameters. So the first parameter is what we call our interval count matrix C. And so in C, the columns are integer values that represent the number of copies of each genomic interval in each subpopulation in the sample. Our second parameter is actually going to be what we call a genome mixing vector mu, which describes the proportion of the different subpopulations in our mixture. So if we know C and mu, we'd like to be able to infer something about our observed read depth information. So let's take a look a little closer into what happens with our sequencing data. So under typical assumptions, we're going to assume that our reads are sampled uniformly across the genomes in our sample. And so when we do this, we observe that the probability of any read aligning to any particular interval is going to be proportional to the amount of DNA from that interval in the entire tumor sample. However, it's really important to note that these probabilities are not actually independent values. So consider the case of a relatively large deletion. So reads that no longer align to this region because of the deletion actually may lead to an increase in observed read depth in other regions of the genome. So it's actually better to model this type of read depth data using a multinomial rather than independent binomials. So when we do this, we can actually then measure the probability of a read aligning to any particular interval just using our parameters C and mu. So to complete our generative model, we then can model our complete observed read depth information as draws from a particular multinomial distribution, again, that depends only on our parameters C and mu. So now, again, we'd like to be able to go this other direction for given observed information about our read depth, we want to be able to infer what are C and mu. And so we do this using a maximum likelihood approach. However, it's important to note that this is actually not an easily solvable problem. We actually had to devise a special coordinate transform that allows us to solve this as a dysfunction of different convex optimizations. I also want to point out that the problem is actually not identifiable in C and mu. It's only identifiable in the space of multinomial parameters P. So we put this all together in our algorithm, which is called tumor heterogeneity analysis, or theta. And so theta is a polynomial time algorithm in the case when we consider a single tumor population in normal cells. And it can actually be extended to any number of different tumor populations. However, I do have to note that when you have multiple tumor populations, the runtime is exponential in the number of intervals in your partition. So to demonstrate the efficacy of theta, let's take a look at some simulated data. So this is a simulation where I created a mixture of three different tumor or three different populations, normal cells, and two different tumor populations in 43% of cells and 20% of cells. And so what you see here are the actual copy number profiles for the different populations here, where the color corresponds to which population it comes from. So we applied theta to this data set and found we did pretty well. So theta inferred the tumor purity within 3% of the true purity. And while there's some noise in the copy number aberrations predicted, we were able to recover most of the large copy number aberrations. So we also, for comparison, we applied absolute to this same simulation. However, I do want to note that while absolute has recently announced a new version, we weren't actually able to get a copy of that due to licensing issues. So the results I'm showing you here are actually for an older version. So absolute actually returns a collection of different solutions with different purities and different likelihoods, which an analyst can then go through and annotate. And so what I'm showing you here is a histogram of all the purity of values returned for all 12 solutions. And so the important thing to note is that the bulk of them are actually much lower than the true purity of 63%. We also notice there's a parameter in absolute that allows you to set the maximum amount of DNA that can be predicted as subclonal. And by default, this is set to 5%. So we tried increasing that to see if we'd get better solutions. And so while it returns many more solutions, some of which are actually near the true value, the most likely solutions are still much lower. So this is really demonstrating that it's really important to actually explicitly consider multiple subclonal populations, in particular when trying to infer tumor purity. So we've continued to do some work on theta. And I want to talk to you about three improvements that we've done recently. So the first is we've actually done a lot to improve the optimization when we consider multiple tumor subpopulations. In fact, it's over thousands of times faster now than it used to be. We've also extended theta to be applicable to both whole genome and low-pass sequencing data. And we've further done some refinements that allow us to measure copy number aberrations at a more precise level, allowing us to do better analysis of highly rearranged or highly segmented genomes. So I don't have time to talk about all three of these. So I just want to show you some examples that really highlight the second two points here. So first off, we applied theta to a number of whole exome samples. And so for comparison, we're comparing the purity of values inferred by theta to purity values that were reported by the original absolute paper using SNP array data. And so for most of these, we actually find that our estimates of purity are actually quite similar. So if we actually take a look at one of these samples where both algorithms return similar purity, we find that they both have similar purity. If we apply theta to whole genome data, again, we get a similar purity value. However, in contrast to absolute, theta was actually able to infer several subclonal aberrations here, deletions on chromosomes 3 and 13. So we wanted to do some further analysis to actually make sure these were real. So we looked at redepth information for the whole genome data. So what we did is we partitioned the genome into 50 kb intervals. And what you see here is the histogram of these redepth ratios. So in this diagram, different peaks correspond to different copy numbers in different populations of cells in the sample. So for example, this big peak here corresponds to normal copy regions. We see amplifications, deletions that exist in all cells. But then we also see this other peak in between deletions and normal copy, which actually directly corresponds to these subclonal deletions that we predicted. So there are also some samples where absolute appears to either underestimate purity or in fact actually fails to predict purity at all due to high subclonality. And so this is one example of such a sample where theta is actually able to infer information about the different subpopulations. So as I mentioned, we've also applied theta to low-passed data. So here is a reconstruction for a pretty rearranged breast cancer genome. We've looked at several other breast cancers with this as well. So moving on to my last point. So we've done some things that allow theta to actually be applied to really highly rearranged genomes. And so this is one example of a lung cancer genome where we infer two different populations in 50% and 18% of cells. And we wanted to do some further analysis for this genome. So we created something that we call a virtual SNP array. And so what we do here is we identify heterozygous germline SNPs from the matched normal. And we wanna look at the observed allele frequency for the minor allele or the B allele for the tumor genome. And so let's actually consider here chromosome six. So chromosome six, we predict to have normal copy. So we expect that it should contain both copies of these alleles. So we'd expect that these B allele frequencies should be centered right around 0.5, which is in fact exactly what we see. So now let's look at a different chromosome. Let's consider chromosome three here where we've predicted it to be deleted in all tumor cells in the population. So one copy of the allele has been deleted, either the maternal or the paternal. So we'd expect that these B allele frequencies should be right around close to either zero or one, which when we look at that is exactly what we see. So then for any region that we've predicted to be a subclonal deletion, where it's only deleted in a fraction of the tumor cells, we expect that these B allele frequencies should be not as far away from 0.5, but still. And so that's exactly what we see for different, maybe. There we go. And that's exactly what we see for some of these different subclonal aberrations that we've predicted. So this is just an example where we've used B allele frequencies as a ways to do validation of our estimates. We're also working on some things now where we actually want to incorporate B allele frequencies into the model. So I briefly mentioned that there's an identifiability issue. And so theta might actually return two equally likely reconstructions when it uses just read depth information. And so this is an example of a glioblastoma genome where that happened. So we've devised a probabilistic model which uses the reconstructions output by theta in order to model the B allele frequencies, which we can then use to select the most likely solution. And in this case, we actually end up selecting the solution that is mostly deployed where we see some of these characteristic rearrangements from glioblastoma, like an amplification of chromosome seven or deletion of chromosome 10. And so going forward, we're looking at doing more with this. And we really want to directly incorporate B allele frequencies and SNVs directly into the theta model. So in summary, I've described to you theta, our algorithm for inferring tumor purity and different tumor subpopulations. And I've introduced several improvements that we've done to theta that really allow us to apply to whole genome data, including low pass data, as well as whole XM data. And so with that, I would like to thank my advisor, Ben Raphael, Greta, who has been working on some of these new improvements to theta and Ahmed, who is involved in the original development and the rest of the people from the Raphael research group. Thank you. Time for one quick question. Oh, sorry, very nice talk. Have you tried to incorporate mutation data into the theta because this is basically is a copy number of changes. So, yeah. Because there are some tumor types, basically they are very quiet. Yeah, so at this point, we only are using the redep information and that's something that we want to do in the future. So if there are different populations that are only distinguished by SNVs right now, theta wouldn't be able to distinguish those. It's a great question. Sorry, have you thought about any strategies to validate the conclusions you get? I mean, so we've used the BLL frequencies at this point as our means of validation. So at this point, that's actually not part of the theta model. So it's using different data in order to validate the reconstructions that we've done. But I don't think that can really validate the clonal structures that you're arriving at. Yeah, I mean, I think that's a hard question in general and probably requires, you know, maybe doing some amount of lab work at that point. But yeah, it's definitely, I think a hard question is something that, you know, all the algorithms in this field have problem with it. How do you actually validate your predictions? You talked about the thousand fold improvement in algorithm, what did that contribute to? What did that what? What did you do to get the thousand fold? Yeah, yeah, so figure out how to do this in one sentence. So basically, we improve the way that we actually, so we have to enumerate different of these C matrices that are possible solutions. And so we were able to algorithmically define a subset of them that could guarantee that we gain the optimal solution. And we have new ways of actually being able to enumerate those and we have to, and so rather than checking many more of them, we're checking many fewer of them exponentially less. I can tell you more in detail. Thank you, Layla, very nice. Our next presentation from Mike Gatza on the genomic characterization of invasive lobular breast cancer, Mike.