 to talk about uncovering the pseudo-subclonal structure of tumor samples with copy number variation of next-generation sequencing data. Thank you. I'd first like to thank the organizer for granting me such a wonderful opportunity to share with you some of the work I have been doing after I joined Gabor-Marth lab as their new graduate student. And as you know, probably all pretty much aware of, tumor sample is always going to be mixed with a certain amount, the surrounding normals. And if you sequence the sample, it is going, the sequencing really is going to be a mixture of the DNA sample from both the tumor and normal. And we really started this project to look at the mixture ratio of normal in any sequencing data by looking at the copy number information from what you can gain from the BAM file, which is pretty straightforward to do. You can count the rate depths in a, for example, in this specific example, it is a 10 KB moving window of the tumor BAM file. Also, you can count the same thing with the pair normal. And if you calculate the ratio between the tumor and the normal rate depths, then the ratio is going to capture the somatic copy number variant that is happening in the tumor sample. And if you plot a histogram of the ratio, then if you assume that the sample which is being called as tumor is a pure homogeneous sample with some kind of deletion events, heterozygous deletion events, then you would expect that aside from the large peak around one where it would correspond to copy number neutral event, you would expect a peak around 0.5, which corresponds to the heterozygous deletion event. However, what is interesting is that we actually observed the peak at 0.7, which means the peak has been pushed towards where the copy number neutral ratio resides. And since cell cannot really have non-interger level of copy number, the way to interpret this is to think of the original sample being a mixture of both a tumor clone and the contaminating normal clone. And based on a simple linear combination to actually solve for the contamination ratio in this case, for example, the contamination ratio, the tumor purity is 80%. That will shift the line up as what is observing in this case. And if you locate the peak and identify where the copy number two and the copy number one, which shows as the black and the blue line in these slides, you can actually calculate the contamination ratio and use that ratio again and the copy number neutral line location. You can actually predict where the copy number three and the four in this example, that in this tumor that will appear in this histogram. And that fits the data extremely well. If you overlay these lines with the ratio plot versus the chromosome location, you can see that these lines are perfectly explaining some of the segments that you're observing here. So we were really excited about the results and we applied the same technique to all the other chromosomes from the same sample. This is only chromosome 19 of a specific TCGA sample. We observe something quite odd. Different chromosomes show up to have different tumor purity and that they tend to cluster together with very high differences among clusters but very small differences within a cluster. Now, this can't happen because it can only mix cells at a cellular level. How come different chromosomes have different contamination ratios unless there is more than one subclone that exists in the tumor subclone? And there exists a hierarchical that that's where the tumor heterogeneity comes in. We model this as a hierarchical subclone structure where the most prevalent mutation should appear in all the tumor that's deemed as tumor cells, which leaves out like for example, 20%, which would be the normal contamination but the less represented mutation will exist in a subset of the tumor subclone. And at those locations where the cells don't have that minor mutation will act as if they were normal and that is why you're observing different tumor purity at different chromosomes. And the same way applies for the even minor copy number. And so this is where we saw if we are able to reconstruct the entire structure algorithmically and that the way actually we come up with a way which is as good as following, which is to look for the, to approach this problem from bottom up to look for the most diverged subclone in this cell first and then make your way up onto you. So for example, if you have a ratio, if you have observed the ratio data which shows like the one with the white lines, then you can initialize a model which consists of only normal genotype which is presumably 100% of copy number two. And assuming that this is your tumor sample you can actually predict where the ratio would be and granted that is going to be one all over the place. And you can then calculate the differences among the model between the model and the actual data. And you can come up with numbers that represents the contamination ratio. In this case, they're different from each other. So what you would do is that you break the normal sub your normal clone into two sub clones and the right leaf is going still to represent the normal sub clone but the left leaf is going to contain the mutations which is snapped at an integer number to explain the differences that you observed in the actual data. And we also with the corresponding contaminant tumor purity. One thing to notice that the most diverged cell will also have both the more prevalent and the less prevalent mutations. So when you initialize the copy number profile of the smaller tumor sample you have to count in for all the differences that you've seen here. But with this model you can update your predicted value. And as you can see one of the segment is going to be explained perfectly. But that, and you can update the number and then pretty much do the same thing again. But next time you break, you always break the normal sub clone to count for the part that has not yet been explained. And well, until your data can fit your model well we can actually devise a score here which is pretty much the sum of the absolute value of the differences between the model and the actual data. And this entire thing is iterated over and over again until the score does not improve anymore. And if you back trace the leaf nodes that is exactly what you have seen in my previous slides which is the structure that we learn from this algorithms. However, it is important to keep in mind that given one ratio data there is always, there is chances that there are more than one actual biological structure that corresponds to the same ratio observation you have. For example, the model on the left and the model on the right. We don't think that it is possible to distinguish between these two models based on the copy number ratio data alone. So the algorithm choose to produce the model on the right just because we believe it is a better representation of the actual biological process. And as many of you who talked to me yesterday during the poster session have mentioned, this looks strikingly like what we expect when there is a cancer stem cell that they sort of aggregate mutations themselves by along the way they produce cells that constitute the entire tumor. And that's pretty much the method with these simulations. If you start with a 40% normal and a 60% tumor mixture profile with the copy number profile as shown here in the tumor sample, the method is able to predict 42% of normal and 58% of tumor which is fairly close. And we have incorporated a small error term in there that counts for the error that you observe in the actual sequencing data. So that's why the number is a little bit off. But all in all it's doing a pretty good job. If you change the simulation profile to include and a third tumor subclone, the result is comparable like 20% of a subclone and 57% of the other subclone whereas you have 23% of the normal. And then we apply the method to some actual TCGA data. This is the result that the algorithm comes up from the data that I showed you earlier from the chromosome pileup figure. And it looks pretty close to the one that we come up by hand. And this is specifically ovarian cancer. This is another example shown with glioblastoma cancer and you can see that there are much higher heterogeneity compared to ovarian cancer. And there are some bunch of other more examples on glioblastoma cancer. So just to give you a taste of how the program looks like. So if you zoom in the part that actually results is the, I don't know if you can see but the purple line actually represents the observed ratio value subject to genome segmentation. And so that's pretty much representing the somatic copy number events in your sample. And the green line with dark green line is what the final model is predicting and they look pretty close. There are even segments where you only see green lines that doesn't mean the green lines perfectly overlapping with the purple line. And yes, in conclusion, the method is able to simultaneously estimate both the normal cell contamination ratio and some measurement over the tumor heterogeneity. I think both of which will be very important for any downstream analysis. And the algorithm is pretty fast. It, in the worst case scenario, if you have as many different copy number states as the number of chromosome locations you investigated, then at the most you will need those number of iterations to explain the entire data. The actual speed limiting step is counting the read depth. It takes roughly one day and a half to count the read depth of a whole genome sequencing data with 40 times medium coverage and takes roughly two minutes to come up with the subclone structure. And it can be, the information can be used as a prior for any downstream analysis. For example, if you know that there are certain percentage of normal contamination, you might require less evidence to call a variant in your tumor sample. In the, the method is actually designed to be independent of the CNV color. I, in this case, I implemented my own very rudimentary, simple CNV color, but if you have your favorite CNV color, you can just plug it in and it takes input as the segmented genome. And then the model it produced is a biologically motivated model, but there are definitely other possibilities that it represents a class of possible combinations based on copy number data alone. So which brings us to the future directions that we want to be able to, I didn't say here, but it might be helpful to incorporate sequence data to try to differentiate between the case where you have non-overlapping mutations and overlapping mutations. And well, another thing that we want to do is do validations. Since we are a bioinformatics lab, we are actually looking for, we're looking for collaboration with web groups to do some validation work. And the current examples that I showed up there was tested on whole genome sequencing data. And the next way was on to test and probably make modifications to make these algorithms work with captured data. And yep, I'd like to thank everybody in the lab. I'm pretty new to this lab and everybody has showed tremendous amount of support for my work. And I'd also like to thank TCGA for this opportunity to both get in my hands on some excellent data and the opportunity to also present my work to the community. Thank you. Thank you. Thank you. Good questions. I'll start out with one question. Very exciting, very nice work. How are we going to cope with representing the uncertainty? You gave a good example where there were a couple of different interpretations of the data. And I think we're struggling with that at Santa Cruz as we do similar work. We don't want to present one machine solution as the absolute truth when we're actually quite uncertain as to how the data should actually be interpreted. Do you have any thoughts on how to... Well, I sort of think about that, for example. Okay, even the example when I was sort of breaking... So I was sort of breaking a tree structure. Most of the time it is going to be different ways to break the left node and the right node with equally the same goodness of explaining the data. And I think there might actually not be a way to differentiate which one is correct based on the copy number data alone. Absolutely, that's the point. We will have uncertain situations where there's no one solution that we're confident in. There could be other equally good solutions. So how can we communicate to the biological community that fact? We should first of all make that aware, aware on all things. And then most of the time I don't think that is going to matter. For example, it might be equivalent if, for example, a location where there is a homozygous deletion but only 50% of the cell has that versus a heterozygous deletion where all the cells have that. So if you break the genome into pieces and do whatever the sequencing technology that we do nowadays, then there might not be a difference at a DNA level. Right, but there's a very important difference at a biological level and so we need to know that. Yeah, we might need to incorporate other data then. All right, so it's an issue. Any other questions? Actually I have sort of a related question to David. So that basically means there's an alternative scenario but suppose there's only one scenario, do you think we will also pre-gave the confidence interval for the percentage of different tumors? Yes, that is actually the things that I actually want to work on. Currently it is based on a linear model but it might need to be changed to incorporate a statistical solution. For example, what is the percentage of that? I think it is more, the confidence is going to play more in terms of determining the actual integer number of the copy number in a subclone versus it's going to play less than actual ratio. So I guess that's where the confidence is going to play an important role there. Okay, let's thank you again.