 Good afternoon, TCGA. My name is Scott Carter. I'm going to be talking today about a lot of work that I did for my PhD thesis, but a lot of it is very relevant to TCGA. So basically, it's going to have three different parts. I'm going to talk briefly about a method, a computational method that we developed to infer tumor purity, ploidy, and absolute allelic copy numbers directly from allelic copy number data, so either SNP array or and then I'm going to talk about some various applications of that, both to sort of design genomic experiments and to actually interpret and hopefully learn some new biology. So one of the sort of caveats to a big characterization effort like the TCGA is that we're actually always analyzing data generated from a population of heterogeneous cells rather than sort of individual tumor nuclei. So you can see on my slide you can think of a big tumor is having two, at least two big populations, those of tumor cells and those of contaminating normal cells. Now the normal cells typically we think of as having a diploid genome, whereas the tumor cells can have almost an arbitrarily complex seeming genome with very aberrant carrier type and very altered ploidy. So these two attributes dramatically affect the kinds of copy data signal that you derive when you mash these things up and either run out the resulting DNA on a SNP array or do sequencing on those things. So we were motivated to develop a method that could actually sort of deconvolve at least these two populations and give you integral integer estimates of allelic copy numbers fixed in the cancer clone as well as to understand the tumor ploidy and the purity of the tumor itself. So this is a slide sort of giving an overview of the inference methodology. So in the top left you can see the sort of collapsed estimates from all the individual heterozygous markers in a given individual to give you what we call homolog specific copy ratio or essentially allelic copy ratio that's smoothed across the genome. So you can see we infer two distinct copy ratio values for every locus in the genome representing each of the two homologous chromosomes. But the challenge is sort of to take this as its input into and to understand what the mapping is of these sort of discrete seeming levels in the data are to actual integer allelic copy numbers. So you can see on the bottom there are sort of three distinct solutions that are essentially mappings of those peaks you see in the data to integer allelic copy estimates. So on the top right of the slide you can see that each of those three solutions corresponds to a different combination of tumor purity and ploidy. So the question is to decide which of these three possibilities is the correct one and in general unfortunately it's sort of impossible to solve this problem just by itself without bringing in some external data because you can see sort of the blue and the green solution fit the data equally well. So to break the tie or ties like this we developed what we call karyotype models which are sort of bootstrapped up from thousands of tumor samples that we analyzed and also have some information coming from cytological data which are to say there are essentially mixture models of what are recurrent cancer karyotypes and they're specific to the tumor type you're analyzing to be more sensitive although that's not really necessary. So using these karyotype models you can often break the score and get a nice prediction about what the actual true solution is without having to resort to cytological or sort of fact space techniques in order to understand the true purity value, ploidy value as well. And I get one more caveat of this is that almost no tumor I've analyzed is truly clonal and even in terms of copy number so you can see here if you look at the black arrows there does seem to appear here a subclonal gain of chromosome 2 which is explicitly modeled by our method as subclonal and the trick here is not to overfit these subclonal copy alterations by proposing a much more complex karyotype than is truly warranted. So to validate this we've done several experiments so the purity validation is is derived by mixing cancer and normal cell lines and then running them out on snip arrays and getting back the more or less correct mixing fraction. The ploidy analysis was done actually on some of the TCGA ovarian cancer primaries where we actually went and did facts on them to estimate ploidy and you can see it correlates quite well with our estimates although there are a few samples that seem to have been misclassified. So actually before I move on I just want to say we've been sort of using this method at the Broad since or for a couple years to select only the highest purity samples for whole genome sequencing efforts to ensure that we have good power in order to detect alterations in all those expensive experiments. So right so that leads into my next point which is sort of about how purity and ploidy affect sequencing types of experiments. So there's a strong relationship between purity and copy number. I think Gatti's talked about this a lot where basically the higher copy number local copy number you are the deeper you need to sequence to detect a mutation at one copy per cell in that sample. Similarly it seems intuitive but the lower the purity of the sample is the deeper you need to sequence in order to adequately find mutations and those relationships are illustrated on the slide. Perhaps most interesting here is if you look on the far right here so if you want to be able to detect subclonal events say at 20% cell fraction then you need to sequence fairly deep but on the other hand these numbers that you're seeing down there are actually pretty readily attained with a lot of our whole exome experiments that we're doing currently. So now I'm going to talk about actually rescaling these estimates of allelic fraction that people always talk about into estimates of what we call multiplicity which is to say an estimate of the number of mutant alleles per cancer cell. So this is data from the TCGA ovarian cancer whole exome sequencing on alumina and you can see on the left we just combined all like 30,000 mutations that we detected and the allelic fraction distribution doesn't seem to have a lot of structure in it. It's kind of a smear and that's because there are a lot of copy number alterations in this tumor type and more importantly perhaps because they are very different purities which totally obscure these allel fractions. On the other hand using absolute you now know what the integer allelic copy numbers are for every position in the genome and you know the tumor purity. So with that and the allelic fraction you can rescale these raw allelic fractions into what we call multiplicity estimates which you're seeing on the bottom right of this plot and nicely there's a nice peek that you see at 1.0 which is to say that the modal sort of point estimate is about you know one copy of a mutant allel per cancer cell for the clonal cases but you also see an additional peek in this distribution on the left which is colored pink. So these we surmise are actually subclonal mutations in these tumors. We think the distribution we're seeing here is mostly consistent with neutral evolution so it's sort of unlike the earlier AML data we saw today that had discrete subclones at sort of high allel multiplicities or allelic fractions. Okay so this plot I think was important for us to just prove to ourselves that these subclonal alterations are likely to be real. So the idea is that if these subclonal alterations were germline contaminant or if they were machine noise they would have a different fingerprint in the sense of a different mutant allel spectrum right so the frequency of C to T transversions or C to A etc would be very different but in fact as you can see on the bottom left they're very similar which gave us some confidence there. In contrast if you look at the plot comparing the mutant allele spectrum of tumor to germline SNPs it looks very different. So this is sort of like a fingerprinting and it shows that the subclonal and clonal mutations in ovarian cancer have the same fingerprint. So the next thing we did was to try to actually see if we could use this multiplicity estimate in order to learn something about to classify what these mutated genes might be. So we took these 15 genes that we discussed in the ovarian paper and tried to understand which were frequently homozygous which is to say there were zero copies of the wild type allele remaining in the the cancer clone. So as you can see a lot of the top genes the top tumor suppressors like p53 and NF1 and BRCA2 have a large proportion of their mutations rendering that gene homozygous so that's like there are no wild type copies left whereas a lot of the oncogenes do not have this property. In addition we were able to see that the p53 locus was often present at two or more copies per cancer cell in ovarian cancer. So 60% of the mutations in p53 are actually were amplified in this regard and this led us to believe that actually p53 is likely to be a very early event in these cells since very very few other genes in the genome have this kind of recurrent fraction of amplified eutent alleles. So now I'm going to talk about the last thing is just inferring genome doublings in human cancer development from these absolute allelic copy number estimates. So I think genome doubling had been really widely speculated about for a long time and if you look at cytological data like on the top right here there's clearly a bimodal distribution of ploidy by say facts or spectral carry typing. So it stands to reason that a lot of these passage through a tetra ploid intermediate for example. On the bottom right I'm showing the ploidy estimate the distribution of ploidy estimates we got for various different tumor types using absolute and we can see that it recapitulates qualitatively the sort of bimodal ploidy distribution. But the additional information that you get from absolute allelic copy number data allows you to actually look at an individual tumor sample and make a more precise determination of whether or not it went through a genome doubling. So you can see here I'm visualizing both the low copy homologues on the left and the high copy homologues on the right for the same samples. The samples are sorted by ploidy from top to bottom so high ploidy on the top low ploidy on the bottom. You can see here this this distribution recapitulates the sort of bimodal ploidy distribution that you saw in all these cancers earlier. And but I think more interestingly is that right at the inflection point of the ploidy distribution you see a transition from low low homolog copy numbers of zero in one to low homolog copies of zero in two and similar on on the right side you go from sort of one and two to two and four. So this is even more precise evidence of the genome doubling and we sort of formalize this with a statistical simulation sort of giving p values to this but it's I think it's actually pretty obvious when you look at the allelic copy data. So now we can actually take you know a large set of 3,000 copy profiles that we've analyzed with this method and try to characterize how common genome doublings are across human cancer. And you can see it on the top right it actually varies from different cancers but for example the ovarian cancer data in TCGA about 60 percent of the samples went through at least one genome doubling event during their evolution whereas things like ALL and MPD don't usually have genome doublings and the GBM samples from TCGA had a small proportion maybe 20 percent that did genome double. On the bottom right I hope you can appreciate that it's not as simple as sort of thresholding by ployty right you sort of you really want to use the allelic copy number data and make thresholds in that space rather than the total ployty because after genome doubling you can see that the the peak of the genome doubled samples is about three in which means that there are you know probably gains that happen rather losses that occur prior to genome doubling and I think even more losses that occur following that genome doubling event. So it's just speaking to that idea of which happens first if you look at the patterns of LOH in ovarian cancer by chromosome arm you see that actually for a given arm the frequency of LOH in doubled samples and in non-doubled samples is nearly identical which I think is reasonably good evidence that on these annuplies chromosome longer level annuplies tend to occur early on in ovarian cancer and specifically these LOH events and this is but also amplification events also as seen on the bottom here and this is this is actually a general feature of many human cancers that show this pattern right so these these are many different human cancers that we analyze that show this pattern so in general genome double tumors have many more somatic copy number alterations this is sort of showing the sort of log log plot of this somatic copy number length versus their their frequency and you can see the straight line sort of fitting this power power law model so interestingly the the slope of the lines for each of the genome doubled groups is very similar suggesting that somewhat similar mechanisms govern these things but like you have more DNA perhaps so that the rate of the rate of actually generating genome alterations is higher so in ovarian cancer we are able to show that the number of mutations so for example clonal heterozygous mutations on the top increases as a function of genome doubling on the other hand in the same bar plot if you divide by ploidy that effect totally goes away which is to say that the sort of mutation rate per base is probably the same across these samples and what you're seeing is just the effect of more DNA at risk to be mutated on the other hand clonal homozygous mutations tend to decrease as a function of genome doubling as you might expect after you double the genome it actually becomes significantly harder to create a homozygous either copy alteration or mutation and you can see on the bottom of homozygous deletions in ovarian cancer that they tend to go down as a function of genome doublings so interestingly of the 15 NF1 mutations that I we I saw in this particular set of 214 ovarian cancers 13 of them occurred in the non genome doubled set in which case they're all homozygous so that's sort of suggest that you're getting specific selection just on the recessive inactivation of NF1 as expected for a tumor suppressor but what's what's even more interesting about this is that we didn't observe any amplified mutations in F1 in the genome doubled samples which means that it's not that NF1 can happen early and then become duplicated by genome doubling as it happens with p53 it's sort of like in F1 mutation seems to commit you to the non genome doubled trajectory of evolution in ovarian cancer and finally just a few associations with the clinical data the patient age of diagnosis tended to increase as a function of genome doublings which was significant I'm not quite sure what that means but it's sort of interesting and I might have some interesting relations to teal your biology there and there is a small but significant association with time to recurrence and genome doubling in ovarian cancer so that's it I'd like to thank Gadigatz and Matthew Myerson and all my colleagues at the Broad and all the TCGA thanks very much thank you Scott other questions well perhaps I'd like to start with a question well I actually had two questions but let me start with this one I'm curious if you could comment on why do you see a power law between the number of these events and the length I think there's a there's a paper in press I believe currently it's Folgenberg et al I'm I'm sorry I'm not going to get the name wrong but I think it's really I think it's supposed to be related to the fractal globule three-dimensional structure of the human genome which would predict sort of the the contact length distribution would be power law distributed or closely interesting okay and also my my other question was can you comment on I didn't quite understand why is it that the mutant allele spectrum would be different if the samples were contaminated right because in general germline variation always vastly exceeds somatic mutation rates right so you'd be swamped with germline variants and they have a very different mutation spectrum okay thank you cheers okay thanks again