 Thanks, Phong, for doing an excellent job with the lab. That was the first time he's done that, so I think he did a very good job. Okay, so now we're switching gears to, instead of looking at macro scale type events in the genome, looking at the very highest resolution events, somatic mutations. So, I wanted to start, though, with just thinking about cancer and how it progresses and putting into context cancers into evolution. And this will become clear over the course of the lecture why this is important. So the clonal evolution theory of tumor cell populations was first proposed in 1976, so quite some time ago, and really casts cancer into the rubric of phylogenetics, or evolutionary theory. And the theory really predicts several features of cancer. So one is that tumors will change over time. And so if we think about the unit of selection here as the cell, and a cell will acquire new mutations upon cell division, if that modifies or confers a phenotypic advantage to that cell, then it might outgrow its neighbors and expand. And so we can really leverage the concepts of evolutionary theory to reconstruct or explain our observations of cancer. The key aspects of this is that acquisition of new mutations over time, especially in the context of a therapeutic selective pressure, so in position of a drug, mutations that are resistant to that drug or confer resistance to the drug will likely confer a selective advantage over those populations. So for concepts like drug resistance, evolutionary theory is critically important to understanding how drug resistance emerges. And the bottom line is certainly that, as we discussed earlier this morning, is that tumors will be composed of heterogeneous clonal populations. And so to make sense of a sequencing of a bulk tumor, one cannot ignore this fact, and then in many cases, considering the population structure helps interpret the biology of the sample under study. So in light of the two lectures, is there one type? Yeah, so we did discuss a little bit of that in the sense that it's probably too much for the cell to bear if two types of DNA repair mechanisms are deficient in a cell. So if mismatch repair is deficient, and as well as, for example, Hama August recombination, which repairs double-strand breaks, that's going to be a deleterious situation for the cell. And in fact, I think drugs like PARP inhibitors exploit that. So certainly defects in multiple DNA repair pathways is selected against. And so the fitness of those cells is very low. So it can be advantageous for the tumor to have one type of DNA repair aberrated, but not multiple. So the notion here that this evokes is that we might have a tumor that's composed of multiple populations that looks like this, maybe at time point one. And we can monitor, if we could monitor these populations over time, what we might see is potentially through the imposition of an evolutionary bottleneck here that with therapy, then what might emerge is a population that is really quite rare in a primary case, but then becomes dominant over time. So the fabric of this tumor looks quite different from the starting point than it did maybe at the end point. So in particular, what we're interested in is phenotype. And so we've talked about the concept of a driver this morning. And another way to think about that is a driver mutation is a mutation that alters phenotype. Selection, evolutionary selection operates on phenotypes, not genotypes. And so given that we may have differing genotypes, we can ask if they alter phenotypic behaviors and more importantly from a clinical perspective is how this relates to treatment response, progression and metastasis as an example. So this is now generally accepted in the field as being a major feature of cancers. And so for example, in this review paper by Bert Vogelstein, one can have several different types of heterogeneity across a sample. So within a sample, one cancer can be composed of multiple clones. Those clones might have different metastatic potential or be selected for in different tumor microenvironments when they spread. And of course, it's also the same type of concept that leads to dramatic interpatient variability as well. So because each patient undergoes its own evolutionary trajectory, it's not terribly surprising that we don't see large overlaps in mutation content from one patient to another. There may be a few genes that are actually the phenotypic drivers and the rest of the mutations are maybe specific to that cancer or just benign passengers. So these are concepts that we really must come to grips with. Work that's been carried out by Charlie Swanton and others have revealed that evolution also happens in space, so anatomic space. So this is a project that undertook sequencing of metastatic lesions at the time of diagnosis and profiled what mutations were present in each of these samples. This is one patient, multiple metastatic nodules, and each row here represents one of these samples and the column represents a mutation and a gray box means that that mutation was present in that particular sample. So there's a block of mutations here that's shared amongst multiple samples and this could be really viewed as the ancestral clone. So this is part of the initial colonial expansion that led to the tumor. But then you can see that there are many sets of mutations here that are really specific to specific samples. And so these are necessarily probably mutations that are found later in evolution and maybe very specifically selected for the particular microenvironment of these tumors. So this is quite an important concept to think about when we're sequencing a bulk tumor. One is how representative is that one sample of the entire mutation spectrum in the patient. And then specifically, if we find, for example, a targetable feature in this sample, we can tell that that drug is very likely to be ineffective in the rest of the samples. So knowing the diversity and the distribution of mutations across anatomic sites gives rich biological information and in some case clinical application as well. So driver versus passenger, I think we've covered that. And I think I said everything I need to say about that. So then we can have, the concept of temporal drivers is really quite important. So we can have driver mutations that initiate the neoplastic transformation. So classic examples of this would be P53 loss or KRAS code on 12 mutation, for example, pancreatic cancer. And then we can have driver mutations that confirm metastatic potential. So these are not necessarily mutations that initiate the tumor but maybe are acquired later and allow those cells to evade cell contact inhibition and things like that. And then allow those cells to spread throughout the either vasculature or lymph system or other cavities within the body. And certainly there are driver mutations that can confer chemotherapy resistance. So a classic example is the T790M mutation in lung cancer in EJFR where patients that have been administered EJFR inhibitors often develop resistance to be mediated by acquisition of this new mutation and that clone that carries that mutation expands out in the presence of anti-EJFR therapy. And so that is a predictable driver mutation but there's not present at a dominant levels in the initial tumor. Okay, good. So, you know, given that we can characterize a lot of biology from inference and interpretation of mutations, there have been, as you well know by this point, massive efforts in cancer genome sequencing to improve cancer biology knowledge and identify targets for therapies. And so there's this classic paper that was written by Hannah Hannon Weinberg in published in Cell in 2000. There's been a 10-year anniversary version of this paper in 2010. But what's interesting about this, this paper, how many people have seen this diagram before? Should be, okay. Anyone interested in cancer biology should have seen it. But so it describes the characteristics that tumors share. But one thing that sort of leaves out is what is the underlying mechanism by which these features are acquired? And sequencing the genomes of cancers can really reveal that. So certainly we can ask what genetic abnormalities underpin the ability of these tumor cells to achieve these phenotypes. And then given the evolutionary context that I described is we can ask how do these genetic abnormalities change over time. And from a pathway perspective or a biological disruption pathway type of perspective, we can ask what genes or pathways would be altered due to somatic genome aberrations. So I don't need to dwell on this to say that, but major initiatives in the billions of dollars of investment are being put into sequencing large-scale sets of tumors. The TCGA phase one is complete and it probably contains about 5,000 cases. If not more, the ICGC initial phase is now three more years. It's now underway and has three more years. And 25,000 cases are being proposed in the ICGC project. And a lot of the informatics support is being headquartered here at the OICR. So these are huge investments and then that's just the major consortia that are involved. And then major centers across the world now are investing in both research study and clinical use of sequencing to inform therapeutic options for patients. So although it's not inconceivable that well over a million individuals will be sequenced, their tumors will be sequenced in the next few years. So one of the results of the TCGA, and really I would stress that without doing unbiased analysis, some of these findings would have never been put forward. And so what this shows is a summary of the TCGA. And the columns here are the 12 major subtypes that were included in what's called the pan-cancer analysis. And then the rows are the genes and the heat encoded in this matrix shows you the prevalence of that particular mutation in that particular cancer type. So naturally, as you might expect, here's P53, this row right here. You can see that this is the most frequently mutated gene across human cancers. It crosses tumor type and is really the kind of boss gene that we knew about this before. But nonetheless, its distribution amongst the major cancer subtypes maybe hadn't been completely appreciated. Now we see subtype specific alterations. So loss of von Heppel-Lindau, this VHL, is specific to kidney cancer. So that's a disease that's defined by von Heppel-Lindau. And then we see other mutations such as here's APC in colorectal cancer. So colorectal cancer is characterized, but I think it's on the order of 60% have APC mutations. But some of the surprising findings were of the following. So I've just indicated here two biological themes that can be drawn out of the TCGA. The first is that we see an enrichment for mutation and histone modification, epigenetic regulation. And this was previously underappreciated as a mechanism for tumor genesis. Epigenetic deregulation is now well accepted that this is a major driving force for tumor genesis. In the same way that DNA repair, for example, has been known as a mechanism for tumor genesis. And then even more surprising perhaps is that mutations in the splicing machinery, which you'd expect would have global impact or deleterious effects in the cell, have been shown in certain tumors to be mutated more often than you'd expect by chance. And so I would stress again that these types of unbiased type surveys do yield important discoveries. And so the only way to achieve this type of result is to sequence the whole genomes of Lyres-Number's cases. Okay. So some of the cancer genome sequencing work is directly applied in the clinic. Traditionally there are well-known markers. For example, BCR-Able translocation in CML is a target of Gleebeck. Then for diagnostics, certainly as I mentioned before, so high-grade serious ovarian cancer, 95 to 100% estimates range of cases will have a loss of function P53 mutation. So if it doesn't have a P53 mutation, it's probably not high-grade serious ovarian cancer. Companion diagnostics for targeted therapy, I'll expand a little bit on that. But for example, EGFR mutations for anti-EGFR TKIs in lung cancer, and certainly BRAF-V600E and melanoma. There's a nice website here that has some of the targets for personalized or targeted therapies for which inhibitors have been developed against mutations. And then I already mentioned the secondary mutations in anti-EGFR resistant tumors. So these are markers of resistance to therapy. Also, K-RAS code on 12 is a resistance marker in colorectal cancers that are treated with satuximab. So these are mutations that can emerge in the presence of therapy, and often they're recurrent and predictable. Okay, so let's just now look in detail at the anatomy of somatic point mutations. I think we can all grasp this quite easily, but just so that we're all on the same page, we may have a sequence in the normal cell that looks something like this. And in the tumor cell, we get a substitution, a G to T substitution in the tumor, and that will result in an amino acid change in the protein. And so this results in, this is a real example of a P53 mutation that results in a loss of function or truncating mutation in that particular case. And so these are the types of things that we're talking about. There are different classes of point mutations to know about. The first is a missense mutation, which can be defined as a single-base substitution altering the amino acid sequence of the protein. So it results in a protein substitution, amino acid substitution. There are silent mutations, also called synonymous mutations, which are similar single-base substitutions, but do not change the amino acid sequence of the protein. Interestingly enough, there's a lot of work being undertaken right now. The general assumption is that these silent mutations are actually benign, they don't really have an impact, but there's a lot of mounting evidence that a fair number of these mutations actually do impact by creating cryptic spice sites or impacting regulation of transcription. You will often hear nonsense mutations, or thus synonymous with truncating mutations. These are single-base substitutions introducing a premature stop codon in the amino acid sequence. And then we have frame-shifting mutations, which are small deletions or insertions. These are single-base or small numbers of bases that are inserted or deleted that change the reading frame of the open reading frame, and so that results in a premature stop codon and is a functional equivalent to truncating mutations. So often you'll see, when I say loss of function mutation, it's typically associated with a truncation or a frame shift, or it can be a copy number deletion, for example. That's also another way to have a loss of function. Okay, so a classic example of a miss-sense mutation is a mutation that I was involved in discovering, which is the miss-sense mutation of FoxL2 in a rare subtype of ovarian cancer, which is the work I did with David Huntsman. And so what we found is that this is the quintessential, what's called a pathenomonic or disease-defining mutation. Every single case has a somatic mutation at this location in this FoxL2 gene. So it now defines the disease and is being used as a diagnostic in different countries and is a clear separator of ambiguously diagnosed granulosa cell tumors. Sequencing for the mutation, which is very easy to spot because it's a single base. We can design assays to look for this specific allele. It's the diagnostic definition for this disease. And so this is really the quintessential hotspot type mutation. Other examples that are much more common are examples like PI3-canes. So PI3-canes as a pathway is a commonly... it drives phosphate KT signaling as one of the most common aberrated pathways in cancer. And it's often driven by a mutation in one of two hotspots in the PI3-canes protein. And so this is just a diagram that's pulled out of the C-bio portal. Are they looking at C-bio portal? Okay, so this is a nice website that essentially contains the TCGA data and other datasets. And you can generate plots like this which show essentially the prevalence of a mutation distributed across a protein. And you can see that essentially the mutations in PI3-canes cluster at these two regions in the, first of all, in the PI3-canes domain and then towards the terminal end of the protein here. So these are... these are well-known hotspots that... where mutations accumulate. And so other well-known examples are KRAS CODON 12 mutations in colorectal cancer and pancreatic cancers, as well as BRF B600 mutations in melanoma. I'll expand on those in a minute. So if we look at these mutations zoomed in, this just shows you exactly what the reading frames are and how you can predict that. Here's the actual base that's... Yeah, this is the amino acid sequence that's being changed right here. And it can be hit in any multiple ways through substitution of any one of these positions in the genome. And this just shows how where the PI3-canes gene fits into a phosphate AKT signaling in this keg pathway diagram. So it sits here and so can have a dramatic downstream effect on AKT which drives signaling in a number of different directions. And so the mechanism of action here is that the phosphorylation capacity of PI3-canes is changed by the presence of those mutations. So these are hotspot mutations that are characterized typically by missense mutations clustering in small regions of the protein. By way of contrast, tumor suppressor and loss of function mutations tend to distribute widely across the protein. So here's an example of a study that was carried out in ovarian clear cell carcinomas and other endometriosis-associated ovarian carcinomas. And what we found is that approximately half the cases have loss of function mutations in the gene called arid1a. Arid1a is involved in the SWE SNF complex which is involved in chromatin modification regulations. It's one of these class of genes that's involved in epigenetic regulation. And in this disease is present in about, like I said, 50% has harbour mutations and arid1a actually is quite commonly mutated in other cancers as well. But the pattern here is nicely depicted by showing the presence of these mutations is distributed almost uniformly across the protein. And that's just to say that there are a number of ways for which you can knock out a protein. In the other case we're looking at PI3 kinase, there are only specific ways in which one can modulate activity of phosphorylation. And that's why the mutations cluster in those regions. So here's a summary of that from Mogelstein showing again PI3 kinase with mutations piling up in the kinase domain and the helicase domain. And then another example, which is the isocitrate dehydrogenase gene. This is another example of a very surprising finding by unbiased sequencing. So this is a metabolic gene. And so metabolism gained fashion in the 1980s as a way of trying to understand cancers. But it wasn't until then it kind of went out of favour in the era of discovering new oncogenes. And then with the advent of sequencing being applied, these mutations were discovered in glioblastoma and also in AML. And now very quickly, the first discoveries were made in the 2007-2008 era. And already there are clinical trials that are testing inhibitors against this mutation. So that's a very rapid discovery cycle. And so this is a really interesting mutation that is characteristic of the hotspot oncogenic driver type mutations. Then to contrast that, here I've just put down two tumor suppressors, two additional tumor suppressors, RB1 and von Heppel Lindau. And you can see that how the mutations distribute across the protein is just like every one. Okay, so what good are mutations? Well, one, they help us understand biology. But two, is they really identify targets for therapy. And there have been a number of regulatory body approved, so FDA, Health Canada, C-mark, et cetera, drugs that have been approved to target these particular mutations. So I've just given some examples. And these are on-label indications, meaning that the drugs cannot be prescribed without knowing the presence or absence of the mutation. Okay, so it cannot be administered under the FDA or Health Canada without those conditions being met. So for metastatic melanoma that are BRAF, B600D positive, one can prescribe V-marathenid, which is an inhibitor now branded as Plexicon, I think, through some, I can't remember which company, but that's the commercial name for this inhibitor. And it's proven to be effective in our unreceptible melanoma. So EGFR, X-19 deletions, or this particular amino acid substitution is an indicator for prescription of their lotinib in EGFR expressing locally advanced non-small lung cancer. And so this has shown to be effective, but naturally when you, then this is the evolutionary process at play, it selects for non-resistance mutations. So the presence of a mutation in T790M is somehow inert to these EGFR targeting drugs. And so that's another example. And then for KRAS, so this in this particular case is a contraindication. So in EGFR expressing metastatic colorectal cancer, the mutation status of KRAS has to be wild type in order for patients to receive satuximab. So because KRAS mutation is a resistance mechanism to satuximab, so the label indication is that these cases have to be KRAS wild type. And so that's another example. So let's just look at how this can actually impact clinical response. So this is the major paper that was published in New England Journal of Medicine in 2010, a Chapman that all showing what's called a waterfall plot which shows essentially the growth differential of tumors. Where each line here in this plot represents a patient and lines below the zero mark means that there was response and lines above the zero mark means that there was growth on the drug. And you can see with the BRAF inhibitor, most patients showed some degree of response, whereas using standard chemotherapy, this is an alkylating agent, then a lot of tumors didn't respond at all and much fewer cases exhibited a response. Yes. Absolutely, yeah. So, and so, sure. Absolutely, and actually that's really the focus of a lot of the next phase of these large scale studies is to work with clinical trial samples where everything is controlled and we know the treatments are administered and we know the treatment arms and then we can compare the genomic characteristics of patients that respond versus those that don't. Taking the samples as they are, and for example TCGA or ICG, this is kind of a mishmash of things where the treatment is not controlled, data is not really available. And so the next phase of these studies is really directed towards that. And then, excuse me, I threw in a lot of smaller labs are really doing these kind of more focused studies. So in my lab, for example, we're investigating the differences in the genomes of platinum refractory high-grade serious cancers versus those that have long-term survivorship. And there are differences. So we can start to get at mechanisms of why some cases are sensitive to platinum-based therapies and systemic chemotherapy type rugs. And the signals aren't obvious. We don't see like a T790M mutation, but they're there. Do you think that's just typical of these older therapies is so much more broad? Yeah, I think so. So the disappointing part about targeted therapies is that they universally select for resistance. So that's all we're doing is we're selecting for resistant clones. And this is where new modalities like immunotherapy, for example, are really quite exciting because they can actually contend with the evolutionary capacity of cancers. Good. So then, so this is the sad story here. So just the presence of a mutation, of course, doesn't necessarily mean that one can use that drug. And so this is now, a lot of people are quite excited about the idea of off-label indications. So taking a drug that, let's say, was approved for melanoma on the basis of a BRAF V600E and saying, well, I have a colorectal cancer with a BRAF V600E. I should use that same therapy. But unfortunately what that ignores is that those are different cell contexts. And so colorectal cancers are expressing EGFR and melanoma cells are not expressing EGFR. So what happens is that you can administer the BRAF inhibitor and then what you end up having is an EGFR expressing colorectal cancer. And so what this paper shows from our friend, Radu's department, is that a combination therapy of both EGFR inhibition and BRAF inhibition is necessary for response and colorectal cells of BRAF V600E. And looking at either therapy alone results in no response. Okay, so cell context is important. All right, so moving beyond single genes, and so I've shown you really the really kingpins of genes that we know about in cancer. But what I want to now focus on is the genome. And what can the genome tell us about different cancers and how can they be used to stratify cases? And so we can think about two concepts. One is the mutation rate, so just the abundance of mutations in a given cancer. And the second is as mutational signatures. And mutational signatures are essentially a description of the distribution of the types of substitutions, nucleotide substitutions that we observe in particular cancers. And what this paper showed from the Broad is that different tumor types exhibit very different mutational signatures. And so this is this doughnut or bagel plot, as they called it, that shows along as you go out from the center of the circle a number of mutations per case. And each dot is a particular tumor. And then arrayed around the circle are particular substitution types. And so here what's shown is there's a cluster of cases that have a predominance of C to A mutations. And these are characteristic of lung cancers. And the interesting, so can anybody make a guess as to why that might be associated with why are lung cancers and really only lung cancers associated with these C to A mutations? Yeah, so smoking induces the substitution in DNA. And so similarly melanomas are characterized by a very large number of mutations. And they're characterized by C to T mutations. And not surprisingly C to T mutation is the signature of UV damage to DNA. So melanomas have high prevalence in places like Australia where the ozone layer was unfortunately eroded and protection against UV radiation was reduced. And so this is an example of the types of signatures that one can actually use. So this is the respect of gene content. We can learn a lot about what was potentially the underlying mutation mechanism that gave rise to these particular tumors. So let's see, what else can I say? And then we might have cases, for example, AMLs. As you can see here, they're characterized by very few mutations overall. And so this is a disease that is a very aggressive cancer, acute myeloid leukemia. But it's characterized by very few mutations. And so it's quite interesting from that perspective. This is a different summary of similar data as summarized across in that TCGA pan cancer, showing that both the abundance of mutations across the different tumor types and the types of mutations as depicted by these signatures is highly variable across different human cancers. And so this can really be used to learn something about what's going on in those cancers. Within, even within particular cancers, the genomic properties of mutations are really important. So this is the endometrial carcinoma, or uterine carcinoma TCGA paper. This, by the way, I think is my favorite TCGA paper. A lot of them are kind of a little bit intellectually vapid, if you will. This one is fantastic. I think this is a great paper and shows that there are major subgroups of endometrial cancer that are characterized in a number of different ways. So this is a group of ultra hypermutated cases. They have huge numbers of mutations. So this is a log scale here, and you can see that these cases have log orders, more mutations than the other subgroups. And then their mutation signature, which is shown in the stack bar plot here, is very different than the rest of the cancers. And so this hypermutation mechanism is actually kind of hitting all kinds of substitutions, whereas the other cases have an enrichment for specific types of substitution. And then if we look at how to characterize, so these are ultra mutated and they typically don't have copy numbered alterations, as we discussed earlier. Whereas this group over here, they have a much lower number of mutations, but they have high incidence of copy number change. So these are like the high grade cancers that are very much analogous to high grade serious ovarian cancers. And so you can see that they're just dramatic difference in genomic characterization, and not surprisingly, there are dramatic differences in outcomes here. So here you have the high grade cases, naturally, as you might expect, have much poorer outcome. And whereas these poly hypermutated cases, they almost all do extremely well. And so this has been validated several times over now, and it's true that these poly mutated cases that have ultra hypermutation really is an indicator of very good prognosis. Okay? Okay, so let's look at the statistical considerations, and we've really talked about this in context of copy number, but it's worth just going over them again. So we have tumor normal admixture problem. We have intertumoral heterogeneity or clonal diversity. We have genomic instability. And the experimental design that's really engineered for capturing somatically acquired mutations that are present in the tumor, not the normal, necessitates essentially doing two sequencing reactions. So you're going to sequence the normal and the tumor. And so that presents an opportunity for new analytic methods. And we'll go over some of these approaches. So the first step, as you've already gone over, is aligning these millions of fruit reads to a reference genome. And when you do that, you might end up with something like this. And so once we align, then we can start looking at where the mismatches are and regions of, or places where we have recurrent mismatches are good candidates for presence of mutations. And so the alignment process, there's a huge number of methods, some of which are listed here, probably omitted some as well. But that's a well-oiled activity in the computational biology space now. So once we have alignments, then what we can do is, yes. Yeah. So that's definitely the case. In particular, for looking at SNVs, what you really want to be careful about is making sure that indels are accounted for when aligning. And a lot of the short read aligners don't handle that very well. And so a lot of groups what they do is they do what's called local realignment at indels. And that turns out to make a big difference in terms of accuracy of SNV calls, because if you don't align a gap to read correctly, it gives the false... I'll show you an example of that as it gives a false positive. And so the aligners that handle that more accurately are better for SNV calling, but it may not make a difference, for example, if you're bidding across 1KB regions, you don't need to know... you just need to know that the read aligns there, and it needs to have the precise gap in the right place. So, yeah, it does make a difference, sure. Okay, so this is now what allelic count data might look like. So here's an example of a normal genome. Here's the reference. We align and we see that there are some regions that have variants. And we can just actually collapse this down into two numeric vectors, which tell us the counts of the number of reads that match the reference and the number of total reads. And so, for example, here's a region... here's a locus that has... essentially matches the reference every time, and so we have six and six. Here's one that is actually... has a homozygous SNP, and so zero reads actually match the reference, and then here you have what looks like a heterozygous SNP where half the reads match the reference. And when we're looking for somatic mutations, this is essentially what we're doing, is then we overlay the tumor and we can see that that red locus that were in the normal had all the reads matching the reference. That same locus in the tumor looks like it has some variants there, and so half the reads in the tumor actually are showing this substitution, just A to C. And this locus is conserved in the tumor, so we see it as a homozygous SNP in the normal. It's also there in the tumor, but this one here... and this one here is a heterozygous SNP in the normal, and it also looks like it's a heterozygous SNP in the tumor. So what we want to do here is isolate the red locus from the blue loci. So the red locus are the things that we're looking for. And we can do this with various types of statistical models. I won't go into the details of this, but essentially there's a fairly good set of tools now that can accurately pull out these red type of somatic mutations across the whole genome in a reasonable computational time. So when we first started this work, it's worthwhile going through the exercise that we really actually restricted ourselves to these allelic distributions and say, well, we can actually write down a very elegant statistical model that will distinguish all these different classes. And from a theoretical perspective, we can simulate data and we get perfect results. It's fantastic. We can have this nice model and it works really well. In practice, unfortunately, the realities of the sequencing technology render a lot of these predictions as false positives. So when we first started getting engaged in this work, we were looking at approximately 3,000 mutations and we revalidated these, we re-sequenced these to see if we could recover them again. And interestingly enough, only a third of them turned out to be actually real. And so we wanted to try to learn something from this. So we started digging into what was the reason for why these mutations were not validated. And some of the artifacts that were due to, for example, misalignment, these would be reads that would align equally well to some other place in the genome and got misplaced at this particular locus. And so you can see there's a signal here that looks like a mutation, but in fact, it's just due to misalignment. In this one here, we have a gap here that's an indel. And what that is doing is that in these particular reads here, the gap is not properly inserted. And so what that creates is a series of mismatches that are artifactual. In fact, the gap would create a different position for these reads. And then that signal, just by looking at this particular locus, would be rendered back to what it should be, which is no change here. There is the base-calling software of aluminum machines has some uncertainty associated with it. And so essentially what happens is that there's an image taken at each stage. I'm sure John went over this. And then there's a base-call that's made based on the color that's emitted from a particular cluster on the flow cell. And that is a noisy process. And so sometimes it's very certain what that call should be. And other times there's noise in that distribution, so there could be multiple possible bases. And so we can calibrate the quality score of a base. And in many cases some of the false positives were due to just low base-calls because the base-call was actually incorrect. So those are sequencing errors. In some cases we saw examples where all mutations were sequenced only in one direction. And that's a result of an optical PCR problem. That's well known. It's called strand bias. And then in some cases we saw revalidation work that just, it was kind of mystifying. So in this case here you have very clean signals that couldn't be explained by things like base quality, presence of an indel, strand bias or anything like that. But it just doesn't validate. So there's something interesting going on that suggests that here's what looks like a really good signal, but it turns out to be a false positive. So what is going on there? On the flip side you have true positive examples that maybe have no business being true positive examples. Here's one that has very few reads that support the mutation. We called it and turned out to be real. And so this is due to probably a very minor population in cancer that harbors these mutations. So maybe this has a cellular prevalence of somewhere around 10%. And so we need to be sensitive to these types of signals. How do you pick up those types of signals? Yeah, so there really is a trade-off. There's always going to be a sensitivity specificity trade-off. What you hope to do is capture things like this while at the same time ignoring the noise in the system and not being subject to false positives. So I'll explain some of the ways that we can try to do that. So here's another one that's very, very rare. Only two reads there show the signal. Okay, so given this set of... Yeah, say that again? In such cases, do you look up the minor level of frequency? Oh, well that is what we're calculating, the minor allele frequency. Or the variant. It's the counts of the variant. And so that count level really has to be above some statistical power to resolve it. And so that's... I'll show you some work that we've done to try to get to that. So how can we leverage this information to be able to address these various comments? Well, we can take... There are a number of different measurements that one can extract from the data. And so at a given locus, one can compute many different quantitative metrics on a given locus. So these include things like mapping quality, base quality, the strand bias, the actual number of mutant alleles versus wild type, et cetera, et cetera. Homopolymer runs, the presence of indels, many, many different features. So we undertook a study to see if we could learn something using leveraging machine learning techniques from the measurements that we had taken. This is work that was led by Jaree Ding in the lab. And so what Jaree did is he took these 3,000 mutations, computed 106 different features from each one of these mutations, some of which were jointly computed from the tumor normal, and then tried to see if we could separate the real from the wheat from the chaff, so to speak. And just a general exploration of this principle component analysis of the data suggests that when you project this on to three dimensions, in fact, the black points, which are the somatic mutations, actually can separate quite nicely from the germline and the wild type mutations, and the germline and wild type can also be separated. So we had some confidence that this would actually work, and so we used actually random forest-based classifier to learn a classifier that weighted all the different features appropriately and to see if we could increase sensitivity and specificity. And so just looking at accuracy metrics in a cross-validation study, we could show that we really... This is an AUC plot to show that we could dramatically improve the accuracy of calling through this method that took into account all these different features on top of the allele counts that I showed earlier. So allele counts are elegant in the sense that if everything's working correctly, then we can really create nice models because they're really good probabilistic distributions that one can leverage for account data. But as it turns out, the artifacts end up contributing more signal than we would like, and so they need to be modeled as well. And so with this result, we were actually able to characterize what are the major contributors to false positives. And the first being Strambias was a major contributor. There were... This is a known pattern which is cropped up is that sequences involving GGT trinucleotides often get read as a GGG, not just a machine artifact. We had misalignments due to repetitive sequence, so repetitive sequence is a particularly nasty part feature of the genome that results in misalignments. Then we had another group of mutations with low base quality. And then this one is really quite interesting in that this is the profile, if you will, of the true somatic changes. And this group of mutations here had all the similar characteristics of the somatic mutations, but in fact had just weak evidence of the variant in the normal. So this is a stochastic sampling error whereby... So whenever we sequence these fragments, it's like reaching into a bag and there's a mixture of alleles. And you pull out and you get what you get. So we've all had a case where you're playing gotsy or something and somebody gets a couple of gotsies in a row. That's pretty unusual, but it can happen. So that's essentially statistical tricks being played there and you just get unlucky and don't sample the germline alleles in the normal. So that's what that group is all about. So that's sort of an overview of some of the things that we need to consider from an analytical point of view when detecting somatic mutations. And there are a number of tools now that are really robust and available and are used in very large scale like the TCG or ICGC that work well and these problems are largely a thing of the past, but I wanted to just draw your attention to them. These are important considerations and I think you'll look in the lab at certain examples of when mutations that may be called are actually probably not real. And so you have to pay attention to that. So I'm going to skip over this and skip over this and I'm going to skip over that. So some of the available tools for SMVs a really useful package is SAM tools. That's a useful thing for extracting some of these features that I'm talking about. There's a really good set of libraries and I'm sure you've already used it. Yeah, we've already used it. So that's good. The Broad Institute has this genome analysis toolkit called the GATK. It's another reasonable framework for extracting these features of mutations. I wouldn't recommend to use it for somatic mutation calling, but it has some, it's really designed for things like the 1000 Genomes Project, et cetera, for normal human variation or for clinical congenital abnormalities on normal DNA. So some of the available tools for somatic mutation detection, probably the most popular tool is called MUTECT out of the Broad Institute. And you can see that, so it has a nice likelihood function to compute the allele counts and then essentially what they do is they push their high-quality or high-probability mutations through a series of filters that account for these various features that I've highlighted. And so it's kind of like a likelihood test followed by filtration. And it claims to be sensitive down to a allele fraction of about 5%. So obviously you had a question about that. And so through those filters, then one can actually start separating the signal from the noise. And that's always a function of coverage. So the higher the sensitivity one wants or if we want to be sensitive to, let's say 1%, 30x coverage obviously isn't going to connect, right? Because you'd be missing that allele more often than not. But if you get up to 500x or 1,000x then you can start being confident at a 1% allele frequency. Another very popular tool is called Strelka. And this is actually from Illumina. And so it actually has something very similar in the sense that it tries to filter reads and it has a feature though that's really quite important which is a realignment of indel locations. And this makes the algorithm quite slow but certainly for indels, this is really in our lab. We use this in our production platform for calling indels. And then our mutation caller is called MutationSeq. And it's a standalone Python package and it actually has a visualization component that's developed by Sydney Nielsen in the lab that accompanies it. And so from a whole genome library, you can quickly summarize, for example, allele ratio distributions across the genome. You can get a sense for where to maybe draw a threshold for calling. So it's a probabilistic model that outputs a probability. You can see here that there's a little inflection point on the left and somewhere around here is probably where the real mutations lie. So you can draw a threshold there and that would be output to high quality mutations. Part of the visualization package is to plot the mutation signature automatically and I think Fung is going to show you how to do that outside of this package in our markdown. And so this is actually quite convenient. So MutationSeq is the pride and cloutide context of these signatures as a native output. One of the things that we've used it for, and this is a painful story for Fung, but is to do some QC analysis. So it's well known that oxidation of DNA during sonication in the library construction process can induce substitutions, C to A substitutions. And so this is particularly problematic because what happens is that these mutations actually get written in the DNA. So they're real, but they're artifact of library construction process. And so what we found in a couple of lymphomas that we've been working with in Fung's project is that we actually had a massive overabundance of C to A mutations that were essentially drowning out the signal. And this is due to this oxidative problem during sonication. So it's a sad tale, but at the same time it shows that you can use these visualization tools to do some QC on the data. And I recently had an experience with a collaboration outside of our center. The collaborator sent us data that was sequenced at their center. And the agreement was that we'd do some mental analysis for them and do some interpretation on their project. And it was a series of very precious samples that were very hard to collect. And what we got back was data that looked like this. And so it was really a sad tale. But it's much better than trying to, than not knowing that this is the case and then over interpreting the data. So doing some QC steps on these distributions is really quite informative. And what you can see here is that this set of mutations corresponds very cleanly with very low prevalence alleles. So these are just, these are just likely not mutations that are real at all. And you can see there's another density cloud up here that are probably representing the real mutations. These ones down here are probably all just artifacts. So it's just a cautionary tale there. The format for variant calling is VCF. And this is kind of what we call a community standard. It's definitely not a technical standard but is sort of why they adopted. Unfortunately, it has a very flexible format. So you can't really call it a standard at all. But nonetheless, it's what people use. And you know, it's computer scientists in the room shutter at this. But this is what's used in the community and you should get used to knowing what VCF is. So it has two components. It has a header, which shows essentially some metadata about what's gone into creating the output. And you can see there's this info, this info tag tells you essentially what feels are located in that info tag. And so here for mutation seek, we have probability, which is PR. And then there's a little description that says, okay, this is a probability of somatic mutation. Then we have another field called TR. And it's a number and it tells you the count of tumor reads with the reference, reference allele, and simulated for normal and then you have the trinucleotide context. And so this is the metadata at the head of the file and then you get into the real data. It looks something like this. The chromosome has a position on the chromosome. You can have an ID, and sometimes this is the RSID of a SNP. That's really what this ID is for. Of course, for somatic mutations, we hope that they don't have RSIDs in DBSNP. And then you have the reference base, the alternate base, you have a quality score. You can have a filter which says, does it pass this given filter? And then you have this info field. And so the info field is actually a semi-colon delimited set of fields that correspond to those tags in the info of the header. So here's the probability of that being a somatic mutation. And you can see here that that's a very low probability. And then the counts here are given. So this is a tumor reference, tumor alternate, normal reference, normal alternate. This trinucleotide context. And then the number of indels is actually in the surrounding vicinity is also given. Okay, so I think you're going to explore VCF in the lab. And so, you know, you should get used to looking at these types of files. They're quite useful. And in some ways, they're nice because they're text files. You can actually grab through them. You can pull out just specific rows, et cetera, like that. Or you can get fancy and use tools like SED and AUK to filter these files for, you know, unless you want to find all mutations with a given probability or higher. So you can use Unix tools to pull those out. And there are also some tools that have been developed specifically for manipulating and working with VCF. I think there's a package called VCF tools that is quite popular and can use that. The other attractive thing about VCF is the number of the downstream annotation tools that you'll learn about tomorrow actually input VCF. So, for example, like this is quite useless information if you don't know the gene content of the mutations, for example, right? This is just pulling out coordinates of the genome and saying there's a mutation at this position. Of course, we all want to know what gene is there, what amino acid substitution is, corresponds to that gene. And so you can use tools like SNPF or NFR, things like that, that can take in these files and output annotated files. So here's just a list of some tools and their associated websites. And then there are a number of visualization tools you're already familiar with IGV. And that's an exercise that you're going to do in the lab in Sakhanu. And then here are a series of annotation tools. There's a mutation assessor. There's ANIVAR and SNPF are three nice annotation tools that are out there in the community that people use and have been used in major publications. And I think I won't say more about that because you're going to learn about that in detail tomorrow, right? Okay, good. So any questions at this point? That was a little dry, I know, but now we're getting to the good stuff. There are a lot of databases and tumor repositories where we don't, is it not worth anyone's time to try and do variant calling on ANAC or RDC samples where there's just no normal control? So I think if you cast it, you think about it from a signal to noise ratio perspective. So the germline polymorphism rate is somewhere in the range of one in 10,000 in any given individual, right? And that results in, on the order of two to three million SNPs in any one individual's genome. So you and I would differ at two to three million different positions, okay? So between a tumor and a smashed normal, the mutation rate is somewhere in the order of one in a million. And so in any given tumor normal comparison, you may end up with around three to 10,000 mutations. So the problem with doing variant detection in only the tumor samples, of course those germline polymorphisms will be there as well. And so you're looking at at least a 10 to 1 signal in the germline that's going to squash the somatic signal. What about filtering or known mutations in cancer genes using the cosmic database? Right, okay. So in that case, it would be useless to sequence the whole genome. So what you should do in that case, and this is what's typically done, is you can design a panel. Let's say you're just looking at unlabeled drug indications. You want to find KRAS code on 12, BRFV, 690, PI-3-counties, hotspots, et cetera. So then you design a panel that looks only at those locations because you know that if those mutations are in the germline, that's embryonic lethal. Those embryos will never develop. And so those are unambiguous changes. But it would be a complete waste of money to sequence the whole genome when you can do a panel for 200 bucks. Mm-hmm. Well, that's a good question. So the question is what QC steps do you take? In variant calling, it's subtly different because the resolution is very single nucleotide. And so the patterns that you can extract in copy number at colony because you're looking at maybe 1KB windows, there actually are correlations to pattern the GC content, for example, that one can normalize that. The single nucleotide level, the best you can do is, for example, remove optical PCR duplicates, for example. That's really the important step. And so there are tools for preprocessing that are part of the Picard package. Michelle, have you done any preprocessing? Have you guys done any preprocessing work on BAMs? So using R&D or Picard tools or anything like that? Recovered that? No, OK. Yeah, that? OK. Yeah, maybe that's something to consider for next time. But essentially, there are well-known steps that you can take. And I think the Broad has a best practices workflow to go through. And essentially what it involves is doing alignment, local realignment for the indels, removing optical PCR duplicates, and then essentially then you can do, then it forks. In an example, maybe because of the high GC content, the read depth that that region would be lower. Yeah. So calling the heterozygous snip would be difficult because you can't add more content there, right? I mean, so... No, but you can normalize your weighted calling function so that 3 in a read depth region of 10 would be considered sufficient to call heterozygous snip, but in a read depth of 40, 3 won't work. Yeah, and that's where the probabilistic models come into play. So one can leverage the binomial distribution, for example. It exactly takes that into account, right? So the binomial distribution is like coin flipping. And you try to understand. And so the more times you flip a coin, the more confident you can be in whether the coin is biased or not. So let's say that you're trying to find out whether the coin has a skew towards heads of tails. You need to flip it a large number of times in order to establish that. So if you only have three flips, you probably wouldn't be able to establish that, right? And so then the probability density function of the binomial, when there are few observations, is always low. And it gets higher with more confidence. So that accounting for variance in depth is encoded in the probabilistic models of the binomial distribution and also in a mutact tone, which is the law of God's distribution that is published there in the slides. Trini? Yeah. Yeah, so good question. So certainly for panels or for PCR-based amplicon work, trimming of the data becomes important because often you read into either the primer or you read into the actors. And then also sometimes the quality of reads, the quality of the base calling will tail off towards the end of the reads. And so a lot of people do do some trimming of reads that remove those low quality regions of reads. Yeah, that's a good point. And I think there are a number of tools now out there that can work with fast queue data to do quality control and pre-processing the data even prior to alignment. Yes? Yeah, recommendations on whether to do that. I've usually been doing trimming before alignment because I don't have hard data on that to see what results in a better practice, but intuitively it would make sense to trim before. Okay, good. Good questions. So in the last part then, we'll return to this idea of trying to do some advanced work in colonial evolution with working with mutations this time. So it's well established that cost-effective interrogation of the whole genome of cancer is now there. That's really the biggest marketing or selling point of this technology. But perhaps what's less well appreciated is the digital nature of the technology. And what that allows for is very precise estimates of allelic prevalence. If you remember back to the definition from this morning. And so through either capture or through PCR, one can actually measure very deeply at a particular locus to get precise allelic abundance measurements. So in our pool of DNA, we might have fragments like this and some small proportion, maybe 10% of the alleles contain a particular mutation. And so when we sequence very deeply, then that proportion is reflected in the reads that are output from our sequencing reaction. And so what we see here is that some small proportion of the data actually harbors that particular mutation. We can count that very precisely. So again, it's a switch. The Sanger sequencing is the analog way of doing things and this is a digital representation of the mixture. And that is very powerful because it allows us to start thinking about how we can deconvolute mixtures of different clonal populations and bulk sequencing, et cetera. And so this has been leveraged in a number of different experimental designs. And so we can try to, from a single bulk sequence, of course we don't know what the composition of that bulk sequence is, but we can try to infer that from the allelic prevalence measurements that we're taking across different mutations. And this has been shown in a couple of papers, namely an exploration of triple negative breast cancers that was published a couple of years ago. And then we can think about how those allelic measurements change over time or in space to get some idea of whether selection is operating or whether clonal expansions are happening over time. And so I gave that example of how a tumor, the composition of a tumor in terms of its clonal composition can really change over time in the presence of a therapeutic intervention and also across spatial samples. This is, it depicted here, is an ovarian cancer but with multi-intrapithelial metastases and also lymph node metastases as well. And so we can compare the relative abundance of specific alleles across these different spaces. And then the field is moving rapidly into single nucleus sequencing and we've been doing a lot of that activity as well and we'll explain a little bit of our progress in that front. So this is a nice sort of schematic of what we're actually doing here. We're sequencing a mixed population and then we get digital representation of those alleles. So at a given position we might have, let's say, two reads here that harbor this particular red mutation and it's proportional to the mixture that's in our bulk population. And so just to reiterate, we have these two concepts of definition, allelic prevalence and cellular prevalence. And so allelic prevalence is proportion of reads with a variant and cellular prevalence is the proportion of cells with a variant. And it's very important to note that these two things are not equivalent and the reason is because of the concept of genotype. And similarly to what I showed in the copy number space the mutations can have mutation genotype as well. And so that results in a problem where one has to sort of deconvolute the prevalence of the mutation and its genotype at the same time. And so we've developed some models around that namely a method called PyClone which is developed by Andy Roth in the lab and what PyClone can do is actually then cluster mutations according to their cellular prevalence. And so this is just an example of a systematically designed experiment. We took two cell line populations. We examined a panel of mutations that fell into three different classes. We had mutations that were shared in both cell lines and then mutations that were specific to one and mutations that were specific to the other. And then what we did is we mixed those cells at known proportions to see if we could recover the cellular prevalence of those mutations over time. Or not over time but over that series. And essentially what that simulates is that if you order that prevalence of the mixtures it simulates the idea that there's a clonal expansion happening over time. And so as the prevalence in the mixture of one sample goes up, that simulates a clonal expansion and that comes at the expense of the other one. And so what's nice about this idea is that we know exactly what the expected distribution should look like and that's what's shown here. So these are the sets of mutations that are specific to, we'll call it the red cell line and then we have these blue ones which are sets of mutations that are specific to the blue cell line and then the green mutations are mutations that are shared in both. And so of course the mutations that are shared in both are going to be at 100% in all the experiments. And I'm showing here schematically time but in fact these are actually mixing proportions on the x-axis. And so with our statistical model pi clone what we're able to do is actually the dotted line is the ground truth and then the overlaid line is the inferred prevalence and you can see that the model is doing very, very well. And the nice thing about this is that it really simulates this idea of an expansion and an extinction clone. So this would be simulating a clone that's expanding, this would be simulating a clone that's being extinguished. And the important thing to note here is that when we have lines that cross like this when we have sets of mutations that are decreasing in prevalence and sets of mutations that are increasing in prevalence those by definition they can't be in the same cells. So those really mark clones that are distinct from each other. Those are mutually exclusive mutations. And so I'll show you how that comes into play in a real study in a minute. So this isn't really time. This is actually just mixing proportions. But it's a simulation of time. Correct, correct. Yeah, that's right, that's right. So we engaged, again this is work with Sam Apparicio in trying to understand population dynamics of cells using temporal sampling. And this is now looking at time series and trying to examine these populations as they're changing over time. So Sam had been generating breast cancer patient derived xenographs and what these are are materials extracted from patients tumor materials extracted from patients and then implanted into immunodeficient mice and allowed to grow. And so again thinking from an evolutionary perspective in the patient you have the micro environment you have the immune system that's keeping that has some pressure on the cancer and then we're taking that tumor out of that selective pressure and putting it into a mouse that doesn't have an immune system and is also a new host. So you can imagine that selective pressures are quite different. So one might predict even though these models are actually used for a lot of drug efficacy type programs you can imagine that there might be quite a dramatic change in selective pressure when you take that to patient tumor and put it into a mouse. So we want to investigate the degree to which this happens. And so what we did is we took a series of 15 PDXs and we compared the colonial composition of the tumor to serial passages of the xenographs and tracked these mutations the cellular prevalence of these mutations over time and to see if there were changes. And in almost every case we saw some degree of dynamics in clones. And so one extreme example is this one here where the first time point is the tumor and then we have the xenographs shown in red and you can see that there's a dramatic expansion of a very minor clone that's present in the tumor that expands to be almost dominant in the xenograph and that comes to the expense of the dominant clone that was in the tumor essentially shrinks to nothing. And so this is a really dramatic case where on engraftment there is a massive amount of evolutionary dynamics happening. And then we could ask does that stabilize over time? In this case it looks like it's fairly stable although there's an emergence of a clone here in passage X4. But the most extreme example of ongoing dynamics after engraftment was this case here and you can see that there's really a dramatic expansion I think I've got a zoomed version of this of this green clone over time that was essentially very, very rare in the tumor but then came to dominate the xenograph after five serial passages and again this comes at the expense of other clones that actually are extinguished. So you can see that this looks somewhat reminiscent of our systematic experiment where we did the mixtures and oh, can't see. Okay, there we go. Something's funky with my animations. But anyways, so this expansion of this green clone comes at the expense of this red clone here, yes? Are there other genetic changes there? Yeah, that's right. So in fact these cases that are underlined are the cases that we actually did whole genome sequencing on. So we didn't just look at mutations that were present in the tumor, we looked at mutations that were specific to the xenograph as well. And they are indeed new mutations arising or they come detectable as we go. And so the mutations in the green cluster are the perfect example of that. So we just sampled these but there are 15 mutations that are characteristic of mutations that are essentially new in the X5 passage. So it's difficult to know to interpret their selective capacity. So I should mention that this study is all done without any kind of drug intervention. So that's the next phase of the project is to do systematic drug intervention on these xenographs to determine whether there are common features that are selected out after putting these populations through a bottleneck like that. But we just wanted to establish a baseline. And the dynamics were enough to think about first of all developing methodology to track clones. That in and of itself is a major contribution because now we can actually systematically look at population dynamics in time series and patients by using this type of method. So patients that have local relapses or serial relapses, for example, there aren't that many that we can get but we do have some. And so we can start asking these questions on therapy, what clones emerge relative to the primary tumor. Okay, so... Okay, I'm getting the evil eye. I'm only... I'm right at time so I'm going to spend two more minutes here. So the remarkable thing about this study is that these dynamics were actually reproducible. So we actually created biological replicates of these xenographs and the same clones emerge repeated times. So there is something deterministic about the genotypes of these clones which suggests that they really do have some fitness advantage over their neighbors. So the last thing I wanted to discuss is... Okay, so this is an xenograph system. How relevant is it to patients? How relevant is this concept to patients? Well, there really is an emergence of a really exciting field in self-resirculating tumor DNA. You may have heard about this but essentially these techniques of measuring allelic prevalence can be leveraged by measuring these mutations in blood or in plasma. And so tumors do apoptose and they shed their DNA and often that DNA can be picked up in the circulation. And so the idea of tracking a mutation in blood is quite attractive because it's a non-invasive liquid biopsy for patients. And so there's a huge field now that's growing that is engaged in this idea of targeting particular mutations and looking at their abundance in plasma as a measure of tumor burden. And so here you can see for example this is monitoring of a KRAS mutation and over time a patient's plasma was examined and the allele fraction of that particular mutation was measured and basically one can see here that it starts to go up and this is really an indicator of increased tumor burdens. This patient is relapsing and it can be predicted from the plasma. And in a different paper this group showed that in fact it can be used to really dissociate different clones and so here this is a breast cancer patient with a P53 mutation and what's very interesting is that one can measure it has a P33 mutation and a PI3 kinase mutation. And on Pachletaxel which is an inhibitor of PI3 kinase you can see that that clone that harbors PI3 kinase the abundance of that clone drops almost to zero. Unfortunately it comes back again but for that period of time the PI3 kinase clone essentially is extinguished. And so this is again this is not using tumor tissue this is using plasma from blood samples. So that's a good question I don't have a good answer for you I think there are very rigorous SOPs that are established and I think one of the things that's really important is to spin down the plasma within something like an hour because the half-life is of that DNA is pretty short and so there's some pretty stringent SOPs that need to be followed in order to extract the signal out. It doesn't uniformly work in some cases we're just in work at all so it's a highly variable in efficacy. You actually did a lot on most of the DNA. It's not perfectly isolated that's for sure. There are some platforms that are enriched for the mutant alleles and so that's the so then you don't get accurate counts relative counts but you get absolute counts in terms of because the the targets are enriched and then the rest is thrown away and so it's a boreal genomics for example is a platform that uses that technology. Okay so I think with that I'll close and but just I want to read this concluding thought because I think it's quite relevant to all the things that we've been talking about and this is from that Seminal Science paper from Peter Knoll in 76. He says the acquired genetic instability in associated selection process most readily recognized cytogenetically results in advanced human malignancies being highly individual caretipically and biologically and hence each patient's cancer may require individual specific therapy and even this may be thwarted by emergence of genetically variant sublines resistant to treatment and then more research should be directed towards understanding controlling evolutionary process in tumors before it reaches a late stage usually seen in clinical cancer. So this is really quite a prescient statement because you know back then we didn't have the measurement technologies that we have now and what it's very clear is that by measuring these mutations and putting and casting them in an evolutionary context it's incredibly illuminating as to what is actually the dynamics of cancer and it's clear that concepts like population genetics and trying to keep evolution in check is going to be a dominant area of research going forward and this is why as I said before methods like immunotherapy are really quite attracted because it's essentially leveraging the body's own immune system to battle the arms race between acquisition of new mutations and keeping immune surveillance in check. Of course that also has a side effects you don't want to over stimulate the immune system because we all know the problems with auto-immunity etc. But there are some very promising combination drug therapy for example is another approach that one could use to try to keep the evolution in check. I think this is a very exciting area of research and really is probably key to eliminating progression in cancer at all times. So with that I should acknowledge a large number of people a couple of which are actually in the room here Fong and Andrew and other grad students in my lab are constantly educating me and it's a great privilege to work with people like that and so and in particular acknowledge Sam Apparicio and David Huntsman who are my close colleagues and a lot of the ideas that I presented today are a result of lots of conversations with the two of them and so and my work is funded by a large number of organizations as well so I think I'll stop there and thank you for your attention and enjoy the rest of the workshop.