 So, what I'm going to talk about now is a different but related topic, and this is about finding somatic mutations in cancer genome sequencing data. And this is just an example of what that looks like. So, this is actually real data. This is four cases with the same somatic mutation in a gene called FoxL2 in a rare form of ovarian cancer. And this is a situation where we have a single nucleotide change and induces an amino acid change, and essentially we think change is the function of the protein. So, before we get there, I think that it's important to revisit the idea of cancer as an evolutionary process, and this is a slide from Mike Stratton that I think is really quite elegant, and it really shows that the main concept that cancer arises from an acquisition of abnormal genomes. So, we start with normal cells, and over time accumulate mutations. So, these mutations are just represented with little glyphs inside the cell. And we all walk around with somatic mutations that we weren't born with, and some of them confer a selective growth advantage to the cell. So, that's what this star here represents, and that's an accumulation of the driver mutation. And I mentioned that really loosely defined, the driver mutation is something that changes the phenotype of the cell and that can get selected for in the evolutionary history of the tumor. And so, we get the formation of a tumor and increased rapid proliferation, which can then actually create even more mutations, and mutations accrue over the lifetime of a cancer. Okay. And so, just keep bear in mind that this process is a time-dependent process, and that tumor cells change over time, and are governed really by the principles of Darwinian evolution and selection. So, that's concept number one that we need to bear in mind. So, let's look at, just so we're on the same page, this is really simplistic and cartoony, but I just want to illustrate that this is what we're talking about when we talk about a mutation, a point mutation is, it's really, so here's the sequence of, this is actually an actual mutation in MP53, a tumor suppressor gene. Here's just a fragment of sequence from the normal cell, and at this residue right here, we have a substitution of a G to a T in the tumor cell. So, and what this does is this induces a stop codon mutation in the tumor cell, and the protein is lost. So, this is a tumor suppressor gene, and genetic defects in this gene allow cells to evade program cell death and DNA repair. So, I just want to return to this slide and go over some of these, so why are mutations important to discover in the context of not only biology, but the clinical context, and that's that there are a few point mutations that are targetable by specific drugs, and some of the genes that harbor those are EGFR, PI3 kinase, BRAF, and KRAS, and I'll get into some of the details around these. So, how many people have seen this iconic diagram before? Maybe it's already been shown in this workshop. So, this is from a classic paper from just over 10 years ago by Doug Hanahan and Bob Weinberg, and they described, it's really a review paper, but they described cancer as having a few hallmarks, and these hallmarks are essentially biological characteristics or phenotypes that allow a tumor cell to exist as a tumor cell, and they evade program cell death, they evade apoptosis, they have self-sufficiency in growth signals, they have insensitivity to anti-growth signals that become independent from signals that are directing them to stop growing. They have the properties of being able to evade different tissue sources and spread, limitless replicative potential and sustained angiogenesis, which is essentially generating a blood supply to those tumor cells. So, this is all very nicely and eloquently described, but what was not known at the time and still trying to discover is essentially what genetic abnormalities underpin the ability of tumor cells to achieve these oncogenic properties, and then also, how does this change over time and what genetic abnormalities drive that process. And finally, the specific genes that are involved in generating this phenotype is relatively unknown. So, there's huge potential for uncovering the properties of how this happens from the perspective of the genome because they showed before cancer is a disease of acquisition of abnormal genomes. And so, to understand its properties at its most fundamental level, we need to look at the genome itself. So, returning to the discussion of drivers versus passengers, and this is again, these are loose definitions, but driver mutations essentially can be considered as altering the phenotype at the level of cell that is selected for in the evolutionary history of the tumor. And this may be visible in a population of tumors as a result of convergent evolution. So, convergent evolution just means that through different, in different individuals, the same evolutionary path is taken. And there's examples of this in nature, in species, and the same thing happens in given tumor types. Something about the properties of certain cells and microenvironments in anatomical positions in the body and also just the type of cells that obtain mutations that result in recurrent mutations in a population. So, the same gene will be affected in multiple different individuals from the same tumor type. By contrast, passenger mutations are benign and essentially do not alter the phenotype of the cell. And these are really stochastically randomly induced. They're likely infrequent in a population. And so, by some metric, recurrence of a particular mutation in a population of tumors, so if you see in 5, 10, 20 percent of tumors, you see a gene mutated, that gives you some indication that that might be an important gene to look at. So, drivers can really take the couple of different forms. You can have driver mutations that are a gain of function, you can have driver mutations that are a loss of function, or a switch of function. And it's hard to read here, but these can actually accrue at different points in the evolutionary history of the tumor. I think we've discussed this somewhere along the way. So, we have oncogenic initiating mutations such as mutations in KRAS, BRAF, EGFR, PI3 kinase, et cetera, and tumor suppressors such as P10 and BRCA1. And then you can have drivers that confer metastatic potential for metastatic initiation. Here's a couple of examples of genes that harbor mutations that confer metastatic potential, metastatic progression, and then even virulence there. And so the other concept that is important is that, so we go from drivers that initiate neoplastic transformation, drivers that confer metastatic potential, and then driver mutations that can confer chemotherapeutic resistance. So, I mentioned that there's a target inhibitor against the BCR-ABLE translocation in CML. I mentioned that in the morning lecture. Well, so that's good, and the targeted agent is often quite effective, but eventually tumor cells can acquire resistance to that targeted agent, and they do that by mutations in the ABL gene. And so here's just a couple of examples of mutations that essentially are really not prevalent in the primary tumor, but are acquired and are selected for in the presence of a chemotherapeutic agent. So it allows the cell to essentially become resistant to the kinase inhibitors that knock down the original primary tumors. So I just wanted to show an example. This is not in your notes, so this is just two extra slides here. So mutations, depending on whether they're oncogenic or loss of function, tumor suppressor type of mutations, they have certain patterns. And that's usually because, so here's the PI3 kinase gene. And what's shown here is the prevalence of PI3 kinase mutations as a function of the amino acid position. So what's just shown here is the amino acid position. Here are some domains. And then the y-axis is just the frequency that that's observed in it, when they give in population. And you can see that there are really two positions that are disproportionately represented. That's because the mutation here really changes the function of the protein in such a way that that drives the oncogenic potential of the downstream signaling. And both of these hot spots, they do that. And so this is what's called an activating or oncogenic mutation. And other examples are, for example, KRAS codon-12. So usually codon-12 or codon-13 in KRAS in pancreatic and in other tumors, that's the location that's mutated. And if you see a mutation there, then that chances are that's going to be the driver oncogenic mutation. And same with BRAF v600e. So in about 70% of melanomas harbor this very specific mutation. So amino acid 600 of the BRAF gene, you have this substitution v2e. And this is a targetable mutation as well. Okay, so that's the pattern of oncogenic mutation. So when looking at a panel of tumors, if you see the same codon hit maybe multiple times, then that could be an indication that that's an important hotspot mutation to look at. The tumor suppressor pattern is very different. So here what's shown is from a paper that my group was involved in, published in Newland Journal of Medicine. And this is, David Huntsman will probably talk about this tomorrow as well. And what this is mutations in a gene called arid1a. It's involved in the swissniff chromatin remodeling complex. And we think this is a tumor suppressor because essentially we have a pattern of mutation that spans the whole gene. And most of these are inactivating, stop codon mutations or frameshifting indels that will result in a truncating protein. And so in this case, what's getting selected for is loss of the protein. So it doesn't matter how you get there. Whereas in this case, what's being selected for is a very specific phenotype that's driven by mutations in these particular amino acids or in these parts of the protein. And so these have very different patterns but one can leverage this. So if you see a gene that's recurrently mutated and the majority of mutations are stop codon mutations or frameshifting insertions and deletions, then one can start to infer that that's a loss of function or tumor suppressor gene. And the same pattern is exhibited in genes like P53 and BRC1 and 2. Okay. So let's dig into this idea of cancer as an evolutionary process. So this has been thought about for quite some time. 35 years ago, this paper by Peter Noll was published. It's the same guy that found the Philadelphia chromosome in CML. And what he showed here is really that. I mean, what does this look like to you? Anyone who's done kind of ecology or evolution type of studies? So what is it? Yeah, it's a phylogenetic tree, right? It's an evolutionary tree. And that's exactly what it is. And so he... With the bottom end of it. Sorry? With the bottom end of it. With the bottom end of it. That's right. Yeah, sure. And so what he was describing essentially is as the process of tumor genesis and progression as an evolutionary process where we start with a normal cell and we acquire the tumor initiating event here. And then through stochastic acquisition of different mutations, those mutations essentially get propagated forward. And some clones just die and they don't make it out. That's what these ones are. So these would be cells that acquire a new mutation but that don't get selected for. But then some go through this process of actually being selected for and then further branching and further differentiation. And one of the key concepts here is that early events are propagated throughout the tree. And so all the cells at this stage, they'll harbor these initial early mutations because they get carried forward, generally speaking. But the clones at this stage, they'll have their own unique mutations that differentiate them. And so these would be rather rare in the cellular population of a tumor. But the mutations that happen early will be quite frequent. So that's a concept we need to think about. And that's really illustrated in this schematic. So let's start to think now of a tumor as a collection of distinct populations of cells. And this is a cartoon where the tumor has 12 cells. And half of the cells have this particular genotype. And what that is is basically it harbors these three mutations, mutations A, B, and C. And then we have this little population that has just mutations A and B. And then we have this population that has the mutations A and D. And so when we look at the abundance in terms of how many cells harbor each mutation, you can see that A is present in all the cells. And so we'll have what we call a clonal frequency of 1.0. B is in two thirds. C is in half of them. And D is only in a third. Okay, does that make sense? Okay. So the fundamental questions, though, that arise from this type of situation are, do clonal genotypes actually drive different phenotype behaviors? We don't really know the answer to that in a large scale. And how does this relate to treatment response, progression, and metastasis? So if I have a clonally diverse tumor at the outset in the primary stage, is that going to have some predictive effect on whether whether a patient is going to respond well or poorly to a given therapy? These are the questions that are as yet unknown. And the question is, how can we measure this? You're right. It's C. C, because it's in six cells here. Okay, so well, the advent of next gen sequencing has made these types of questions accessible to us now. And not necessarily related to clonality, but in some ways it is, but actually just in terms of even one level above that, just trying to find recurrent mutated genes, there have been lots of really successes. This is a very partial list. Every week in nature, if you open up nature, you'll probably see a new paper on a different tumor type that's describing a novel cancer gene that was not previously known to be implicated in that disease. This is just a few examples. Insights into tumor evolution. We and others have shown that sequencing technology, and I'll explain how we use this to model tumor evolution, is yielding great new insights. We're getting insights into genomic architectures of cancer. You looked at rearrangement data yesterday, and finding new processes like chromothripsis that not only drive the biology, but have some sort of clinical prognostic significance in terms of what the architectures look like. And all this is driving towards redefined mutational landscapes. And the mutational landscape of a tumor is basically the comprehensive set of mutations that govern the biology of that particular tumor. And we're really coming towards achieving this goal. It's really quite remarkable. And this is really, a lot of this is being driven by two massive initiatives. The TCJA, the Cancer Genome Atlas Project, mentioned a couple of papers. And the ICGC, which OICR plays a major role in. So, you've probably already gone over this, so I'm not going to spend too much time on how the assay actually works. But essentially, where we are at right now is we have the 500 gigabase run, so literally 150 times haploid coverage of the human genome is achievable now in 10 days. This is absolutely astounding. This is astounding. This costs, well, for a 30x coverage cost about $5,000 right now. So you think about the effort that went into the Human Genome Project that was essentially still being refined, but largely complete in 2003. This was more than a decade of work. It was more than, several thousands of people were involved and more than a billion dollars was poured into the generation of the Human Genome Project. And what these machines can do is several times the capacity of that project in terms of data generation, not interpretation and analysis, but in terms of data generation in a matter of days for a small fraction of the cost. Okay, so that's, you've seen all that, heard all that probably from John the person. So here's a concept that I want to make sure we all understand. One of the biggest innovations of next-gen sequencing is the concept of digital allele counting. The Human Genome Project was done with capillary-based sequencing, essentially averages the alleles in a given mix and you get an aggregate soft representation of whether that allele is present. In this type of sequencing, we're approaching single molecule sequencing. And what that means is that, say we have a mixture, so this is my DNA pool that I'm sequencing. And within this pool, I have 30% of my DNA fragments harbor a particular mutation. And that's roughly proportional, let's say, to the cell, not percentage of cells that harbor a particular mutation. When I do my sequencing and I get my reads out, those mutations will be present in roughly the same proportions as they are in the pool. And moreover is that if we sequence this very, very deeply, we have sensitivity to detect alleles that are relatively infrequent, so they can be even down to 1% of the allelic fraction or even sub 1% of the allelic fraction. And that has tremendous implications for how we can interpret mutations. And I'll elaborate on that. And I just wanted to then talk about just a general workflow for how we can interpret mutations. So we start with unaligned reads. We end up with, we do some alignments, which you've covered. And then when we get the aligned reads, we can start to then do some inference. And we can make some predictions of what's going on in the biology of the tumor. So we can predict some single nucleotide variants or somatic mutations. And then usually what we try to do is try to validate these. So it's still an imperfect science to try to get true somatic mutations from this data. We're getting much, much better at it. But validation is still pretty important. And so there are three possible outcomes that one can have from each event that we're trying to validate. So that can be that it's a confirmed somatic mutation. It can end up being a germline polymorphism that we dismissed. We didn't see that in the normal, let's say. Or it could just be a false positive due to machine noise or alignment artifacts. And this is a major problem. And we're going to revisit the sources of these false positives throughout the day. And finally, what we want to do in the end is actually establish some sort of clinical relevance or biological significance of things that we find confirmed. What is the functional significance of a confirmed somatic mutation? So that's generally the workflow that when sequencing a tumor or a set of tumors that we like to think about. Okay. So now we get into the fun stuff. So statistical considerations for modeling these allelic distributions. I'm going to sound like a broken record, but I think this is important to say again. Cancer genomes have specific properties that do warrant specialized analytical strategies. Okay. So we have the tumor normal admixture problem. We have the intratumoral heterogeneity problem. With respect to modeling allelic distributions, and I'll show how this is a problem. Genomic instability plays a major role here. So you saw how copy number changes can skew the allelic expected allelic distributions. And we need to be able to account for that. And then finally, the experimental design to capture somatic mutation necessitate the sequencing of normal germline DNA and tumor DNA. So we have a pair of samples generally speaking for each tumor that we want to look at. And that creates new opportunities and challenges in terms of how we deal with the data. So the first step is to align all these sequence reads to a reference sequence. And when we do that, we get something that looks like this. Okay. So it's like assembling a giant jigsaw puzzle. And there are lots of tools to do that. But the key thing here is that I just want to show that this is what the data actually looks like. And I think Gavin showed something very similar, where the red bases here are putative variants. And the black bases are positions that actually match the reference genome. And so we can make use of this and interpret these allelic counts using statistical models. So for somatic mutations, I just want to go over a couple of methods that we've developed in my lab. And these were really driven by some of the problems that we encountered when we first started looking at the data a few years ago. So one of the bigger problems is, so let's just say we're dealing with exome data, but you could easily put genome data here. So the key thing is that we look at the tumor and the normal data simultaneously. So they both go as inputs into these models. The joint SMB mix model, what it does is it's a statistical model that simultaneously emits the tumor and normal allelic counts. And I'll show you that graphically what that means. And the key concept here is that we borrow statistical strength to better detect germline polymorphism. So you can imagine that in sequencing a tumor in a normal genome, I mentioned that there are copy number germline polymorphisms that show up in the tumor. And the very analogous problem exists in the sequencing single nucleotide world as well. And because most tumor projects are garnered towards finding somatic mutations, what that signal tends to get lost in a sea of germline polymorphisms. And so it becomes very advantageous to try to capture as many of those germline polymorphisms as possible. And so we've developed some models to try to borrow statistical strength across these two samples to better detect germline polymorphisms. So that's one source. And this is what I call biological noise. This is either germline polymorphisms we call them, they call this biological noise. The other type of noise in the data is machine noise. And these will be false positive variants that are induced by artifacts in the machine. And so we developed a machine learning-based classifier to better detect these machine artifacts trained on a large dataset. And the key thing is that both methods improve sensitivity and specificity of mutation detection compared to independent or standard methods. And this is an example of where the specific case of cancer biology that has driven the experimental design of a tumor normal pair was necessitated, you can say necessitated or given the opportunity for very specific methods for cancer. And these are two examples that we've developed. Can I say a quick question? Yeah. So when it comes to the germline mutations, I'm here basically to answer why mutations... Absolutely, absolutely. So there are so familial cancer, inherited disease, warrants, different strategies for sure. What I'm talking about today is we sporadic cancers that don't have an inherited component. But it's a very different experimental design, a different question altogether. Yeah, that's right. So what you'd want to do is with inherited diseases is sequence related family members with affected people, unaffected people and start to narrow down on doing linkage analysis and things like that. So that's a very different process than this type of analysis. So here what I'm showing is this idea of using joint genotypes of samples based on paired sequence data. And the major use cases is tumor normal pairs. And so that's just shown here. What are the possibilities if we have... What's shown on the right axis here is the percentage of reads that would match the reference or the wild type. And what's shown on the left is a percentage of reads that are wild type for the normal. And so in this little landscape, so this corner here would represent basically a wild type. So you'd have... Basically, if that's the case, you'd have no evidence of a variant at all. So that's just what we call wild type. If most of the reads are wild type in the normal, but variant in the tumor, then this part of the curve lights up and that's the somatic mutation part of the curve. And that's what we want to try to detect. You'll note that a large proportion of the landscape here is dominated by a germline. And this is just where you have any kind of signal that is even weakly represented in both the normal and the tumor. Then we want to call that a germline polymorphism. So we want to create a model that can actually look at this. And then the last case is where we have variant in the normal, but as Gavin showed at the extreme, so either wild type in the tumor or all variant in the tumor. And this is the LOH, the Loss of Header's Igosity case. So this is just at a given position. You would call this wild type. If the reads that stacked up at the tumor were all wild type and the reads that stacked up at the normal were all wild type, then you'd want to classify that position as being a wild type position. Wild type would not be. No variant there. Can you compare those to reference? Both are compared to the reference. So let's get into that. That's right. That's right. Yeah. So that's shown here actually. So here you have the reference. And you can take the normal and you take these reads and align them to the reference. And essentially you convert this alignment matrix into what we call an allelic count vectors. And so this basically captures the number of reads that match the reference at each position. And then this is the number of total reads at that position. So here you have a position that we have all the reads, all seven reads are variant. Okay. And then here you have a position where you have seven reads that pile up there. And four reads are variant three or reference. Okay. And then you do the same thing for the tumor. So you can see how we have this position here. That's that shown in red. All the reads match the reference. And so our non variant are wild type. But in the tumor we have good evidence. There's three reads here that are variant. And this is a very strong indication that that's a somatic mutation. So we have a very clean signal for wild type than the normal. But we have a strong signal for variant in the tumor. And that's a good somatic mutation. These other two are germline polymorphisms. And of course, since the tumor cell evolves from the normal cell, they carry forward the germline polymorphisms. And so these positions are manifest as variants in both the tumor and the normal. And that's why we want to capture these as germline alleles. And capture this one as a somatic change. So that if we look at a table and we look at the probability of these joint genotypes in the normal along here and the tumor, we can see that this position, we can assign potentially a probability that our genotype is wild type in the normal and variant in the tumor. And that's just a joint probability. So just to illustrate this again, so the problem here is that the genotypes are highly correlated. You have germline polymorphisms that are present in the tumor sample. And our solution is to borrow statistical strength to capture these signals and better focus on what is different between the tumor and the normal. So I think to save on time, I'm not going to go into detail on this model. You can ask me about it later, but basically we have this, the key concept is that we can borrow strength to infer the joint genotype from both data sets. And we have some metrics that we published to show that this does better than the standard methodology. So that's good. That's great. So we can find, we do a pretty good job at reducing this possibility, the confirmed germline, but we're still left with a number of false positives. And so what's contributing to that? So why do we get these false positive predictions? So there are a couple of reasons. So this is data shown in IGV where we have on the top we have the tumor reads. Each one of these bars represents a read that is aligned to the genomic position that's indicated by the red bar here. So this is where we're looking. And so what's colored here is when you have a nucleotide that is a reference mismatch. So this is a mismatch and that's what's shown here. And if you look at this, you would say, wow, okay, so I've got some signal in the tumor here. It looks like there's some variant there. And I look in the normal and it looks pretty clean. There's not much action going on there. So this is probably the somatic mutation. So this turns out to be just a misalignment. These reads are misaligned. They shouldn't be put here. And when they are aligned here, it induces this illusion of being a variant. And so this is a misalignment artifact. Why didn't the misalignment occur in the normal? Yeah, that's a good question. So there may be some additional errors that, so you can see some of these reads have additional errors that may have caused this misalignment to this position. And that might be an inter-experimental variability that in those particular reads, the experiment for the tumor created some abnormalities that the machine induced that didn't happen when the normal was sequenced, for example. So this is something to watch out for. So let's look at what insertions and deletions do. Oh, sorry. Just go back. Oh, yeah. So, well, it's true. You would say that this is a misalignment tumor, right? Visually, why do you know that's the case? So it doesn't match the one below? Yeah, so if you take, so what we did is we actually took some of these reads and then we used kind of a more sensitive alignment tool. And it turns out that they could be placed elsewhere in the genome, but with just one mismatch, for example. Yeah. So that's how we know. And so here's an example. Insertions and deletions here, so when you have a deletion, it's shown with this kind of black bar. So these are reads that have this deletion. And so in this case, we just happened to have come across a few reads here that you can see that there's just kind of all kinds of noise in the data here. But there are these reads here that have this gap and really should have quite a longer gap. But, and then what that does is it creates this kind of false, this false call of a variant in the tumor again. And the normal just wasn't susceptible to that. And again, this probably just due to stochastic fragment selection in the library construction that led us to this position. So if you weren't browsing this data, if you were just looking at from this in a computational way, and you're counting up the alleles that, the allele accounts that look here and you say, look, I've got three out of, I don't know, 20 or so that look like, that look like there's a variant in the tumor. And then I have 50 reads in the normal that don't show any sign. That's probably a good indication of a somatic mutation. But if you look in the surrounding neighborhood, there's all kinds of funky things going on due to insertions and due to deletions in this case. And that's causing, again, misalignments. And these reads probably need a bigger gap. And then that would cause the, a gap to be open and push the alignment of these nucleotides somewhere else. Okay. Yeah. That's general rule. Can we say that any SMBs in close by, even those are highlighted? Generally speaking, that tends to be the case. Yeah. But can that be more crazy? Okay. Yeah. These are, these are actually variants that, that didn't validate. Yeah. So, so here's one that where you have just, you have, you can barely see it, but there's a whisper of, one thing I haven't really gone over is that the, the base call is actually a probabilistic entity. Okay. So, so you, the machine actually produces not just the call as to what the nucleotide is, but actually produces a vector of four values. What's the probability that it's an A, a C, a G or a T? And, and what this, what IGV does is actually it, it shades the, the, the mismatch in terms of intensity, according to how the quality of that read. So, so the quality of that base call. So, for example, here we just have fake whispers of a, of a variant, but, but this may have just been above the threshold cutoff for our quality, for example. So, so when we're dealing with discrete counts, what we do is we usually establish a threshold and apply that. We've actually, you know, then gone and modeled this, this probabilistic base calling, which does a much better job. But still, what most people do is they, they have some sort of cutoff that says, okay, well, what's my cutoff of base quality to, to actually call a nucleotide, a nucleotide. And when we do that, these bases get admitted into the, into the analysis space. And again, this is a only present in the tumor. And so it creates the illusion of, of a somatic mutation. What's very likely is that these are just machine errors, the missed calls of the bases. And so you can imagine if you're, you're producing 500, 500 billion base calls per run on an aluminum machine, they're going to be some errors. And this is an example of that. Okay. Another thing to watch out for is when all the variant reads are from the same strand. And so what IGV does is it puts a little, puts a little notch on the direction of, so you can tell the direction of sequencing. And if, if all the reads that harbor the variant are in the same direction, the chances are very high that that's just an artifact of, of the PCR. So the PCR gets stuck in a kind of a stutter step and is disproportionately amplifying that particular fragment. And, and so that, that's just a machine induced artifact. And, and these are all examples that don't occur in the normal. And that's just really by random chance. I mean, so you asked that question. And when you're dealing with the space of a genome, these are just a couple of examples, you know, even, even 100 events out of 3 billion end up coming forward to the, to the, to the prediction space because they look like they're interesting events. But this is something that can be modeled as well. And, and I'll get to that. So these are another example of an artifact. And so, so here's one. Sorry, this one is flipped around. So this is the tumor and this is the normal now. And, and this one is kind of unexplainable. So why don't we, why, why are we not able to validate this one? There's no, no sign of, of a poor base quality. There's no sign of an indel. There's no sign of a strand bias. But yet this one doesn't validate. So, so what's going on here? We don't really know. This one's a mystery. Okay. And so this could be that the validation assay didn't work. That's, that's another possibility. So, so the validation assay is imperfect as well. And, and so this would be one that, you know, you may want to try again. But, yeah. Well, percentage. So, so what we tend to do is instead of deleting these is rather just I'll explain how we, how we handle that. So, well, so what we'd expect is yes, the same, same extraction. Yeah, same extraction. But that would be true. If you did a re-extraction from a different part of the tumor or a different, different section, then it's very possible you wouldn't see it. How do you validate it? Oh yeah. No, no, this is, this is all new. These are all new. How do you validate it? So, so then you, I'll get to that. So you basically can design PCR primers around your variant of interest, generating an amplicon and re-sequence using different techniques. So you can use Sanger or you can use targeted next-gen sequencing to get, to get very deep coverage of that. Okay. So here's some true positive examples. So, this one is a, is a real mutation, validated. So here you just have a few percentage of the reads that, that has hardware as this particular variant. This is one that validated. So here's one that also validated. And you have a very weak representation in the, in the, in the tumor. Okay. But only two out of say 35 reads have this particular variant. And so, so, so we want to be able to capture this because it's a true somatic mutation. And it affects the, the amino acid sequence of a protein. And so this is a, just to illustrate that this is a very challenging problem, especially in the, in the context of tumor heterogeneity. So, so this might be present in, in 10% of cells or less. And so, so the chances of finding it in a 30x genome or a 50x exome is pretty small. And so you want to be able to leverage this. And this is where the joint modeling also helps. So if you have a very clean signal in the normal, the signal required in the tumor to call a variant is, is it requires less, less signal there. Okay. So, so how can we deal with this? Well, so we set out to, to try to cope with the, all these problems in a unified framework because you can't, you won't, nobody will be able to go through and, and look at each mutation in IGV and plot because you literally get thousands of mutations per tumor sample that, that one sequencing. And so how, how do you cope with this? We need a, an effective, intelligent way to, to cope with all this, all these, these issues. So, so what we tried to do is we tried to enumerate what are all the characteristics that we can extract from the data at a given position. So, and, and all that is really captured in the alignment data. And so, so from there we can get base quality. We can get how well the read aligns to the given position. We can get where in the read the variant is, is located. We can get the strand bias or the, the, the strandedness of a particular read that, that harbors a variant. And we can get proximity to an indel. So we can compute all of this. And so, so we enumerated literally 106 features, 40 features each from the tumor and normal and then 26 features that we were able to extract from both the tumor and normal data. So there's some sort of aggregate feature. And they really, they comprise the things I talked about. So base quality, mapping quality, homopolymer run. So when you have the same base repeated many, many times that can create problems for the machine. So we can, we can model that strand bias, etc. And, and, and this concept is, is implemented in the, in the GATK for single samples. But for what we did is we, we used the tumor and normal data simultaneously. And so what this shows is taking all this 106 dimensional space and collapsing it down using principal component analysis, you can see that the data points that represented true somatic mutations are really separable from those that are germline or wild type. And, and so this gave us some confidence that, that we should be able to create a machine learning classifier that can do this. So we set out to do this. And and found that we really dramatically increased accuracy in our mutation calling by taking into account all these different features. So this is an ROC curve. How many people have seen an ROC curve before? Okay, a few. And so what's plotted here is the false positive rate or one minus the specificity. And then on the Y axis is, is the sensitivity or true positive rate. And, and you want to be up here in the top left corner. And so what's shown here is when we used features to call the data in a cross validation classifier structure, what we found is that what really mattered is that we used all these different features. And the type of classifier, which is this is random forest, basin added with regression trees, support vector machines or logistic regression, that made a little difference. But all all these classifier feature based classifiers outperform the simpler models that that didn't take into account all these features or use the threshold ad hoc thresholded methods to actually call mutations. And then so this was all trained on exome data. Exome capture data and then sequenced by Illumina. And we had a ground truth training set a ground truth test set from whole genome data, whole genome shotgun data from a completely different platform, the solid platform. And we found that essentially the same trend was there too. So, so this, this machine learning classifier that we developed is actually really quite robust and, and, and, it's unequivocal that taking into account all these different features. And both the, from both the tumor and the normal data, it really dramatically improves somatic mutation detection. So, so, so the lesson is really that, you know, you want to be able to screen out germline polymorphisms, and you want to be able to screen out all the artifacts that are induced by the machine. And there are many, many of them. Yeah. How do you label your data? So, so, so we validated them all. So we called them with a naive method, and then went back in and, and did the validation experiment as I described. So we designed sample cons around each, each prediction. And then if the variant was seen again, above a statistical threshold for the tumor, and there was strong evidence for there not being a variant in the normal, then we can call that a somatic mutation. So there might be bias because you're using a label that might be wrong. And then you use that validation. No, well, the possible outcomes are that you have, so you take the naive way of just looking at, ignoring features. Okay. Call all the things that look like a somatic mutation, including all those artifacts that I showed. Okay. So then you, you have three possible outcomes. You have a confirmed somatic, you have a germline, or you have a false positive. Okay. And so that's what we used. Then we used actually binary classifiers. So we just looked at the somatic versus the other two. And then a cross-validation routine. There are three, these are 3,000 data points. So actually we were able to, to separate using cross, tenfold cross-validation and train the classifiers and had it held out test set. And then see how well we did on the test set. The thing that you are not sure, like, how did you decide about, like, the ground truth of the labels, like those three labels? The outcome of the validation experiment. So, so we sequence, sequence using exome capture. Okay. That gives us some set of allelic counts that we use to then just call something as somatic or not. Okay. That we actually all, then we only use somatic. So we, we took all the predicted somatic just from the allelic count, ignoring all the other stuff. Okay. We took all those positions and revalidated them using, using the, the technique that I described. Some of those were confirmed somatic. Some of those were not. That establishes the label. Then we go back in and use the features, the same positions, and see if that helps. And, and to avoid circularity, we use the cross-validation. And because we have 3,000 positions, we could use ten different cross-validation, tenfold cross-validation. Is that clear? Okay. Okay. So, so the other thing we noticed, this is quite important. So then we took all the, we took all the, the resulting false positives. And, and we found that actually there are essentially six different groups that, that yield false positives. There are six different sort of classes of, of false positives that can, can be extracted from the data. The first is, is dominated by strand bias. So this is a group that, and so what's shown here is just the heat map. And each, each column here is the feature. And then each, each row in, in this matrix is a mutation. And, and so then we clustered the features to try to group the mutations into different classes. And so the first group is, is really dominated by strand bias and, and unequal mapping qualities in, in both the tumor and the normal. And, and, and then we had basically low confidence in terms of the genotypes. Then we had a group two which was really dominated by strand bias and this very interesting sequencing error. So we looked at essentially the local context and found that a large number of variants, this is a huge, huge number here, were all of the following. So where you had what should have been a GGT, so a trinucleotide that ended with a T. So GGT was being read as a GGG by the machine. And so, so this is a, this is just a systematic artifact that, that is experiment specific because it wasn't always in the normal as well. Otherwise we wouldn't have called those in the first place. And, and so this is a, this is just a property of the, the aluminum machines that when they encounter a GGT, they often misread it as a GGG. So something to be aware of. Yeah. Well, in fact, this, this, what's interesting is that this, this phenomenon exists in different platforms as well, including sand or something similar. Yeah, similar. Yeah. But, but you're right. So if we, there are different, different platforms that will need different characteristics. And so what we're working on now is actually developing this for different platforms as well. Then we had a, a second group that was essentially due to misalignments because of repetitive sequence. So, so when we looked at that, those were low mapability regions. And then we had a group that had the same error, but with low base quality. And finally, we had this group that, that was really interesting because they, they have properties. You can see this is, this is what somatic mutations look like in terms of its feature profile, if you will. So, so these are the features that light up in the somatic. And you can see that this group has very kind of similar properties. It's very, it's closely related to that. But it has a distinguishing feature of having very weak signals for the variant in the normal. So there's, there's a, there's a hint of a, of a germline polymorphism there that's captured when you look at the data together, but would be essentially miscalled if, if you look, just to look at one, one, one case or the other. Okay. So, so that's the story in terms of somatic mutations. It's not an easy process. So I review grants all the time that say that, well, I'm just going to sequence 10 tumor normal genomes, and I'm going to find the mutations. But it's not an easy process. It's the, we're still at the stage where we're learning about the imperfections of the data, what biases the machines introduce, and, and, and, and understanding the pervasiveness of germline polymorphisms. And so I would encourage everyone to, who's, you know, doing a sequencing experiment to just bear these things in mind. It's not, it's an imperfect science. It's not sequence and you shall find. It's sequence, predict, validate. Okay. I have a question. Here, I think you use 50 triple negative breast cancer there. So if you increase this 52 by 500, how would you expect that the result will change? And then the second thing, what would change if you wouldn't focus on a subgroup, like if you look at all breast cancer or if you focus on an even smaller subgroup, like based on the new findings? Yeah. So, so these are, this is really designed at trying to get at not the biological variability, but, but the machine variability. And so what we found is we've applied this, this framework now in different tumor types, and it works very, very well. In fact, so we're now basically have a whole pipeline that essentially has incorporated this at the genome sciences center. And so basically every tumor normal pair that goes through the genome center there uses this to call some mutations. And we found that the validation rate is very, very high. Before we started this, we were getting huge false positive rates. And now, when we, when we go back and revalidate the, the, the validation rate is very, very high. So, so I think it's not a tumor type specific phenomenon. Because I mean like some groups, because imagine like, I can imagine that if you mix this triple negative with TR positive patients, then, you know, you'll get like complete different maybe mutations. And then maybe like you want to go through the filter. Because, yeah, but, but, but that's, that's biological variability. This is only concerned with machine variability, machine noise. So it should be completely immune to what you're talking about. And then how about the sample size? So the sample size, this is, this sample size is based on number of events, right? So yes, the 3,000 mutations were found from 50 triple negative breast cancers. But, but what matters here is, is actually the number of events that we're looking at. So if we were to look at only 100 mutations, let's say one came from each of 100 different tumors, that would be underpowered to, to, to actually learn all these features. And, but if we were to look at, let's say a thousand or 5,000 mutations from one tumor, that's probably going to be enough. It's, what matters is the number of data points, not where they came from. The advantage of, of extra cases is that they're different, they're different sequencing runs that may have generated them. So, the inter-experimental variability of the machines themselves might be captured better with more cases. I'm going to get into how we applied some of these tools in a, in a recent, in a recent study. So, and this is joint with Sam Alprecio of the BC Cancer Agency. So, all this machinery was really developed for the purpose of looking at tumors and, and answering questions about tumor biology. So, we decided to focus on triple negative breast cancers. And, this is a molecular subtype of breast cancer. And, it's really defined for what it, what it isn't. So, it does not express the, the clinical biomarkers. I talked about her too when I talked about ER this morning. These are tumors that don't express either of those markers. And so, as a consequence, there's no targeted therapy. It's the most aggressive type of breast cancer affecting about 12 to 15%. It has a very poor prognosis, affects younger women disproportionately. And, really the mutations that cause this cancer are largely unknown. So, we set out to accumulate a fairly large cohort. These are, as I mentioned, these are quite rare. And so, so we collected 140 tumor samples from patients in the UK and Canada. And, and a major point is that all these tumors were resected early in the clinical course. And so, basically at the time of diagnosis, surgery was indicated and the tumor was removed. And, and that's, that's where, that's a time point in the clinical course that these, these samples were generated. So, we sequenced 65 tumor normal exome pairs. We generated copy number arrays and we also generated RNA-seq. But I'm going to focus on the genome part. And, and the major goals of this were really to identify all the somatic mutations that occurred in these tumors. And, and to re-sequence the mutations to obtain the mutational abundance. And, and this is about that digital allele frequency counting that, that I was talking about earlier. And, and what we wanted to do with this data is, is try to infer the properties or the characteristics of clonal evolution. So, so, you know, what was the clonal composition of these tumors at diagnosis? And, and then what can we learn from that about the mutational and clonal evolution spectrum within this disease? So, the first observation is that these cancers varied quite widely in their mutational load. So, this is the number of non-synonymous protein coding changes that we observe for each case. So, each bar here represents the number of mutations for a given tumor. And, and, and so we saw this, we saw this wide range of mutations all the way down to two cases where we didn't find any. Zero. So, we went from 200 to zero and pretty much everything in between. So, so that was quite interesting. And this is not at all related to tumor cellularity. So, you can imagine that if you had a tumor where, or a sample where most of the cells were normal, you'd have a hard time trying to find a mutation. But this is unrelated to that. And, and then the other thing that we noticed is that, that was, I think relevant here is that even some of these cases at the lower end of the spectrum in terms of mutations had abundant copy number mutations. So, that, that's what this, this screen, the screen curve shows is essentially the percentage of genome that was altered by somatic copy number changes. So, so in this case, you know, you have one, a thing is that this tumor had a single mutation, but had a large percentage of its genome was altered by copy number changes. This is axome sequence. Axome sequence. Axome sequence. Okay. Oh, sure. Yeah, sure. So, so I think a couple of things that are probably mutations outside the, the axome that, that drive, that are driver mutations. So, for example, in transcription factor binding sites or promoter regulator, regulatory regions, or in non-coding RNAs, that affect the biology of the tumor and gets selected for. That, that without a doubt is probably the case. The other thing that's missing here is the epigenome. So, maybe a large part of the variation and, and phenotype of the cancer cells is driven by epigenetics and not, not necessarily genetics. So, so this is just looking at the genome and it's probably missing out on certain, certain components. Could also be that these are just seen at different phases of their clinical, of their evolutionary history. So, so it could be that these cancers are not evolved this much. And these cancers are, are, have a, have ahead of time to accumulate more mutations. Yeah, they're all high grade. Yeah. That's right. And, and this one in particular is interesting because this, this has a mutation in a mismatched repair protein. So, and so, so amongst the 200 genes that are mutated here, there's a mismatched repair protein. So, what's likely the case is that, that tumor is able to accumulate mutations in a hyper, hypermutator phenotype type of fashion. So, it accumulated a lot more mutations than others because of the specific defect in the mismatched repair protein. Okay. So, so when we start to look at, that's interesting that they had different mutational loads, but we wanted to look at the gene content of those mutations. And so, so what's plotted here is essentially again, a matrix where you have the cases on the bottom, and then the genes on the, on the left. And then a box in this, in this matrix indicates that a particular case had a mutation in a particular gene. And, and so the p53 gene is by far the most abundantly mutated gene in this, in this cancer type. So just over 50% of cases had mutations in p53, and that was known. And then we see other tumor suppressors such as p10 and rb1 are mutated quite frequently. And then we saw mutations in p3 kinase, which is also known. So, so these four are really well known before. But then what we had is we had a list of mutations that was really quite surprising. So, so inf, infrequent but not, not singleton mutations, singleton meaning just in one case. And, and when we started to do, we wanted to try to look at, okay, what is there, is there a biological signal in these, in these genes that are rarely mutated. And so we did some pathway analysis and found that a lot of these genes were in focal adhesion, integrin signaling, extracellular matrix and actin cytoskeleton genes. And this all, all the, these biological processes relate to cell shape and motility. And, and so, so what's the clear signal that emerged from this is that somehow these tumors are acquiring mutations in, in these pathways that govern cell shape. And so, and, and, and focal adhesion. So that may confer a metastatic potential on those, on those particular tumors. And then the other thing that we noticed that, that is pretty particularly, I think important here is that mention that this is a disease for which there's no real targeted therapy. But in 20, 20% of the cases we found mutations in the EGFR kinase domain, ERB2 kinase domain, or, and we even had one case of the BRAF B600D mutation. So this is, this is the, the targeted mutation that's indicated in melanoma and has a, a targeted agent against it. No one would think to give that targeted agent to a breast cancer patient, but maybe because the presence of this mutation that, that, that agent could be, could be applied in this case to that particular patient. We have no idea whether that's, you know, has toxicity levels that are different in melanoma or breast cancers. All that stuff is, remains unanswered and that has to be really determined in the context of clinical trials. But at least there are hints here that for at least 20% of the cases there are potentially actionable mutations that an oncologist could prescribe. An off-label drug is already available, FDA approved, and, and potentially put those patients on that, that context. So, so this gives, this gives a lot of motivation to sequence patients in the context of their, their clinical care, yeah. That's right, yeah. That's right or a copy number, that's right. So the, these ones didn't have any, they had mutations in other genes, but just not in this selection. Yeah, yeah. Okay. So, so now I think this will get it. So your question about how, you know, how we validated and I wanted to just talk about this deep sequencing experiment. So, so what we did for every single mutation in total we found about 2,500 is that we, we designed primers around the mutation and generated amplicons and then sequenced very, very deeply. So actually the median, median redundancy around each mutation was 20,000 fold. And that was to really try to estimate what is the percentage of cells harboring each mutation. And you go back to that slide that I showed where you have your pool of DNA and some portion of the DNA fragments have a mutation and, and you want to try to recapitulate that and measure that according to this. So, so the one problem with this is just taking, taking, you know, just this data is that as we found out this morning, copy number changes also alter the allelic frequency of, of a particular locus. And so, and so really the allelic abundance or allelic frequency that we can get from sequencing mutation is, is actually derived from a mixture of many different things. So it's the copy number, it's the heterogeneity, it's the normal contamination. And so, so here's what copy number or, or loss of head of zygosity does to the allelic frequency. So here's a simplest situation where you have a diploid genome, okay? You have two copies and one of the copies acquires a mutation. If all your cells are exactly the same and so you have a pure sample, then that's, when you do this experiment of, of deeply sequencing, 50% of the reads should actually harbor the mutation, okay? On average. So you should get a very clear reputation that. And so if you were to do let's say a diploid cell line and you were to look at a heterozygous SNP, chances are the allelic frequency when you deeply sequence it would be somewhere around 50%. If you have this situation that is then followed by a loss of the wild type allele, so the non-mutated allele, that should result in actually all the alleles of that particular lobe should be around 100%. They should all have the mutation. And similarly, if you then have a copy neutral, this situation where this allele gets duplicated, you'd also have allelic frequency of 100%. So the confounding thing is that you could have the following situation where you have copy neutral LOH that happens first, followed by a mutation that will yield a 50% allelic ratio. So the presence of a mutation within a copy neutral LOH region doesn't necessarily mean that you're going to get 100%. You could get 50%. It's depending on the order of events. Does that make sense? Okay. And so then we can also imagine the scenario with let's say 50% of the cells have this particular genotype. That would also yield a 50% allelic frequency. So we have all these competing explanations for what could give rise to our observation. And what we try to do is actually deconvolute that with a probabilistic model. And so we call this model pi-clone. And so what it takes into account is it takes into account its allelic abundance measurement, takes into account the copy number state, the LOH state, and then tries to essentially leverage the concept that mutations probably accrue in ways of clonal expansion and selection. And so there should be groups of mutations that have approximately the same clonal frequency. So we try to take advantage of that. And that's the assumption, the fundamental assumption in the model. And this is what we got at. So what's shown here is our six plots, what for six different cases. And on the x-axis here is the clonal frequency. So the percentage, estimated percentage of cells that harbor the mutation from one down to zero. And then each one of these rows represents a mutation. What these little distributions show is essentially what's the posterior distribution that a particular mutation has a clonal frequency. So this is a probabilistic model using Markov chain Monte Carlo. And so you don't just get an estimate of, you know, it's this. It's you get the distribution around what the actual estimate is. And that's what's shown here. And so what we found is really quite interesting. So here's an example in the top row. Our three cases were essentially, most of the mutations were around the same. So this would be kind of a genomically not very clonally diverse. So we're a clonally homogeneous tumor where most of the mutations were just either in one or two groups. And the groups are just colored with different colors. And in the bottom row, I showed three examples of tumors that had much more clonal diversity. So whereby we had some mutations that were quite frequent, but other mutations that were quite rare. If you think back to that tree of Peter Knowles tree, where the early mutations get propagated forward throughout the evolutionary process, and late mutations are probably only harbored in a few cells. That's what we think is being recapitulated here. And so we had a fair number of tumors that exhibited patterns like this. And you have to remember that these are all resected basically at diagnosis. And so in terms of their clinical course, they're all just taken out right away. And at the time diagnosis, but it's very clear is that at that time, when they're first diagnosed, they're already quite varied in terms of their evolutionary diversity. And this is something that was not known at all before. And the implications of this is that these tumors are currently treated with the same treatment protocol. So they're treated uniformly, but they're clearly all different. And that's a problem if we want to make some headway in terms of making some progress in this disease. So the last thing I really want to talk about is this idea of temporal inference of mutations. So now I'm showing one of these matrices again. And instead of just plotting the type of variant, I'm plotting the estimated clonal frequency where the dark boxes represent high clonal frequency and the lighter boxes represent low clonal frequency. And what's interesting here is that so we can ask questions about which mutations are the earliest events and which mutations occur later in the evolutionary history of the tumor. And that has some implications for how the biology of the tumor develops. And so quite gratifyingly, most of the p53 mutations were considered clonally dominant or had high clonal frequency. And the implication there is that they happen early. They're one of the first initiating tumor events. And that's very consistent with what we know about p53. I think somebody asked about this, is there a phenotype that can be essentially where the gatekeeper is lost? And that's really probably what's going on here is that this is a gatekeeper gene that's deleted early and that allows the accumulation of additional mutations. However, that wasn't the case for all the tumors. We actually had some that where p53 looks like it occurred later on. And that's quite interesting as well. So that's, it's not a hard written rule that p53 mutations are the earliest event in every cancer. The other thing that, absolutely. Yeah, that's right. Yeah, that's right. So then then I've also shown the genes involved in focal adhesion and integrin signaling here. And so it was natural to ask the question from a pathway perspective. Are there pathways that appear to be mutated late in the evolutionary history of the tumor and are there pathways that appear to be mutated early in the evolutionary history? And the results of this are shown here in this enrichment map. And what this is is projecting genes onto pathways. You're going to do this exercise I think tomorrow. You may actually, Gary mentioned that he might actually try to work with this data. But I'm not sure if you got that far. But anyways, you'll make a diagram like this tomorrow. No, Friday. Oh, Friday. Tomorrow and Friday you're going to make a diagram. It's very exciting. So what I've done here is we colored, we took the distribution of clonal frequency, go back here. So we took this distribution here and essentially we asked of each pathway whether they had the same shape of the distribution. And if they didn't have the same shape, if they were dominated by early events, then we color it red. And if they're dominated by late events, then we color it lighter. And so what came out is essentially these fundamental tumor initiating type of pathways like cell cycle checkpoint, RB tumor suppressor checkpoint. The fundamental cancer pathways, tumor suppressor, PI3 kinase, P53 feedback, cell cycle checkpoint, all of these fundamental processes that would actually create a tumor, those happen early. The pathways associated with focal adhesion, integrant signaling, focal adhesion, or is it here, they tended to happen late. So the mutations that were found in those pathways had lower clonal frequencies, statistically speaking. And so the inference there is that there's a progression of biological events. First you need to abrogate your cell cycle checkpoint to become a tumor. And then in order to spread and invade or become independent from your neighbors, then you need to abrogate focal adhesion and integrant signaling pathways. But these mutations may not be in the same cloned? That's absolutely true. Yeah, that's true. They could be. Yeah. So the only way to answer that is with single cell sequencing. And then you know the mutation combination that you've done. Exactly. So ultimately selection operates at the single cell level and probably operates on the whole genotype. And so that's where we're actually now trying to validate all this with single cell sequencing to see which mutations happen together and which ones are exclusive. Yeah. Okay. So that's all I want to say on that paper. And that was published last month and is available online. So in summary then, triple negative breast cancer is a disease defined by exclusion with no targeted treatment strategies. We found actionable mutations in about 20% of cases. And I just want to make the point that mutations can tell us much more than just the genes they affect. So we tend to think of mutations as these kind of binary entities. But if we think of the aggregate profile of mutations, we saw that these cancer exhibited heterogeneous mutational load, heterogeneous mutational profiles in terms of their gene content, and even heterogeneous coronal population structures and evolutionary stage. And so we've taken a group that is defined as a kind of a homogeneous entity and showed that it has huge variation and shouldn't be classed as one disease. Likely this will have to be dealt with in terms of individual therapeutic options. Okay. So that's the kind of applied work. And I'll just finish now with just going over some of the available tools for mutation findings since we want to actually do this in the lab. So one of the more popular tools is SAM tools. And this is a suite of tools for working with alignment files and really that are represented in the standard community standard SAM or BAM format. So did you work with SAM tools yesterday? Yes? No? Yes? Maybe so? Yes? Okay. Okay, BWA and SAM tools. Good. This is a nice piece of software because it's fast and memory efficient and it can be compiled anywhere on your Windows machine or on your Linux machine or on your Mac machine. The other very popular tool is the genome analysis toolkit, GATK from the Broad Institute. This is implemented in Java. And I think you're going to go over, yes, Gavin's going to go over GATK in the lab. There's an emerging and I do say emerging standard of how to represent variants in terms of a standard format in the community. And the format that's being adopted is this VCF format. Are you going to go over VCF format? Gavin? Okay. Last year we did. Sorry? They're going to go on there. Okay, so you'll explore VCF format. This is a variant calling format. And the strength here is actually a lot of the features that we executed in that mutation seek study were actually derived from the VCF format and you can look at how those features are encoded. There are, if you look at the dates of these papers, these are the only three papers that deal specifically with the somatic mutation detection problem as far as I know, more coming out. But as of right now or as of a few months ago, there are really three and they're all published almost within weeks of each other actually. And so the community is only just getting to grips with this. We're only in the very first stages of trying to model somatic mutations from to normal data. And these are some of the tools that are available. So two from my lab, one from the WashU. And then the Broad Institute has their own color that you can get by request. And the Sanger Institute has now, for the first time, their paper on breast cancer just appeared about 10 days ago and they refer to their internal code base as well, their platform for calling somatic mutations. So for visualization, IGV is nice. And I showed you examples of that. And then annotation, so that's one thing I didn't really get over. So how do you go from a genomic coordinate to knowing whether, you know, your mutation actually affects the protein or is it synonymous, synonymous variant, non-coding, non-synonymous variant. It's a very nice system developed by a Chris Sanders lab at Sloan Kettering called Mutation Assessor that we use probably more than they want us to, but we use it anyways. And this is a nice way of projecting genomic coordinates with specific nucleotide substitutions onto amino acid and putting protein context. And I think we're going to do that in the lab too. So one thing we didn't cover, and we're probably not going to cover in depth, but you certainly need to be made aware of it, are insertions and deletions. So how do we deal with that? Small insertions and deletions are often harbored by tumor suppressor genes and confer frame shifting, translation, and loss of protein expression. So P53, BRCA, all those tumor suppressor type genes harbored these indels and they're important to find. There are, what is very clear is that indels in short read data are much harder to detect than single nucleotide variants. And so the literature is just really catching up with this. What's very likely is that they'll require more sensitive and specific aligners. So BWA is probably just not going to cut it for indels. We need to use a more principal method that can actually open a gap when necessary in the alignment. But there are some tools available. And one strategy is actually de novo assembly. That should actually be the preferred strategy, but of course that's computationally intensive and is difficult. So I just wanted to also say now that, and I'm just going to wrap up, is that it's really important to understand the sources of artifacts and the data. I'll reemphasize that the biology of cancer is complex. We're still in an era where a few tools exist specifically for cancer data. We're trying to make that, ameliorate that situation slowly, but surely we're getting there. And all these new experimental designs, including single cell genomics of cancer, these are going to warrant brand new analytical problems. And so they represent new statistical challenges that need to be addressed if we're really to fully define the cancer mutational landscapes. Ultimately, what I think will emerge from sequencing individual tumors is potentially not just using archival samples, but actually potentially using this in the clinical context under the guise of an oncologist. And I think this is apropos in the context of what Peter Noel concluded in his paper in 1976. So just bear with me and I'll just read this. So the acquired genetic instability and associated selection process most readily recognized cytogenetically results in advanced human malignancies being highly individual, karyotypically, and biologically. So individual is the key word there. Hence, each patient's cancer may require individual specific therapy. And even this may be thwarted by the emergence of a genetically variant subline resistant to the treatment. The more research should be directed towards understanding and controlling the evolutionary process in tumors before it reaches the late stage usually seen in clinical cancer. Well, what's really interesting is that he was talking about this without having the tools that we have today. And we have at our disposal the ability to now, in a cost effective way, apply sequencing technologies to individual patients to get at some of these questions. And I would suggest that the road to personal therapy will be paved with mutations. This is what we need. We need to look at each cancer in its mutational context. But we really have a lot of work to do to identify targetable actionable mutations and understand how drugs drive selection in different tissue microenvironments and cell types. So I'll just leave you with those thoughts and thank a whole number of people. Specifically, Sam and David were really great close colleagues and inspired a lot of this work. All the people in my lab, including Gavin, contributed to the studies that I presented today. And actually the courageous patients as well who donated their tumor specimens to research. We couldn't do the work certainly without them. And then finally, just a note that I am recruiting people. And so if you like some of the work that you've seen in today's presentation, just give me a call and send me an email. And it'd be great to chat about possibilities. And I think at this point, I'll turn it over. Are we going to take a break? Okay, so we'll take a break till 3.30. And thanks a lot for your attention and enjoy the rest of the workshop.