 Okay, so today is this afternoon now, we'll switch gears and we'll talk about a different type of genomic aberration that is actually at the very precise nucleotide level and that is somatic point mutations. So here's just an example of a mutation in a rare form of ovarian cancer where you have a substitution here that is a CDG substitution and this is essentially the pathogenic mutation in this particular disease, okay? So we'll talk about mutations like this. So you've probably seen this slide already in the context of this workshop. It's a very well-known slide that's probably overused, but it essentially illustrates the point that cancer is derived from acquisition of mutations in normal cells. And so here's the fertilized egg, a very first cell, and then throughout the history of this cell there may be an accumulation of mutations in the genome and that's just depicted by these little glyphs here inside the nucleus of the cell. And at some point there is an acquisition of a mutation denoted by a star here that will change the phenotype of that cell and ultimately endow the cell with a growth and proliferative advantage. That results in a clonal expansion. So that cell will then go on to expand and proliferate and replicate itself. And this process actually continues. So once growth and proliferation has been acquired, then DNA replication will happen more frequently and the likelihood of additional mutations being acquired will increase. And so at some point the rate at which mutations are accrued in the cell will expand and that can lead to additional phenotypic changes including, for example, acquisition of chemotherapeutic resistance and other phenotypes. So once it actually substitutes these glyphs, this can be a mutation, it could be a copy number change, it could be a rearrangement and so these are all generalized in the sense that these are genomic aberrations derived from normal cells. So with respect to individual mutations, this is just so we're all aligned here. This is a particular type of variation that we're interested in this section of the workshop. And this is, for example, a single nucleotide change. There's a G to T mutation. This is actually a real mutation in a gene called P53 in a breast cancer and this one single nucleotide change is likely the driver event in this particular tumor causes a truncated protein and will result in P53 not being expressed in this particular tumor. And so in addition to the copy number changes I talked about in the last lecture, there are a number of point mutations existing in oncogenes such as KRAS, BRAF, and kinase domain mutations in EJFR and KIT that are targetable. And so KRAS and BRAF testing are now pretty much routine in these types of diseases. So BRAF, a large percentage of melanomas are afflicted with a V600E mutation. It's a very same mutation in this portion of melanomas that harbor it and there's a targeted therapy against that particular mutation that can inhibit the oncoprotein that is developed. And so knowing the presence of a mutational profile can be very beneficial in terms of selecting targeted therapy for patients and also in just understanding the biological properties of disease and identifying new targets for additional therapeutic strategies. Actually before I leave this topic, I mean so what's with the availability of sequencing technology the way it is right now, there are now large efforts underway across the world in many different academic and also in commercial labs to develop targeted panels that can sequence for the presence of just these mutations. So for example it can be done very inexpensively and in a high throughput manner to just profile a very small subset of the genome for which there are actionable mutations. So mutations that an oncologist could actually potentially do something about and take a drug that's FDA approved and used in other indications and administer that drug to a particular patient. And this is now being adopted across a lot of genetics labs all over the world. And so as we learn more about the mutational landscapes of cancer and what mutations are potentially oncogenic and for which there are therapies against this will only become more and more applicable to clinical, the course of clinical care. So of course the best way to be able to identify mutations is through sequencing the genomes. And really so this is a picture that probably should be familiar to most of you. It's from the Hannah Hannon Weinberg classic review paper on the hallmarks of cancer and there's actually been an update that was the 10-year anniversary to this. But the point is that all cancers harbor certain biological characteristics. And the question is what genetic abnormalities actually underpin these characteristics. And so what genes or pathways are disrupted due to these somatic genome aberrations. And again these can be copy number changes, rearrangements, point mutations. And so there's been, as you've been exposed to by now, incredible efforts and levels of investment being put forward into sequencing the cancer genomes in a number of different contexts from large population-based studies to understanding mechanisms of chemotherapeutic resistance. And huge efforts on international scope have been placed on really trying to understand these properties from the perspective of what mutations exist within different cancers. So before we dig into the actual analysis of data, I wanted to just take a step back for a minute and think about how is it that tumors come to be the way they are. So that graphic that I showed at the beginning from Mike Stratton's paper, which shows a sort of linear trajectory, ignores some fundamental properties of how tumors arise. And that's really about the notion of Darwinian evolution in the sense of considering the tumor cell or the clone population of cells that share the same phenotype as a unit of selection. And this is originally articulated by Peter Noel in Science in 1976. And he cast this problem into the context of phylogenetic evolution. And he proposed this clonal evolution theory of tumor cell populations. And what that theory predicts is that, first of all, tumors will change over time and through anatomical space, and acquisition of mutations will eventually lead to phenotypic changes that will convert selective advances in clonal expansions. What that means is if you consider this phylogenetic tree here, this encodes the structure by which cell populations in a tumor are related to each other through genetic abnormalities. And this has certain properties to it. And the first is that clones that are at the root of the tree will propagate their genetic abnormalities down to the branches and the leaves of the tree. So these cells here will inherit the genetic abnormalities from this clone here. So we start from the normal, then we have some sort of genetic event that transforms this normal cell into a malignant cell. So this cell harbors all the genetic variation that exists in this normal cell but has acquired something new. And that will result in a clonal expansion. And then similarly, we can have this branching process that will result in when we reconstruct the tumor from its phylogenetic history, something that looks like this. And that will result in also in clones that may acquire an aberration that is not selected for. So it will be deleterious to that clone and that clone will just basically die out and won't expand. And so the end result of this is that tumors will be composed of clonal populations with different underlying phenotypes. And so we shouldn't be under any illusions that tumors that we're sequencing or studying are homogeneous entities. They're mixtures of cell populations. And that has lots of consequences for how we might interpret the data going forward. So another way to look at this is that we might have a population of cells that looks something like this. So let's say we have this population, this brown population here that is characterized by mutations A, B and C. And we call the mutational profile of a clone. It's clonal genotype. So one of these cells may then acquire a mutation. Let's call it mutation D. And that will result in a clonal expansion. So we get this population of orange cells. And then one of these cells may acquire a mutation G that will result in a clonal expansion that results in the green cells. And so this has certain consequences when we look at what we call the mutational prevalence or mutated here called mutation frequencies in this population. And the idea here is that these mutations A, B and C, they're the earliest events that create this tumor, they will be present in all cells. Because again, the earliest mutations are propagated forward in the evolutionary history and so that they will be present everywhere. So these will have mutation frequencies of near one. So it all cells harbor mutations A, B and C. Now we know that D is acquired later. And so in all but a few cells, you'll notice that here 17 are the 21 cells, so not the original cells here will harbor mutation D. Because if you trace this tree, then this is pretty high up. But then some of the mutations like G, for example, will have been acquired late and only be present in a subset of cells. So this is an important concept that we'll get to later on in the lecture in that the prevalence of mutations in the population gives you some indication of where in the evolutionary history those mutations arose. And we can use that to try to reconstruct this mixture of cells. Of course we don't know this mixture and we want to try to infer something about this mixture from sequencing data and I'll get to that a little bit later on. So again, we have these different populations and the key questions that we can ask of these populations is do these clonal genotypes, for example, do they drive different phenotypic behavior? So how do these, how does the green cells versus the orange cells, for example, respond to particular drug intervention? Are all the cells equally susceptible? Are some resistant and others not? And will the orange cells, for example, expand under the pressure of a drug selection or even in different micro environments, so in different parts of the anatomy or even in different micro parts of the anatomy? And so really a fundamental question is how this phenomenon, which we know exists, how does it relate to treatment, response, progression and metastasis? So have we visited the concept of driver versus passenger mutations? I talked a little bit about it in the previous lecture, but have we, okay. So this is a difficult topic because there are lots of different views of what is a driver mutation and what's a passenger mutation. So a driver mutation, so the analogy here is, of course, you have a bus and you have one person that the wheel is driving the bus and that person is actually in charge. And so the direction that the bus goes depends on that person alone. And the passengers are, of course, just along for the ride and they actually have no say in the matter. So the way I look at this is that I think of a driver mutation as a mutation that alters the phenotype at the level of cell and that actually creates a selective advantage for that cell, such that when Darwinian selection is operating that cell can expand and there will be a criminal expansion as a result of this particular driver. So that's one definition. The passengers are more or less stochastically induced, so there's some sort of random mutagenic process. You may induce, for example, the classic example of a passenger mutation is a synonymous mutation. So a mutation in a gene that doesn't alter the protein and so there's no way for selection to actually operate on that because it doesn't change the underlying biochemistry. And so this would just be along for the ride, so to speak. So those are the two sort of fundamental concepts that the driver mutations that really alter the phenotype and creates a selective advantage and the passenger mutations that just accrue as a result of either compromised mismatch repair or just are left unchecked because they don't have a deleterious nor an advantageous effect on the cell. So driver mutations can have a number of different properties to them. They can be what we call a gain of function, loss of function, or a switch of function. And then these, we can think of driver mutations as being tumorogenic or initiating the neoplastic trials formation. And examples of this are, for example, loss of function mutations in P53 or KRAS code on 12. These are activating mutations that really change the function or consider amongst the earliest events in a tumor's evolutionary history. We know that driver mutations can confer metastatic potential. So this has some sort of implication in that these mutations are really required to create the malignant phenotype to begin with. And so these are really important early on in the evolutionary history. These mutations here may have to do with potentially acquiring contact independence from your neighbors, for example. So cells that can actually exist without being in a tightly cohesive matrix of cells. And so this might be something that would not create a tumor but may be required for that tumor to spread. This is a temporal aspect to this. And then finally when there's intervention, there may be a mutation that has nothing to do with tumorogenic properties or metastatic potential. But it does confer some sort of resistance to a drug. And maybe there's a mutation that activates a pump that allows the cells to pump out a drug and therefore those cells become resistant to the drug and will therefore expand. And again, since these mutations are occurring late, they will accrued and carry forward all the mutations that have been accrued prior to that. And so those will still have the malignant properties and that will allow for expansion. So this is just a nice review that talks about different mutations that are involved in these different processes. And so I encourage you to review that. Okay, so let's talk about these oncogenic mutations or activating mutations. These are very much gaining a function. And so one classic example is the PI3 kinase gene. There are two hotspot kinase domain mutations clustered around 545 amino acid, 545, and this one is 1047. And so many tumors exhibit mutations at these exact amino acids, but there are other mutations spread throughout the gene as well. But this is a really nice canonical example of an activating oncogenic mutation. Other examples are K-RAS-CO-12, which I've already mentioned, and the BRAF-V600E in melanoma and also in colorectal cancers. So these particular mutations, it's very much identical. It's the same amino acid, same substitution that happens in different individuals. So this is an example of what we call convergent evolution. So we know that these individuals are not related by any other mechanism. And so these are mutations that just happen independently of each other in different patients. But what gets selected for is these particular mutations to drive a clonal expansion and create them like a phenotype. So this pattern here, the key thing is that these patterns will be localized in clusters around the particular amino acid or the protein domain that's being affected. And I'll show you the contrast of that. Melissa has a nice paper talking about this in PPP2R1A mutations as well. So by contrast, we have tumor suppressor or loss of function pattern. And you can see here that this is a protein called arid1A and this is involved in chromatin remodeling. And you can see here that there are mutations that are spread throughout the protein. And the characteristic of this is that most of these aberrations will either induce a premature stop codon, so they're nonsense mutations, or they're frame shifting insertions and deletions. They disrupt the reading frame of the protein, so it has a similar effect to a nonsense mutation. Or sometimes these proteins are affected by deletions, such as homozygous deletions that I showed in the previous module. So this is a classic profile of a tumor suppressor. P53 shows a profile like this and other tumor suppressors such as P10. RB1 will show profiles like this. And also BRCA12. So that's a stark contrast to the clustered mutations that we see in the olcoproteins, such as B-rath and K-raths. Okay, so let's just talk about this concept of digital sequencing for a minute. So one of the properties of next-generation sequencing, in addition to its cost-effectiveness in the sense that we can cover the whole genome in a relatively inexpensive and both in time and money space, is that we get digital representation of the DNA mixture that we're sequencing. And so here's just schematically, we have some sort of soup of DNA that we've extracted from our population of cells. And some proportion of those cells were harbor mutations. So here's this mutation G that may be represented at about 30% of the alleles, let's say. So then we can create a library from this mixture of DNA and then we can sequence it. And when we take those reads and we align them to the genome, we can see that there's a certain proportion of reads that harbor that mutation. And this will be proportional to what was present in the initial sample. And the advantage of the digital technology is that we can sequence very deeply in a targeted way and get very precise representation of the allelic abundance of a particular mutation. So even if it's occurring in less than 1% of cells, we can still resolve that because we have what individual molecule sequencing essentially. So that's just a concept I want to carry forward to the rest of the lecture. Okay, any questions so far? It's the after lunch lecture, so. I see tired eyes, but I don't want to see any closed eyes. Asking questions is a good way to stay awake. Okay, so let's talk about statistical considerations for modeling these allelic distributions. So I've already talked about this, but it's worth re-emphasizing. Okay, so we have, again, we have tumor normal admixture in these samples. We've talked about intertumoral heterogeneity. We've talked about that. So with respect to modeling specific alleles, the copy number changes that we talked about in the last lab have some pretty important implications for when we measure the precise allelic abundance of particular mutations. These can also be influenced by copy number changes. And so we'll talk about a fairly advanced topic later on that tries to address this phenomenon when inferring properties of mutations. And then the other really unique thing about the cancer space here is that the experimental design to capture mutations really requires the simultaneous sequencing of a match normal. And so that immediately changes the analytical strategy to look at these libraries. So we know, as I said, we know that they're highly correlated and so we should take advantage of that. All right. So what actually happens when we sequence a genome that we might get millions or billions of these read fragments, these are sequence reads, that come off the machine. And when we start this process and the data come off, have you talked about FASTQ format? Yes. Okay, so pretend that each one of these is a FASTQ format. We have no idea where these reads align to the genome. So the first step is basically taking a reference sequence and trying to put these reads in some sort of semblance of order. And you can consider this as a giant jigsaw puzzle. It sounds like you did some work with alignments yesterday. And the idea here is that once we do this alignment, there's numerous approaches to this, is that the biological variation starts to illuminate itself. And so we can look at each read, we can take the concept of whether this particular nucleotide matches the reference at that particular position. And so we can see there are some examples here, like this one, this A here, this does not match the reference. The reference has a C. But there's really only one read that has a mismatch there. And so this might be due to a sequencing error. It might be due to a rare clone that harbors this particular, it might be a real mutation present in only a rare clone. Then we have these other locations that have essentially reproducible variations. So here we have three reads with a T where the reference has an A. And so we might be quite confident that this particular location harbors a mutation. We have three independent observations of that mutation in the reads that we sequenced. And then consequently, this one here, similarly, this would be an example where we have six out of seven reads harbour a mutation. So what could be the interpretation, for example, of this column versus this column? Why would only half the reads have a mutation here and almost all the reads have a mutation here? Possibly, yeah. So let's just project this onto proportions. So here we have three out of six or 50%. And here we have six out of seven or almost 100%. Right, okay, good. So you could have heterozygous mutation and homozygous mutation. So this would be both alleles are affected or you have bililic inactivation or bililic mutation. And here you have just a single heterozygous mutation. Could be just a snip. We'll get to that later. But let's say this is the tumor. And then what about this one here? I don't know why the C is highlighted. It shouldn't be. But this one here, this could be, as I said, due to a number of different factors. It could be a sequencing error. It could be a misaligned read. It could be a number of different things. Okay, good. Can we have gap in the seed funds to make C a line with a column? Here? Yeah, we had gap before C. Now C is a line with a full column. Well then it would push the rest of it to be misaligned. I think really this C is meant to be here. This should be this base that's highlighted and not C. Yeah, it's just a mistake. I've had this slide for five years. This is the first time I've noticed this. Yeah. That's right. It's been a clonal expansion. Okay, so let's talk about this experimental design of a tumor exome and a normal exome. So this could be a genome or an exome within the matter. But we have representation of a tumor and normal. And so we've developed a couple of strategies for this. And at the risk of being a little bit narcissistic, I don't want to focus really too much on my own tools, but just to use those to illustrate the concepts that I think are important in an analysis of this data. And so I'll go through these two methods and just to illustrate the concepts that are important. Do I have data in response? Well, naturally. So let's talk about the data now from a tumor normal pair. And this comes from Andy's paper of an occurrence I wanted to read. So here's an example of just a small segment of the genome and the normal data and the tumor data. So now here we expand this concept. So let's say we just look at this column here. So this is very much like that other column that I showed where you have half the reads in both the normal and the tumor that harbor a particular variation. And so when we see the variation in both the normal and the tumor, that's indication of a germline polymorphism. So that's shared between all the cells and that's because this is probably present in the very earliest zygote that is created. It is the very first cell. And then we have, so this is a heterozygous shared mutation. And then here we have a mutation or a variation that is homozygous in both the tumor and the normal. And so this is just a case where both the maternal and paternal alleles are the same and they both are different from the reference. And then we have this red column here and that's a locus where the normal indicates no source of variation and the tumor has half the reads that show a variation here. And so we can project all this down into a very compact representation of the data. And so here we have all the nucleotide level and the actual bases here. But as very essence, essentially we can reduce this to a binary problem. Is a read match the reference or not? And that's where this projection onto counts here. And so the A represents the number of reads that match the reference at each position and then D is the number of reads covering that position. And this will be variable across the genome, this depth here. And I don't know if you've talked about experimental designs or depth of coverage and things like that. Yeah, you've gone over that sort of. So depth of coverage here. I mean this is obviously quite shallow. Usually we try to achieve at least 30x coverage but just for illustrative purposes. Here it's somewhere around six. And we can do that for the normal and we can do that for the tumor. And so we get basically these four vectors that can produce the input data into our statistical model. And the concept here is that we try to take each one of these columns and assign it a particular biologic class. And so here you have a germline heterozygous. This sort of goes into the red here. So sorry if you can't read this. Germline heterozygous, germline homozygous, and somatic heterozygous here. So we try to take these count vectors and assign them into these biological classes. Sounds a bit familiar to the copy number idea, right? So we take some sort of signal and we try to assign some biology to it. So if we just focus on this red column here, we can encode this in a nice probabilistic model and then we can look at the probability of each of these possible nine combinations of normal tumor joint distributions. And so there's a good indication from the signal here that this has a high probability of being AAAB or a heterozygous somatic mutation. Is that clear to everyone? So if you look at this column here, then most of the probability mass should be on this BBBB. So if we were to compute this matrix for this particular position, then this would probably be somewhere around close to 1.0 for that particular part of the matrix. So the problem we try to solve here is that we know that the genotypes of the tumor in the normal highly correlated. And as I said, the mutation rate for cancers may be somewhere in the order of one in 10,000 nucleotides, maybe less. And so the polymorphisms tend to really dominate. And so we need to be able to have a way to distinguish the germline events from the somatic events in a rigorous statistical fashion. And so this model here called joint SMV mix, which is developed by Andy, is a solution to this problem. And so here the input data is a normal BAM file and a tumor BAM file. And without going into the details of this, essentially this model allows the data to be considered simultaneously, and that allows for a borrowing of statistical strength across the samples. And that confers some measurable improvements in accuracy, particularly the ability to isolate germline polymorphisms and distinguish them from somatic mutations. So I would encourage you to read that if that's of interest to you. And I think that's the tool that Andy's going to have you go through in the lab to really examine in a hands-on way what a germline polymorphism looks like and its ability to distinguish that from a somatic mutation. So we might have some predictions, and so I've gone through this process of raw data, online reads, predicting variants. And then we might want to have a protocol for validation, and so this is quite important because there's still, despite dramatic improvements in prediction ability, we want to be able to confidently say that when you're talking about biology or making a claim about a particular mutation, first of all, you want to be able to demonstrate that it's real, and second of all, that it's somatic. And so there are still quite a few instances of false positives, and then sometimes what happens is that the sequencer may not adequately sample the variation in the normal, and so there's an illusion of a somatic mutation, but then upon confirmation that that somatic mutation actually turns out to be a germline polymorphism. So this validation step is really a critical step still on the path to actually getting knowledge and clinical relevance. So let's just look at some, focus on this false positive idea. So let's look at different artifacts that might induce false positives. So you should be familiar with IGB by now. So you've read in sequence data to IGB, so you know what this means, that's good. And so what I'm showing here is an example where this is the normal on the bottom and this is the tumor. And so if you're just to look at this particular position, you can see that the normal is relatively devoid of a variation and the tumor looks like it has some variants in here. And now this is a prediction that was made that did not confirm. And the reason is because these reads are all misaligned. So these reads can basically be aligned somewhere else, almost with an identical score. And so the alignment process can introduce artifacts into the problem. So you have reads that don't belong where they're assigned and that can create the illusion of a mutation. So insertions and deletions wreak havoc on this data. This is a real serious problem that exists. And so, especially with the Ion Torrent data, I'm sure you've experienced that. And so these are essentially when you think about this problem. You've got 100 base pair reads at the most really for your whole genome these days. Some of the targeted sequencing can get longer reads and there are other platforms that can get longer reads. But on a standard high-seat 2500 run of a genome, you get 100 base pair reads. And you have to align those 100 base pair reads to a 3 billion letter space. And so that's quite an enormous task, especially in a concept where the genome is quite repetitive and has lots of places where there might be ambiguous alignments. And then if you throw on top of it the fact that some of these reads will harbor our sequencing across microinsertions or deletions, that makes the problem that much harder. And so what tends to happen is that you may get the aligners just to be tractable, have certain heuristics associated with them. So these would be less than 20. Yeah, it's just arbitrary, but small. Yeah, so beyond 20, it's basically no hope to sequence an insertion or deletion that sets 50 nucleotides in 100 base pair read and expect that read to align properly is probably very difficult. May pairs help. But I think the field is kind of zeroed in on the 1 to 20 range and as something that may be tractable with 100 base pair reads. But I still think we have a massive false negative rate with respect to these microinsertions and deletions. But the reason I raise this here is that this creates the illusion that if we tend to look at these positions in isolation and so of course we don't troll through the genome in IGV and see the context. We run an algorithm that runs across the whole space of the genome and treats each position independently of the next. But of course there's some sort of context that's involved here and so here is a couple of reads that have this insertion deletion. It's probably misaligned in the sense that the gap is not long enough and what that creates is that if the aligner has forced this read to be in this confirmation that's induced this illusion of a somatic change here in the tumor and that just by chance hasn't happened in the normal. And it could be because the insertion deletion is potentially tumor specific but maybe in this case it's actually evidence that it's there in the normal as well. It's just in the tumor data the aligner has resulted in a different type of alignment that has created this artifact here. So this doesn't exist in the biology and it's simply an artifact of the insertion deletion. So that's something to really watch out for. So another thing that can yield false positives and low base quality so actually IGV will encode the strength of the color of the mismatch according to the quality of the base call. So what happens of the sequencer is that the data are actually not discrete so it's not in your FASTQ file the last line of the FASTQ file encodes a quality metric that's associated with that particular nucleotide of interest. And so the colors only produce the result for the base colors produce the result for the most likely base but that might have a very low probability so there may be quite a bit of uncertainty as to what that base actually is. And so when you visualize that you can barely see it here but these are very faint representation of a mismatch here and so if these might just be above the threshold that you would normally discount so these would just be low quality bases that would give again the illusion of the mutation there that doesn't exist. One of the most common causes of artifact is that we see mutations sequenced all in the same direction so these reads all have some directionality associated with them and they represent the strandedness of how these were sequenced and so typically if we see representation where all the variants are sequenced in the same direction and you don't get the sequencing of reads in the other direction that's most likely due to an artifact in the optical PCR process of the sequencers and so this is something that is a pervasive and major confounding effect in mutation analysis and so if all the reads are sequenced in one direction then you should watch out for it presumably as an example of this in the lab? No. Okay, alright well this is just collectively known as strand bias and most of the mutation collars now will account for this but that's only recently really only in the last year have a number of mutation collars now emerge that actually account for this phenomenon and so usually by inspection you can see this and so I guarantee that if you have encountered sequence data you'll have looked at a mutation and thought aha I've got a mutation and then you go look at it and everything is in the same direction and you say no I don't have a mutation that's hopefully what you do Melissa's done that okay and then I think we still have some unknown sources here so I've reversed the tumor normal for this one unfortunately so this is an example where everything looks good there's no strand bias, the base quality is good the alignment scores are good there's no presence of an indel everything looks fantastic but this one doesn't verify and so there's something else going on here that we don't know so this is just a way to introduce you to the fact that it goes far beyond just modeling allele counts and one has to be very very wary of numerous sources of artifacts that exist in the data and that could lead to a stray when interpreting mutations okay so let's not dwell on the negatives and now let's talk about positives so here's some true positive examples of what it should look like so here's an example of a mutation that I think would be quite difficult to detect but it's a real mutation that we are able to verify and by the way all these examples are chosen for real examples in work that I've done these aren't just made up examples so here's an example of mutation that we are able to call and it validates but it's only present in a small proportion of reads here so this would be a true example but one that's quite challenging to call here's one that's even more difficult so it's a G here is the variation and this may be present in I would say less than 5% of reads but it's there and it's real so what could lead to a signal like this? okay, right, exactly so there might be just a few cells in the population that actually harbor that mutation and that will be represented because of the digital nature of this technology in this way and you'll notice that there are no reads in the normal that have that particular variant yes so there will be some false negative rate that's going to be without question and at some point the signal, the sensitivity of the algorithm will converge with the noise model in the system and so the signal will be indistinguishable from noise and there is some threshold there that takes place and ultimately a lot of the variant callers and somatic mutation callers have some sort of confidence estimates associated with the call and then it's about you're as an experimentalist what's your tolerance for false positives so is there a cost model associated with that? is it really bad if you miss a subclonal mutation? in that case your tolerance for false positives is going to have to go up is it really bad if you are polluted with false positives and you really only want to look at the creme de la creme of the mutations that you think are present in all cells, for example in which case you can ratchet up your thresholding so that your specificity is high so that's just a trade-off that you have to decide as an experimentalist related to that, how do you typically verify and how do you choose which variant if you have a large number of them? yeah so I think tomorrow so there's interpretation layers that go on top of this this is really just about signal processing trying to find mutations on genomic coordinate level but of course you can assign those genomic coordinates to which mutations lie in protein coding regions which induce amino acid change, for example and then you can look at the specific amino acid change and decide if that's quite a significant change in terms of what that might do the charge and polarity of that particular amino acid all that stuff whether that induces some sort of drug molecular docking confirmation change in a protein that we know is targetable by a drug etc etc and so typically when doing population level studies you may want to look at mutations that might occur in more than one sample so they're recurrent in the same gene or at the same position and so to validate there are a number of different procedures so what WashU typically does in, this is Elaine Mars's group at the Genome Institute and I think, is OB arrived yet? Friday so you can ask OB about this but typically what they do is they'll sequence the whole genome they'll predict their variants and then do custom capture on the positions that are showing interesting variations and so they've designed probes essentially that tile the whole genome and then just subselect a set of probes and then capture that material again resequence it if it's there again and that's considered validated but you can do other much more targeted ways you can just design primers around a particular variant of interest generate an amplicon and throw it on a myseq or you can just do Sanger validation for example which is very inexpensive in high throughput but is maybe not sensitive to mutations like this that are so cool so a lot of different strategies to validate yes do you put data on the reference genome for assembly purposes because it seems like you're a pretty normal do you why also compare to the reference? yeah so right so the normal though is not assembled so it's also just sequenced in the same way and so the way to get to tractably look at the comparison is to take the normal reads align them to the reference to the tumor reads align them to the reference to compare now another way to do that though is to assemble both genomes and then look at the differences in the assembled genomes but I would say that's a much it's still an expanding field and I don't think that we have super reliable ways of assembling whole genomes especially human level genomes bacterial genomes and other smaller genomes I think we can do if we get to the whole genome scale it gets quite challenging yes so if you're doing a medium to large scale validation experiment do you feel that sequencing technologies are maturing to the point where you don't really have to do a orthogonal assay so if you're comparing something right now you have to run everything on high on torus to get rid of the platform specific there will be platform specific biases I think that's inevitable so never eliminate those but I think we have matured to a point where we can reliably tell true mutations from false mutations and some of that work is just due to advances in the methodology for calling mutations in the first place so our latest validation experiment show almost 99% validation rates of the mutations that we're calling whereas 5 years ago or 4 years ago that might have been around 30% so we've learned from our experiences and can now much more reliably detect mutations now that probably induces some sort of false negative rate that's unquantified at this point but at least for the mutations that we find we can confirm almost all of them there's very few that don't confirm above the very high confidence ones are almost all real okay actually this is a nice picture that shows that so here this is data from a study that we did and this is 3,000 mutations taken from a spectrum of triple negative breast cancers and what we did is we initially called these mutations just using the allele data and what we found is that that doesn't of course doesn't account for all those artifacts that I just showed you and so when we start to account for those artifacts and we can look at different features of the data and when I say features I mean things like strand bias, presence of an indel base quality, the mapping quality I think at least some of those things will be covering the lab when we account for those we can start to very nicely separate the true somatic mutations from the false positive and also germline variations here and that's just shown where we've taken each of the 3,000 mutations and calculated 100 different features on each one of these mutations and then projected these using the principal components analysis and you can see that the somatic mutations in this 3D space separate quite nicely away from the wild type, these are false positives and germline which are shown in red and so now we can actually, this gave us some indication that we could probably train a classifier using machine learning techniques to distinguish these black dots from the others and so we went ahead and did that and showed really good performance in terms of accuracy so this is an ROC curve showing the performance in a cross-validation scheme when we train this multi-feature classifier called mutation seed and we're able to outperform standard methods quite considerably and the other thing to note here is that all these different curves represent different classification schemes but the point being is that the classification schemes performed equally well, the most important thing was that we considered all these features well and so it's the addition, in addition to alleles, we need to consider things like strand bias, presence of an indel, et cetera, et cetera and that dramatically improves performance so and this is just some standard tools like JTK and SAM tools that we compared and then zooming in on the elbow of the curve here quite gratifyingly we were able to take the trained model from one platform Illumina and apply it to solid and actually performed remarkably well so the initial training data was on exome Illumina data and then we projected this onto solid whole genome data and actually got reasonable performance here so really accounting for these features is quite important and does translate across platforms, it's not perfect you'll notice that the curve isn't here like it was before but it's actually much better than the other tools so I think I'll just so what's time? okay so then what we did is we took we took all these mutations that were classed as false positives and we took these features to see if we could actually group them by different classes and they fell out quite nicely into different groups that could explain the reason for the false positive predictions in the first place and so we had misalignments due to the competitive sequence we had strand bias and a very specific GGT to GGG sequencing error so this is where the sequencing context is quite important this has been reported several times now where the sequencer reads these nucleotides as GGT sorry as GGG when they really should be GGT and so you might see for example T to G substitutions that are quite nicely represented in the data but it turns out this is just an actual sequencing error and this is partly why that is there's something in the chemistry that has a bias towards this I'm not yet well it's actually it's ubiquitous multiple different platforms have shown this particular error as well so there's just something to watch out for so if you have a huge enrichment for T to G substitutions in your cancer genome you may want to think about looking into that it's quite important so this was just a pattern that we were able to pull out other trinucleotide combinations that have this problem but the majority that's been I've seen reported is this one ok here's a group with low base quality and also the error and also strand bias so basically we're able to categorize sets of variants as we could explain the error according to these different properties in the data and that's really the point that we're trying to make here yes technical artifacts ok so now I want to shift gears a little bit and talk about how we talked about copy number changes and we've talked about mutations and I want to explain how the two actually intersect and show how copy number changes actually affect the allelic distributions of mutations so here what I'm showing is a similar plot to what I showed before this is a tumor genome that we've studied quite well it was originally published in 2009 and so this tumor harbors a high level amplification of this arm of chromosome 19 and you can see that that results in an allelic split of the the heterozygous polymorphisms here and what's interesting is that this region harbors a large number of mutations and this speaks to the sensitivity of methods and I just want to illustrate this in a sense so I showed you a model before that assumes essentially three different genotypes for a tumor so we have A, A, A, B and B, B but in a similar way to what I showed for the copy number allelic specific copy number changes we can have a mutational genotype that's affected so you can imagine that if you have four copies you could have any combination of this particular mutational genotype and when we account for that to adjust the distributions then we can really increase the sensitivity of the model so let me just show you that here so what we found is that we employed this extension of the genotype state space and we looked at a comparison to just the standard approach and we found that in this genome we found 200 non-synonymous protein coding changes that were unique to this method that allows for the genotype to expand according to the copy number and so we were able to confirm out of this that there were 24 somatic mutations in this genome that were undetectable by standard methods that now appeared in this new model and an important point here is that the original analysis of this tumor which was published in Nature 2009 yielded about 30 mutations and so with this reanalysis we actually almost doubled the number of non-synonymous mutations in the genome and so I'll just skip over this I've said the main points and so the point is that the copy number will the copy number architecture of the genome will distort the ability to find mutations in those regions and it's important to just consider that and bear that in mind and so we talked about false negative rates false negative rates are due to all kinds of reasons some of which are subclonal mutations others are actually copy number changes that might that might bury the signal that's present in the data okay so summary so far so we talked about these binomial mixture models to join SMV mix a robust public probabilistic framework for modeling allele counts and we talked about joint inference of tumor and normal pairs and we've talked about artifacts in the data we've talked about how copy number changes influence relic distributions any questions so far that's a good question so really this dot model that I showed you really requires a priori knowledge of what the landscape looks like so for targeted sequencing the best way is to do an array on the same sample so you have some notion of what the copy number architecture actually looks like and then you can apply that in the context of targeted sequencing it's difficult because the targeted sequencing especially if you're just looking at parts of a gene or just even a gene in isolation it's difficult to know the copy number alteration would typically span a larger region than that so you just don't know from individual mutations whether it's in a copy number change or not you really need a global picture of what the chromosome architecture looks like yeah and to know what that or even it's you know the number normal versus the number you can't distinguish between a sub-clonal mutation and a mutation that has an allele skewed by a copy number so it could be sub-clonal and have a properties or it could be just AAAB type of mutation we can't distinguish that but for targeted sequencing the sense of the immigration which we want to address would be less of an issue right sure so so available tools SAM tools is quite an important suite of tools to get familiar with this GATK which is I think also you probably have already touched already I wanted to talk about VCF format have you gone over the formats at all? not VCF a little bit okay so we'll touch it on in the lab but essentially VCF has become for better or worse the standard representation of variants in the field and I think VCF stands for variant calling format I think and so there's a specification here that I've listed in this URL and essentially what it does is it has a chromosome, a position an ID for that has the reference base, the alternate base and then some information about about the particular call of interest and these last few fields are generally free form and so it's a format but it's not really a format so that's why it's kind of you know a little bit dubious somebody's nodding their head in the back there you've got experience with this but this is really it is the community standard accepted way of representing variants and a lot of tools assume that and especially the annotation tools that you might visit in later days in the lectures in the labs will assume that there's some VCF format for variants and so it's important to get to know this I'm not going to dwell on it but so but what you can do is encode a fair amount of information in the VCF format so I've also just listed here a number of tools that have been developed specifically for the somatic mutation context and I have to say this literature has grown considerably in the last I would say two years I think when did we publish our paper end of 2011 early 2012 so just over a year ago and now there are probably six or seven reasonably good somatic mutation collars but previous to 18 months ago there were zero so the field has matured quite a bit and I think Andy's was the first on the scene so yeah right yeah just around the same time okay so visualization tools you'll go through IGVE in the lab and so then what do we do with mutations once we have predicted them and I'm not going to spend much time on this except to just list these tools here there's a nice tool called mutation assessor that comes out of Chris Sanders lab at Sloan Kettering and essentially what this allows one to profile or the protein coding mutations and it allows one to assess the impact of an amino acid substitution in many different contexts so the way this works is that what this group did is they took all the pathogenic variants known to be disease causing in the literature so they went through OMib and they classified pathogenic mutations known to exist in both cancer and germline setting and they looked at the properties of those particular mutations in the context of their amino acid substitutions the protein structures and their evolutionary conservation across species and then developed a classifier that could score a particular an arbitrary amino acid substitution in the context of what we already knew and so this is what this tool does is it takes a particular variant and will give it a score as to what its potential impact is and also there's a nice web interface you'll find it at this particular URL and it allows you to visualize the mutation in the context of protein structure so you can see where on the structure, 3D structure of the protein mutation occurs so if it's in a binding domain or if it's in some sort of pocket you can see that and then you can compare different mutations that you might see across the protein sequence and see if those cluster together for example in a nice three three dimensional space this is quite a valuable tool that I quite like although it's not perfect and for example the PI3 kinase hotspot mutations are classed as low functional impact so it's not perfect but it's pretty good this is another tool that I just put up here because I think this is the tool that's going to be used in the next sections called ANOVAR I don't have direct experience with this one okay so how else can we interpret mutations so a lot of people are talking about TCGA data this is data from the Endometrial paper that was published just when was it last month very recent, yep updated my slides and so what this shows is this is a cancer it's essentially uterine cancer with a really rich mutational landscape so this is in contrast to for example the neuroblastoma that I was talking about earlier where the mutational landscape is essentially barren there are almost no mutations to even discuss in those studies this one on the other hand is rich with mutations and so what's shown here is a mutations that are basically highly recurrent in the population so many different cases had a mutation and that's what's shown in the y-axis here and so you see a lot of the familiar players that have already talked about p10, p53, psb, kinase erid1a I've mentioned KRAS is there pvp2r1a is there and I keep calling this the Melissa gene but it's not really Melissa's gene it's everyone's gene but Melissa knows a lot about it and so these are just genes that are highly recurrently mutated in the population now this is an example again if you think about putter evolutionary hats on again this is convergent evolution happening so this is again all these people are unrelated but they have all developed uterine cancers and a large number of them have mutations in the same genes and so there's some sort of phenotype that gets selected for when these mutations have the crude and then these stars here represent something important so what we can do in population studies is we can ask the question is my gene of interest mutated more frequently than I would expect by chance and what by chance means is we can take into account the background mutation rate and look at how many mutations does this tumor have we can look at the length of the gene and say okay how many mutations given my background mutation rate how many mutations would I expect in my gene of interest and then we can ask given that background distribution in that gene is the foreground which is what I observe higher than what I've expected by chance and that's essentially what this tool here is called as a genetic tool mutational significance in cancer that's essentially what that tool calculates and so the mutations with stars here represent mutations that are significantly mutated in the population according to those calculations and this is just something you'll see over and over and over again in the TCGA papers they love this and for better or for worse and every single TCGA paper will have a figure that has these significant mutations okay any questions on that? so that's one way of interpreting so that helps when you have a rich mutational landscape and you have 500 tumors that you sequenced we may not have that situation this is the other extreme now so I'm going to talk about n equals 1 experiment where this group sequenced different metastasis metastases from the same individual patient so this is taking samples from distant metastasis where the original primary was a renal cancer and then there are a number of other metastases that were sequenced and so basically during surgery maybe during primary surgical staging these different biopsies were acquired and harvested and then each one of them was sequenced and so what's shown here is each one of these rows represents one of these regions that was sequenced and then the column represents the specific mutation that was found and so what I've shown here is the grey boxes represent where there's a mutation in that region so a mutation in that particular sample and the blue boxes represent an absence of mutation in that particular sample and you can see that the profiles are quite different and thank you very much I've been speaking too much you guys got to ask more questions ok so here what we have is the ancestral clone and these are mutations that are essentially shared everywhere that's why I call it the ancestral clone so this would be a representation of the mutational profile of the clone that underwent the ancestral expansion alright and then what we have is we start to get quite significant deviation from this mutational profile and so to the point where we have these mutations are specific to certain regions ok so this region has only these mutations and so there's what we call a descending clone that obviously shares the mutations from the ancestral clone that has acquired its own set of mutations that's distinct from the other regions so this is a very very detailed look at one tumor and what we can see is that there's dramatic divergence in the mutational profiles at different regions in the tumor so this is really a landmark paper that I think is quite important and significant the point being is that in this paper they describe mutations that are regionally isolated that will have impact on therapeutic response so that explains why you might have a partial response in the tumor or an intrinsic resistance to therapy right off the bat ok so this is kind of like a mid-level resolution between the single cell data that I showed in the previous lecture and a sort of one sample examination that is typically associated with this type of analysis so this gives you a little bit of a better resolution on what the entire mutational landscape looks like in a tumor I have to direct you to the paper I can't remember ok so let's now talk about a different concept and so we talk a lot about genes we talk a lot about mutations projected onto genes and we look at the coding regions and we say we have non-synonymous mutations and so I'm going to focus on that one percent of the genome that exists but of course in whole genome sequencing experiments we have a rich set of mutations often in the tens of thousands of mutations that we can predict and we can leverage that data to learn something about the biology of the tumor and so what this paper describes is a set of called mutational processes in 21 breast cancers and what they were able to do is by sequencing the whole genomes of these 21 cancers they notice that there were specific substitution patterns that can fall out and so there's generally an enrichment in C to T mutations and then looking at the trinocleotide context of what base comes before the C and what base comes after the C they were able to sub-classify these tumors into these different groupings and so they found these five different signatures and what's really quite striking is that when they calculated the signatures of each one of these patients they found that they clustered into BRCA1 into wild type breast cancer so these are patients that don't have either somatic or germline BRCA1 and a different class of tumors that have BRCA1 into germline and to be fair this was a supervised analysis where the study design was to really try to understand the mutational mechanisms behind BRCA1 germline patients so the Angelina Jolise of the world who didn't have cancer yet but her mother and her aunt did so so these are BRCA1 to carriers and they have a significant the different mutational signature of the actually nucleotides that are substituted so that suggests a mutational mechanism that's associated with BRCA abnormality and the speculation in the paper that these are due to cytosine deamination enzymes called the apalbek proteins and that that presents potential therapeutic vulnerability to these to these particular cancers so if you drive those particular tumors then maybe you can those cancers will mutate themselves to death and those cells will die okay so this is really quite nice and this is really gaining some prominence in the field as we gain more and more whole genomes we're gaining better insight into this type of mutational mechanisms that exist in these cancers and there's going to be a paper emerging precinct from the brode that shows this spectrum of 300 different cancers and shows that they can really be classed into different tumor types according to their mutational spectra and so for example melanomas that are induced through environmental insult due to UV exposure I have a particular mutational signature lung cancers that are associated with nicotine and tobacco smoke have a particular mutational signature etc etc and so and so there's some really nice biology that can be extracted from just looking at the mutational signatures and this is independent of the gene content and so to speak okay alright so I'm getting close to the end here any questions on this this is a relatively new development that in the field here so let's gain some comments okay so let's talk a little bit about clonal evolution again and we can just wrap up with this concept and so what this here shows is that we can look at the cell of the prevalence of mutations and we can do this in a temporal axis so what this represents is a follicular lymphoma that's been sequenced and we have a number of mutations and and then it's a secondary biopsy from the same patient after relapse so this patient has been treated and they have an initial response and they come back sometime later with a relapse and so we've done some work on sequencing the genomes of these pairs and then we can compare the mutational profiles and we can compare using deep sequencing technology the abundance of clones in the different biopsies so here's a clone for example that has centered around a cell with a prevalence of about 60% of cells hard with this particular mutation in the primary or in the first biopsy and the second biopsy is completely absent okay the inference there is that that's a clone that's probably extinguished by therapy so it doesn't exist anymore but then what we see is we see on the other axis we see a set of mutations that is completely absent in the primary biopsy but it seems to have expanded or has grown up in the relapse biopsy and we know we're sequencing the same tumor because a large number of mutations are actually shared between the primary and the relapse but it's these guys here that may indicate something about resistance so there's a temporal this really illustrates quite nicely the temporal element to how clones shift over time and that we are sequencing mixtures of cells and depending on the context of selection pressure there's a dynamic nature to the composition of cells over time so I think what I'll do is I was going to illustrate this breast cancer study but I think I'm out of time here and I'll just wrap up for this section but you can look at the slides and you can read the paper that I've outlined here and I just want to end with a couple of thoughts so there I hope what I've commenced you today that there really are enormous statistical challenges for the future going forward and as new technologies emerge, again new challenges follow that and so for mutation calling we still are trying to understand the sources of artifacts we're getting much better at this but there are all kinds of concepts like base calling alignment and positional error rates that can affect the biology of complex biology of cancer is exceedingly complex and I was complaining earlier about that few tools exist specifically for cancer data now that is improving and most of the tools that are emerging now have considered for example copy number changes, tumor normal admixture and tools specifically to assess mutational heterogeneity are coming online Andy and I have worked on a problem to really try to quantify the degree of clonal diversity in a tumor sample and we're working on getting that work published I think what will emerge in the next little while are single cell genomics is going to become quite an important field to really understand the clonal diversity that exists in tumors and specifically to understand what clonal genotypes are confer resistance to drugs and what gets selected for or extinguished in the presence of a drug sequencing self-recirculating tumor DNA this is actually quite an exciting field that I think is going to actually emerge as probably one of the more exciting developments in tumor monitoring in the last probably a couple of decades and what this is is that again leveraging the deep digital technology of these next generation devices one can isolate cell-free DNA that's been shed from tumors in the circulation and through experimental protocols isolate that DNA and sequence specific mutation and the allelic prevalence of those mutations can be an indicator of tumor burden and so you can take through a non-vasive methods one can follow a patient throughout their chemotherapy cycle and follow them afterwards and by measuring the allelic abundance of these mutations get some indication of whether there's a response or whether there's potentially a relapse coming and so a few of the early papers have demonstrated quite convincingly that it's better than circulating tumor cells or better than imaging for example to predict relapse and can predict a relapse up to 11 months in advance of imaging technicians is currently the state of the art so you can imagine that it's all about capturing these things early and so this is really promising technology that involves sequencing and interpretation of the fields and then again the last example I showed revolves around evolutionary dynamics understanding how clonal selection is operating in different contexts so I would say that all these problems really represent new statistical challenges that will need to be addressed and we really want to make advances in that area and use appropriate statistical techniques in order to maximize the biology and so I just want to finish then with a quote from Peter Knowles paper and just permit me to read this here so he says the acquired genetic instability and associated selection process most readily recognized cytogenetically results in advanced human malignancies being highly individual caretipically and biologically hence each patients cancer may require individual specific therapy and even this may be thwarted by emergence of genetically variant sublines resistant to the treatment and research should be directed towards understanding controlling evolutionary process and tumors before it reaches the late stage usually seen in clinical cancer so I think this is really an amazing prescient paragraph in 1976 and the thing is is that he of course didn't have access to these incredible devices that we have access to now and I think there's actually really great hope that through example through tumor monitoring through CT DNA and other mutational profiling techniques that we can actually get a handle on this and I think that the road to personalized therapy of course would be paved with mutations and so knowing the mutational content of a particular tumor can really help in the therapeutic process and so there's a lot of work that's left to be done in terms of identifying targetable and actionable mutations and really understand how drugs drive selection and resistance in different micro environments in the tissue and so I'd like to conclude there and just thank you for your attention for the day and there are a couple of lectures and I'm happy to take any questions Thank you This thing also makes an IT to know that it's an amazing thing absolutely everything what do you think the stopping point is how many cells do you actually have to test Right, so I think we're in a phase right now of exploration and discovery and so we don't yet know for example we're only learning about single mutations as being potentially resistant but we have no idea about mutations co-occurring in the same cells and how that actually has an impact on drug resistance or drug selection and so just knowing combinations and mutations that might have some sort of selective advantage would be I think quite advantageous and that we're just in early days there so it'd be something I think quite a bit of exploratory work but eventually there'll be some knowledge that's gaining from that and that exploratory work at that level will probably stop and then can be applied in the context of hopefully in terms of these sort of diagnostic tests in patients and so I think for the foreseeable future the next decade at least we're still in this exploratory phase where we can drill down at finer and finer levels of detail to really understand the properties the biological properties of what mutations are conferring and ultimately as you know the acid test for this is in functional models and really trying to take these mutations that we discover and induce them into animal models and test their efficiencies and see what they do so I would say that the discovery phase is still relatively early even though the TCJ is now coming to full fruition but they haven't for example yet in depth multiple samples from the same individual they haven't looked at that single cell level they haven't looked at the dynamics of pre and post treatment and all of that so there's a lot of work to be done and we're still in the early phases I would say of really trying to understand the evolutionary properties of tumors What about targeted sequencing? Is that something that's productive at this point? No I think there are a subset of very well characterized mutations that can indicate potentially indicate a therapy that can be administered and that work is being done and that's being done all over the place and as I said it's becoming routine it's becoming clinically certified tests that can be now administered to patients and big centers like MD Anderson and the Mayo Clinic and the Dana-Farber and Boston and even here at the OICR in Toronto and the OCI have a clinical trial ongoing to do targeted sequencing and so in our own center I would say between 15 and 20% of patients looked at in this way and these are patients basically these are so-called last hope patients they've failed all kinds of standard therapies or they have primary tumors of unknown origin and so the oncologist doesn't know what to do with them anyways and so they've been enrolled in this project and between 15 and 20% have yielded a genetic abnormality that suggests a particular therapy of intervention and in those cases there's been reduction in tumor size and it's early days to know whether that improves quality of life and how much that prolongs survival and all of that stuff but at least in the very early pilot projects all of the indications is highly favorable that more information is better