 Okay, so today we're going to talk about finding somatic mutations in cancer genome sequence data. And I'd like to begin with a borrowed slide from Mike Stratton that really describes how cells evolve over time and that they really accumulate mutations somewhat stochastically over their lifespan. And occasionally those mutations, if it randomly happens to hit into a gene, that's a so-called driver gene, then that can initiate tumor genesis and the cell can become malignant. And so these cells go over time, undergo a process where their genomes change. So really the cancer arises from the endogenous cell itself. And this is something that is really quite important to understand, is that mutations just accumulate over time. And as I said yesterday that if we all live long enough, we'd likely all develop cancer of some kind. So tumors undergo a process of Darwinian evolution and under selection and where the individual is not an individual organism, but rather the individual cell. So you can really imagine this process whereby cells' genomes change and they undergo selection and those cells who have genomes that give them proliferative advantage will be selected for throughout their life cycle. So really what we're after is to find mutations such as this mutation here in P53. So this is an actual example from some data that I've been working with. And this we believe is a tumor-initiating event. And so what we're focusing on today as opposed to yesterday are single-nucleotide changes. So right at the base pair resolution of the genome. Yesterday we looked at larger changes, more than a KV, let's say, and today we're going to look at actual single-nucleotide changes. So in this case P53 is called a tumor suppressor gene and defects or mutations in P53 can allow cells to evade, program, cell death and DNA repair. So I'm going to show this slide again that I showed yesterday and now I'll ask you to focus on the last component of this table. And these are genes, BRAF, KRAS, PI3 kinase, PJFR, whereby mutations in these genes are actually actionable by clinicians. So in KRAS there's a particular mutation, CODON12, for which there are targeted therapy. And in melanoma there's BRAF, which is CODON600, and that there are RAF inhibitors for that particular mutation. So it's very advantageous to try to find mutations in cancers because eventually we can develop therapies against them. And especially if they're driver mutations then that's really a major goal of what a lot of groups are doing right now internationally. So the way to find mutations is by sequencing genomes, then why should we want to sequence cancer genomes? Well, this figure here shows the classic paper from Hannah Hand and Weinberg that discusses the six hallmarks of cancer. I don't know if you've seen this figure in this workshop already. Yes, you have? Okay, great. So it wasn't covered, I'm not surprised by that by the way, what wasn't really covered in this paper is what genetic abnormalities underpin the ability of tumor cells to achieve these oncogenic properties. And how do these, moreover, how do these genetic abnormalities change over the natural history of a tumor? And what genes and pathways are actually disrupted due to somatic genome aberrations? These are all fundamental questions that underpin the mechanisms by which tumors acquire their malignant state. So there have been now just really an astounding number of early successes with next generation sequencing applied to cancer. I would argue that more than any other field, cancer has benefited the most from sequencing technology because it is a disease of the genome. That's what cancer is. It's a disease of altered genomes. And in order to understand the cancers, we need to understand the genomes. The way to understand genomes is by sequencing them. So there have been a number of recent discoveries. New cancer genes, such as FoxL2, Arid1A, EZH2, IDH1 mutations in gluoblastoma were discovered recently, PPP2R1A, and that whole complex has now been implicated in numerous cancers. And these are some studies that I've been involved in. There are many others as well. We've also seen insights into tumor evolution. So really for the first time, we can now study how genomes change over time. So we can look at samples from a primary tumor and say a distant metastasis in the ensuing years or recurrence. And we can compare those genomes and see how they're different or how similar or different they are and how those genomes may have changed under selection pressure of therapy. This is something that was inconceivable even five or six years ago. So we've also gained enormous insights into the genomic architectures of cancer. So yesterday we looked at copy number changes and rearrangements, I believe, as well. And really these two types of architectural changes really define the structure of the genomes. What we've learned to appreciate by sequencing is that the genomes of many epithelial cancers are far more rearranged and far more abnormal than we could have even imagined. And so this has led to enormous new insights into just setting the architectures of the genomes of cancer. And really what all this leads to is what I'm going to call the redefined mutational landscape. So the mutational landscape of a tumor is really the full complement of mutations in a given tumor. It describes what the tumor is. It defines it with a fingerprint. And really up until now there are about 100,000 mutations or so stored in cosmic. It's growing exponentially. And I predict that in the next three to five years we'll see at least one, maybe two orders of magnitude in terms of number of mutations that are found through this technology. And so groups like the International Cancer Genome Consortium and the Cancer Genome Atlas are really poised to completely rewrite the book on how we understand cancers from a mutational perspective. So this is a very exciting time to be in this field. So a brief look at sequencing technology probably seems like this already. But essentially the way this works is that really next generation sequencing or high throughput sequencing, as it will, approaches single molecule sequencing. And DNA fragments are anchored to molecules on a solid surface. And basically each of these molecules is copied in situ by PCR to amplify what we call the template. And then essentially nucleotides are added one at a time in cycles. And the nucleotide that is added in a massively parallel way is read by image analysis. And really this way of doing in situ type of PCR on a chip is on a flow cell. And what I call it in the limina has really revolutionized the way that sequencing can be done in a massively parallel fashion. And the throughput these days is quite staggering. Five billion sequence reads at 100 base pair runs, leads about 500 gigabases of sequence over and really over 150 fold haploid coverage in a matter of 10 days. So just to put this in perspective, I'm sure John went over this already, but this dwarfs the scale of the human genome project just in a matter of days and for $5,000. Pretty amazing. Okay. So in addition to cost and throughput, one of the real advantages for cancer is that because it approximates single molecule sequencing, we can actually read the proportion of alleles in a digital fashion in the sample. So say, for example, we have a tumor sample with 30% of the cells harbor a mutation. When we do our sequencing, 30% of the reads approximately will harbor that mutation. And so we can get really precise measurements on the abundance of mutations in our given samples. And it allows us to get at issues like mutational heterogeneity in a way that we could never have done before. So it's the digital nature of how this technology works that gives it an additional advantage to cost and throughput. So we'll talk about this a little later on as well. Okay. So here's a really, really simple and schematized workflow for how we might go about discovering mutations whereby we obtain reads in an unaligned form. And we use some sort of software to align the reads to the human genome. So when we get the data, we don't know where these reads actually come from in the genome. This is basically a schematic of a paired end. I think you've probably gone over all this stuff, so I'll be quick about it. So we go from unaligned reads to aligned reads. Once we have our aligned reads, we try to find the features in our genome of interest that are different from the reference. And so this is inference or predictions at this point. And for single nucleotide variants, what we usually do is, depending on the experimental design, we'll screen out known polymorphisms in a cancer setting, because for similar reasons through what we discussed yesterday. And then likely we'll, to try to pick the low-hanging fruit, we will likely screen out synonymous or non-coding mutations, because it's a little bit more difficult to interpret. And so then what we're left with is a set of synonymous changes that affect the protein coding sequence of the gene that you're contained in, okay? And then what's really important is to initiate validation. So currently, the field in general, I don't think we don't really understand all of the artifacts that this technology produces. Just like any other genomics technology, and you've now gone through a few of them, gene expression microarrays, high-throughput genotyping arrays, and sequencing is no exception, there are machine-based artifacts. So anytime you engage in a high-throughput type of activity that's generating a lot of data, there are almost certainly the artifacts. And sequencing is no exception. So it shouldn't be under any illusions that you're going to take your tumor sample and your normal, mass-normal controlling, and a sequence in both, and you're going to be able to easily find all the mutations, that's not the case. So validation is quite important. We need to use some sort of orthogonal experiment to confirm the presence of your prediction. And usually there are three outcomes. So a particular variant of interest could be confirmed as somatic. Often what we have is we have the illusion of what looks like a somatic mutation, but turns out to be a germline, a polymorphism, a polyvalidation. And then there are false positives, which are just artifactual mutations that give the illusion of a mutation, but are actually just not variations at all. So then once we have gone through a validation process, then the task is really to try to assess some sort of clinical relevance or functional relevance to the mutations that we found. And one way to do that is just by assessing recurrence. So how often is a gene mutated in a larger population? So we can then extend in a, looking at a small number of genes that have emerged from our validation exercise can then do sort of a more targeted focused analysis on a larger set of patients to try to infer clinical relevance. And then ultimately what we want to try to do is in a new mutation, try to figure out exactly what it's doing biochemically. And that's a much longer process that involves model systems and in vivo experiments to try to understand when you induce a mutation, what actually happens to the biochemistry of those cells. So that in a very simplified way is usually what mutational discovery experiments are involving. Okay, any questions on that? Yes? So if during your validation experiment you find SMPs that were not included in your predictions? So that's kind of challenging. Then you'd have to revalidate those in a different way. Usually the validation experiments are very targeted. So it's just, you're asking the question, in my list of predicted events, how many can I reconfirm in the validation experiment? It does happen where we discover something de novo in a validation experiment. But that's, I mean, from pure experimental design, it's probably not principal to then pull out that. And because it's not comparable to how the others were found in the first place. Okay, so I'm just gonna talk about some use case examples of some published work that I've been involved in just to illustrate some of these points. So I work in both ovarian cancer and breast cancer. And so when this technology first came about, it was of great interest to a colleague of mine, David Huntsman. And so we embarked on a project to try to look at the different subtypes of ovarian cancer. And the reason why this is important is that really ovarian cancer is not one disease. It's made up of several different diseases that differ in many regard. This really depicts the, I mean, histochemistry protein expression of a panel of markers. And you can see that these four different subtypes differ quite a bit in terms of their protein expression profiles. And moreover, really, there have distinct epidemiological and genetic risk factors that different precursor lesions, different biomarker profiles and completely different clinical behavior. And the real issue when we embarked on this is that at that time, and really still today, the biomarker studies and treatment protocols were not subtypes specific. So we were trying to use a broad brushstroke to look at really just very distinct variations of a disease. And you couldn't even call it really one disease. It's a collection of different diseases. And so we asked the question, do ovarian carcinomas feature subtype specific mutations that can be developed as biomarkers for early diagnosis or therapeutic targets? And at this time, DNA sequencing or whole genome sequencing was a little bit infeasible to do on a population level. So we actually embarked on an experiment to sequence the transcriptomes of a number of cases. And we really, we looked at these five subtypes here. And this high grade serous being the most common, and then granulosa cell tumors, these are the two that I'll highlight being the most rare, and actually phenotypically distinct as well. So the result of this was the discovery of a recurrent mutation in a gene called Fox cell 2. And now just to illustrate some of the steps that we went through to actually find this. So we sequenced 15 different cases. So four were granulosa cell tumors. Another five were clear cell cancers, the ovary, another four ametroid, carcinomas, and then a couple of different cell lines. And what we found is that using that protocol or the schematic workflow that I described earlier, we found about three to 500 non-synonymous variants in each case. And so at this time, this was really the first time we'd encountered any data like this. And we were just kind of baffled as what do we do with that? That's a lot of events in each tumor and how are we going to make sense of this? So we decided to be extremely stringent and really focus our energy on the original hypothesis, which was looking for subtype and specific and recurrent mutations. So we took the perspective of the granulosa cell tumors and we asked how many variants were present in three or more of the granulosa cell tumors. And so that led us to just filtering the list down to 29 positions. And then we asked, okay, of those, how many were unique to granulosa cell tumors? So how many were subtype specific? And that again filtered the depth list down to an extremely small list of five. And then upon inspection, and you'll be doing a little bit of this in the lab, upon inspection, we realized that three of these were actually just artifacts. And so that left us with only two. And so then we carried these forward for validation. And one of them did not validate. And what we were left with is the following mutation. So this is in the transcript of position 402 in a gene called FoxL2. It's a transcription factor. And it induces a cysteine to tryptophan amino acid change. And which is really kind of a massive amino acid transformation in terms of biochemistry. And the remarkable thing is that all four of the granulosa cell tumors had exactly the same point mutation in the genome. And this was not found in any other tumors we looked at, nor was it found in any of the polymorphism databases that existed. So we got quite excited about this. And then we looked at the match normal DNA and confirmed that indeed this mutation was somatic as well. So once we had that, then we executed the next step, which was to look in a larger cohort of tumors. And so David, my colleague, he's been quite generous with donating samples over the years. Yes? If we hadn't found the mutation in the genome, so it was only there in the transcriptome. Well, so we actually have found variants like that, which are transcriptome specific. And there's a phenomenon called RNA editing, which is a known phenomenon whereby post-transcriptionally, the message gets modified. And there have been a lot of speculation as to what these RNA edits do. Some speculation is that in certain organisms, it's used for immune evasion. So in tropanosomes, for example, they undergo significant editing of their transcriptomes in order to evade the immune system of the host. In a tumor that we sequenced, and I'll talk about in a few minutes, we actually determined that some of these edits can actually induce synonymous changes in the protein coding sequence. And so it really has a potential to alter function. And so that's actually a potentially a new class of variation in cancer that we just don't really understand. There have been some studies now emerging, although I think there's some controversy in the literature as some of the recent studies on RNA editing as to how well they were actually executed. But nonetheless, it does seem to be a phenomenon in normal cells as well. And it potentially is a way to regulate expression of proteins by changing three prime UTRs, for example, so that mRNAs can't bind there. But that's all fairly speculative at this point. Anyways, so had we not found it in the genome, we'd been a bit perplexed, but ultimately probably would have called it an RNA. RNA I edit and would have made a different story altogether. Yeah, so then moving on, we collected a set of granulosa cell tumors from around the world. This is a pretty rare disease, so we had to call in favor some internationally. And what we found is that of the 89 tumors that we were able to collect, 86 of them had also had the exact same change. Okay, so this is probably almost unprecedented level of recurrence in the tumor. And so really this then we looked at the same position in 800 other cancers and never found it again. So it's highly recurrent, highly specific change that essentially defines the disease. And so histologically this disease is in a sort of a spectrum whereby a pathologist has to sort of make a call based on their intuition. So looking down the microscope, looking at the morphology and says, oh, it kind of looks like this, so I'll call it a granulosa cell tumor. Now we have a precise molecular test that's based on the genome that defines what this disease is. Okay, so we went from a disease that could be difficult to diagnose by histology and really this finding provides a diagnostic and a novel target for therapeutics. And the other point I want to make is that FoxLT was not implicated in cancer at all prior to our study. So it wasn't even on the radar screen. Okay, so this is really where sequencing, and I didn't mention this before, but a lot of the genes that we know about so far have really been done by targeted analysis or targeted experiments. So an investigator will have a gene of interest and maybe sequence just that gene. And because of their prior work it might have a prior assumption that this is an important cancer gene and may find mutations. It's a very low throughput and somewhat biased way of doing mutation discovery. What the sequencing technology allows us to do is look at the whole genome in an unbiased fashion and make discoveries like this that were on a gene that maybe wasn't on the radar screen. Yeah. So in this type of tumor, is this the type of ovarian cancer which is implicated also with the inherited DRCA? No, this is not. No, this is the much rarer subtype that's a sex cord scromal tumor, it's non-epithelial. So what you're thinking about are high-grade serious cancers, that's 20% or DRCA1 inherited, okay. Is it herbosis for this type just as bad as that tumor? It's actually not. But what happens to patients with these tumors is that their consistent recurrences and eventually usually patients will undergo maybe 15 years of surgeries, repeated surgeries to extract the tumors. No targeted chemotherapy that will eradicate the disease and eventually not to be grim about it, but basically patients will usually succumb to having had too many surgeries. Okay, so we then focused our attention on a slightly more common but still rare subtype called ovarian. Called clear cell carcinomas of the ovary or endometriosis associated carcinomas. Whereby endometriotic cysts can then lead to tumors. And so we apply the same type of technique and we found mutations in a gene called arid 1A. And what was quite different about this is that we found truncating mutations. So these are mutations that induce a premature stop code. Did we go over that stuff at all? So missense, nonsense mutations? Yes, yeah, okay, so we did. Okay, so these are quite different in nature. So these are truncating mutations that induce a premature stop code on. What we found is that in contrast to the granulosa cell tumors, we found the exact same position that was recurrent. We found mutations that were spread throughout the gene. So, and if you think about the consequences of that, so you could think of the granulosa cell tumors as inducing a very specific change and that change is probably having some sort of very, very specific function. And so we can maybe call that a gain a function or a switch a function. In this case, we saw stop codons pepper throughout the gene and that's really a loss of function. So that's the hypothesis there is that it doesn't really matter where you induce a stop code on. As long as you do that, those transcripts will get degraded and proteins won't get made. And so on the top, we're showing the discovery cohort whereby we have 15 tumors in the discovery cohort. We found that seven of them harbor a stop code on mutation or an insertion or deletion in this gene. And then we went to look at the extension cohort. We found that really there were stop code on mutations peppered all throughout the protein and it was highly recurrent at about 46% in clear cell carcinomas. Yeah, yeah. It was, so yeah, sure. So I have to admit that this was really done by scanning spreadsheets. And we just noticed, just noticed it that, look, there are a lot of stop code on mutations in this gene, let's check it out. So there's more of a hunch. I wouldn't claim that there is a real systematic approach to finding this. We were just really browsing the data that we had generated, which was already a very, very filtered down list. So that allowed us to browse it in actually without looking at 100,000 sites, we could look at less than a few hundred sites. And that's actually doable in a human with your brain just looking through this. So that's how this was found. I wish I could claim that there was a beautiful algorithm that discovered this, but yeah. But the human brain is sometimes better than computers. So sometimes, yeah. I did say sometimes. Okay, so what I, then what we did is actually, and this is, David carried this forward and we wanted to try to look at hundreds of tumors. And so we found that the mutation status actually correlated with aminohistochemistry, which is a measurement, essentially, of protein expression of the gene. So when there was a mutation, we often saw loss of the protein expression. And so we were able to look at many, many tumors in this way and essentially deem that around 50% of these cancers were harbored this mutation. And so the other thing about this gene is that it's involved in the chromatin remodeling complex, and which is an important complex for genome stability. And what we found is that this was highly recurrent and that this has now prompted the next steps, which is to do now whole genome sequencing of 50 of these carcinomas and their machinomals. And really the question there is, so we found mutation in 50% of the cases, but what's happening in the other 50%? And so now that's underway now and we're engaged in that study now. So to sum up some of the ovarian cancer research, so sequencing and robust analysis has revealed these subtype specific mutations. And this is what the hypothesis originally was, is that these diseases are distinct and they probably harbor subtype specific mutations. And by sequencing we've actually shown this. And so FoxL2 provides a diagnostic tool for granulosa cell tumors at the ovary and ARID1A provides a novel tumor suppressor and eliminates the sui-sniff complex as an important biochemical pathway. And so the other thing about this is that that's really important is that these two studies illustrate really two types of mutations that I think need to be cognizant of. One is where you're targeting a very specific part of the protein and where there's a gain of function. And so that's usually that's very recurrent even by position or by amino acid position where a similar substitution is observed in many, many cases. In contrast ARID1A is a loss of function whereby you'll see truncating or mutations or indels throughout the protein. And that usually indicates that it may be a tumor suppressor. And so P53 for example is very similar to ARID1A. In fact it's been shown now just through personal communications, sequencing a very large number of cell lines that ARID1A is likely the second most frequently mutated gene in cancers. And what's remarkable about this is that this was not on the radar screen before sequencing technology came to be. All right, yes. Okay, so this table says that in clear cell carcinomas 46% of the cases are with mutation. In other endometriosis associated cancers such as endometrioid carcinoma, 30% of the cases are with the mutation. But in high grade serous, which is the most common subtype, none of them had it. They're all over in cancer subtypes. That's right. Sorry, I didn't explain that well. Okay. Okay, so now switching to breast cancer. So around the same time that we're doing all the work on ovarian cancer, was engaged in a project to sequence a lobular breast cancer. And this was quite fortuitous because at the time we were searching for, when the sequencing technology was just coming online at the genome sciences center in Vancouver, we had found this tumor. This is work with Sam Aparicio led by Sam. And so he had identified a tumor for which we had a pleural effusion metastasis that had emerged nine years after the original primary tumor. But we had the primary tumor sample in the tumor bank already. And so we were able to ask some pretty unique questions here. And first of all, we were able to ask the question and this had never been done before at this point is what mutations were actually present in the metastatic tumor? Could we use the genome sequencing technology to fully define the mutational landscape of this tumor? And then we asked how many new mutations or aberrations arose over time? So we could compare the profile of the metastatic tumor to its match primary from the same individual from nine years earlier. And then we asked as well is, can this digital allelic abundance counting be used to detect heterogeneity in samples as well? So these are the major questions and I'll just illustrate how we went about trying to address this question. So by today's standards, this was maybe not so staggering, but even about three years ago, this was an amazing amount of data we generated. We generated three billion paired end reads or 120 gigabases of a line sequence of the genome and also generated about six gigabases of a line sequence of the transcriptome. So we also sequenced the transcriptome of this tumor. And the results of this were again, following a similar workflow, we found about 1500 novel non-synonymous variants and then we engaged in a pretty in a comprehensive validation exercise. So we took those 1500 or so and we wanted to actually validate every single position and check for its somatic status. So this is validation with the Sanger sequencing. And so we were able to Sanger sequence about 1100 of these, about 450 or so were confirmed and 32 were confirmed as somatic. So in this experimental design, which I don't advocate, but we couldn't afford to do the proper experimental design. This experimental design, we only sequenced the tumor. We did not sequence the match normal. We had the DNA, but we chose to look at the normal DNA in a targeted way rather than a comprehensive way. So at that time, that was the most cost-effective way. Things have changed to a point now where that's no longer the most cost-effective way. It's much more cost-effective to the sequence matched pairs. And then that way, so you can see from 400 or so confirmed variants, only 32 of them were somatic. So that would have likely removed quite a few of those germline variants. So after looking at these 32 somatic changes, then we asked these questions in comparing them to the primary DNA. And what was really remarkable about this is that only five of these mutations were present in the primary tumor at a highly looked abundances of A. So these are mutations from the metastatic tumor. Only five were present in the primary. At high abundance, where the majority of cells harbored the mutation. Interestingly, six of these were present at very low frequencies. They were detectable in the primary, but at low frequencies. And this is significant because it suggests that only a portion of the cells at the time of diagnosis in the primary tumor harbored these mutations. And so that was a very good evidence of a mutational heterogeneity at the time of diagnosis diagnosis. So the primary tumor was already heterogeneous and made up of different clonal populations. And so that has significant implications for how to study these tumors in the first place. I think we talked about intratumoral heterogeneity a lot down. And then remarkably, 19 of these mutations were just not there in the primary tumor at all. So there had been significant mutational evolution over the course of the life history of these of this tumor. So this was one of the sobering results of this is that, so we took these 32 mutations or a subset of them, and we looked at that same position in 192 other cancers. And the only recurrent positions or recurrent variants that we found were in three cases, we found nearby mutations in the ERB2. We talked about ERB2. And we found some truncating mutations in a gene called house three, but only in two cases. And the rest of them were not seen again. So that suggests that within breast cancer, mutational profiles are extremely, extremely heterogeneous. So within individual tumors, there's heterogeneity between patients is heterogeneity as well. So at the mutational level. Okay. So what's also was notable about this is that most of the genes had never been, again, once again, never been seen before in any cancer and certainly not breast cancers. So again, it illustrates the power of this unbiased whole genome sequencing approach to reveal new potential cancer genes. Okay. So I think I'm gonna skip ahead at this point. Do you consider those mutations to be drivers in that case? Well, so we think we know what the driver mutation is in that one because we found a mutation in a gene called PALB2. And that's one of the genes that was present in the primary tumor. And that is a very nice way to try to get at driver mutations is by doing this sort of matched analysis whereby the early events should be present. So if you look at a metastasis and then you ask the questions, well, which mutations were present in the primary tumor? And those necessarily the driver events are probably present in the primary tumor. So drastically reduces the candidate drivers of tumor analysis. So we go from 32 mutations down to five candidates. And one of those candidates is a gene called PALB2 which is partner and localizer of BRCA2. And so that's likely what we think is going on there. I have a long that seemed up to you when you're doing that kind of a study and look at the drumline ones at all or do you just, I mean potentially that person has something that predates most in the primary to all those other mutations. Yeah, so in this study, very difficult because it's an N of weak equals one experiment. So it's hard to ascribe any kind of susceptibility to one patient. But what will likely emerge in various circles is that there are now large scale efforts underway to sequence families with hereditary breast cancer but without BRCA1 or BRCA2 without the known markers for hereditary breast cancer which approximately 50% of hereditary breast cancer is unexplained. And so there are huge efforts underway now to do things like studying 1,000 different patients and patient families and sequencing their whole genomes to try to find in an epidemiological way the susceptibility risk factors. Yes, yeah, that's a great question. We don't suggest that it's related because the mutations are the same, but we didn't find zero overlap. We found five mutations that overlap. So can you quantify when you do this type of paradigm analysis how many cells in the primary? Yeah, okay, so I was going to get a bit slow here. So I thought I'd gloss over this, but since you asked the question, so I'll go back to this then. So what we did is this is where this technology, I think, can really shine. So what we did is in a very targeted way. So we designed the amplicons by PCRing just the mutations of interest. So we designed primers around our 32 somatic mutations and amplified that DNA up. Then we pooled it all together and then sequenced it on the lane of Illumina. And so you can imagine that with the throughput of this technology, this produced an incredible amount of redundancy. So almost 10,000 fold in some cases, whereby we could look at how many reads or what proportion of reads that align to the position of interest contained the mutation. And so we can use some statistics to answer whether the mutation was there or not. But more importantly, we can look at the actual frequencies of the mutation. So here I'm showing the depth of the primary. So this is the sequence, the number of reads we're able to obtain at that position in the primary tumor. And then the proportion of those reads that harbor the mutation. So here you have 50% down to about 25%. And we deem these as being what we call dominant mutations. These are present in the majority of cells. And then in the metastatic, you can see what proportion of cells harbor the mutation as well. So these were somewhat comparable. So then we move to the next class, which is where we have somewhat dominant mutations in the metastatic, so ranging from about 28% to upwards of 60%. And really these were present in the primary tumor, but at much lower frequencies. So you can see that here only 13% of the alleles that we sequence actually harbor the mutation and almost down to sub 1%. So this is the precision that we can get. We can look at mutations in a deep sequencing and targeted fashion, whereby you're getting multiple tens of thousands of reads aligning to a particular position. We can start to get at even sub 1% presence of alleles. Wouldn't you say that those are the most important priority ones for the metastatic? Because they were low frequency in the primary, they were abbreviated in the next step. Absolutely. So that's the other side of the question, is what's driving the metastasis. And these are obviously been selected for over time, because they're present in a small number of cells, and then they were selected for to become dominant mutations in the primary. Interesting that how big was the most important one for the primary? Because it was already present in high abundance in the primary. And because of its function, we know what it does. That's our hypothesis for this particular tumor. So my question, looking at this data, is that do you know in the primary if all these mutations are actually present in the same cell, or is the metastatic still composed of multiple cells that launch from the primary tumor? Otherwise, you would expect a frequency to be the same. Yeah, that's right. So there is some variation in the frequency in the metastatic tumor as well. And that's almost certainly the case that it's not uniform. It's probably heterogeneous as well. And then so finally, then there's the last group here, which is this list of mutations that were basically these levels were not above the detection of just statistical noise, or the background error rate. And so these were basically deemed as not present at all. All right. The denominator is the depth here. So basically what this is, is a number of various alleles over this cone here, which is the number of total reads aligned to the position. So you can see, let me just go back, basically it's like this. So in this particular example, you would have, so the depth would be the total number of reads here, and the ratio would be this number of variants. So in this case, the A's, which is looks like about 80% or 90% over the total number of reads. OK, so then just to summarize. So in this project, there were 32 novel somatic mutations revealed in the somatostatic tumor. Approximately 28 of the genes were not known to be mutated in cancer before we embarked on this. Comparison with the primary tumor revealed significant mutational evolution in intra-tumoral heterogeneity. I didn't get into this, but comparison to the transcriptome of the genome revealed that there was widespread RNA editing, so they get back to Francis's question. So we found a lot of variants and confirmed them that were present only in the transcriptome and not in the genome. And this suggests that it just raises a lot of questions. What are these edits doing? Are they specific to cancer? We don't know. Do they have any kind of functioning cancer? We don't know. It's something that we're following up and engaged in heavily at this point. Is that the? Actually, they are the same type of numbers, yeah. We've reported that. We looked at the non-synonymous changes in a targeted way, but there are many, many edits in UTRs that are detected in very clear signals. So what I want to emphasize as well, and we'll discuss some of the details of this, is that although I didn't stress it, and this is more for biology-focused part of the lecture, bioinformatics is the discovery engine in these cancer genome sequencing projects. That's how the discoveries are made. They're made in the computer. And we validate in the lab, but the discoveries, initial discoveries and initial insights are made through computational analysis. And so I think what's also becoming very clear in this field is that if you're going to study cancer genomes, you're all here for a cancer genomics workshop. If you're going to study cancer genomes, you have to know about sequencing. And to understand sequencing, you have to be adept computationally. It's the only way to do it. That is the microscope of today for cancer genomics, is the computers in front of you, or the computers you're about to see in the tour at the OICR, which are much bigger machines, I guess. But nonetheless, this is the way it's done, and this is the way it's going to be. So if you're going to be in this field, becoming adept computationally, or at least partnering with people that are computationally adept, is a requirement. So then just to follow up. So currently what we're doing in the breast cancer space is sequencing 100 tumor normal pairs of triple negative breast cancer. It's the most aggressive form of breast cancer and occurs at about 15% of the population of breast cancers. We have a submitted manuscript under review for that. OK. So what's our schedule? We're going to 10.30 before we break, or? OK. OK. Good. OK, so is everyone ready for some statistics now? Yeah? OK. All right. So now that we've sort of discussed the motivation and some of the success stories behind sequencing, so let's get into some of the nitty gritties of how to actually analyze these genomes. So I mentioned this yesterday, but I'll mention it again, that cancer genomes have specific properties that warrant specialized analytical strategies. OK, so the first property is this tumor normal admixture problem. So tumor DNA is often contaminated with DNA from non-malignant cells, and this, again, may doubly important biological signals. Introtumoral heterogeneity, we've just seen that cancer is often a mosaic of cellular populations that are genomically distinct. This must be considered. As you saw yesterday, many tumors undergo genomic instability. So a large number of copy number changes, loss of headers, agosity, and genomic rearrangements, all of which you looked at yesterday, that will distort the expected distribution. And then finally, the experimental design to capture somatically acquired mutations is quite different than most epidemiological studies that are studying potentially normal genomes or looking for Mendelian inheritance type of diseases. And the data are generated from a pair of DNA samples from the same patient, usually a tumor sample and a normal sample. In some cases now, many groups are now looking at multiple samples from the same individual, same tumor, or looking at time series. And Robert's engaged in a project like that, in fact. And so looking at how genomes evolve. So the experimental designs are very, very different and can be quite, and really the statistical tests and the analysis that goes into looking at those questions needs to take that into account. OK, so when you undergo a sequencing experiment, you get data back, and it looks something like this. It's just a bunch of sequences. And you have no idea where they come from. Well, it doesn't quite look like this. It's a little bit more organized and a nice flat file. Usually a text file is some kind. But for all intents and purposes, it's like this. You don't know where these reads come from. So thankfully, we have the reference human genome, which is generated by the human genome project. And essentially, the first task is to take this set of reads and alignment. And so we get something like that. And so now we have a much more organized and ordered representation of the data, whereby we have some idea, mostly, of where these reads originated from in the genome. So what I've done here is a colored, once we have alignments, actually, I should talk about alignment first. So how do we do this? It's essentially like a giant jigsaw puzzle. And what you're trying to do is assemble all these reads based on a scaffold that's provided by the reference sequence. And there are a whole host of tools that have been developed. One of the earlier robust methods was MAC. Its next iteration is BWA. But there are other tools, Shrimp and Mosaic, that have different properties. And Shrimp was developed in my Brutino's lab here in Toronto. So there are a number of tools that are available for alignment. And they all are essentially have something in common, whereby they chop up the genome and induce string matching with some sort of mismatched tolerance so that single nucleotide variants can be captured. And so what we can do once we have alignments is we can actually look at where we have variations in the genome compared to the reference. And that's where these red nucleotides here are highlighted. So if we look at that, what we can do is we can take that, I would say, a matrix of sequences. And then what we can do is we can compress that down into what I'm going to call allele counts. So I mentioned that this is digital sequencing whereby we can actually count the number of alleles. And what we want to try to do is to find these positions in the genome whereby there are variations compared to the reference. And so if we compress that down into two vectors whereby the top vector here represents the number of reads that match the reference and the bottom vector is the number of reads that have a mismatch, then we get something like this. Or we get a representation like this. And so we want to be able to ask the question in really three billion positions. So you can look at this, it's just a fixed number of positions. You can look at it and say, okay, I know where the variants are, no problem, right? You can visualize that. But you're not going to be able to do that for the whole genome. So you need a principled way to ask the question, okay, given all my alignments and my allele counts, how can I find variations? So we embarked on a project whereby we wanted to model these alleles. And so we'll use a little bit of mathematical notation now whereby we let a sub i be the number of reference reads at a position i. Now I'm going to build up on the right, build up what's called a probabilistic graphical model. And you don't need to really know all the details about that except that it's a nice way of representing a statistical model. And so we have two variables that we can represent. And n sub i is then actually the total number of reads at the given position. And these are shaded quantities. And that's why, and that's because they're known, they're observed, these are things that we observe. Now we have some unobserved things that we want to actually ask about. We want to ask, in now, in very similar representation to the genus hyping that we looked at yesterday, but instead of major and minor allele, we have reference, non-reference allele, okay? So I'm going to use lowercase a and b to represent that. So it's very similar in nature. It's just that it's no, it's not major, minor. It's reference, non-reference. Okay, so we have three possible genotypes. Let's assume a diploid state. So we talked about non-diploid states and we'll get into that with sequencing as well. But for the time being, let's assume a diploid state. And we can ask the question at each position, which one of three possible genotypes is most likely to have given rise to that data? Okay, and that's what we're trying to infer here. That's the major question. What we can do is actually then in a probabilistic way, we can induce what we call a prior and over top of this set of genotypes. And we can say, well, most of the positions we know are going to be actually homozygous for the reference. So most of the positions in the genome will actually be AA. Does that make sense? Because humans are mostly the same and tumors are mostly the same to their normal DNA. And then, but then we can set the rest of the probabilities on the other two accordingly, okay? And then we have a parameter. This parameter in the USOK represents the, what we call the genotype-specific parameter of a binomial distribution. So, binomial distributions are great for modeling things like coin flips, where you have two possible outcomes, okay? So you can imagine that there might be three different coins that we're trying to model here. We're trying to guess the bias of each of these coins. And so you might have a coin that mostly comes up heads, okay? And that will be, with some exceptions, will be a coin that represents the AA genotype, okay? So it's gonna become a mostly reference. And then you're gonna have a coin that is, let's say balanced, 50-50. And that's gonna represent what? What would that represent? That A, B, a heterozygous position, right? We have half and half. And then you have another coin that represents the variant alleles, homozygous for the variant. And that'll be highly biased towards the variant alleles. Okay, so that's, and those precise quantities are what we're gonna try to infer as well. So this is an unshaded quantity and we're gonna try to infer that. So this is published, this is a tool called SNB Mix that's published in Bioinformatics in 2010. And really what we showed in this paper is that, especially in the cancer context, it's very important to estimate these parameters from data because it'll be skew that we saw due to tumor normal head mixture due to intramural heterogeneity whereby if you estimate the parameters from the data, you get much more accurate results. There are other tools such as the one you're gonna use today called GATK, which is really designed for normal human genomes that use fixed parameters based on estimated parameters. So really they'd fix, for example, the A, B coin at 0.5. But what we know from cancer studies is that somatic mutation will likely occur in a proportion of cells. And so maybe the most you ever get is something like 0.4. And so that makes a big difference in the inference of these positions, okay? All right. So that was how we model alleles for a single sample. However, as I mentioned, the most of the experimental design in tumors induces a tumor normal pair design. And ultimately what we wanna do is identify what we call joint genotypes of samples based on paired analysis. And some use cases are, for example, tumor normal pairs and or primary metastatic or DNA RNA. There are many different configurations of a paired type of analysis. And really the goal in a tumor normal setting is to again separate germline variants, which should be present in both the tumor and the normal from somatic mutations, which are tumor specific signals. Does that make sense? Yeah? Okay. And so this is if you just plot probabilities, that this is sort of what we hope to see is that here what we have is an increasing number of reference reads for the normal on the left and the tumor on the right. And it's just a heat map encoding where the red indicates high to high values and the blue indicates low values. And then there's a spectrum in between. And so you'd expect when you have most of the reads that are referenced in the tumor and the normal, that would induce a wild type genotype. So that's just, there's no change there at all. Okay? Now where you have mostly reference reads in the normal, but then you have variation in the tumor across here, okay? Then that would be evidence of a somatic mutation. Okay, does that make sense? This part's important. Okay? And then I'll just go over here for a second. So this is part of the landscape that would be representative of loss of heterozygosity. So here you have heterozygous positions in the normal and then, but you have relatively homozygous position. So either reference or non-reference, sorry reference or non-reference over here in the tumor. And so this part of the landscape will be indicative of loss of heterozygosity in a very similar way to what we described yesterday. And then finally, you'll notice that most of this landscape is actually dominated by germline. And this is important because often in these sequencing technologies, you may have, for example, very weak signals in one or the other sample. And this is especially a problem when the signal is very weak in the normal. So if you were to induce some sort of threshold and say, okay, I'm gonna analyze the normal, and analyze the tumor. And in the normal, you have just a very weak signal of a variant. And so you threshold it out and you don't count it. But then the tumor, you have a nice signal of a variant. It'll induce the illusion of a somatic mutation because you'll have called the variant in the tumor but not in the normal. And so this type, the analysis that I'm about to describe tries to get past this by doing what we call simultaneous or joint inference. And it should be more sensitive to shared signals because they can borrow statistical strength. And you don't need to know the underlying mathematical model, but essentially this is what it looks like. So it's an extension of the original SMB mix whereby the genotype is now a joint genotype. So we actually have nine possible genotypes because you have three in the normal and three in the tumor. We just take the cross product of that. So we have AAAA, AAAB, AABB, et cetera. And the variants of interest, of course, are the AAAB and AABB variety. So we're the AA in the normal and your variant in the tumor. So that's really what we're after. And really this, what we've shown in this work is that just look at the right figure here in the interest of time, this joint SMB mix model in red. When we look at the top number of candidates of somatic mutations in a number of different tumors, what we find is that the proportion of those variants that are in polymorphism databases is much lower than if we do, for example, independent analysis. And that suggests to us that we're actually much better able to trap germline alleles that would be present in polymorphism databases than if we were just do independent analysis in the first place. So this is an example of this very specialized cancer focused analysis that only arises in this type of cancer setting and where there are clear advantages to try to take advantage of the experimental design that we're working with. And so it's just by way of illustration of how one can actually do a lot better when trying to develop specialized analytical strategies for the cancer setting. Okay, so here's another specialized analysis in the cancer setting. So we saw yesterday how copy number changes can influence allelic distributions. And you're familiar now with these types of plots. So this is from the same breast cancer that I was talking about earlier. And here you see the normal, this is chromosome 19. And here you see the normal copy number from the array. We did an array on this as well. And then you see the B allele fraction plots on the bottom. You can see that it's nice in heterozygous throughout the chromosome. And with then looking at the tumor DNA, you can see that there's a distinct allele specific copy number change here that's really skewing the proportion away from a heterozygosity. This is the same data acquired from the sequence data. So we can, for a copy number from the sequence data that we touched on that yesterday. And then when looking at the allele counts from the digital processing that I mentioned just before, you can see that a very same phenomenon exists as although it's arguably much clearer in the sequencing data that it is in the array. So how can we take advantage of this? Well, basically we borrowed the same type of ideas from the array work whereby copy number changes will induce additional genotypes. So the extended SMB mix then go from the diploid state to the multiple copy number state. And then by doing that, what we found is that we found these somatic mutations that we were not able to find in the original analysis. And so made the algorithm much more sensitive without losing specificity by considering copy number changes. Okay, so this is a very same idea and without dwelling on the details, but basically an amplification induces an additional set of genotypes. And then we can model instead of having three coins that we had before, we now have five or six as a case maybe depending on the copy number change. And the mutations that we found were all of this variety here where we had AAAB for example where the mutation was happening on a very small proportion of the alleles. And it was basically undetectable by the other methods and by extending the method we were able to find them. So again, not to dwell, but basically we found 24 additional somatic mutations upon reanalysis of the same genomes. This is the genome that we had, those 32 somatic mutations in the breast cancer. And by extending the analysis to account for copy number changes, we actually found an additional 24. So we almost doubled the number of mutations. And this is a similar table to what I showed earlier. So we did deep sequencing to validate these. And indeed these were in proportions that were quite a bit lower than what we observed for the other mutations. And you can see in the normal basically most of these are just not present at all. And so these were deemed somatic. All right, so in terms of statistics, we use binomial mixture models fit to the data in a robust probabilistic framework for modeling allelic counts. And these probabilistic graphical models, as I showed, are quite extensible and flexible. And so there's a real advantage to sort of doing from first principles trying to develop these analytical models because they can just be extended quite easily. And what I've shown is that joint inference of tumor normal paradate results and increased specificity when predicting somatic mutations. And then finally that copy number changes can influence allelic distribution. And we can take advantage of that to increase sensitivity to mutations. So I'm seeing a lot of tired looks. So I think at this point, we're gonna take a little break and we'll come back, oh, is it early? Okay, well we can soldier on then. That's right, yeah, that's right. The wall's about to fall. So just while he's looking at that, I mean, really, we're okay. Okay, yeah, so I'll just, one last comment before we take a break is that, so again, this data that is produced by these machines is difficult to handle. And both innovation in both mathematical, statistical models, as well as software tools and application of software tools is what is going to discover the mutations. That's what it is, it's computation. So there's no other way to look at this data. So don't be under any illusions that you can sequence three billion base pairs and then browse it while you're sitting in your living room. It's not gonna happen. So the way to look at this is computationally. The field is immature to a point where there aren't really great GUI tools where you can sit and do point and click analysis at this point. In fact, it's got, they're coming and there are some services that, there's some companies that have now emerged where you can just deposit your data and they'll analyze it for you on the cloud, for example. But that way you're somewhat detached from your data and you're depending on someone else to do some analysis and you don't always know what they did. And it's your science. So you should know what's going on at every level. And so it's gonna be difficult, but you're all, at least you're here. So you're always, obviously all expressing an interest in trying to learn analysis. And so I think that that is what this science is becoming. It's becoming a quantitative science and learning how to do analysis is critical to it. So with that, we'll take a break. Okay, so continuing on where we left off, okay. So I now wanna move beyond allelic counts because allelic counts only tell part of the story. And a very important part of analysis of sequence data is to sort of look at technical artifacts. And technical artifacts induce many false predictions. Here's an example. So this is an IGV plot of a tumor on top and a normal on the bottom. And it looks suspiciously like a somatic mutation. Got variance in the tumor and you have no variance in the normal. However, this turned out to be false due to these reads being misaligned to this location. So this is an artifact that we need to watch out for. Insertions and deletions. So here's a region where there's an insertion in, sorry, rather a deletion in the actual reads or in the sample itself. Likely a somatic deletion, although we see some evidence of it in the normal down here. And what this is doing is not allowing the reads to align properly. And again, creating the illusion of variance that are really just due to the aligner not being able to cope properly with the presence of the deletion. So this is a phenomenon that happens quite a bit. The GTK toolbox that you're gonna use in the lab has a way to compensate for this by doing local realignment around indels. It helps a little bit, definitely. So here's an example, you can't see it very well, but in the tumor, we have the presence of some reads. And if we use base quality thresholding, so I should mention that while the data is digital in the sense that we can do a little accounting. In fact, the data that comes off the machine or the nucleotides that are read or the bases that are called have an associated probability with them. So it's not just, it is this base. It's this base with a certain probability. And those probabilities are called base qualities. And so sometimes what we do is threshold, we try to threshold away the low quality bases. And in this particular case, there are variants, variant bases that are just above the threshold that we used. And so again, these are still poor quality, but they're above our threshold. And so they get counted. And so they're most likely sequencing artifacts and cannot be really trusted. But these sort of leak through and we called these as a variant, but they turned out to be false. This is one that's quite important. It's called a strand bias problem. And it's induced by a PCR artifact. I actually don't fully understand how it happens. Maybe John Daniel, how it happens in PCR, a strand bias PCR artifact. No, yeah, okay. So what ends up happening is that often we get a read that gets duplicated many times or a fragment that gets duplicated many times. And it may contain a sequencing error as well. And so it creates the illusion of a variant, but it can be caught because all the variants reads are in the same orientation, okay? And so that's called a strand bias effect. We'll look at that in the lab as well. So that's another artifact to look at. So here's one. This one actually has the plots reverse. So the bottom is the tumor and the top one is normal. So this one looks beautiful. It's very clean. There's no evidence of base quality problems, alignment problems, indels, no strand bias, but it's not real. So I don't even know what's going on with this one. So the point I'm trying to stress here is that allelic counts are nice because they can model the allelic abundance in the sample. However, all those counts are confounded by many, many different technical artifacts that are induced by the machines, they're induced by alignment, they're induced by properties of the genome as well. And in fact, there are certain parts of the genome where it's just very difficult to call variants simply because they're highly repetitive and it's hard to align reads there. And Mark was talking, we were talking the other day about he's seeing some errors in excellent capture platforms that are whereby a lot of miscalled variants are just in certain parts of the probe or the tail end of the probe where you get mishybridization of fragments to the probe that's trying to pull down this capture experiment. So hopefully I've convinced you by this point that it's not, you will not undergo DNA extraction from tumor, DNA extraction from normal and get a nice list of somatic mutations. It's not simple, it's a very complex problem that there's been a lot of effort into. And anecdotally, I can tell you that leaders of the TCGA analysis in conversations with them, this is the American consortium. So they asked several groups from Baylor, from the Broad, from Wash U, from Berkeley to take a set of data and call somatic mutations. Same set of data, use their preferred method to call somatic mutations. And then they came back at a subsequent meeting and they compared the results and the overlaps were really quite tiny. So that's kind of disconcerting and troubling but it gives people like me strong motivation to try to improve methods for calling these somatic mutations. So it's a non-trivial exercise and it's still, I would say, in its infancy but however, that doesn't mean that we shouldn't be undaunted by that. We should just keep soldiering on because obviously there are discoveries to be made and we're making discoveries as we go. It's just that it's difficult to do this in a very systematic way. This is why validation is extremely important still at this point. Okay, so here's some examples of true positives. These are real examples and I'll switch to back again so the tumor is on the top and the normal's on the bottom. So in this case, we have a very small proportion of reads that contain a variant and none in the normal. So maybe somebody can elaborate or just guess to see what's going on here. Why do we have, what could you speculate as to what's happening with this variant? So look at the number of reads there and then look at the number of reads that contain a variant. So this has been a running theme throughout the whole thing. We've talked about it from day one. What's going on? I've mentioned it about five times. Maybe six. Yeah, so low frequency. So this is probably representative of a mutation that's only present in a small population of cells and so a small proportion of cells, okay? So, and these are actually quite hard to detect. So you can imagine that there might be 25 reads here, 50 reads, let's call it 50 reads, and only three of them represent the mutation. And that's barely above the level of noise in the machine as well. And so you can look at that signal in the context of some of the other artifacts we were looking at and really you have to try to pull out this signal from all the other noisy things that we've looked at. So that's quite challenging. So here's one that's very similar. Here you just have two reads that have a variant, okay? And these are real, we've validated these and we've deep sequenced them, okay? So given all these problems, what are these solutions? So how do we go beyond all these accounts? There are a couple of nice software tools that really have been developed for analysis of a thousand genomes data and not, I should really stress with a caveat that both these tools I'm about to describe have not been focused on cancer studies at all. However, they're quite effective and robust and they have huge user communities and so there's a lot of engineers behind them to make them nice robust software tools. So the first is SAM tools and they've given some URLs for where to learn more about SAM tools and essentially it's a suite of tools for working with alignment files and in the so-called community standard, SAM or BAM format. And so it allows you to very, in a nice way, in a UNIX environment, manipulate these alignment, huge alignment files in a very efficient way. And so they're fast and memory efficient and highly, we use SAM tools in our lab on a multiple daily basis. The other tool that is quite complimentary but has a lot of the same features is the GATK or GATK or GATK as people call it now. And there's a nice paper that describes GATK. It works with the cloud if you wanna work on the cloud and it's implemented in Java and this is actually what we're gonna use in the lab. So the other thing I wanted to just introduce is this format called the BCF or variant calling format. This is, again, a community standard but will likely, I think, become a really de facto standard for variant calling simply because of the communities that are developing it, being the broad, and it really involved in a huge number of projects and so are really driving the development of this format along with the thousand genomes people. So in the lab, I would really encourage you to follow this URL here. Understanding the Unified DenoTyper's VCF format. There's, it's really complex. There's quite a lot of information there and but it's something that would require sort of independent study. I'm not gonna go through every single field but at least a point to the resource and you can then look at that and learn about it yourself. We'll explore this VCF format in the lab. The really strength, the strength of this is that it computes many features about the data which can be used to remove poor quality, poor quality features. Okay, so poor quality predictions. So this really moves beyond all the accounts and gives you all the contextual information about base quality and mapping quality and strand bias and presence of indels and all those things that I touched upon with quite examples are computed in this VCF format and computed by the GATK tools. So given that, we tried to take advantage of that and see what we could learn from some of our data that we've engaged in this field. So I mentioned that we've been sequencing these triple negative breast cancers and we actually really took the approach of being very liberal with our variant calling or somatic mutation calling. And so we actually tried to validate 3000 positions in these 50 tumor normal pair of breast cancer exomes. And the results of that were basically that we were able to revalidate about 1000 and the remaining 2000 were either wild type so we'd never saw the variant again or their germline in which case they were just missed in the normal and the original exome data. So we have 2000, what we call false positives or negatives. So 1000 positives and 2000 negatives. And so being a computer scientist that experienced machine learning, I thought it might be nice to see what we can learn from this data computationally. They extract all the features that we can from using SAM tools in GATK and then learn a classifier that can using machine learning techniques distinguished between these true positives and false positives. So we embarked on this. And this is my PhD student, Jerry Ding who's been leading this. And really the results are quite beautiful. So here it shows a principal component analysis that beautifully separates when we look at features. So now not just looking at illegal accounts we're looking at all these features that are computed by GATK and SAM tools. We can beautifully separate the somatic mutations from the non-somatic mutations. And basically we get on an independent validation test set we get very nice accuracy metrics in terms of applying this classifier then to new data. And then that also has ground truth associated with it. So the point being is that we can take advantage of all these features. And even though there are massive amount of systematic artifacts in the data we can use sophisticated computer science techniques and machine learning techniques to learn in a machine way how to distinguish true positive and false positive mutations. Good. So moving on to visualization tools. So we've talked about IGV and often as I've shown visual inspection can reveal obvious artifacts as well. So if you've got a small study and you're just looking at a few tumors and you might get let's say a list of 100 variants it's often tractable to just browse them all spend the day, spend the two days and just browse them. We get familiar with the data learn what a real mutation looks like compared to a false mutation. And that's what we're gonna do in the lab and give you examples of false positives and examples of true positives and you're gonna actually pull them up in IGV and have a look and see for yourself what the differences are. So annotation tools. So one thing we need to do is we need to go from genomic positions which are just coordinates on a chromosome and want to contextualize those positions in terms of genes. And so one tool that does this very nicely is a tool called mutation assessor. It's developed at the Sloan Kettering Sloan Kettering Institute in New York and there's a paper and there's a tool and there's a web interface and so we'll do that in the lab as well. Basically you go from genome position information to protein functional impact predictions whereby you can see precisely where this variant sits on the protein. You can look at that in a 3D confirmation. You can see where in the 3D confirmation the protein, the mutation is affecting the amino acid. You can look at multiple sequence alignments to look at for how well conserved that particular amino acid is in the context of evolution to see. And so obviously if it's a highly conserved site that's mutated, the probability of functional impact is much higher because you're disrupting something that's been selected for over time. And so it's a very nice tool that I highly recommend you check out. And I use it again, we collaborate with this group and I think Carrie, how many queries have we done over the years? Probably 500,000 queries, so. Okay, so I'm just gonna wrap up some future considerations. So a question I often get asked and I'm sure John gets asked this as well is how much sequence do you need? What's the coverage that you need to sequence a tumor genome? Well, here's a kind of a scary picture and yet another scary picture about intertumoral heterogeneity. So this is a project that I'm leading that whereby we're looking at different parts of the same tumor. So we're extracting different samples from within the same tumor mass to really measure precisely the extent of mutational heterogeneity within a given tumor. And so here are two cases. So these are two individuals with four samples each. And so what we did is we executed the mutation calling based on our classifier that I mentioned. And then we clustered the data just with the four cases just to show how much of the mutational profile is shared amongst those four samples and how much is unique or not shared amongst the three. So what you can see here is that approximately a third in both cases, a third of the mutations is shared. Only a third. So taking one sample, you may be getting slightly more than a third of the mutations in a given tumor. That's a bit scary. So one solution is to sequence deeper. So one can, because you may have some cells from different clones that are just sort of rare in different parts of the tumor. And so you may be able to get them by sequencing deeply. But what I suspect will happen is that we'll go beyond this paradigm of one sample per tumor and really start to engage in multiple samples per tumor in time and in space. And so we can do time series type experiments, follow patients over time. But we can also look at distant metastases and there are papers that have looked at patients in this way using sequencing. And not from a particular single nucleotide perspective, but certainly from an architecture perspective using some of the tools that you used yesterday. So this is a bit of a sobering picture, but also shows you how much there is to be learned about the genomes of cancers by sequencing and we can design experiments to really quantify for the first time the extent of a heterogeneity within a given tumor. And that's just, I think it's quite exciting in that we can actually do this now. So to give you an idea of how accessible this is, so I wrote a grant that I had proposed to do five samples per tumor using exome capture technology. This is about a year ago. And the grant was funded and was recently just getting initiated now. But now what I can do, so in one year, what I can do is I can do 10 samples, whole genomes for the same price as what I costed out for doing five samples of exomes just a year ago. So this is gonna get cheaper and cheaper and cheaper. We're approaching probably a $1,000 genome that's approaching. Right now we're sitting at about a $5,000 genome. And so these experiments are gonna become commoditized and basically even small labs will be able to engage in them. But you have to be prepared that it's gonna create an enormous amount of data analysis. And that's both exciting and daunting at the same time. So, okay. So then finally, some statistical challenges for the future. So I mentioned some of these artifacts and we really haven't, we only begun to understand these artifacts. So what's going on in base calling? We've recently noticed a very systematic error where you have a pattern of nucleotides. And whenever you see that pattern of nucleotides, there's a sequencing error throughout. Okay, and it's really clean. It's beautiful. It looks like a perfect bonafide variant, but it's actually a sequencing error. So these things are gonna get discovered over time. And so we have to be ready to take advantage of them. Again, I'll emphasize that the biology of cancer is very complex. Few tools exist specifically for cancer data. This is my MO is to try to fix this and working extensively on this. So all these properties of copy number alterations, mutational heterogeneity, tumor normal admixture will be determined allelic distributions observed in the data. And finally, we're gonna have integration problems with multiple views of a tumor and multiple sampling. So to the tumor normal pair paradigm, we have the DNA RNA paradigm or we might see RNA edits or allele specific expression, pre and post treatment pairs. Some people are engaged in that type of work in the room here. And multiple samplings of the same tumor. All these problems are extremely interesting and important and also represent new statistical challenges that need to be addressed to where to fully define cancer mutational landscapes as I described earlier today. So with that, I'll conclude this sort of lecture component, but before I do that, I'd be remiss not to acknowledge a number of people that have contributed to all the work that I've presented today. And most notably, Sam Apparicio and David Huntsman are my former post doc advisors and now colleagues. And they really, I think have been leaders in this field and bold leaders in this field and adopted sequencing technology in a cancer context in the Canadian setting very early on and really led the way and have had many early successes. So they deserve a lot of credit for that. I'd also like to acknowledge Gavin Ha and Jerry Ding who are my PhD students who actually worked a lot on developing some of the lab content for both yesterday and today. And then I have a number of other grad students, Andrew McPherson, Anna Cresan, Andrew Roth who all have done some of the work that you've seen in the slides. And my collaborator on the Intra-Tumoral heterogeneity project is Jessica McAlpine who's a kind of Cologic Surgeon at the BC Cancer Agency. Finally, Mark Omara and Martin Hurst who really worked extremely hard to get a functioning sequencing pipeline going at the Genome Sciences Center and have produced an incredible amount of good quality data for our experiments. And that's a non-trivial exercise to get that data analysis, the data generation pipeline to a point where we're getting actually good quality data. So with that, I'd like to also thank the courageous patients who donated their tumor specimens to research. So obviously none of this work can be possible without people signing consent forms and saying, yes, I can, I give you my permission to use my tumor samples for research. So none of that, this is possible without them. Okay, so that's where I'll conclude the lecture component and we'll move to the lab now.