 Good morning everyone. I'm actually very excited to give this talk as a past student of Bioinformatics.ca workshops. I actually took this class as a graduate student in the very early days of next generation sequencing and microarrays. It's changed a lot since then, so I'm just going to give really right back to basics, what is genomics, and then really start to go into some actual data examples. Really at a very high level, really sort of set the table for all the very in-depth talks and workshops we're going to have for the rest of the week. So learning objectives are really focused on cancer genome analysis, what goes wrong in the cancer genome, and how can it go wrong, and how can we detect it. And really think carefully about the different bioinformatic approaches to really detecting, in some cases, the exact same type of cancer gene aberration. There are many ways to find mutations, many ways to find rearrangements. I'm not going to go into intense detail into any of these. There are specific lectures just on this topic, but I really want to give you this really broad overview of really how this bioinformatics enterprise really comes together. And I'm going to finish, so the first 60 minutes are really going to be that overview piece of the talk. But in the last 30 minutes are really just a case study, so this is actually a published report. You may have seen it in the slides, probably the first example of using cancer genomics to manage a patient case. And we'll just go through that, almost journal club style, just to really put what I've shown in the first part of the talk into practice, in this case, into guiding patient management. So with that in mind, I really want to start with this basic idea that we are all made of cells, including cancer. Cancer's really normal cells gone wrong. And each of these cells has DNA molecules in it, and the enterprise of genomics is to really understand the DNA within these cells. What about that DNA makes cells cells? How does DNA turn to RNA turn to protein? All of these are very measurable right down to the molecular level today, and this week is really all about measuring and interpreting those patterns. So I wanted to start with an example of my human genomes. This is one of my blood cells dropped onto a slide and stained, and this is really where genomics really started actually looking at DNA down, the microscope. In the early days of set of genetics, the first step you would do is literally cut and paste with scissors and glue, paste up like this, so where you have, you take this mess of chromosomes, you line them up, and you start by eye looking for differences in batting patterns between these chromosomes. And this is still done today, especially looking for very large structural rearrangements. And cancer genomics and the advent of next generation sequencing has really let us start to look at really the exact same sources of cancer gene variation. Large deletions, rearrangements, but now we can go really even to even greater resolution, really down to the base pair level. And the point of my showing one of my normal healthy cells is just to compare it to a cancer genome. And my first take home point is that cancer is the disease, the genome you can tell by eye, all the many ways that a cancer genome has gone wrong in this case. There's been a genome doubling event and I have twice as many chromosomes as you should. If you look at chromosome nine, half the chromosomes are longer than the others. So the nine P-arm has just been deleted. There are rearrangements tucked in here. You can see there have been genome doubling and then shedding of chromosomes. Some of the chromosomes have copy three, copy two, even though most of them are copy four. And really our goal, especially in clinical cancer genomics, is to find these and then to do something about them or to recommend that something be done about them. And this is really still the central challenge in cancer genomics today. All happy cells are alike and each unhappy cell is unhappy in its own way. And unhappy cells in this case are cancer cells. I have just shown here a Circos plot. You've probably seen these a lot in cancer genomics papers and presentations. And really it's an attempt to take really that side by side carrier or karyotype that I just showed earlier and really try to make sense of it on a single page. This is actually very similar data. Instead around the outside ring are the chromosomes. So instead of showing them side by side, they're not ordered end to end around the outside of the circle. And then the inset here with the letters is really just trying to annotate all the ways that DNA has been modified specifically in cancer cells. So we're going to have lectures on almost all these topics, quant mutations, copy number alterations, structural rearrangements for one piece for chromosome is now stuck to another one. I'm really getting into specific genes and regulatory elements. There's a whole workshop on mapping these genes into networks and trying to understand how they interact with each other, how these maps specifically to pathways. And what I've actually shown here are all these arcs. These arcs can be used to map many different things. In this case on Circos.ca, this is Martin Chvinsky's site, he's just used it to show all the portions of the genome that look like another portion. So these aren't even rearrangements. These are just very difficult to understand and very difficult to map. Genomic segments, just the sequences identical. And even today, the genome, the human genome sequence has been stated to have been finished for a very, very long time, but we get a new build every five or six years or so. As we learn more about it, we get better at mapping these sequences. I'm really starting to put these really A, C, T's and G's into back into this keratific form. So that's really the background, the molecular complexity of cancer. It actually seems quite daunting at first, but we definitely have first of all, we know a lot about cancer biology already. And we have very powerful computational techniques that start to detect and integrate all these data. So I'm going to stick with the biology a little bit and just talk about how different cancers can vary in their level or type of cancer genome variation. The first concept I wanted to show is a famous figure from a review by Meg Stratton actually a while ago now. Really just illustrating that cancer cells do accumulate somatic alterations over time. And this concept of somatic in germline is actually really key to when you're talking about cancer genomics. So germline variation are cells that you're born with. So really just the cells that raise the normal cells, healthy cells, the cells that make up most of your body. And somatic alterations are the new hits that occur in cancer cells. So the point of this slide is to really show the life cycle of a cancer cell right from a germline, a fertilized egg. Overtime, gestation, infancy, and you can even see very early on you start to get hits. In this case the, there's the key here, the color didn't really come out here, but you start to accumulate what are called passenger mutations to just variants we don't necessarily understand. Often they don't have any functional impact. But you can see over time you start to acquire more and more such mutations over time, especially as you're exposed to additional environmental or external influences. And in this case, the ultimate acquisition of a driver mutation shown here is a star. And this concept of driver mutations are really DNA changes that confer a new ability onto these cells. So I have a mutation in a driver oncogene or a mutation that's a loss of, a loss of function mutation in a tumor suppressor. And it's really these types of mutations that start to drive the accumulation of additional mutations. And so I've annotated or Mike Stratton here has annotated and read the mutator phenotype and actually the acquisition of additional mutations, particularly in response or because of treatment, particularly chemotherapy. So we'll take all messages on this slide is that mutation frequency depends on cancer type, pediatric cancers, a lot of the hemolygencies are very low mutation rate. They're occurring much earlier on in the life cycle of a lot of these cells, but certainly external forces can drive increased mutate, really the piling up of additional mutations. And this next slide really says it best. The way to read this plot is all these different mutation types along the bottom and on the y-axis is just a number of mutations and these are on a log scale. So as I said down here on the left are all the low mutation rate pediatric cancers. Each dot is a single cancer. So thanks to the cancer genome atlas and other efforts that you're going to hear about in other workshops, you can really see this diversity of mutation rate. But in general, low mutation rate cancer sort of cluster together and high mutation rate, high mutation rate cancers cluster together together. The outlier, the highest, the most mutated cancer is melanoma. These are sun exposed cells. They actually have a very strong sun exposure signature. Thousands and thousands of CT mutations. Lung cancer as well, very specific mutation type and just thousands of these mutations. But on par with the cancer types are really here in the middle. Not necessarily high, not necessarily very low, but very unlucky in that a mutation has occurred in the wrong gene at the wrong time that's resulting in cancer development. I also like to show this plot for the outliers. The one I do like to show is neuroblastoma here. So this is a pediatric cancer of the developing nervous system, often you find in adrenal gland. But I like to show this high, I can pick this up correctly, this very high mutation rate tumor. This is actually the only neuroblastoma with two hits in a DNA repair gene. So even though they were pediatric cancer, they had two mutations that basically that cell lost the ability to repair its own DNA. And that cell actually had as many mutations as a lifetime, a smoker and lung cancer. It's really a case where you can start to find molecular patterns that really start to explain global views in a cancer genome. And this really sums this all up. This is actually a famous figure that's been updated by Hannah and Weinberg just updated recently in cell, really just trying to pin down how do we map all these mutations to specific functional abilities. And ultimately, tumor cells are normal cells gone wrong and tumor cells have really just acquired abnormal abilities really by co-opting things that cells can already do very well. So I'll just go around this circle very quickly. Definitely if you're going to do cancer genomes, you basically must read hallmarks of cancer and the new way and the reboot of hallmarks to cancer. At the end of the day, they've really boiled it down to relatively few concepts, specifically resisting cell death, just cells being able to persist over time, not dying when they're supposed to, sustaining proliferative signaling, really being able to talk to other cancer cells, really driving growth of a tumor mass, evading growth suppressors around this idea of cells that like to grow. Normal, healthy cells should be talking to other ones through cell cell signaling. In this case, they just avoid or don't even see these growth suppressors. Activating invasion of metastasis. So normal cells, especially if you look at the pathology slide, will be nicely ordered and cancer, these cells are very disorganized. Enabling replicative immortality, really cells should have a finite number of times that they should be able to replicate. A lot of tumors have lost this ability, they've become immortalized. And inducing antigenesis, really just trying to drive growth of normal cells to feed the tumor cells, really promoting invasion of blood vessels into a tumor to really deliver oxygen nutrients that tumor cells need to grow. It's one thing to say these tumors have thousands and thousands of cells, but really as we look at hundreds of these tumors, we start to see recurrent patterns, really start to see mutations really showing up in the same genes over and over. And this is a very simplified diagram from genetics and medicine, which is really a seminal textbook just in genetics in general. And really driving on this concept of a single cell having activating mutations in specific genes, turns oncogenes, loss of function mutations in tumor suppressors. A few examples here, this can result in larger bulk tumors. And then additional mutations, not necessarily gain a function or loss of function in additional genes that really sculpt the cellular makeup of these tumors. And I'm going to return to this idea of subclonal heterogeneity in just a moment. I believe Sir Abshaz has a whole lecture on exactly this topic. So clinically, there's actually been some success in targeting these specific highly recurrent mutations. And you're going to hear this concept of targeted therapies that can exploit the presence of specific mutations and really illustrating why we need to know about these mutations to really guide patient care. And here are two of the most famous examples. Lung adenocarcinoma with activating mutations of the epidermal growth factor receptor. So highly recurrent or actually not even highly recurrent, relatively rare mutations, but enriched in non-smoking females of Asian descent. This is still a mystery. No one really knows why this group specifically appears to be enriched for this type of mutation. But really you can see these very dramatic responses. When you know that these mutations exist, these tumors are being driven by mutations in this specific gene. And therefore, these patients are treated with a specific inhibitor that shuts the protein encoded by the gene down. A very similar story here in the multimyeloma with an activating mutation of BRAF. You can see this poor fellow here on the left absolutely riddled with metastatic disease. All of these tumors have activating BRAF mutations. You can see on the right a very, very dramatic response, really metastases melting away, which was very, very exciting five, six, seven years ago. Unfortunately, we've really come to the appreciation that resistance is really inevitable in this case. These tumors are very plastic. While they have the single mutation that really can be hit by a single gene, this model has really not held up very well over time because these tumors are growing. There's a selection response to therapy. And ultimately, especially with these highly targeted therapies, another clone eventually grows out or there's a secondary resistance mutation. So to really get at this problem, there have been high throughput bioinformatic assays that really have tried to come up with, what are the actual mutations? In addition to the single EGFR mutation, the single BRAF mutation, are there ways that we can really start to, first of all, look across cancer types, look for EGFR BRAF mutations outside of the two diseases I showed, but also try to think about combination therapies as well. So this is a very nice review of thousands of tumors sequenced by the cancer genome atlas and analyzed by Chris Sanders Group, then in New York. And really, the data were absolutely gigantic, but it was really summed up in this heat map where they really mapped genes or proteins down on, so each row is a gene that encodes a protein, and each column is a cancer type. And really, this paper really had three main take-home messages. First of all, drug-able alterations cut across cancer types. So BRAF mutations are not just restricted to multimyeloma, we've actually seen them at lower frequencies in other cancer types. So they're actually the same elected mechanisms are getting repurposed in other cancers. Combination therapies may be effective in tumors with compound pathway alterations. So in this case, they've actually mapped each of these sets of proteins to specific pathways, and they really saw specific pathways are being commutated quite frequently in specific tumors. So there's potential here for combination therapies, and these are trials that are really ongoing in terms of trying to detect multiple genetic alterations and then drug them both. And perhaps a little ambitious at the time that 50% of tumors had at least two disrupted drug-able targets across this large set. And this is a bit of vocabulary that's going to come up over and over this idea of drug-able and actionable. This description is really not set in stone at all. I would term actionable as these types of mutations, really something that you have this mutation, you get a drug immediately. I think for these types of analysis, it's more been any mutation in any of these proteins that map to a pathway, which for clinical genetics is a bit of a lower bar for actionability. So I've talked about the tumor as this homogenous mass up to this point, but the real challenge here, especially in the resistance arena, is that even within a single tumor mass, different cells may actually have different mutations. And even more insidious is actually the tumor mass that has a very, very small subpopulation that is already resistant before you even start treatment. So this is the case where you have an activating EGFR mutation that you can detect using conventional methods, but also a very, very low rare population that is already resistant to these types of drugs. And this is really very nicely described in this review here, but this ball really sums it up, a single large tumor mass, in this case with three different clones already up in the blue. And this is particularly important as we start to look at and analyze these tumors over time. So here's a cartoon from, from, pardon? That's right, at WashU, really trying to show the life cycle of this tumor over time, specifically beginning with relatively few mutations. These are really the initiating mutations, the initiating mutations at the very beginning of the tumor. Over time, you can see the acquisition of these additional sub clones. You can see in this case, there's yellow clone, a purple clone. This patient is then treated with chemotherapy. You can then see selection for these clones, purple clone, completely responsive to chemotherapy. But then this very rare persistent clone that then is actually the one that out the out grows, has a large number of additional mutations, and then ultimately this patient then succumbs. And really the challenge here is not to necessarily chase all these mutations over time, which is really a bit of the current model, but really to have an appreciation for this global subclonal structure of cancer, and really attempt to treat tumor population more holistically. This is particularly challenging when you have not just local heterogeneity, but also metastatic heterogeneity as well. You have the primary tumor at the very beginning, subclones are then acquired. Specific subclones travel to different metastases, and this is one theory as to why we see differential response of a metastatic tumor versus the primary tumor, because these actually may be completely, not completely, but largely divergent at the molecular level. So this really brings me to this applied clinical question really, what are the reasons for molecular testing in cancer specifically around treatment, understanding drug resistance and metabolism. I'm going to talk very briefly about inherited cancer syndrome. So individuals who are born with one of these single hits already, just waiting for the second hit for tumor to develop, and then prognosis overall. Which mutations are actionable, which mutations increase your risk, but don't necessarily immediately lead to cancer, and this is really the enterprise of molecular genetics and molecular pathology today. So this is all great at the global level, looking at thousands of tumors, aggregating them together, mapping them to pathways, or for single patients the question is what are the targets in my cancer, and what has gone wrong? What's the sequence, what's the structure, what's the function, are there external contributors, viral integration, and what can be done about them. So are there any broad questions about the cancer genome, and what can go wrong in general? So if not, I'll push on specifically the topic of this workshop, which is really applications of next generation sequencing specifically to cancer. So I've grossly simplified how we make next generation sequencing data today into four steps. Cells are lysed to extract genomic DNA or RNA. We then make a library of DNA fragments, and a library is really just fragmented DNA that have adapters that make them compatible with our next generation sequencers. This then goes on a sequencing device. There are many vendors for next generation sequencing machines. The most famous is Illumina, all the data I'm going to show today. Most of it comes from Illumina. And then computational analysis is really what all of you are going to do once these data are made. So next generation sequencing literally is A, Cs, Ts, and Gs in a text box. This is real data from an Illumina sequencer. Just a massive text file of basis and a partner and a really a partner list of qualities. You can say every single base has a certain quality. This makes really no sense at all, certainly to humans. One lane on Illumina sequencer makes millions of these. One lane will make 600 million lines just like this. Another goal of computational analysis is to take each of these reads and compare them to the human genome reference. So you can see even with a single step taking reads and mapping them to basically the product of the human genome project, you can start to see a lot of the reads actually start to look the same. So this is just a region of the human genome with no variation at all. All the reads that map to this region are all exactly the same. So don't go looking for a mutation here. There shouldn't be one. But really just to illustrate the stack of thousands, tens of thousands, or even hundreds of thousands of reads, all supporting the lack of mutation in this region. And these real alignments are very powerful. This is really the starting point for a lot of the bioinformatic analysis that you will all be doing. This is just a cartoon of the real data I just show you really to illustrate that multiple types of cancer genome variation can be inferred from these sequencing read alignments. So exactly like I just showed you, here's the human genome reference along the top. Here's the cartoon of all the DNA sequencing reads. And specifically in cancer we're interested in point mutations, which are sequence differences, insertions and deletions, indels, missing two or three, up to 50 bases. Homozygous deletions, the absence of data, we infer as a deletion. There's no DNA there to be sequenced. Hemozygous deletions, half as much data as we'd expect. This is one of two copies has been deleted. Copy number gain, where you have more data than you'd expect, or at least then your reference set. Translucations, you have a piece of one chromosome rearranged and stuck to another. Half the read corresponds in this case to chromosome one. In this case, the blue end of the read actually maps to chromosome five. So we can start to resolve copy number break points. And ultimately, after you finish an alignment to human, to the human genome reference, there are lots of other references out there. Actually, I just read the draft genome. It was just out a couple of weeks ago. In this case, we actually aligned a lot of these reads to viruses, bacteria, but you're trying to get at what other pieces of DNA are in the tissues that we're analyzing. And I've put in red some of at least my favorite tools for interpreting all these, for detecting all these sources of cancer genome variation. There are many others, just for alignment. There's a Wikipedia article on alignment, I think it's 50 or 60 software, just to do that first step. And actually, a lot of bioinformatics is trying out two or three of these tools and really picking the tool that's the best for your specific application. So here's the A list of DNA sequencing approaches to cancer, specifically whole genome and exome sequencing, targeted gene sequencing, variant genotyping, and epigenome. I've sort of been these conceptually into DNA sequencing. These are all modifications to the DNA sequence. Following on with the central dogma, talking about RNA sequencing, both microRNA and whole transcriptome sequencing. And then finally, I can actually have a readout of protein modifications as well in the form of mapping protein DNA interactions, things like RIP-seq, RIP-seq, and epigenetic mapping, attack-seq, looking for open chromatin regions versus closed, really trying to get at where nucleosomes are mapped. So there's a whole workshop specifically on epigenetic analysis. So primarily, this workshop is going to focus more on the DNA and RNA work. And I'm just going to go through most of these and show you what real data look like in each of these different configurations. Really, the only modification to that simple four-step process is really how the DNA and the RNA are prepared, and whether a specific subset of library is isolated, either by PCR, hypercapture, or those are sort of the big two to really focus the sequencing a little bit more. So here's a case where we have matched whole genome sequencing, whole exome sequencing, and RNA sequencing. So the way to read this, this is a snapshot of a very, very zoomed out version or view of the integrative genomics viewer. The way to read this, each tiny, tiny little, how is this for like, each tiny, tiny little gray tick as the sequencing reads, you can see just on this slide alone, we have thousands, probably tens of thousands. So the point is not to appreciate individual reads, but to appreciate where these reads are mapping. So in this case, we have the KRAS gene. So here's all the promoter and all the exomes down here along the bottom. And I want you to appreciate for whole genome sequencing just the absolute sea of data. There's really very little bias in where these reads actually map, just end to end, exon, intron, promoter. It really doesn't matter. It's semi random distribution of reads. It's not truly random due to bias in the library construction, but on general, lots of reads everywhere. Whole exome sequencing, there's a library staff up front where you are a modification up front, where you use DNA or RNA probes specifically to protein coding regions. This is the 1% of the genome that we think we understand, or where a lot of the work has been done to date. In this case specifically, you can see all the reads actually map only to exomes. So all these little read stacks actually map specifically to the protein coding regions of KRAS in this case. There is a little bit of background. There's sort of a 0.1 or 0.01x whole genome in the background. That pull down isn't perfect, so we'll often see these out-target reads and actually some clever bioinformatic tools make use of those reads as well. And then RNA sequencing is actually a little bit different. You extract whole RNA, but you let the cell do the work. The cell is the one that's splicing out the introns. The reads are really wherever the cells have placed them. So the way you read this, all the reads are here. With RNA sequencing, reads will actually, since introns are completely spliced out, exomes and exomes are brought right next to each other. So we can actually tell which exomes are spliced or attached to one another. So you can see all the reads here are mapping specifically to exomes, just like there were in exome sequencing. You can see this exome from exome sequencing is just as well covered as all the others. You can see in RNA seek actually that exomes are not used very often. In fact, it's often spliced out and it's actually being, exome one is actually more often being attached to one of these other exomes for the upstream. So it's really a biological readout using the exact same sequencing technology. Yes? These horizontal lines are they indicating that the read is mapping to two different exomes because they're spliced together. Exactly. So there's basically a, it's a gapped alignment. And I'll zoom into some of these later just so you can see exactly what that break point really looks like. Exome and genome sequencing is really held up as sort of comprehensive. We also see the term comprehensive genome analysis in papers. This is really not the case. It's actually a bit of a misnomer. The way to read this is every single exon in the genome is on the y-axis and every little, you know, they're basically a few pixels thick. And on the bottom here we have nearly 300 neuroblastoma cases that I analyzed as a postdoc. And the color is really the coverage of each of these exons. So I showed on the previous page exons that had coverage. You can actually see, especially in whole exome sequencing, there are a lot of exons that don't have coverage update. Now in this study actually, as low as 80 or even 70 percent have flat zero coverage. I've also included some very deep 60x genomes here. And you can even see here, you know, 5 percent of the whole, what it's termed whole genome sequencing really have no coverage whatsoever. That's maybe errors in the reference. That sequence actually isn't there to be sequenced. Comp number variation. These may be real deletions. And there's still sequencing bias in the machines that we use to sequence DNA today. So GC rich and GC poor regions, which is the point of this track here on the left, are actually continually very difficult for the enzymes we currently use and for the library construction methods we use to really just generate DNA fragments for sequencing. Transcriptome sequencing is a little bit different in that the coverage is completely dependent on expressions. So genes with high expression get lots of coverage. Genes with low expression get lower coverage. And the point of this slide is really to just show how you may be able to use this to classify, in this case, the breast cancer patient. So the way you read this is I've just plotted every single x on here on the x-axis and the expression levels here on the y. And these are very sort of stereotypical shape of the data you would see for RNA sequencing. So all these very, very extraordinarily highly expressed genes, rhombosomal proteins, housekeeping genes, genes that are basically flat off. In this case, there's three molecules I used to classify breast cancer. In this case, all very low expression. This is on a log scale again, just as before. And the point of this plot was we didn't know the type of breast cancer for this patient. There were three molecules that were interested to classify them either as ER positive or ER negative. So you can see clinically what we call ER positive is really not, it's not like an ER positive breast cancer is expressing ESR1 at an insane level. It's a slight uptake relative to other genes. Here's the ESR negative. It looks like really very little expression. This patient is much more, is really much more consistent with the second case here. It's really what we're calling positive and negative. They're not really as binary as it's positive or it's negative. Really, these are all continuous variables specifically in expression space. I mentioned earlier a little bit about epigenic modifications as well. We're not going to touch on this too much in this workshop, but I thought I should include it as a source of cancer genome variation. The point here is that epigenic modifications can regulate gene expression through the addition of methyl groups on specific DNA bases, most famously methyl cytosine. And the point of showing this slide here, this is actually a different way to look at, in this case, bisulfite-treated DNA. So in this case, Cs that have methyl groups don't react with this bisulfite chemical, whereas unmethalated Cs, they do react. And the way to read this out by next generation sequencing is that unmethalated Cs turn into T's, so we can see these as piles of red bases or sorry, the other way around. Unmethalated Cs become T's. We see these as piles of blue. In this case, this is an unmethalated promoter, so MLH1 is expressed. And in this cancer, there's actually methylation of this promoter, which actually shuts down expression of the specific gene. So our new technology is now to really do this genome-wide and to start to look at methylation patterns that every single promoter of every single gene, there's a whole workshop specifically focused on these type of modifications. This slide got a little bit scrambled, but this is really just a summary of what I've just talked about, really how these data types can be confirmatory and complementary. In this case, using genome, exome, RNA-seq, or methyl-seq, which is the readout for epigenetic analysis, and really what you can use specific technologies for, and really the power in generating multiple different types of genome data from the exact same samples, specifically or, for example, using sequence. If you're interested in sequencing mutations, you can actually use exome, or genome, exome, or RNA sequencing data, as well as methyl-seq if you're lucky enough not to have mutation at a methyl-seq. So I'm going to go briefly into all these various sources of all these types of genome variation, just show data examples of what the mutation looks like, what a copy number variation looks like, and so on. So beginning first with somatic mutations. So this is actually, just as a benchmark, I wanted to show what a natural variation in human genome looks like. So these are reads aligned to a germline polymorphism. In this case, half the reads support the variant, and half the reads don't. So in this case, we have relatively shallow coverage, only 19 reads. In this case, half the reads map to a C, and half the reads map to a T. This is exactly consistent with having two copies of the gene, two copies of the chromosome, half the reads map to the reference base, and half the reads map to the wild type base. And this is really exactly the type of variation we're looking for in cancer as well, specifically variants that are not necessarily polymorphisms, but are somatic changes, really changes that are specific to the tumor itself. So finding these mutations in high quality, undergraded DNA is relatively straightforward. You can certainly do it by eye, you can just by looking at reads here. Really putting this into practice, especially applying it to tumor biopsies, is confounded by several features of the cancer genome. The first challenge is this challenge of variable DNA quality and quantity from tissues that are routinely used for a diagnosis. These are 12 tissues all from the same hospital, all from the same cancer type. These are all lung tumors, but you can really see quite a bit of variability just in the size and quality of the DNA that's coming out of these fragments. In this case, this is a DNA ladder. This is the ideal case, nice high molecular weight DNA, very easy to work with in the laboratory. What you can see is you go from left to right, this DNA gets increasingly degraded until the far right number 12 actually has no DNA at all. This is a metastasis to a bone that was then decalcified. The DNA was completely destroyed. There's absolutely nothing there to analyze. The other challenge is that tumors are a mix of cancer and normal cells, and you may hear the terms purity and tumor content. Tumor cells are really a mix of cancer cells themselves, but also their environment. A lung tumor cell is actually surrounded by lung tumor cells. In this case, this is actually a lung tumor metastasized to a lymph node, and I've circled all the tumor cells. In fact, the tumor content here is you're lucky if it's 50%. So as we're extracting DNA from these samples, half of our DNA is from normal cells, and the other half are coming from tumor cells. So this is really a challenge in cancer genome analysis where probably half of your data is not going to tell you very much about the cancer genome itself. As I alluded to earlier, tumors can have multiple genome copies. In this case, you may hear the term ploidy. This is just a count of how many chromosomes are in the cancer cell. In this case, the average ploidy is just under four. So there's been this genome doubling event. There are roughly four copies of every chromosome, but then there has been some chromosomal shedding. So the ploidy here is slightly less than that. As a result of these two features, low purity and high ploidy, deep coverage is needed to detect mutation in low purity or high ploidy tumors. This is the exact same tumor analyzed in these two different ways, once by relatively shallow whole genome sequencing, and one by sort of average whole exome sequencing. For the cancer genome atlas, the average coverage targeted was 100x. This is slightly deeper than that. We can really see the very scant red coverage of this mutation here in the exome sequence. No hint whatsoever in the whole genome sequence. And this is really where you're designing your research project, your assay, really trying to think about what assay makes the most sense. You want whole genome sequencing where you're going to have coverage throughout the entire genome, or you want to go very, very deep on protein coding regions, or you may be able to readily interpret those variants more directly. The other benefit of deep DNA sequencing, specifically whole exome sequencing in this example, is we can start to infer tumor purity, ploidy, and subplonal structure. So the way to read this plot, this is the output of sequenza. This is the which is just a software that makes these plots for you from whole exome sequencing data. This is the y-axis here is the genome. So each little bin is a chromosome. And the summary down here really is the copy of the copy state. So here in chromosome one, most of chromosome one is intact, apart from one end of it. In this case chromosome two, you have two copies of one allele, the red allele and one copy of blue, so you have a single copy gain, and so on. So you're really just illustrating that you can infer allele specific copy number across the entire genome, even though you don't have whole genome sequencing. The other ability of this algorithm is to really start to look at these specific patterns and infer the tumor content and the average ploidy. It spits out a number, in this case, 61% of your cells, your tumor cells, 2.5% are the average ploidy across the genome is 2.5. So in general, there are more gains than losses, but you can really see it and that solution is just down here in the corner, but you can actually see there are lots of other potential solutions. You can actually be very pure with no genome or with no copy number alterations, or very, very impure sample with highly amplified chromosomes. And this is really where data visualization of these algorithms is absolutely crucial because really it's very valuable, you know, just use your human eye to really critically evaluate what the other potential solutions to their formula could have been in this case. And the third output is a program out of Surab Shah's group, really trying to map these mutations to specific subclones. The way to read this plot is each stripe is a single mutation, so in this case you have hundreds of mutations mapped to a super clone, really the dominant clone in the tumor population. Then you have the steps, so this other second collection of mutations at lower allele fraction. And then there's very, very minor subclone down here, around 20% supported by 10 to 20 mutations, really trying to directly read out subclonal structure, really from that original sequencing read alignment. So it's one thing to find mutations. The other challenge is to first of all verify that these mutations work correctly called, they're actually there in your data. And this is, I didn't want this to be a completely Illumina sequencing every talk, so I didn't want to show some Pacific Biosciences data as well. So down here on the bottom here's the Illumina data we used for our mutation discovery, a point mutation and an insert on a deletion rather. And we just use PCR primers to sequence the exact same region, certainly by eye, the PacBio data looks a little scary, it's packed full of insertions and deletions, but the real mutation is really quite well supported by hundreds of reads, both the point mutation and the deletion itself, even though there's this high background error mode of deletions. And this is really where you need to tailor your algorithm specifically to the data type that you're generating. So there are mutation collars specifically for PacBio sequencing and there are many mutation collars specifically for Illumina data as well. DNA sequencers make errors as I just showed using that PacBio slide. The challenge is really just to tell the difference between an error and a real base pair change. And I'm just showing on this plot here the effect of mutation detection sensitivity as coverage increases. So exome sequencing around 100x today, certainly we're pushing this to 500 to 1000x, really just with money coverage equals money. Sequencers are generating more and more data over time, so certainly you're getting more sequencing data for the same amount of money that's improving over time. Exome sequencing around 100x in our clinical lab, we target 500x to really get down to subclinality, we're using these targeted methods to really get down to thousands, tens of thousands, in some cases hundreds of thousands of x coverage. And then circulating tumor DNA, which I want to talk about next, really trying to get down to this very, very sensitive level, really banging on empty, really getting very close to the background sequencer error rate. And this is particularly challenging in circulating tumor DNA where this is sort of the ultimate tumor tissue where almost all of the DNA, all the cells in the bloodstream are normal, non-cancerous cells. This is really where we need very, very deep sequencing to start to pick up these very rare alleles on the order of 0.05%. And this is really why I wanted to show this really extreme use of the next generation sequencing data. Cell-free DNA dissolved in blood is derived from many normal cells and just a few tumor cells. And when I say cell-free DNA, I'm talking about the DNA that's dissolved in the liquid portion of the blood. They're also circulating tumor cells that have actually picked up in the cellular portion as well. Also at very, very low sensitivity, requiring very, very sensitive methods. And I bring up circulating tumor DNA because as sequencing costs have fallen, they haven't fallen as fast as we need them to do routine whole genome sequencing from blood, especially at the depths needed to pick up these very, very rare alleles. So I do really want to put side-by-side two methods that we use to really focus these data. So I showed the whole exome data or whole genome data where we just had coverage wall-to-wall, whole exome data where we're really focused on specific exons. And I wanted to show here, two other approaches, highly targeted PCR sequencing where you just put PCR primers down, amplify the exact same chunk of DNA thousands of times and then just sequence that region over and over. You can see highly targeted DNA sequencing in hundreds of thousands of X possible. And to contrast that to hyper capture, this is really the technology we use for exome sequencing more routinely. Certainly all the coverage focuses primarily on the regions of interest. You hear the two exons here, all the reads mapped primarily to exons. But you can see you still have a lot of off-target with hyper capture. Really you have this nice normal distribution around an exon. But if you really want that depth, you're really just wasting that depth specifically focused on exons, you're really wasting coverage because you're sequencing to the left and the right of your regions of interest. In my lab, we're very focused on using hyper capture anyway because we're interested not just in point mutations, but also structural rearrangements and trying to cover broader regions of the genome rather than one or two specific mutation hotspots. This is particularly important for circulating tumor DNA analysis where you want a very highly quantitative method. In this case, this paper from Max Dean's group at Stanford really just illustrating a very tight correlation between tumor volume and increased concentrations of circulating tumor DNA. You just get more circulating tumor DNA from patients with larger tumors or metastatic disease. You can really start to use this technology to monitor shifts and not just circulating tumor DNA, but also shifts in tumor size over time. In this plot in red are a fusion that was monitored by targeted sequencing and the tumor itself was monitored by imaging. You can really see the quantity of circulating tumor DNA and the tumor volume really tracking very closely over time. Again, using very targeted, very deep next generation sequencing to pick up the very few reads specifically that supported the fusion in this case. So that was a little bit on mutations. These are the smallest genetic change that we can currently measure by next generation sequencing and I wanted to step back to the larger copy number alterations and structural alterations. This is a figure you've likely seen on several introductory cancer genomics talks. Really just illustrating a technology called Sky, where each piece, each chromosome gets its own color. So in a healthy normal cell, you'd see pairs of yellow, pairs of red, just like the original peritype I showed. In this case, you can see this genome is completely scrambled. Different pieces of her chromosomes are all mapped to one or all rearranged and stuck to one another. This is really a major bioinformatics challenge. It's just detecting which rearrangements occur and then mapping specific rearrangements to oncogenes specifically. So copy number variation in cancer has been appreciated for a very long time, certainly at the chromosome level, microarrays that can use for the previous generation just to pick up gains and losses of DNA. The way to read this plot instead, I put the chromosomes on the y-axis. In this case, here's a microarray side-by-side DNA-based copy number profile and really you can contrast the tumor where the gains are red, the losses are blue to the normal. It's not that the normal has no gains and losses, but certainly polymorphic copy number gains and copy number losses that are seen across the population, but certainly not to the extent of this tumor sample, both by microarray and by sequencing. Clinically or in circling tumor DNA, we typically use these very, very small targeted panels. These still have value as sources of copy number variation. In this case, here's a small targeted panel that really looks at very few genes, but you can see even just by eye, EGFR here has much, much more coverage than the other genes on the same chromosome. In this case, it's an EGFR-amplified lung cancer. So even by looking at small targeted panels, we're still able to infer gene level gain and loss. These copy number alterations become increasingly rich as we add additional probes, certainly going to exome and genome sequencing. The resolution of these copy number alterations becomes even more clear. In this case, here a very clean gain of EGFR just on its own, and the rest of the chromosomes are really just intact. You can also, even from these targeted panels, start to begin to at least act like a sage geneticist. In this case, all of the P-arm is completely lost. The Q-arm is actually gained. This is actually indicative of an isochromosome, where you've lost one piece of the chromosome and the other piece has then been duplicated. So you have what appears to be a regularly sized chromosome, but the one arm is actually composed of genetic material from the other. And this is actually diagnostic of specific cancer types. One benefit of whole genome sequencing and RNA sequencing is that the exact break points of these rearrangements can be detected by examining reads that span this region. So this is an example in lung cancer where there's an activating deletion in the C-terminal of EGFR. You can see the wild type reads just span the region as they normally would, but where you have a break point, these reads actually break and then continue on the other side. And this actually goes back to your question earlier about the exon, where you sort of have these gap alignments, these really these reads that where a portion of the read could be thousands or even megabases away from the other part of the read. And this is really where you need aligners that know that this can happen and then have the ability to break your read across. And what is a very large piece of the genome in your reference set in your reference genome that are actually right next to each other in your real cancer? Yeah, it really depends on the alignment that's being used. Certainly one or two bases are not going to cut it. I actually like to determine where these break points are using reads with good read support on the other side and then fish for these little edge cases further support, but basically further support the break point. So just for discovery, you really want nice robust coverage on either side, 30, 40, 50 base pairs. But it's not to say that a five more is not of any use if you already know that that situation is there from other readings. So you can do a sort of almost a circular realignment draw. I was about to say de novo assembly, which isn't quite right, sort of a guided assembly from your other reads that are already there. Yeah. That's right. And then the gray are the wild type reads. So in this case, you actually have no reads in the middle that are in that region. All the reads that are in this case, all the reads that are in the deletion region are wild types, so either from the non-deleted allele or from normal cells. So here's what real data looks like. So here are two famous transformations, EWS5 and PLNL, rah, rah. So the way to read this is gray are all the reads that properly map the reference. And the color reads are actually all mismatches, because they actually don't match to this region. They actually match to the transportation partner. And this really illustrates how clean next generation sequencing data can be, because really to the break point, all of these reads are support, to the base, all these reads are supporting the exact same break points. Even though they're definitely different DNA molecules, you can really read in the normal sequence and find the break point. And since in this case it's a reciprocal translocation, you have reads on the other side of the break point where all the bases are do map to this region, you then hit the break point, and then you have all these mismatches that actually map to the other transportation partner. And the same story here, really right to the base there. Actually this example is quite nice, because there's actually an uncompleted addition, you can see there's actually an overlap of mismatches. So when the rearrangement happened, the cell attempted to repair it by putting in additional bases, none of which map either of these two genes. These are actually just nonsense sequences that were put in during the process of DNA repair. Yes, exactly. So in these IGV screenshots in gray, all the bases that map the reference are gray, you don't see any variation, color means they don't map to this region. And the reason they don't map to this region is because they actually map the transportation partner. And this can be extraordinarily complex. This is a figure from a prostate cancer paper, really just illustrated a four-way transportation between Tempors-Urg and Introgenic Region and Thrap 3, and really actually in his postdoc well who did this, he actually just went through and manually mapped all of these by hand. There are now algorithms to do this. No evidence whatsoever of these rearrangements in normal data, but you can really see how basically these two chromosomes had this catastrophic event and then were stitched back together completely out of order. They actually brought together two of the drive, two resulted in a fusion protein that drives, frequently drives prostate cancer. And I did want to touch on this concept of highly-rearranged versus chromosomally quiet genome. So in this case here, this is a circles plot just like I showed in the very early slide. The way to read this around the outside are all the chromosomes and all the arcs on the inside in this depiction are rearrangements of those chromosomes. So in this case, these are three lung cancers along the top, medium, medium, and highly-rearranged tumors, contrast allos to pediatric cancers in neuroblastoma. No rearrangements, one arrangement. There's actually an interesting report here of chromostripsis where there are really very, very few rearrangements across the entire genome, apart from one chromosome that was massively rearranged. So there's a catastrophic event specifically of a single chromosome that was then repaired. So you do have a chromosome, but you have a high level of copy number gains, copy number losses, and abnormal break points specifically focused on a single chromosome. Really going back to if it can go wrong, it really does at some point in some tumor somewhere. I did want to touch on transcriptome sequencing as well. So what I've talked about so far is just changes in chromosomes and DNA. These of course are read out into RNA. RNA sequencing perhaps is most powerful in telling you what exon cells are actually using. So even if you have a mutation there, you do want to know whether it's being expressed, what the functional effect is. This is an example of a colorectal cancer cell line that's treated with a 5-thloro euro cell. So you can see in this, so the way to read this here, all the exons of the umps gene are interested in this specific chromosome, or exon, exon-2 here. You can see in the normal cell it uses exon-1. It uses an exon-1-2 junction. It never uses an exon-1-3 junction. So it always goes 1, 2, 3, 4, 5, 6. It never jumps from 1 to 3. So it's very, very rare. And then 2, 2, 3, the rest of the gene is really intact. You treat the cell line with 5-thloro euro cell, what actually starts to happen is it starts to skip exon-2. So it's still using exon-1. You do see exon-1-2 junctions, but you also see exon-1-3 junctions, actually at the same level as 1, 2. So what's actually happening here is exon-1, rather than reading directly into exon-2, exon-1 is then skipping completely into exon-3, 4, 5, 6. In this case, this confers resistance to 5-thloro euro cell. Something that's not at all evident from DNA sequencing. DNA is all intact. There are no mutations in this region. Only by RNA sequencing could you have appreciated this shift in which exons are being transcribed and used by this cell. Very similar to that KRAS example I showed where you have a decrease of expression of just 1 exon amongst the entire gene. So the way these patterns are read out is just by coverage. If you have additional reads on additional exons, this corresponds to expression of that exon. I believe there's a workshop specifically on gene expression profiling. You'll learn how to make heat maps like these, or really the heat in next generation sequencing is number of reads, often corrected for some other feature of the gene, such as gene length or GC content. The whole point of this is to look for specific genes that are expressed or have more reads in certain cell types versus others, and begin to cluster these, especially across large number of tissues, large number of tumors, often in the context of some clinical questions, such as response or resistance to therapy. The other powerful use of RNA sequencing data is really just to use it to compare transcriptional patterns across, not just cancer types, but also across normal tissues as well. So this is actually clustering of data from normal tissues from the Gene Tissue Expression Project, or GTEX. You can see here the samples are colored by the annotation, not by the clustering. So in this case, the brains have all clustered together. You can see all the purple labels are together, all the bloods clustered together, the muscles clustered together. And we're interested in this case, there's a very unusual lung tumor seen almost exclusively in women that actually clustered very closely with muscle. And actually, as we drove dove into this even further, it's specifically clustered with a uterine smooth muscle. So this gives some hint at the potential cell of origin for this unusual cancer type. This also has implications for trying to get at tumors about known origin, whether there's a metastasis, but there's really no hint of what the primary tumor may be. These types analysis can then become very powerful because RNA sequencing, since you are literally just counting reads, you can really start to compare these expression profiles across centers, across labs, and actually potentially across the world as well. I mean, all this data comes from a large project in the United States. So in this case, yes, this is unsupervised clustering, so it's going to be a whole workshop on different approaches and methods specifically to clustering gene expression data. But in this case, this is a differential gene expression. I mentioned earlier, aligning reads to the non-human genome reference. This is one example where we just have to pick out pathogens that we didn't necessarily expect were there. In this case, this is a Circos plot, not of the human genome, but the Epstein-Varr virus. So in this case, we aligned all the RNA sequencing reads to the human genome reference, but we saw the unmapped read, the percentage of reads that were unmapped were unusually high for this case. So we took all those reads, mapped them to a database of viruses, and we're really almost able to reconstruct the Epstein-Varr transcriptome. And we subsequently stained this specifically back in the tumor, and only the tumor cells were positive. So in this case, I'm covering, actually changing diagnosis of this case, specifically diagnosing a limpoepthelioma-like carcinoma, much far rarer lung cancer than the one we're interested in, which was long ahead, no carcinoma in this case. So I'm just coming up on an hour here, we just have a few more slides left specifically on types of, on really that early part in my cancer timeline plot, really getting into germline cancer genetics, or you may hear the term hereditary cancer syndromes. And the point of this slide is really to not necessarily focus specifically on cancer-derived fragments, but also to look at the normal sequence data as well. And it's particularly important in the context of a second hit. So in this case, this is a famous tumor suppressor retinal blastoma gene, and in this cartoon, there's actually a patient who's born already with a single loss of function mutations. With a single loss function mutation. And this is particularly important in terms of somatic genetics. Really, this individual is just waiting for that second hit. They've already lost one of their two copies of a tumor suppressor, and given the heterogeneity of, or the many different ways that the cancer genome can go wrong, there are actually several ways that that second copy of RV1 can then be deleted. A local event, just like a second mutation, somatic recombination, just duplication of that allele, deletion, or in the cancer, just complete chromosome loss. So you already have loss of one allele, simple loss of the second allele. Now you have those two hits of a tumor suppressor that can then drive cancer. And here's an example in Medjula blastoma, where there's a child with a heterozygous deletion in Tsufu. So it's the tumor suppressor gene specifically associated with this brain cancer. So you can see half the reads have the deletion, half the reads don't. This is just a normal blood sample. You can see here on the tumor, both in the microarray, and the sequencing data, this huge deletion, all of chromosome 10 is lost. And the resulting data in the tumor, every single read has a deletion due to this deletion, or has the deletion due to the loss of this chromosome. So really, no source of intact reads whatsoever. Every single read is coming from a single allele in this case. So it's one thing to find variants. The real challenge in both research and in clinical genetics is to really use annotation to start to guide interpretation. The real point is to actually start to link the host of cancer genome variants to specific, ideally, clinical genetics into action. This is really the active of annotation and interpretation. This is really what differentiates labs in general because there are so many different ways to both look at the genome and then interpret it. So how do we interpret variants en masse? And this is really where there aren't necessarily strict guidelines. Currently, this is a relatively manual approach where you're really looking at clinical indication in history, locus specific and internal variant databases. A lot of clinical labs have PhD scientists who log into these databases, look up specific variants, ask about their labs, have done this before, read papers, and many, many bioinformatic predictors that will attempt to score whether a mutation is pathogenic or has a functional effect. The host of splicing prediction tools, if there's a mutation near a splice site, does it, is it predicted to affect splicing? This is enormously time consuming, trying to go variant by variant, and this is really where bioinformatics has a huge additional value. And this is really where I wanted to bring up command line versus manual annotation. All of those databases are indexed, they're something we can query, and they're all searchable. You can search them yourself by hand. There are web-based tools such as Alamute that attempt to show all these annotations side by side. We're also very good at making enormous tables. So one such tool, Oncotator, will query all of these databases. Every line can be different variant. We really start to attach information to all of these variants to attempt to make sense of them. There are labs all around the world that are specifically looking at these variants in their specific research program, and these become very, very valuable when you start to stick all this information together and look at it much more holistically. So I think more traditionally we've done this. We've done this manual curation, try to look at variants one by one. But certainly this is the age of bioinformatics where we really can start to look at virtually every database available to us and then use machine learning techniques to start to pick out specific patterns and to start to link the observation of variation to actual function. So in clinical labs today, this is a table from the American College of Medical Genetics. Really just trying to say here are the rules, here is the way that you should interpret a variant. Relatively few databases that you should be looking at. Paper reading is a big part of it. This is published down there at the bottom. But there's no way every lab is going to be able to look at and interpret every single variant. And the only real way to start is really putting cancer genome variation and interpretation into the clinic is to start to share these interpretations. We're all reading the same papers over and over. In parallel we're really missing opportunity to start to share and really understand what these variants are actually doing. So the next, the rest of today is really about what databases are out there, how can we go about mapping these variants to two specific databases and to specific function. In the clinical genetics lab, this is really the ultimate goal at the end of the day. How do we report and share these results? So here are three mutations in three famous cancer genes, a large block of text to describe only one of them. You can see this is really not going to scale very well for whole exome and whole genome sequencing. If you have a list of 100 of such variants, a lot of them are not necessarily functionally relevant or clinically actionable. Clinically actionable meaning you would go on to a specific clinical trial or get a specific treatment. That being said, we do want to know what other labs, what other variants have been seen by other labs and really what does this interpretation look like at scale across many, many centers. So one such project to really get these types of groups talking to each other is called AACR This is really an effort to get these centers here on the right to share all of their molecular data. Princess Margaret is one of these really trying to build the bioinformatics infrastructure to all query the same databases and to make sure that all of our data are interoperable between one another right down to individual variants. Make sure that we're either interpreting variants in the same way or at least we can have dialogue around why we think specific variants should be actionable or should not. So my last content slide is really trying to get at what is this molecular report to the future. Really trying to move away from this black text on a white page depiction of the genome and really trying to get it really more at the, not just snapshot in time, but really trying to think about how these genomes and variants can actually change over time, especially given the huge molecular complexity of copying of a rearrangement, point mutation, germline variation and so on. So the way to read this is to see bioportal which I believe you're going to be spending a lot of time on, bring the workshop. In this case this is a single patient or they had four tumors taken over time, one, two, three, four, these little yellow circles, the lines of the clinical interventions and all the genomics data is down here. So four tracks, one for each tumor for a copy number profile. So you can see they're largely related but different tumors do have different copy number alterations and then one, two, three, four each little green mark being a point mutations. You can see down here here are the three mutations that are shared across all four of them but you can see certain mutations come and go depending on the subplone that was present or whether it was a metastatic versus primary site. So that's it for content. For me are there any questions before we go quickly into the case study? So we have very powerful computational tools to add annotations and I think my point here is that we can make these enormous columns but it's very easy to lose the context of a lot of these data and I find this actually a bit of a balance between the active annotation and then the active interpretation and that's actually where this interactive approach can actually be very powerful where you have the conservation across species right next to the protein domains, next to all the other genetic variations that have been seen in this case, next to all the somatic mutations that have been reported in other cases whereas a lot of this context and relationship between other databases I find can often be lost in these large data tables and really trying to think about methods that give you this rich context of annotation but also give you the breadth of annotation across the large number of databases that are currently available. UCSE genome browser is another excellent resource for this where you can load hundreds of tracks and really start to put all these tracks right next to each other just so you do have this data context around each of your variants. So this is published, you can read it, it's actually been a while now, actually back in 2010, it's almost over 60 years, this is actually a case we saw at the BC Canceration C and we're really a rather unusual tumor with really no treatment options at that point. We've really formed several large-scale genome analyses at that point to really try to manage this patient's management through post-standard care and then in response to the treatment that was recommended by genome analysis. So the 78-year-old man, fit and active, presented in August with throat discomfort, had a mass at the base of the tongue, really no obvious reason for cancer, non-smoker, non-drinker, cancer in an elderly man. PET scan is subsequent biopsies, this is actually a way to read PET scan, this is basically a highly metabolically active tumor, by pathology this was scored as a as a as a papillary adenocarcinoma, so luckily a salivary gland tumor, relatively rare surgery and following pathology, the mass was taken out, 1.5 centimeter mass, the lymph nodes are involved, so three of the 21 neck nodes that were removed, have metastatic disease, had radiation therapy, good quality of life, returned to work, but then numerous small metastases in both lungs, so very serious development, not something that you're going to cure necessarily by by radiation or surgery. So the question for him was what next, standard of care is not served well, he already has metastatic disease, there's an EGFR trial opening, so one of the cases I showed you, the target therapies I mentioned earlier was just opening up the cancer agency, and the pathologist scored this tumor as expressing EGFR, so at least there was a target to drug, so whether the treatments in this case they went on a lot, so in the EGFR inhibitor, no molecular data at all, so we didn't know if they were EGFR mutant or not, this is relatively early days, in this case he had a six week trial of an EGFR inhibitor, we knew EGFR was expressed, all the pulmonary nodules grew all on our lot, so all these lung metastases are growing, this drug is definitely not working, the largest lesion is really growing from 1.5 to 2.1 centimeters, discontinued that disease, so really what's next, right now the oncologist is thinking palliative care really starting to run out of options, and I'm going to keep this little timeline going, so initially presented surgical resection, head radiation, they have lung metastatic EGFR expressing, robotinum failed quite quickly, so in the theme with this lecture was the targets in our case, so having exhausted standard of care, they turned to the Genome Sciences Center, they had REB of course, large team got together and really approved a protocol for this one case, and patient consented for full genomic sequencing and analysis, with the understanding that no one had really done this before, these are all novel treatment options that may be suggested, so fresh frozen biopsy was taken specifically for RNA sequencing, so here's the mass, here in the lung, here's the final aspirate, so it's a mixed bag of cells, just like we discussed earlier, pathologist review, and they scored it as 80%, so relatively high tumor content confronting this problem of tumor purity in this case, not, you know, 80% of our reads are going to come from tumor 20% out of normal, and we also had four month fixed paraffin embedded DNA that highly degraded material I showed you earlier, so this analysis actually uncovered relatively few mutations, in this case, two mutations in tumor suppressors, P53 and the RB gene, retinoblastoma, as well as two genes that really we could not comment on, it was really mutations in these genes that are, were not well understood, they continue not to be well understood, so two novel mutations of unknown significance, this is really the bane of cancer genome analysis in general, we always find hundreds of mutations that we've never seen before, and we confirmed all these by a secondary method, I showed you packed bio sequencing earlier, in this case we used Sanger sequencing, all these four mutations were confirmed, here's a Circos plot of the copy number alterations, so just like I showed you earlier, gains of data or additional data or missing data gains and losses, in this case, nominated relatively few, at least interpretal copy number alterations, loss of SMAD4 with small deletions inside of it, P53 loss of heterozygosity, so that's consistent with the mutation, you have one mutation in P53 and deletion of the other, loss of the other allele, amplification of map kinase, of a gene including the map kinase domain, loss of P10, this is a tumor suppressor as well, and high level amplification of the rat gene, so an oncogen, also able to explain the aminohistochemistry result, in this case there is an EGFR amplification, so you're resulting in increased expression, but no mutation here, so in this case it's perhaps not surprising that the EGFR inhibitor didn't work, they didn't have the exact mutation that would have suggested that that treatment would be successful. Loss of RB is also a, never mind, just like we do for mutation, we want to confirm that the copy number alterations were correct, so here's, I mentioned loss of P10 on the previous slide, also confirmed by FISH, so in this case we're only having one probe where we should have two, and focal amplification of rat as well, so really being driven by relatively few molecular alterations. I also did RNA sequencing for this case, so in this case we took the tumor, we didn't have GTX or thousands of other tumors at the time, in this case we just had whatever the genome center had on hand, so in this case it was comparing this salivary gland tumor to 50 other tumors and the matched blood sample, it could be much more sophisticated nowadays in terms of properly matching your tumor to your cell origin, regardless even with all its warts, really able to confirm the deletion of SMAD4, expression is very very low, it's deleted and unregulated, RET is at the top five of express genes, again very complementary to the copy number alterations, so really illustrating the value of having two different genomic views of the same cancer, copy number change and increased gene expression and P10, there's a single copy loss and it was significantly under expressed, again the copy number really nicely mirroring the gene expression values, and so they mapped all of these to relatively few pathways, so the way to read this plot is up here on the left, but basically stars are copy number alterations, red being gain, green being loss, and then diamonds depict expression level, and so you can see RET up here at the very very top, really just trying to map that list of variants to the pathway, so RET is at the very top, driving expression of this pathway, you can see the downstream genes are also expressed, there's P10 loss, so P10 is the tumor suppressor of this pathway, you can see it's also lost, so really a lot of this, a lot of these data, this list of alterations is really having us converge on relatively singular pathway, so the point of this analysis is really to recommend treatment, so at the end of the day presented a list of four treatments based on these target alterations here on the left, up regulation of that pathway, amplification of RET, amplification and loss of the suppressor of that pathway really led to this interpretation specifically saying this host of targets maybe after the fond by these drugs, so at the end of the day the oncologist chose the first one, so both for the fact that it's a relatively dirty drug, it actually hits a large number of the candidates in our pathway, but also for a very low toxicity, this patient is already very sick, there's really this balance of perfectly matching the molecular profile to the clinical reality, and amazingly it actually worked, so within four weeks there's a 22% decrease in the tumor, so the way to read this here's the time down here along the bottom, the biopsy was taken here, it took a month to do the analysis, so it actually started in a month later, and the nodules to look at you're going from 22%, even in the time it took to do the genome analysis you're having an increase of 27%, but then the shrink back down very quickly on this targeted therapy, and the patient actually stabilized for seven months, so synanibdose was reduced due to side effects, repeated scans show new nodules, but as I foreshadowed earlier on with these targeted therapies, resistance is almost always inevitable, so in this case the existing lung metastases began to grow, so the oncologist just moved to the next treatment on the list, the switch just wrapped an ambulance cylinder, again some disease stability, actually disease stabilized quite quickly, and actually continued quite robustly for three months, so really using even the existing molecular data to continue to steer treatment decision in this single case. Unfortunately again returning to the targeted therapy problem, recurrent disease did come back again after seven months, in this case it was recurrent disease at the base of the tongue, so even though there was surgical resection that tumor has now come back, and there's a new skin nodule as well, really doing quite badly, deterring quality of life, really what changed, what is new in this cancer genome specifically in response to treatment, is it subclonal, was there a new mutation induced by treatment, what changed? So in this case there was a biopsy, both skin metastasis that had many more mutations than the original case, most of these are in genes we are very difficult to interpret, certainly this looks like an interesting list, maybe not a factory receptor, but they look functionally relevant, but unfortunately no one had really ever seen these mutations before, and this is also an enormous red herring in cancer genomics, having the simple presence of a mutation in a gene is not necessarily enough to call it actionable, it's really this ability to link a specific mutation in a specific case, map this to either a 3D structure, one of these well annotated databases, rather than just saying simple presence of mutation is functional, which is definitely not the true, not true, in this case there is no evidence of these in the pre-treatment biopsy, even at low frequency, so even using those deep sequencing methods PCR or hybrid capture, there was no shred of even one of these mutations in that original case, these appear to be new mutations, mapping these to the exact same pathway, you can appreciate how much redder this pathway is, you can see almost every single member in this pathway is now either upregulated or copy number altered, and there appears to be this new sidearm through AKT signal, so now AKT is highly expressed and now copy number amplified, whereas it wasn't before, there's really been this molecular evolution of this tumor in response to treatment, either selection of a specific clone and then outgrowth, so at this point there's a real mixed bag of molecular targets, so considering a cocktail targeted drugs, the big danger here is this is really very untested, you're not going to give four or five drugs that have never been tested for cross-reactivity or side effects before, and could we have detected these mutations pre-treatment, because none of these mutations were evident before they started, and could resistance have been modeled and monitored? Would serial biopsies have helped there? Would a blood test have really helped us detect this early recurrence and really start to monitor this as patients a little more formally over time, and unfortunately this patient was very very sick right at the end and actually ultimately died, but you can really see the timeline of this case going from where in the absence of molecular data really all coped have been lost here in August and really extension of life for really over a year by using molecular data, unfortunately the second biopsy of primary recurrence are just a host of molecular alterations that just could not have been targeted in parallel, and really illustrating this need to really move this molecular profiling activity really much earlier in the timeline for these patients. So I'll leave it there, this is basically a coffee break questions, I'll take questions now, what we can, yes, the very first, you mean the formula fixed paraffin embedded tumor, that was used for the DNA analysis, and then the fresh tissue was used for the RNA sequencing analysis. So there is this problem actually of using this very very old diagnostic block to manage a patient today, which is really how a lot of molecular profiling is done currently. Is that again? Yes, exactly, we didn't have the RNA sequencing data from the initial case, yes, exactly. If we were to do it again today, you probably could do an RNA sequencing analysis from FFE, but no, not the time, for sure. Yeah, and trials like this are being thought about to sort of chase, I think the Elaine Marcus figure really shows it best, these clonal tidying over time. I think the real challenge is trying to monitor whether what appears in the published figure is actually what is going on, especially for patients like this. Yes, I mean those are trials that are being talked about, it's very difficult to biopsy a patient every month, so I think having some molecular readout to really start to link the clinical treatment to the, and what's actually happening in the genome is really what's needed to inform those trials. But yeah, I mean that's certainly a model for some trials. Is this kind of case study routinely conducted in your hospital? Clinical whole genome sequencing and RNA sequencing is not currently, certainly there are large research protocols that have some flavor of this, large molecular panels are being stood up at many of the hospitals, certainly through Project Genie, one of the major goals is to do this on large gene panels and then to share the data, so we don't come up with lists of mutations like this that we can't interpret. It's very, and really the BC Cancer Agency is the only group currently I know that's doing integrated genome and RNA sequencing under a clinical protocol. But certainly it's cost fall and interpretation becomes more routine. I think that's actually the big barrier here is not necessarily the generation of the data, but taking those lists, taking those cognitive alterations and linking them back to treatments quickly, because the other challenge with especially cancer management is you can't have a six-month genome analysis or interpretation. You need something that really can be reported back not in real time, but in weeks, not months or years. What makes CML different? Does it get long-term durability? Yeah, so you do see secondary mutations. You do, but not in my view for mutations, right? It's a bit of a mystery. Exactly right. It's target therapy. It works. I don't know why certain cancer types do have these very long durable responses. Maybe immune-related. Something I didn't really talk about very much is the advent of immunotherapy that doesn't target tumor cells that actually targets latent immune cells to reactivate them, causing them to target the tumor. This may be part of the story here. Yeah, I mean this is a big open question in our field. Yeah, I guess the question is what is the residual and can you find it? I mean, if there's also this risk of a highly malignant clone that's being kept in check by this bulk, less aggressive tumor as well. So yeah, these are tricky trials to run. Yeah, I don't have a direct answer for it. You were talking about CML. Echemia is very different from solid tumors. Solid tumors are hard. Echemia is in large, right? So the drugs have a very easy way to get to it. Well, that's why most of the echemia get nearly cured by chemotherapy. Yeah, and I suspect it actually is the background molecular footprint for the cell of origin for these two cancer types. And I think that footprint is out there. We've analyzed thousands of solid tumors and hemolygencies, but there's no direct path that CML is special necessarily. Yeah. It just may be much more simplistic from a cancer cell perspective, but it seems there's a very strong single driver trying to exploit multiple properties. Whereas a solid tumor that has multiple kinases. Yeah, it may also be time of detection as well, although CML can persist for a long time. Yeah, this is good coffee and talk. I don't have a great bioinformatics solution to this. Okay, well, great.