 So, my talk today is going to be relatively introductory level, specifically focusing on what is the cancer genome and how do we use genomics and bioinformatics to measure it and really extract, in my case, mostly actionable or clinically relevant variation. But I think this is much more generalizable to all sources of cancer genome variation. So, my talk is really going to focus on what are the cancer genome, what are the sources of cancer genome variants that can be measured by next generation sequencing and hopefully set you up for a lot of the other workshops that are going to come in much, much more detail in specific analysis areas. So the learning objectives of the module really think about how are cancer genomes different from normal genomes, from matched normal controls or across tissues, understanding different bioinformatic approaches to detect different types of changes in the cancer genome, learn about different approaches. This is probably the question I'm asked most frequently as a collaborator. What type of genomics should we do? Should it be DNA-based? Is it RNA-based? How, why do we really tailor our genomic methods to ask specific scientific questions? And then I'm going to highlight how this is really being started, starting to be used to impact patient care. And really the last third of the talk is actually a patient case or an N1 analysis we published actually at BC Canceration C some years ago, but it's actually still quite relevant today. I'll really go into detail just on one patient's sort of genomic journey with cancer. So I'm going to start with the absolute fundamentals. We are all made of cells. All genome analysis really has to start with this concept in mind. We're analyzing cells and are we looking at single cells and average of cells? Is it tissue? Is it a cancer mass? Really all of the thinking around genomics is really how we've been thinking about tissues and cancer basically since the discovery of cells. We're all starting with this single unit of the genome to start, of course, brain cells, fat cells. And really any type of cell can essentially give rise to a cancer. We're really trying to figure out what makes cancer cells different from normal cells. So I like to start with the most personal of slides. This is literally one of my blood cells dropped onto a slide. In this case, these are all the chromosomal content of a blood cell. You can see when you just drop a cell, you basically get a scattering of chromosomes, assuming you've captured the cell at the right cycle. In the old fashioned days, genomics was literally cutting and pasting pictures like this to make pastes up. So this is, you know, chromosomes don't look like this. They look like this when you drop them onto a glass slide. And we would order these up and we would by eye down the microscope try to compare banding patterns across all these chromosomes. So in this case, nothing abnormal as far as I know, but it would be this very manual cell by cell effort to look at changes in banding patterns changing in DNA. Very labor intensive, but still clinically informative all of what we know about genetics really started here in the land of cytogenetics. And the reason I show you this normal genome is really for comparison with a cancer genome. So here's just one example. This is a brain cancer. You can tell just by eye you don't have to be a cytogeneticist just to see there are lots of additional chromosomes. There are missing pieces of chromosomes. There aren't chromosomes nine. You know, two of the chromosomes are missing huge portions of DNA. We have these deletions. They're insertions. Some of these chromosomes are longer than they're supposed to be. Some are rearranged and strapped one another. And this is really still the fundamental complexity in cancer genomics. How do we measure and capture and interpret all of the cancer genome variation that's happening in each cancer type? And this is actually just one cell. So it's only compounds when you start to look at hundreds, thousands, millions of cells within a tumor. And then squares when you start to look at these shifts in the tumor genome over time. And these are all themes that are going to come back. We're going to come back to throughout the course of the workshop. So it's a bioinformatics course. We of course aren't going to be looking at chromosomes or cytogenetics. We're going to be using next generation sequencing primarily. So quote actually from my previous supervisor always sort of held it up and actually really still holds quite well today. All happy cells are like each unhappy cell is unhappy in its own way. This is a surcos plot. You see these a lot in cancer genome atlas and other publications. The way to read these is the exact same conceptual idea. The chromosomes instead of being lined up side by side are instead ordered as a ring around the outside. And what this plot shows is actually relationships amongst chromosomes. This isn't actually a cancer genome. This is just showing what regions of which chromosomes are similar to one another. And this little zoom in here really highlights all the ways the cancer genomes can differ between one another. Point mutations, the most famous. So just single base pair changes in sequence copy number alterations. That's like that chromosome nine example, missing pieces of DNA, additional chromosomes, additional copies of DNA structure arrangements where we strapped two pieces of two, two individual chromosomes together in an abnormal way. This is a way to activate oncogenes or inactivate two suppressors, genes or regulatory elements, pathways. These are all things we're going to of course talk about over and over and over, but the whole point of the slide is really this bird's eye view of the cancer genome not being a single type of cancer genome variation, but really cooperation of many different changes, especially in very highly scramble genomes like that brain cancer I just showed you. The other concept is not just looking at genomics at one point in time. It really is this concept of shifts in the cancer genome over time, not only just as cancer develops, but also during treatment as well. So the very famous figure is a great review out of the Sanger Institute, but it really brings home a lot of the challenges and really some context for the acquisition of mutations and structural alterations over time. Really right from a fertilized egg, the slow acquisition of point mutation, often benign variation just over time, just DNA damage being semi-repaired but being retained in the genome until the eventual acquisition of a driver mutation, so they put those in star as a star here. The eventual acquisition of specific hallmarks of cancer due to the result of mutations in the wrong place at the wrong time during cellular development. And then of course additional mutations introduced during the course of treatment. Over here, chemotherapy, you get these absolutely shredded genomes at the very, very, very late stage disease. And this plot really calibrates us as to the number of somatic alterations we'd expect over time and what we actually see in practice. Pediatric cancers, these are basically acquiring driver mutations much, much early in the life cycle of cancer. These are very low mutation rate cancers driven by very different genetic mechanisms than what we see in very, very late cancers. And it's actually quite informative to take knowledge from pediatric cancers and apply those to adult cancers and then back again because even though the mutation rate is low, the oncogenic mechanisms you see in pediatric cancer are often used or co-opted in late stage tumors as well. And this is a very data rich plot. This is published out of the Broad Institute not too long ago. Each dot on here is a single tumor. The y-axis is the number of mutations per tumor. And they're basically sorted by the median mutation rates. You can see down here, baby rattles, blood drops. These are all the low mutation rate cancers. They're basically arising earlier most often. But you can see even within each cancer type this huge amount of variation, even within pediatric cancers. They're actually highlighted to the outliers here in neuroblastoma. Very, very high mutation rate. Actually, this one tumor actually has just as many mutations as a late stage lung cancer. This is a only a five-year-old child. There's no way that they've been smoking for life, of course. There must be some other mechanism for high mutation phenotype in this case. And same story over here in this melanoma. Very, very low mutation rate actually on par with a pediatric cancer. So even though this cancer looks like melanoma down the microscope, there's something fundamentally different. That's true. And I highlighted these because I know the answer, because we did next generation sequence analysis. This neuroblastoma actually is the only pediatric cancer in that cohort with two hits in a DNA repair gene. So even though it's a very rare low mutation rate cancer, it's hit, it basically has lost its ability to repair its own DNA. So it's just piled up mutations over time, giving it a very distinct mutational phenotype from the other neuroblastomas in that cohort. And it really makes sense. Same story down here. This is actually the only melanoma that did not have a UV-induced mutation signature. So this is actually a spontaneous melanoma. The exact same cell of origin still came from melanocyte, but very, very low mutation rate. It just happened to hit the wrong gene at the wrong time in the wrong cell. And actually, if you, the way to read this yellow plot down here at the bottom, you have all these different types of mutations. And this low outlier is this very, very thin, probably one pixel thick mutation signature that's completely different from all the other melanomas. It's not hypermutated. It's still a cancer of a monocyte. Just the mutational spectrum is completely different. You really wouldn't know that without bioinformatic analysis. This is a very classic slide. Of course, everyone should read Hannah Hennweinberg hallmarks of cancer paper and the reboot recently in cell. The take home point is tumor cells basically acquire abnormal abilities and they co-opt normal behavior. So these are basically all the hallmarks. You should, of course, just read this paper. But the point is that these are pathways that are active in normal cells and cancer is just turning on those pathways in the wrong cell at the wrong time through cancer, by altering their genome through cancer genome variation. And this has really taken that bird's eye view down to specific, very famous cancer genes and cancer suppressors, really this idea of mutations specifically in oncogenes and tumor suppressors. And actually, this is still a fundamental goal of cancer genome analysis, trying to not just measure and count mutations, copy numbers, alterations, et cetera, but actually mapping these specific pathways and specific functions. And really the goal here is really to differentiate mutations in drivers from a background mutation rate. Melanoma and lung cancer are actually great examples where you pile up thousands, tens of thousands of mutations or really a very, very small core set of those actually map to oncogenes, loss of function mutations in tumor suppressors and then additional mutations that enable tumor progression and evasion, or evasion of, or I guess acquisition of treatment resistance. Probably one of the most famous methods to do this is an algorithm called MUTESIG. So if you read one of these cancer genome analysis papers, you'll invariably come across one of these statistical models that basically looks at background mutation rate across large cohorts of tumors and looks for mutations above or genes that are mutated above that background mutation rate. I believe you're gonna have an entire workshop on this type of theme. Certainly my interest is trying to take what we see and count in the cancer genome or really try to do something about them. So here are the two of the most famous examples. These are specific therapies that target that target activated oncogenes. In this case, lung adenocarcinomas with activating mutations of EGFR. There's now a drug or actually several drugs against the epidermal growth factor receptor. So if we have a tumor that we know has mutation in the gene and coding this protein, clinicians will essentially deliver an inhibitor against, deliver an inhibitor against that disease. Very, very similar here in melanoma in this case. I don't know if you can see this, okay, good. Here's a patient with widely metastatic, widely metastatic disease. All of those are, the majority of those modules rather have an activating mutation in BRAF. And you can see how effective actually these treatments can be. You're basically clearing out all these metastases. The major downside with the targeted therapies due to the plasticity of the cancer genome, resistance inevitably arises. So this will buy you 12, 12, 18, maybe two years of efficacy, but ultimately the cancer genome or the cancer cells will ultimately come up with a resistance mechanism to avoid these drugs. Now I'll go into this in a little more detail, especially in the workshop or the case report at the end. So I just showed these two examples. This is a big theme in precision medicine, targeted therapies in general. This is actually a meta-analysis of cancer genome atlas data. So cancer genome atlas is this huge effort to sequence at least 500 exemplar tumors of basically every cancer type. And this paper basically took all those somatic mutations and tried to map those to specific pathways and specific pathways that had specific drugs. So I don't expect you to sort of read this entire heat map, but really the take-home messages are down here on the right. First off was that drugable alterations cut across cancer types. So this is really that theme of the same oncogenic mechanisms are being used across cancers. They've really seen specifically the same mutations across different cancers. This is a big emerging theme in clinical trials these days is not necessarily saying you have a lung cancer trial or a breast cancer trial, but rather a trial that's guided by molecular, by knowledge of molecular alterations. The other take-home message is vertical. So combination therapies may be effective. So the way to read this, these are all the genes and these are all the cancer types. So in this case, within a cancer type, you can actually have multiple alterations in two distinct drugable pathways. And this is the second theme in clinical trials is really this concept of combination therapies. If we have knowledge of what pathways are altered in the cancer genome, attack both of those pathways at once simultaneously. This is simultaneously. And number three, which is I think a little controversial at the time, but actually may hold out to be true, that 50% tumors have at least two disruptive drugable pathways. In practice, the ability to map patients to clinical trials based on their molecular profile has hovered just under 10%. So this 50% may be a little bit of overinterpretation, but I think this is really where we're trying to go measure the cancer genome and then act on it. The huge fundamental challenge with cancer and tumors in general is that tumors are not one cell or one homogeneous mass. They're actually rather this mixture of cancer cells and actually not just cancer cells, but all the other infiltrating cells that make up normal tissues. There are blood vessels, there are fibroblasts. What this figure actually shows are all the different types of cancer clones that can be present within a tumor. So in this case, there are three clones, red, yellow, and blue. And we can see these macroscopically using fluorescent and hydrohybridization or carrier typing, like I showed on the very, very first slide. And this presents a fundamental challenge because when we perform traditional cancer genomics, we're taking masses like these and grinding them all up. I'm looking at the average of their DNA and average of their RNA. And this actually presents a major challenge in that the actionable mutation, a mutation that we think is driving these cancers, may actually only be present in one of the three subclones. And this is actually an active area of research these days. It's really inferring the subclonal structure from bulk genomic data. And most recently, there's been the advent of single cell genomic technologies where rather than grinding up these tumors and looking at an average, single cells can actually be encapsulated in oil droplets. And then we can start to do analysis literally at the single cell level. So that's sort of the big transition point in our field these days, is moving from bulk average based genomics and moving into the single cell landscape. So all of the work that I'm going to talk about today is really going to focus on traditional cancer genomics, which is basically taking a tumor cell, grinding it up and then analyzing the DNA and RNA from all the cells as an aggregate. The recent subclonal structure is important is that cancer subclones can respond differently to therapy. So this is a now very famous figure, very relevant certainly today, in which case you have an activating mutation very, very early on. In this case, this is leukemia. The model here is that you have a mutation in one of the very early progenitor cells to leukemia, so you basically acquire these very fundamental driving mutations within an early clone. And this actually gives it the ability to acquire additional somatic alterations. So you have this enhanced chromosomal or genetic instability. And you can see here the acquisition of multiple clones. So they've all been colored quite differently and they're all present at different frequencies over time. In this model, this is where the genome analysis was done. So in this case, they saw four distinct subclones and they're all defined by different driver mutations, all of which have different color codes here. And the model here is this patient then got chemotherapy that then contracted the clonal structure. So it actually killed off, certainly the purple clone didn't even survive. It was very susceptible to chemotherapy in this case. But you can see this clone still survives. There's actually a mixture of the yellow and the orange clone. So even though the original clone is gone, you still have minimal residual disease or in some cases still active resistant to disease. And this is really the biggest enemy, certainly in medical oncology, is really these resistant clones that persist, acquire additional mutations, and then can actually become really quite aggressive. And the big challenge with cancer genomics is really being able to start to measure these shifts in clonal structure and cancer genome variation more frequently over time. And this is really just drilling in even further. These subclones are not only do you see subclones within a tumor mass, you actually see subclonal structure across when a patient has multiple metastases or multiple tumor sites within their body, you can actually reconstruct the clonal structure using bioinformatic analysis. So even if we have the primary tumor, it's not like the entire tumor is necessarily going to travel to another metastatic site. It may actually be a subclone that would travel and form secondary metastasis. And actually another subclone that arises in the secondary metastasis can then travel and form tertiary metastasis. So this becomes extraordinarily complex as we start to look at multiple metastatic sites as these plastic genomes and find cancer cells move and travel and establish themselves throughout the body. So I just want to step back a little bit and talk a little bit about why we would do this. Yes, it's extraordinarily complex, but there's still good reasons to do molecular testing for cancer. So it's something that's certainly ongoing at most hospitals today. Treatment, that's the most obvious, really looking for actionable, drug-able mutated oncogenes. Monitoring for treatment resistance, really trying to get at are there new mutations that are acquired over time? And does treatment necessarily have to be switched on the basis of knowledge of new mutations? On heredic cancer syndrome, so I'll talk a little bit about this, where individuals are really born with their first hit. So they basically have a high predisposition to cancer. So I'll talk about this concept of two hits leading to cancer and prognosis, which I'll probably speak a little bit last about. So this is all great at the population level. We have tens of thousands of cancer genome data points out there. We can literally download them off the internet. Some of you will be downloading them during these workshops over this week. But for an individual patient, the question is what are the targets in my cancer? It's not so much that there's a highly recurrent mutation in a specific cancer type for an N1 or for a single patient. Where does the cancer differ from the match normal? Is it in sequence? Is it in structure? Is it in function? Is there an external influence? Is it virally associated? And not just what is the list, but also what can be done about it. So this sort of brings me to the meat and potatoes of the talk. So I'd like to pause here if anyone has any questions about sort of fundamental cancer genomics, anything that can go wrong in DNA. Yeah, all right. Just a quick clarification from the understanding when you say subclones, is it subpopulation? Is that right or is it different from a subclone? I think conceptually, yes. I would basically say a subclone is a population. I like subclone because it implies that it's related to a progenitor cell at some point in time. I will typically use subpopulation when I'm talking about cancer and other cells. So if there's an immune cell population as well, so I like subclone because it sort of communicates this hierarchical progression of cancer. That's basically the method in which it was. It's a specific subset of the sensitivity. Exactly. With the implication that they're all genetically related to some initiating cancer, which is not to say that people don't get two cancers at once. That certainly happens for the patients who have a germline predisposition. Yeah, great question. Any other points of clarification? Okay, so we'll move on much more into the next generation sequencing piece. I'm not going to go into how the machines work, but I'm really going to focus on what do we do with data, what does data look like, and how do we analyze it. But I did want to start. I mean, you should of course know where your data comes from. Certainly as a bioinformatician. So I wanted to show this very baked down general DNA sequencing workflow, and this really has not changed. We've had several versions of next generation sequencers, but the fundamental concept is really very straightforward. Extract DNA or RNA as needed for the specific question you're trying to answer. Make a library of DNA fragments, and a library is basically DNA that's compatible with the next generation sequencer. So essentially you've taken the exact same DNA, and you just put little adapters on the end that basically lets the next generation sequencer read your DNA. So that's all a library is. It's called a library because each fragment is, it's basically a big collection of DNA fragments. Once you have these adapters on the end, these are then loaded onto your next generation sequencing device. There are many different technologies to do this. I showed a picture of an Illumina sequencer. That's certainly the most prevalent. That's what I use in my lab. And then computational analysis. And this sort of moves from most standard to least standard. So extracting DNA is relatively routine in most labs. Making libraries, putting adapters on, this is custom to your machine. Different labs will use next generation sequencing devices. And as you'll see this week, there are many ways to analyze data. And this is really the field of bioinformatics. So this is literally what you get from a next generation sequencer. This is just a text file. You can open it in Notepad on your Windows computer. And you'll just get a massive list of A, Cs, Ts, and Gs. And then a massive list of very strange characters that basically communicate the quality of your data. So each line here is a piece of DNA. This is a 75 base pair read. This is 25 of them on one high-seq 2500. You get 600 million of these. So that's actually 1 eighth of the total output. So just massive amounts of data. There are huge text files. If you do open it in Notepad, it completely breaks. We need more sophisticated text handling and bioinformatic methods to read these data. But this is really literally the starting point. Anyone can open and look at A, Cs, Ts, and Gs. So you're going to hear this recurring comment to our theme of a pipeline. And a pipeline is basically just a list of computer commands that operate on data and put it into a new format, often with new annotation. So in this case, I've showed an oil pipeline. It's conceptually exactly the same way off in the distance you put in your raw DNA sequence, which is this. And it basically goes through all these segments of pipes. You can see all these pipes have little connectors. It's exactly the same thing. We have little connectors between different computer programs. And these, in the variant calling pipeline, you put raw DNA in, it goes through many computer programs and comes out the other end with a text file of variants that then you can look at as a scientist and interpret. The hard part is building the pipeline. So that's what all these little segments are here. Here's an oil pipeline in progress. Each of these is a computer program, often made by a different lab. And one of the fundamental roles of bioinformatician is really to understand these tools at a fundamental level and put them together really in an intelligent way. This is a very standard approach to trying to call variants, taking the DNA and aligning it to the human genome reference, doing some preprocessing to make sure that alignment is accurate. Calculation of QC metrics is the DNA you put in making sense. If it's a targeted panel or all your reads in the target that you would have expected, variant calling. And then I put in gray interpretation and clinical report. These are by far the most manual, certainly today. Certainly we've written software to help us interpret and report these data. But there's really not a nice piece of oil pipeline module that we can sort of use to interpret and report these genomes. So I'm going to go into a little more detail on the first four steps and then show you some examples of cancer genome variation and how to interpret it. Oh yeah, so this is the conceptual idea, all the sort of oil pipeline parts. This is what the pieces really look like. So certainly if you're going to be in the DNA or variant calling space, the very first place you should look is the genome analysis toolkit best practices. This is from the Broad Institute, one of the very large genomic centers in the world. And they've really come up with basically how to do cancer genomics their way, how to call variants using their software, which is called the genome analysis toolkit, GATK. And each of these is a little pipeline module. So you take raw reads, you map them, you do realignment. Basically exactly like I just said, that this is a more formalized way to be communicating the specific tools that are used to really get you to analysis ready reads. So you can see going from the original FAST-A or FAST-Q, the original DNA sequence data, there are many steps before you even have reads that are ready to be analyzed. And then these can be then fed into variant callers. You can call different types of variation and ultimately get you down to a list of variants that you can then interpret. So this is a sort of a framework for analyzing germline genomes. And for cancer genome analysis, we of course have our own pieces of software tailored specifically for cancer. And here's some of the tools there. So I've just sort of overlapped basically their list of tools with some of the ones, at least we use quite locally, just to call SNPs, indels, got number of variations, et cetera. And I think I'll probably just leave that there. There are many ways to call variants, and here's some of the software to do it. So this is the exact same data I showed on the previous slide, but now it's been aligned to human genome reference. So this is a snapshot from the integrated genomics viewer. It's a great way to look at reads that have been compared to a human genome reference. So I said in that pipeline, the first step is alignment. That's basically taking all the A, C, Cs and Gs and comparing them to the human genome reference, which literally anyone can download. It's only three gigabytes. It's something you can click on the link and download today. The way to read this plot, here's the chromosome here along the top. And the red is this little piece in the chromosome where we've zoomed in. So we're relatively close to the central mirror here. Here's the axis. So how many bases are we looking at? We're looking at 92 bases of sequence. On the previous slide, I showed you that was a 75 base pair of reads. So of course, the reads aren't stretching across the whole region. These are the specific genome coordinates. This is coverage. So in this case, fairly even coverage across this 92 base pair segment. In white is the human genome reference. So that does not change. You choose your human genome reference and do your entire project against that reference. There's been a shift relatively recently from HG-19, human genome version 19, to relatively new version, GRCH-38. This is genome reference consortium version 38. So definitely if you're inheriting data from other projects, my first step is always what genome build was used because that will actually change the coordinate system that you're using to map and interpret variants. In this case, this is all HG-19 older build. And in gray are all of the reads. In this case, here's the reference along the top. And I intentionally chose a region that doesn't have any variation. All these reads support exactly the same variant. You can pick out by eye little sequencing errors here and there. But in general, all the reads look exactly the same. So this is what DNA looks like from the equivalent of my normal chromosomes I showed on the first slide. And there are many ways to do that. Here are just some listed down here on the bottom. But basically you put your sequence file in, your reference genome in, run your module, and it gives you a file like this. This is called a BAM file. Binary alignment file. And this primary alignment is really the starting point for almost all bioinformatics, certainly in cancer genomics in general. And really the role is to operate on the alignments to infer multiple types of cancer genome variations. So this is a cartoon of the type of data I just showed you. So here's the human genome reference here along the top, conceptually exactly the same as the primary data. Those reads I just showed you are, of course, just tiled across the human genome references I showed on the top. In this case here we're looking for differences. So in this little cartoon, here's a single base pair change that's different from the reference. So the reference is an A. We're looking for a base pair change that's called a point mutation. An indel or an insertion or a deletion in this case is what they've showed is a missing base or multiple bases. So very, very small type of cancer genome variation can still have very large impact on the resultant protein that's translated. The absence of data is actually quite informative. Remember that chromosome 9 that was missing half of its genetic material? That actually shows up in our DNA sequencing data as no data at all. So we infer those as deletions. Hemizagus deletion, half as much data as we'd expect. Chromosomal gain, if you've doubled your genome or if you've gotten four or five copies of a specific genomic region, that manifests as more data. So we infer those as gains. And translocation breakpoints. In this case the human genome reference for chromosome 1 is here, 5 is here, whoop. And these reads are actually a hybrid of those two chromosomes. So the left read is on chromosome 1. The second read on the same DNA fragment is chromosome 5. We actually infer this as a rearrangement. So chromosome 1 and 5 have brought abnormally together in the DNA of the cancer that we're analyzing. And the leftovers are actually quite interesting as well. We align everything to a human genome reference. There's invariably a set of reads that don't align there. Sometimes these are still human. Just the references just doesn't capture that those sequences. But these may actually map somewhere else. And there's a host of viruses, HBV, EBV, associated cancers. That DNA or RNA comes along for the ride. We see it in our DNA sequencing data and we can read those out as pathogens. And down here in red are the various tools that you can use to call these types of variations. Yes? What is the gray bit in each read to know? So the gray is missing data. So it's just to draw your attention here. But gray just means we've looked at that region in the human genome reference, but there are no reads there, which is very unusual. If we do whole genome sequencing, we'd expect reads to cover the reference completely. In this case, there's no data. Oh, yeah. So these are all DNA fragments. And the way common X generation sequencing work is we'll just read the ends of the DNA and then we'll map the ends back to the fragment. Sometimes a fragment is bigger than our sequencing reads. So I showed you an example of a 75 base pair read. But if our DNA happens to be 300 bases, that means we're not sequencing the entire chunk of DNA. Yeah. So that's actually a great question. Depending on the way that your basically your next generation sequencing service configured, you might not read across the entire DNA fragment. Any other questions on this side? This is actually a great place to pause. Yeah? Is that what people refer to as paradigm? Or is that? Yeah. So this is paradigm sequencing? Yeah, exactly. So this improves mapping, and it lets you pick up translocations like this. So this is one configuration of an X generation sequencing experiment. Maybe I have a question for the hemizagis deletions, which is typically you wouldn't expect both of these to be lost, right? Wouldn't that be more common than the total deletion? Yeah, absolutely. So specific genes are commonly homozygous, can be deleted, both copies are deleted. More frequent is a single copy deletion like the hemizagis deletion, and then compounded by a point mutation. And I'll come back to this idea of genes being hit by two different types of cancer gene variation. But yes, you're absolutely right. This is much more frequent than losing both complete copies of the genome. Great. Any other questions or comments? Okay, great. Yeah, this is a great review. It's actually seven years, seven, eight years old, but it's still completely relevant today. It has a huge table with a list of software, which is probably less relevant, but definitely all the concepts, ideas. This is still a figure I show in almost every talk, really just to show the sort of teach you about the concepts of how to apply next generation sequencing specifically to cancer. Okay, so I did want to sort of break down specifically some of the terminology around how to configure your genomics experiment, specifically moving from big to small. So there are really three main focuses or that I'm going to talk about in this talk, for cancer genomics, DNA focused, RNA focused, and protein focused. All of these can be read out by next generation sequencing. We're going to talk a fair amount about whole genome sequencing. So this is just taking DNA and sequencing it, no selection whatsoever. Whole exome sequencing where you introduce a laboratory step where you essentially isolate only the DNA that mapped to genes. It's only 2% of the genomes. You're actually throwing out 98% of the genome, but it's the 2% that we know the most about, at least today. Targeted gene sequencing, this is what we most commonly use in clinical diagnostics due to cost, because the volume of patients is much, much higher. This is not whole exome. This is now zooming in on a handful of candidate genes, Princess Margaret, between 10 and 500 genes, so of the 18,000 potential protein coding genes. Epigenome modifications, so in this case introducing a different laboratory step to reveal modifications to specific cytosine. So basically looking for regulatory modifications to DNA that aren't traditionally read out as ACT or G. There's actually a fifth base, actually multiple additional bases, but the main one that's clinically used are methyl cytosines. The second category is RNA sequencing. This is where instead of extracting DNA from cells, we're extracting RNA, so the transcriptionally active part of cells, sequencing that, and microRNA sequencing, conceptually very similar to exome sequencing, introducing a laboratory step to only look at protein coding or polyadenylated transcripts. I'm not going to be talking as much about protein sequencing, but really there are additional laboratory methods that you focus your reads specifically on regions that are wrapped around a protein or have a DNA-protein interaction. There's actually a very rich field of specifically epigenome mapping, looking at positioning of histones and nucleosomes within DNA. So I did want to show some raw data. I'm going to try to show raw data as often as I can really get some of these concepts across. In this case, this is the exact same sample profiled by three different genomic configurations. Whole genome sequencing, whole exome sequencing, and RNA sequencing. And this is focused on one gene, K-RAS gene, very famous cancer gene that's frequently mutated across cancer types. This is an IGV snapshot. I've trimmed off the chromosomes, so you don't see them at the top and bottom, and I've zoomed way, way out. So those gray reads we're looking at earlier, instead of looking at just 92 base pairs, now we're looking at 49,000 bases, so we're zoomed way out. So you see these little tiny gray ticks is a sequencing read, and the gene is down here on the bottom. I don't know how well that really reproduced, but in this case you have exomes, are these sort of vertical lines, are the thick vertical lines, and the introns are these little intervening segments. And you can tell by whole genome sequencing you just get an unselected c of data. Basically you just get wall-to-wall coverage, but for the same price, relatively low coverage compared to other more targeted methods. So in this case, we've fragmented the genome, we've sequenced it in an unbiased way, and you get coverage everywhere. Now you have knowledge of intronic variation, promoter variation, deep regulatory variation, regions of the genome, we don't necessarily understand yet. Great area for research, there's lots to look at there. And you can actually see, just by eye on the slide, most of the genome is not an exon, it's not protein coding, and there's, you can just, these little colored marks at the top are all differences compared to the reference. So there's a lot going on in the non-coding region. Exome sequencing is really focusing our attention, focusing our reads specifically on exons that we understand a little bit more, certainly than the intronic and the non-coding region. In this case you can see all the sequencing reads are piling up specifically where we want them to. We've introduced this laboratory modification, and you can see the reads are really mapping specifically only to exons. You basically have no coverage whatsoever, or very, very little coverage outside of exonic regions. You still get variation, you can see within exons, of course we have mutations and normal polymorphisms. And RNA sequencing is really the great uniter of these two concepts, because we let the cell do the selection for us. In this case, the reads are still piling up only in the exonic regions. But you can see these lines here, and these are actually bridging reads that since RNA splices out the introns, we can actually tell which exons are linked to which exons, because a read might start in one exon, and then continue in another exon. If we have paired-end sequencing, you can tell the two ends of the DNA map to two different exons. So these lines here aren't necessarily sequencing reads, they're just showing a relationship between exons. That's really not apparent to an exome on whole genome sequencing. Yes? Could you just comment on the time of the notation that we have 40 times? Yeah, so the times here is on average at a single base, how many reads do you have at that position? And this is really a function of cost. So if you order a genome, this is sort of typically what you need to call a variant. We want 40 reads, and if we're looking for a variant that's in one of the chromosomes, we'd see 20 of the reads have a variant, and then 20 won't. And then there's a lot of bioinformatic analysis to try to call variants at lower coverage. Or certainly, it's just in my lab is to try to call subclonal variants using much, much a deeper coverage. So trying to trust three, four reads out of hundreds or thousands. And I'll come back to some data examples, specifically around variant calling and shallow versus deep data. Yeah? Yeah, maybe you're going to come back to this later, but this is actually still quite some reads in the exome mapping between the exome, and the exome sequencing mapping between the exomes. So these are not all mapping errors, I suppose, or are they? Yeah, so it's basically the most common way to do Oh, so sorry. So the question is just by eye and certainly in practice, there are several reads still mapping to the intronic regions, even though we're doing exome sequencing, but I capture that right. And this is absolutely true. This is due to the laboratory modification to focus on exons is not perfect. This isn't capture only exons. Some genomic data actually come through the sequencing library. So it's an enrichment for exons, but it's not a perfect selection necessarily. So that capture rate is typically by 70 to 90% effective. So you're inevitably going to have these other sequences come through. And actually, those sequences can be quite helpful for calibrating your detection of copy number variants and for detecting viruses. So even though you haven't necessarily put into your exome, bait into your exon design, baits for viral DNA, you can actually still do a bioinformatic analysis of these off target reads and still detect real biology. So there's actually a great paper by Stransky et al on head and neck cancer where even though they did exome sequencing, they're still in built-in call-outs, which tumors were virally associated and which ones weren't. So yeah, these reads are actually still useful bioinformatic. Okay. Yeah, actually, yeah, and that leaves really nice thing to this next point. You know, it's called whole exome and whole exome sequencing. Sometimes I call this slide whole like H-O-L-E exome sequencing because there are coverage holes in the genome and in the exome because they really don't measure absolutely everything. The way to read this is actually one of my own papers, so I don't feel so bad highlighting this. The way to read this plot on the y-axis are each little tiny, tiny, one pixel thick line is an exon and in white is our ability to call a mutation in those exons. So a tiny tick is completely black. That means the exon had no coverage. We're just not able to call mutations and it's completely white like most of the exome, most of the genome, or most of the exome rather. That means we're completely confident in our ability to call mutations. In this case, you can see there's actually a, you know, in general, 80% of the exome we have enough coverage to call variants, but that capture method is not 100% effective and there's some exons that we're just not able to capture, not able to capture and sequence effectively and this is a combination of the method we use to pull down the exon. This is a method called hybrid capture. That is, basically does not work well at extremes of DNA sequence content. So a very high number of G's and C's or a very high number of A's and T's. So it has trouble with these regions. That's what this plot's meant to convey. But even the whole genome sequencing, so these are the tumors marked here in, or the tumors marked here in yellow, even the whole genome sequencing still has this fundamental problem of regions of the genome that don't necessarily have reads. So I showed this example where the whole genome, you know, the whole region is necessarily covered but that's not actually true for the entire genome. There's still portions of the genome due to our inability to map an inaccurate reference or a normal variation between people that is just not very well captured by the human genome reference and as a result, we just don't see reads there. The other point of this slide is it really shows variability across samples and this really speaks to the need to do really a very rigorous quality control and comparison of coverage across all of your samples really as a first step. Some of these samples are actually very poorly covered. In this case, the data were from a completely different sequencing platform so there was a bit of a technical difference there as well but it also complicates the comparison of mutations across all of your samples because if there's an exon of interest that's only covered in a fraction of your patients, you may not be able to call a very accurate mutation frequency specifically in that site. Transcriptome sequencing is a little bit different because coverage is not a function of is DNA there or not but actually depends on the functionality of the cell. Is the cell expressing RNA and then can we capture that RNA and sequence it? And I bring it up here. Here's the expression level. Here's an exemplar patient. We really want to compare this patient's RNA transcriptional profile to two profiles that we knew specifically in breast cancer that's expressing a protein called ER, the estrogen receptor. In this case, we want to know what is the difference between this patient's tumor or this patient's tumor and a reference. In this case, we actually had multiple isoforms or multiple genes specifically associated with ER and we really want to compare the ranked expression of those genes to our reference here. This is really not coming up really well. Yeah, maybe I'll return to this in a future slide. I think the point here is that just looking at coverage of specific genes allows you to infer the expression level specifically of that gene. So transcripts or genes with very high coverage are expressed at a very high level. Transcripts or genes that have no coverage are expressed at a very low level are not expressed. And really the goal of transcriptional analysis is to compare these relative coverage levels across transcripts and infer specific cancer types or drill into specific genes. As I try to do on this slide, but I'm going to replace this for next year. I don't think it's doing a great job of that. The fourth source of cancer gene variation, which I'm not going to talk about too much on this slide, is specifically looking at epigenetic modifications. So specifically at methylation of promoter sequences. And this is really a laboratory modification up front that allows you to sort of unmask or look at modifications specifically of methylasease. And I bring up this example specifically because it's clinically relevant. In this case, we're wanting to compare the methylation status of specific promoters of this DNA-repair gene MLH1. This is important in colorectal cancer and an inherited form of this disease called Lynch syndrome. The point of this is that now I've turned on a modin IGV that essentially lets you look at whether Cs are modified or not. In this case, we want to compare normal tissue and hear the promoter sequence. And you really don't need a bioinformatics tool to pull this out. You can really just do it by eye, at least looking at one promoter. There's really this un-methylated promoter sequence. In this case, this gene is expressed and turned on. And you can see in this endometrial carcinoma, there's essentially a methylated promoter and not just a single base, but really almost all of the bases, specifically in the tumor, are methylated and this gene is now not expressed. And a bioinformatics strategy we'll often employ is to look at methylation status of promoters and then look at a matched RNA sequencing data to see whether methylation in that promoter corresponds to a change in expression. In this case, it very often does. Actually, in this case, we often use a proteomic assay. So we'll see lack of protein expression run a methylation assay. Then we'll often see methylation of a promoter specifically. There's not necessarily a mutation. There's not necessarily a copy number change, but there's modification to the bases at the basically at the DNA base level that results in lack of expression specifically of this protein. And that's really the point of this part in the talk is that these data types are confirmatory and complementary. And this is sort of a great cheat sheet for exams and that sort of thing. What can you read out by different types of cancer genome variation? Whole genome sequencing where you get wall-to-wall coverage. Whole exome sequencing where you're really focusing your coverage specifically on the exome. RNA sequencing where you're using relative coverage to read out the expression level of specific genes. And then methylation sequencing to really look at types of cancer genome variation that actually aren't read out very well by exome and genome sequencing. Specifically looking at epigenetic modifications to the DNA that change the ability of transcription factors to bind and therefore influence expression. I bring up low purity and subclonal mutations and I've only put the checkbox here in whole exome sequencing due to the depth afforded by much by the cheaper strategy of only focusing on genes. This is actually starting to change as whole genome sequencing gets cheaper and cheaper. We can sequence these genomes to a much deeper depth than we could have for the same price than we could have with whole exome sequencing. So eventually I expect we're going to be able to put a checkbox in this column as well just because the cost will have fallen to the level where we can get 100 or even thousands of ex coverage specifically across the whole genome. Okay, so I'm going to move into some of these specific types of cancer genome variation in a little more detail. So are there any questions more about the slide data integration types of cancer genome analysis to start? Yes? Just a question about what you just said. So if you can do a genome instead of whole exome sequencing if you always go for the genome sequencing because of the safety of more information? It depends on your question. For, I have an interest in subclonal analysis. For that I will always want more depth. So for the same unit of dollars I will almost always go with the deeper sequence unless my focus is specifically on non-regulatory variation or other regions of the genome. So I don't think it's an, it's definitely not an always, I will always use one of these two methods. It really has to be tailored specifically to the question that hand. If my interest is in clonal mutations and I want to know all knowledge about clonal variation in cancer, then yes whole genome sequencing is really the place to go. RNA sequencing, I'm actually starting to use more and more as we become more, really more confident in our ability to call a variation and to interpret functional gene expression categories from RNA. So I really don't think it's an always, I wouldn't use all, in fact I probably use these four methods more or less to the same degree depending on the budget and the scientific question that hand. So yeah, I don't think it's an always question. Any other questions before I move on? Okay, so let's talk about somatic mutations. We're talking about the smallest possible unit of variation, so a single change or a local change in the DNA sequence. So I just wanted to show some sort of general, some data examples again and I want to differentiate between the concept of germline variation and somatic variation. So germline variation is the variation we're all born with. These are SNPs, single-nucleotide polymorphism, single-piece changes in DNA that we all have and share to some degree. And our real challenge with cancer genome analysis is to differentiate genome variation or sorry germline variation from somatic variation. Germline variants are detectable at low coverage, as you saw in my original chroma, or my original karyotype, two chromosomes, if you have a variant on one, not a variant on the other, you'd expect half the data to have a variant and half the data to not have a variant. That's what I've shown here. Relatively low coverage, just under 20x, and at this one position, half the DNA sequencing reads have a C, half the DNA sequencing reads have a T. Very easy type of cancer genome variation to find. If we're doing this across a whole exomer or a whole genome, we will always run a match normal because there are thousands, hundreds of thousands of single nucleotide polymorphisms. The human genome reference databases and polymorphism databases are perfect. So we always like to have that germline control to compare variants like this to what we do see in the cancer. So these are the easiest types of variation to find in an easy to math region. How am I doing for time? Okay, great. The challenge with cancer genome analysis, especially trying to put this into practice in the hospital, is the tissues that we get for diagnosis are really not ideal. So this is actually a gel from my very first paper ever. These are all lung cancer patients. These are all from the BC Cancer Agency, so the same hospital. This is a, the y-axis is the DNA fragment size. And you can tell, even though it's the same cancer type, the same department, the same hospital, you can see this very variable level of DNA degradation. In this case, very high molecular weight, very intact data. Down here, highly fragmented data. Some of these blocks may be of different sizes, so protocols may have changed over time. This is actually a match to a bone. It was decalcified, completely destroyed, the DNA not usable necessarily for genome analysis, but a big confounder coming into the experiment. The other challenges, as I alluded to earlier, tumors are a mix of cancer in normal cells. You may hear this referred to or written about as purity or tumor content. This is a lung tumor metastasis to a lymph node, and I've circled all the tumor cells, and all the others are non-tumor cells. So we're doing a next generation sequencing experiment. Half our data are not coming from cancer, and this is really a calibration that our bioinformatic methods have to take into account. Not every single read is coming from a tumor cell, necessarily. And as I showed in that earlier slide, tumors can have multiple copies of the genome. This idea is called ploidy. A normal cell should have ploidy of two, two copies of every chromosome. This is probably a ploidy of four or just under, maybe 3.8 or something like that. As a whole, this really changes how your mutation caller or one of the fundamental assumptions of many mutation callers, which may assume that you only have two copies of each chromosome. Or as in this case, where you have five copies, now that allele, that balance of alleles, rather than 50-50 is now one in five, one in three, one in four. It's actually different across every chromosome. So you need a caller that is ploidy aware or is able to compensate for this expected landscape of chromosome instability. And actually, this goes back to that exome versus genome question. This is actually the same sample read out by whole genome sequencing and whole exome sequencing. So a relatively shallow genome, relatively deep for the time exome sequence. In this case, there's not even a single read that supports the presence of that mutation. There's no hint that that mutation is even there. In this case, only six of the 139 reads support that mutation. This was validated by another more sensitive method. So this mutation is definitely there in the tumor sample, but it may have been missed by the shallower assay due to the compounders I just talked about. Variations in DNA quality shifts in tumor ploidy or a very, very low tumor content sample. And this is actually why it's so critical to have pathology review really right up front before even embarking on a next generation sequencing experiment. Just so you can calibrate your thinking around, is this a tissue sample that has few tumor cells? Is six out of 140 makes sense with what is known about cells on the slide? Or is this a very exciting subclonal mutation? If we are certain that the tissue sample that's being analyzed is 100% tumor, these low allele fraction mutations may be a subclonal or very rare cancer cell population. With exome or whole genome-wide data, we can actually start to infer tumor purity, ploidy, and subclonal structure by modeling these combinations of allele fraction pathology content, pathology estimated tumor content, and chromosomes as read out by coverage. This is one of my favorite tools called sequenza. We applied this to a whole exome data set. These are the chromosomes here along the top, specifically looking at the B allele frequency. So looking at germline variation and how it may change on a background of chromosomal instability, changes in depth. So in this case, chromosomes that are gained have more depth than chromosomes that are deleted or don't have a copy number variation. And then by combining these two pieces of data, you can start to model specific shifts or changes specifically in the chromosomal landscape. So you can see there's an extra copy of chromosome 2, two extra copies of chromosome 3. In a normal blood cell, you would just get a flat line of a red and a blue, actually one copy of every allele. So you can really just tell by eye how chromosomally complex this tumor is. These types of data can be modeled to fit different types of tumor content or different ploidies. So you could have a very, very few tumor cells, but they have lots of copies of the genome. So you may still get a lot of tumor DNA because the tumor has duplicated its genome. And you can start to model using tools like PyClone, PhylaWGS, specific subclonal structure of these mutations. So in this case, these are mutations that have very high allele fraction, maybe half the reads have the mutation, half the reads do not. A subclonal population where maybe 75% of the reads have the mutation. And down here, we get these very, very low coverage reads. As we start to look at 10, 15, maybe 100 of these low allele fraction mutations, we start to observe that they're all supported in the same frequency. And we can start to build these into subclonal models. So everything I've showed so far is nice, clean, alumina sequencing data. We really haven't seen a lot of sequencing errors, but I did want to comment on other next generation sequencing technologies. And really their value to complement alumina data. This is just an example of using a PacBio based approach, so a totally different way to sequence DNA. I'm really just using it to validate variants that are found by alumina sequencing. So you can see down here, there's two types of variation, a point mutation, and in this case an indel. So actually two bases have been deleted, so you put a little gap in here. And then we use this other totally different next generation sequencing method to read out the exact same variance. The reason we didn't use this technology right at the beginning is that it has a very high background sequencing error rate. And you can tell that by eye. There's all these purple marks, which are insertions, all these little deletions that we really don't see in this other sequencing data. But you can see the real variation really shines through. And this is really another bioinformatics challenge and working with these other data are correcting for and essentially normalizing out this background error rate that's specific to the technology. The other challenge in designing a next generation sequencing experiment is really how deep do you need to go? Really, what are you looking for? Are you looking for mutations or copy number alterations? Are you doing clinical sequencing with suboptimal clinical samples where the percentage of cells that are tumor cells may be lower than you'd expect in a research sample? And this is barely a plot of the allele fraction detectable by a specific coverage. So in this case, looking for relatively low allele fraction mutations at 10%, this is typically done in 100x genome. This is sort of a cheat sheet for how deep do you need to go for specific allele fraction? I'm going to speak a little bit about circulating tumor DNA later in the talk, specifically sequencing very, very deep, in this case 25,000x, to find very, very low allele fraction mutations down below 1%. So really this function of the need for increased depth to find rare variation. Oh, good. So I have to sign right here. So this is really an active area of research, specifically in cancer genomics, using X-generation sequencing to read out mutations, copy number alterations, rearrangements in basically non-invasively, using a blood test rather than a biopsy of the primary cancer cells. So specifically looking at this concept of cell-free DNA, so you take a blood sample and spin it in a centrifuge, you actually get three layers, a plasma layer, a white blood cell layer, and a red blood cell layer down here at the bottom. In this straw-colored plasma layer is actually DNA that's essentially being shed by cells in the body. Almost all those cells, of course, are normal cells, but tumor cells are turning over as well, and they're actually also shedding their DNA into blood. And a challenge for circulating tumor DNA analysis is to detect these very, very low allele fraction mutations on this background of primarily normal tissue. So I should do that challenging example of a lung tumor metastasis to a lymph node. That was 50-50. This is more like 1% or 0.1% versus 99% normal DNA in that sample. Certainly for cell-free DNA analysis, we're attacking this using two methods, a PCR-based method and the hypercapture-based method, the hypercapture is what we've typically used for whole exome sequencing, where you put a DNA or an RNA bait into the tube that hybridizes your sequences of interest and pulls them down. As you can see, there's a lot of ARF targets, so most of our coverage is where we put our bait, but you can actually see coverage to either side of the region of interest. PCR-based methods, basically you put two primers down, you only amplify that region of interest. You have very focused data, very, very deep. The challenge here is you don't have knowledge of molecular diversity, so you really don't know how many DNA molecules how many distinct DNA molecules are supporting your mutation. In this case, you can have this nice tile of sequencing reads like I showed you in the previous mutation example. In the PCR-based approach, all of your reads start and end at the exact same point. So while it can be very, very deep, there's this challenge of not knowing whether you just have one DNA molecule that's just been amplified over and over and giving you a false sense of a negative or a very high frequency positive. The excitement around cell-free DNA is that it's really highly quantitative, so while configuring your next generation sequencing assay, typically we will do targeted panels so that for the same input, the same coverage, the same budget, we can sequence very, very deeply. This gives us very high sensitivity, so we can start to get down into the picograms per mil range. And the excitement around it really comes comes in the form of the high quantitative nature of cell-free DNA versus the size of the tumors. So you can see as the concentration of cell-free DNA goes up, this is actually tightly correlated to the size of the tumor or the number of tumors that a patient may have. And this actually becomes even more powerful as we start to look at the cancer gene number or the allele fraction over time, especially alongside clinical data where we can start to see tumor shrinkage and cell-free DNA, or sorry, tumor shrinkage and decreases in cell-free DNA concentration with growth of tumor, higher allele fractions, and of course what we all want to see is tumor volume crashes that's very, very small and we can no longer detect mutations that support cell-free DNA, the presence of cell-free DNA either. So that was all I wanted to say on point mutations. Is there any comments on the smallest variation in the genome? Okay, I'll move on to somatic copy number alterations and rearrangements. And this is really what we're seeing at the careotypic level, at the chromosomal level, those large gains and losses. This is just another example of a very complex chromosome. This is a technique called spectral careotyping where each chromosome is painted with an individual color. So a normal cell here should have two chromosomes, one of each color. I can tell in this tumor this is absolutely not the case. This chromosome here is made up of three or four different chromosomes, all rearranged and mixed together. This piece from one chromosome attached to another, chromosome sizes are all very different. This genomic complexity, of course, is all in the sequencing data as well and we're very interested in where all these break points occur as we transition from one chromosome to another within a single genetic unit, which is still this new chromosome that the cancer cell has constructed for itself. Of course, gain losses are evident from the sequence data I showed you in that cartoon earlier. This is an example of microarray data and exome data, the way to read this. Reds are gains, blues are losses, and red is just the number of reads that map to each region. So this is a McKen neurobostoma, very famous for 50 or plus copies of the exact same gene, very unusual, certainly compared to a normal cell, and you can see that it's just glowing red. You have many, many more copies of that specific region compared to everywhere else in the exome or in the genome. Blue missing copies of DNA, of course, confirmed by microarray, and you can tell that's actually quite different than normal exome. We're really just not seeing really anything beyond normal copy number variation in the match normal. So that was an exome or microarray, so genome-wide readout. This can also be read out on small targeted panels as well. So this is actually data from a clinical panel. Each dot is a single exon that's on that panel, so you can see a very, very small panel. But just looking at the amount that the number of reads at each exon can read out copy number variation. So this is an EGFR, amplified lung cancer, so we know by fish they were basically able to stain epidermal growth factor, receptor, and count how many copies each cell have of that gene. Sequencing data basically reads out the exact same result. Yes, this tumor has a gain of BGFR, one copy of RET, two copy loss of P10. These are all, these two are, this is a tumor suppressor, this is an oncogene, but really being able to just use simple coverage data to read out a copy number variation using the exact same data that was originally used or generated to only find mutations. So you can really start to purpose and reuse and combine not just your data, but other publicly available data sets. And this just illustrates the value of a large, albeit not whole exome data, but this is our larger clinical panel. This is 183 gene panel, actually I think this was run at Dana-Farber. Each DOS is an individual exon, you can just tell by eye some genes just have very, very, very, very high coverage consistent with copy number alteration. And you can also start to call it more complex structural alterations. This is a chromosomal alteration called an isochromosome where one arm is deleted, so the P-arm is deleted, and it's actually replaced by the Q-arm. So it actually manifests as a loss of half the chromosome and then gain at the other. The beauty of next generation sequencing is not to say, yes, there's a gain, but really to do some very, very fine mapping within a gene. So this is an example from lung cancer again, looking at EGFR. In this case, there are very specific activating mutations within the C-terminal, within the region that encodes the C-terminal of EGFR. In this case, here are the exons here. Here are the reads that map to the wild type alignment. So they're basically the fragments, 300 base pairs, read one, read two, are mapping within 300 base pairs of each other. In the tumor, you can see these deletion alignments where you have a read that starts on one end of the breakpoint and then hops, just like we would see in RNA, but we wouldn't expect that to see in DNA unless something abnormal had happened. So in this case, we actually have no reads landing specifically in the deletion region. The normal cells do, they're mapping there in a normal way. In this case, there's a large deletion, and we can really tell right to the base precisely where this deletion has happened. In this case, this is actually an in-frame deletion of two exons that actually activates EGFR. And this is really what it looks like in practice. These are two very famous rearrangements, and you can tell in this case, this part of the read is gray. That means it matches the human genome reference sequence, but then turns into this rainbow color. This is basically saying there's a huge pile of mutations here. This portion of the read is not mapping the reference correctly. In fact, these aren't mutations. This is actually a sequence that maps to the other gene instead. So we've actually read in from the right-hand side, hit the breakpoint, and then that little rainbow part of the read actually maps to a completely different part in the genome. And there are many bioinformatics algorithms that are looking for these translocations or gene fusions now, both using DNA and RNA. Same story down here, just precisely to the base, precisely where that translocation has occurred. A big challenge with these bioinformatic methods these days are very high false positive rate, because often these little mismapping reads will actually map to multiple other regions in the genome. One of the big challenges with human genome in general is the presence of homologous regions or pieces of genes that have been duplicated and are now stored somewhere else in the genome. And this really misleading leads mappers, because the mapping algorithms really can't tell where that read should sit in the genome. And often it'll flip a coin, give it a very low mapping quality, but put a read in one of two potential places. And as a result, you'll often have this huge list of putative translocations actually will turn out to be just a mapping artifact. It's really problems with the very first step in that pipeline, that alignment step. The other challenge is that these rearrangements are not always beautifully one-to-one, just one gene-to-one gene. It's the publication from prostate cancer, which really shows these really nefarious multi-chromosome, multi-gene rearrangements. In this case, how do we read this? Erg is mapped to Tempris. This part of Tempris is now mapped to this unannotated region. This piece of Thrap 3 is now strapped back to Erg. So at some point during the cancer development, these three chromosomes, or multiple four chromosomes, have all come near each other within the cell. And there's been a very complex rearrangement step. So Mike Berger, the first author of this, really just sat down and did this by eye with pen and paper to really figure out what reads were going where. This is a very open bioinformatic challenge, certainly to map these multi-chromosomal rearrangements. Very complex events. In the case of prostate cancer, hitting two of the foundational rearrangements that are seen in a large number of prostate tumors. It's something you can read by eye as a human and work out. Really no evidence of it whatsoever, the normal. Really speaking to the power and value of having matched almost all the sequencing that we would do. And I just wanted to step back and get to this concept of just like in mutation space, where we had high mutators, mid-mutators, low-mutators. Some tumors are also highly rearranged. Some are quiet and some have no rearrangements. In this case, lung cancer is the exact same study, highly structurally complex, sort of medium and low. Here's our neuroblastoma study. We just did not see rearrangements in the RNA from neuroblastoma. There's actually another paper around the same time, specifically looking at chromothripsis. So a very focused region of the genome, in this case just one chromosome that had essentially been blown apart and then rearranged, essentially reconstructed. This is a very complex chromosomal event called chromothripsis. And this is also an area of bioinformatic development. How do we detect and call chromothripsis really in a robust way? In this case, I actually found this in 18% of tumors, something that we just did not see at all using whole exome sequencing. So really speaking to the value of whole genome sequencing to really find these alterations right to the base pair and start to learn something about cancer biology. In this case, we're currently rearranged chromosome and pediatric cancer. Transcriptome sequencing, so yes, this will be much better than that ER example. Instead of looking just at the DNA, focusing specifically on the expressed portion of the genome, the functional portion, specifically looking at reads that map not just to genes, but to the exons and the elements within genes. So this is actually an example looking at resistance of a colorectal cancer cell line to five fluoro uracil, so a relatively toxic chemotherapy. In this case, we're interested in what is the expression level, not just of this UMPs gene, but specifically to specific exons within that gene. So the way you read this, we would expect exon one to attach to exon two to attach to exon three and so forth, and therefore we'd expect very specific junctions. Exon one to have a junction to two, two to have a junction to three, three to have a junction to four, and so on. And that's actually exactly what we see in an untreated cell line. So here we have exon one has good read coverage. There's a one, two junction, very poor coverage of an abnormal exon one, three junction. So in this case, this gene is using its exon, in a untreated cell, this gene is basically using its exons in order, so we don't see one, three. We do see exon two. We see a natural two, three junction, a three, and so on and so forth. Specifically under treatment conditions, this cell line now starts to skip exon two. So instead of having this nice ordered usage and actually skipping from one specifically to three, and that's evidenced by the increase in the number of reads mapping to this abnormal one, three junction. So even though it's whole transcriptome sequencing, you can really zoom in right to the exon level of a specific candidate gene. So this, in this case, actually a gene associated with resistance to 5-4-Luro-Urosil and look at sort of the functional consequences of using specific, really how these cells adapt to treatment over time. The other piece of transfer, and actually, and none of this is really evident in the DNA whatsoever. It's not like there's a variant affecting a splice site or copy number change. This is really just the cell starting to use an alternative exon. And I bring this up because RNA-Seq, of course, enables expression profiling that's not evident in the DNA. We want to know how are the cells using the DNA that they have. Of course, RNA-Seq enables expression profiling and subtyping. So by essentially looking at a huge table of cells or samples by genes, we can now start to look for patterns, what genes are used by certain samples and or certain tumors and not used by other tumors, and ask questions like this, is it up in group A, down in group B? This class of genes is always down in group B and so on and so forth. There are many algorithms to cluster and group these cell profiles. To go back to the question of cell lineage, these are increasingly being used to really start to look at single-cell sequencing data to say not only how do these cells cluster together, but how are they related to each other as cancer develops or evolves over time. The other benefit of using large publicly available transcriptome data is really to tell you more about cancers, especially cancers that you really don't understand very well. In this case, these were four lung tumor meths of unknown tissue of origin. In this case, we had all of this normal sequencing data from an effort called G-TEX. This is the Gene Tissue Expression Project, which is essentially just generating RNA-Seq data from thousands of normal tissues, often from the same individuals. So these are car accident victims, people with normal healthy organs who have died for non-disease reasons. This group is sequencing all those tissues and then putting all that data out freely available. So each little mark here is a different tissue from a different person. So lots of brain samples, lots of lung samples. And a very powerful technique here is to basically use your tumor data and to cluster it with normal, really providing insights into what is the specific cell of origin in these patients. As mentioned earlier, unbiased sequencing is really very powerful for finding things you didn't necessarily expect. In this case, this is an Epstein-Barr virus we found in what was diagnosed as a lung adenocarcinoma. This is data that actually came through in the RNA sequencing data set. We're actually able to reconstruct the entire EBV transcriptome using RNA sequencing data. This actually turns out to be a lymphoepithelioma like carcinoma, not actually an adenocarcinoma as it was originally reported. And then we validated this by staining specifically for EBV, and we saw only the tumor cells were expressing or were positive, rather, for EBV transcripts. So that's transcriptome sequencing. I wanted to talk a little bit about where all these cells come from, specifically germline cancer genetics or hereditary cancer syndromes. This is really this concept of really being born with your second hit first. So I've shown over and over point mutations, copy number alterations that are hitting genes of interest. In this case, nearly every hereditary cancer syndrome is actually associated with loss of a tumor suppressor. Tumor suppressors, of course, we have two copies blown on each chromosome, and in cancer, we often see that tumors have lost two copies of a single tumor suppressor, therefore allowing specific pathways to be overexpressed. In patients who have a hereditary cancer syndrome, they're actually born with only one copy of a tumor suppressor or a loss of function mutation in one of their copies. And now they're just waiting for a second hit. And that second hit can take many different forms. So here's a patient with predisposition to retinoblastoma, a rare eye cancer. In this case, they've been born with a loss of function mutation of retinoblastoma, and they have the constitutional gene type. So they still have a single remaining copy of the RB gene, and now they're basically waiting for that second hit to occur and essentially lead to cancer. And that can be a local event, so sort of a gene conversion event, a recombination, a deletion of the other allele, another mutation, complete chromosome loss. I mean, this is really the whole point of the talk, is really the many ways that the cancer genome can be altered to really have the same functional effect. In this case, deleting the remaining allele and therefore having these two hits in a fundamental tumor suppressor, in this case, for retinoblastoma. And here's how that looks really in practice. In this case, this is actually an example from medulloblastoma. This is the matched patient's control, so just the blood sample. In this case, half the reads have a loss of function mutation, just a single base deletion in SUFU, one of the tumor suppressors, important in medulloblastoma. In this case, I just showed you this earlier, there was a huge deletion on chromosome 10, so they really, in the tumor, they only have, they basically have deleted a copy of chromosome 10. And if you look at the DNA sequences in the normal, every single remaining copy is the mutant allele. So they started with two copies, a point mutated and the constitutional, the intact copy. There's been this chromosomal deletion of the remaining intact copy, leaving them with a single copy that's also lost the function, so essentially no SUFU activity in the tumor. So a lot of this talk has been on methods, how to find cancer genome variation, how do we find a mutation, measure a copy number alteration, find a rearrangement. But really the point of research and certainly clinical management is to actually do something about it. So once we find variation, the key task is really to start to annotate these variants, to tell us something about them. Is it a variant that's been seen before? Is it associated with cancer? Is it in the cancer gene? Is it an oncogene? Is it a tumor suppressor? And that's really this activity of annotation, interpretation, and reporting, regardless of the tool that you use to find the variant. So how do we interpret variants in mass? So there are many guideline publications on how to interpret a variant. These are really each lab sort of has their own current way to interpret these variants. And they look at a host of, host of essentially databases and publications specifically tailored to the reason for doing the tests. So I used to do a lot of hearing loss, a lot of testing for hereditary hearing loss. We looked at hearing loss genes and databases and literature associated with hearing loss. Now I do a lot of cancer works. Now we're much more biased towards the, much more biased towards the cancer specific databases. But this is really just a list of the types of databases and really interpretation method or interpretation data sources that are out there to help us link the presence of variant to knowledge of function. And this is actually the number one reason, certainly in my group, we almost always pair our DNA readout with our RNA readout, DNA with RNA, because the RNA really gives us a head start in trying to link function to the effect of a DNA change. We see a deletion and a decrease in an expression. We now have this really association between a change in the cancer genome and the function as it's read out in RNA. So there are a couple of ways to do this. Certainly historically we've done this really by hand. This is actually a famous tool. It's used by a lot of clinical labs, allomutes, interactive desktop software that shows you your gene of interest, shows other databases, shows how your transcript is shared across species, what other isoforms are at play. It really tries to integrate all these databases into one interactive place so a human can sort of sit and interpret these data. This actually works quite well in clinical labs where the panels are quite small and you really can look at individual variants. And there are large international efforts now to share those interpretations across labs, ClinVar being one of the more famous ones. In birefumatic labs we're more likely to use high throughput annotation. This is an example from Occutator. It has no problem looking at hundreds of databases and giving you all that information. So Occutator will take a genome coordinate and it'll pile out, actually probably more than 250 columns now, all from different databases, different data sources, results from TCGA. It has no problem with providing annotations. Where it falls down a little bit is giving you some way to synthesize that data together. And this is really, really the act of science today is really to look at all those annotations and start to think about how these variants start to... Really, what do these variants actually do really at the end of the day? I mentioned Mutesig earlier, trying to use statistical frequency to get to look at genes that are mutated more than we'd expect by chance. Certainly looking at cancer hotspots is very valuable, what cancers are just mutating the exact same base over and over and over, but really this is tailored specifically to your research project. And I've just listed some annotators down here. There are many pieces of software that can look up your variant in a database. There are fewer pieces of software that start to integrate all these together and really classify your variant as functionally important, unknown significance, or benign. So this is just the example of one of these standards and guidelines, in this case from the American College of Medical Genetics and Genomics. How do you put a variant into one of these categories from benign? Is it strongly benign? Is it sporting benign? Is it very strongly associated with disease? And this table actually goes on for a couple of pages as to all the databases you need to look at and to essentially do a variant interpretation. This is very focused on germline genetics. The other challenge, once we've looked at all of our favorite databases and come up with a classification for our variants, certainly in the hospital, the challenge is how do we report and share these results, especially to non-geneticist, non-genomicist, non-bioinformaticians. How do we distill this down into something that clinicians essentially pick up and act upon? In this case, this is how, this is actually a real report. There are three mutations here. There's a paragraph describing the databases that we looked at to interpret that variant and to come up with what class mutation. Is it really a functionally relevant mutation? The challenge, and actually this actually works quite well for targeted panels. There are relatively few mutations. You can read three paragraphs in a relatively short period of time. This method really does not scale at all for whole genome sequencing, certainly not for whole genome sequencing and really more innovative reporting methods are really needed to allow us to integrate and look across what can be very long, less potentially important reactionable variation. One way to do this rather than having every lab interpret their own variants in their own ways is to really get these variants outward into public databases and allow us to share the results of our cancer genome. Not just variant detection, but variant interpretation. This is one example, AACR Genie. This is now one of the databases I look at as part of the variant interpretation. Activity. Really the goal here was to take all the variant calls from clinical genomic testing centers across North America and Europe and to put them all into one database. So you can basically go to acr.org slash genie, bring up a browser for genomic data, in this case they view CBIOPORTAL and you can say give me all the most frequently mutated genes in breast cancer on clinical panels across a series of hospitals. So it's mouse and keyboard, it's very easy to interact with. And there's also a nice API and a download option. So bioinformaticians can download and use these data as well. And this is one example of what that data really looks like. A potential molecular report of the future, just trying to communicate as much data with as little ink or as few pixels as possible. In this case this is a dashboard just from a single patient. Each of these little balls is a different tumor and this is a timeline along the top. So you can see there are two tumors profiled at time zero, a series of clinical interventions and then these two other tumors taken at the end. Unfortunately the patient died here at the end of the study. In this case here all the copy number alterations. So thousands of data points from four tumors all being compared in this case by eye. We can of course do this formally using bioinformatic analysis and then here's a little list of all the mutations that are shared and not shared. And there's some examples. In this case here's a mutation that's unique to tumor four that's really not seen in the other three. Really a distinct mutation specifically to that tumor and then a browsable table down at the bottom. So we're really trying to put as much data into a dashboard from as many different types of cancer genome variation as possible. And this I think this is still also a bit of an unsolved problem. There are many ways to communicate these data and often we need to tailor these specifically to the community that's going to be consuming and interpreting these data long term. Okay so I think that's pretty much it for the content and I have to 11. Is that right? Good. Okay. I have four. Okay. Okay. So we'll take four minutes to go through this last case study. It's published. So don't feel like you're missing out on too much. You can actually I didn't know when you did a slide for the first author in his papers our speaker on this Friday. Oh right. Yeah. So this this example has basically become a massive program at the BC Cancer Agency. So this is Pog Zero. So the very first personal uncle genomics program patient that this group did. So yeah. This is sort of Steve can probably speak to this better than me. So I'll try to roar through this in four minutes or so which actually is probably doable. Caucasian man unusual cancer really no obvious risk factors for his cancer. So he had a very unusual or at the time he just basically had a mass at the base of his tongue presented with throat discomfort. Pet scan. Lymph nodes are lighting up as well as the primary tumor pathology of course looked at this diagnosed it as a I hope I have done here. Basically as a salivary tumor so this poorly differentiated mucinous adenocarcinoma lymph nodes were positive delivered radiation to the site good quality of life returned to work. And then where certainly clinicians started to get worried to start to see numerous metastasis in the lung relatively small. And at that time there was an EGFR inhibitor trial on going at BC Cancer Agency very weak staining for EGFR still evidence of some expression in the tissue is basically qualified him for an EGFR inhibitor inhibitor trial six week trial of urlatinib didn't work at all all those pulmonary nodules gluru while he was on EGFR inhibitor one of them really grew and I use a two centimeter mass in his lung just continued it right away and now they're talking palliative care what's next. And this is where really the clinician Ginesa Laskin reached out to Mark Omeri expected the genome center to come up with options. How can we interpret this cancer genome? So the question was what are the targets in our case? It's great to look at all these cancer genomes. This is a bit of an unusual salivary gland tumor REB of course convene patient considered to full genome and RNA sequencing analysis. Fresh biopsy specific for RNA seek this is very important because of course these tumors change over time as I hope I convinced you earlier pathology review there's cancer there nice high tumor content great for RNA sequencing. Here's the biggest mass everyone was most worried about. In this case relatively short list of mutations mutation in tumor suppressor p53 loss of function RB1 and this very interesting loss of RB1 is actually associated with resistance to an EGFR inhibitor this potentially actually explains why the EGFR therapy didn't work. Knowledge of this variant up front would have been very important to sort of not waste those six weeks that he was on EGFR inhibitor. Here's the Circos plot looking at copy number alterations quite a bit of copy number action so low mutation high copy rate high copy number mutation amplification of EGFR perhaps associated with expression of EGFR loss of p10 application of map k3 and loss of other tumor suppressors. So again just reading out huge source of cancer genome variation in a single assay fish confirmation we read it out by genomics confirmed by microscope they all confirmed here. Looking at gene expression profiling looking specifically at genes of interest in this case there wasn't a lot of public RNA sequencing data at the time so compared the tumor sequence of the patient to a compendium of 50 tumors and his match blood really the best that was available. Now GTX is here there are much better databases to compare these data against and the most interesting one here was RET it wasn't mutated but was amplified and had a very was actually the most highly expressed oncogene across the compendium and within that tumor so it was in the top 5% of all expressed genes. RET's very exciting because they're it's a druggable target. In this case here's the pathway basically mapping all the copy number alterations in the gene expression into one place. The challenge here being how do we present all of this data in one dashboard that you can interpret. In this case actually this framework is still being used the VC cancer agent see today to really map cancer genome alterations to pathways and from pathways to drugs. In this case you can see this pathway here is activated lots of red. Short list present presented to the oncologist RET amplification other hits in that pathway could be acted upon by these four drugs. In this case it really works 22% decrease after four weeks you can see from it took a month to do the genomics in that time the tumor is growing and then treatment is delivered and tumor starts to shrink right back down. Stabilization for four months they were talking palliative care and before genome analysis in this case the treatment appears to be working stabilized for seven months and then the lung mets began to grow. So it was on one treatment in this case lung mets began to grow that she switched to a combination of the two other drugs on that table I showed earlier and actually again the disease stabilized within three weeks and continued for another three months. So now he's bought himself another 10 months actually with good quality of life these are all targeted therapies with minimal side effects. The bad news is as I foreshadowed earlier resistance to targeted therapies is almost inevitable recurrent disease after seven months new skin nodule recurrent disease in the tongue tearing quality of life what has changed in the cancer genome of course we do the same thing biopsy those tissues sequence them totally new list of mutations we saw all the mutations that were in the original primary now we're seeing lots of new mutations that there's really no evidence in that pre-tune biopsy at all even at very very low frequency. Mapping the RNA and the copy number data to pathway you can see this slide is just full of red now this pathway is activated as is this parallel pathway as well so even though we're drugging rat the treatment's happening up here the cancer is basically responding or being selected for cells that overexpress and really drive both of these pathways to a very high level and this is really where it was very challenging do you now have a cocktail of drugs that's multiple members of the pathways this is very clinically difficult to deliver multiple drugs at the same time could we have modeled this resistance over time in a really in a more fine grade to try to catch the resistance early and this is really my last slide here sort of the timeline unfortunately he was really just too sick to go on additional treatment strategy at that time and sort of here's sort of the the lifecycle of the entire project so really a very exciting proof of principle at the time really in a brain good quality of life at the time really saw treatment linked to genomics this is still a big challenge today how do we really deliver cancer genome analysis to guide patient care good and that's my last slide yeah are there questions I guess if we yes so this one was made completely by hand I know there is software actually several groups have pathways or pathway building software our miscetus scape is one that I use personally Gary Bader's lab has made there's not really one that takes in rearrangement copy number point mutation gene expression methylation and integrates it into one chart I mean this is sort of the dream software that I think certainly we could use any one of you will write it but I think this is really what's needed now we're very good at generating and measure the cancer genome we're not as good at this step which is you know there's huge teams that make slides just like this yeah it's so I guess so the question is how do you tell the difference between loss for desigosity and a technical problem where you just didn't sequence the other allele and really the solution to that is coverage measure that site as frequently as you can so you have confidence or statistical probability that you have measured the other allele to some level of stringency this becomes challenging especially in RNA sequencing data where transcripts are not necessarily expressed at a high level and you might not see that second allele so depth is one way to get after it I don't know if I have a great suggestion for anything else yeah and this is actually the allele dropout is happening in the droplet I mean that transcriptome is just never even turned into sequenceable DNA and it's technical and biological replicates this is the way to get there yeah oh you briefly mentioned that you did some research in hearing loss yeah did that involve cholestia and thomas or any other no so this is non-syndromic hearing loss so this is it was basically looking it's a autosomal excessive disease in general so we're looking for a compound inheritance of two loss of function variants in children who are deaf but both parents could hear the closest I've gotten is my schwannoma so like tumors on nerves