 Great. Thanks. So, I'm Trevor Pugh. I'm the general instructor here at OICR and senior scientist at Princess Margaret Cancer Center. So, my role has always been right in the middle of basic research and clinical translation. So, my lab and both operation here at OICR are really right at that interface where we try to take research findings and put them into practice as much as possible. The other hat I wear, I'm a board certified molecular geneticist. So, really, it's like the best kept secret in academic research. Basically, it's a career path for PhD scientists to get board certification and essentially sign out molecular pathology reports through pathology departments. So, I really would. I'm actually the fact that it's on my CV. I'm actually a graduate also of the Canadian Biopharmatics Workshop. So, I took the same program when I was a graduate student at the BC Cancer Agency. And that skill set really held me well going through, especially the clinical genomics world, where really that skill set is really an increasingly need, especially to move from single genes to whole genome sequencing. So, that's sort of a plug for that. But certainly, CBW really got me started. So, my role in this workshop is really to describe not really the right down to the scripts and codes and pipelines, but really a high level overview of cancer genomics. And I've really tried to fill this talk with as many data examples as possible and links out to sort of landmark papers in the field to really illustrate how does the cancer genome go wrong and how have we used bioinformatics and computational biology to measure and interpret it. So, the way the whole talk is set out is the first 45 minutes to an hour is really a little more didactic. It's here's the cancer genome. Here's what goes wrong. Here's how we measure it computationally. But the last 45 minutes, maybe 30 minutes is a case report. Basically, one of the first patients ever to get whole genome and RNA sequencing and how we went through at the Cancer Agency at that time to interpret that genome, especially managed that patient over a year and a half. So, that's the whole learning objectives of this module. How are cancer genomes different from healthy or normal genomes? How can we use bioinformatics to find those differences? What are the many approaches? You can tell there's lots of ways, lots of tools to do things in bioinformatics. So, I'm really not going to obsess over individual tools that do certain things, but really the broad level concepts as to how specific algorithms are useful for specific cancer types or specific clinical questions. And the last part really that case report, how do we use genomics to understand or guide patient care? So, any questions sort of on that fly over overview, background, that kind of thing? I'm happy to hear questions as we go. Actually, no, several of you in the audience, so that's good to see. So, I really want to make a point, we're all made of cells. And if you have a cell, it can become cancerous. So, really think of cancer as starting from a single healthy cell. I'm asked all the time by family, like what about this cancer, what about that cancer type? There are really as many cell types as there are, there are that many cancer types. Genetically, this is where really you can be very specific around types of cancer and driver cancer. So, I really wanted to start the whole talk with what does a normal genome look like? So, this is what one of my blood cells looks like. If you drop it on the glass slide at the right point in the cell cycle, you get what is called a chromosome spread. So, essentially, these are all the chromosomes from one cell. And the old-fashioned way, keratipings, heterogenetics, the clinical enterprise, the lab enterprise, is literally scissors and glue to do a paste up to take all these chromosomes and line them up side by side, purely looking at the banding patterns. So, this is what a paste up of that chromosome, of the chromosome spread looks like. Essentially looking for commonalities or differences. There shouldn't be any necessarily in this normal blood cell. But the point of showing you this is to compare and contrast this chromosome, this chromosome spread to cancer. And you can tell by eye there are way too many chromosomes here. Some chromosomes are missing pieces. Some chromosomes are in IP or IP is missing here. There's translocations. There's a huge list of ways that the cancer genome can be altered. And you can see them even just by eye looking at just the chromosomes. And, of course, this just becomes more and more complicated as you start to look at the, essentially at the base level resolution. So, hiding in here are all the base pair changes, mutations, other often passenger alterations. And a real challenge, especially with a computational biologist, is not just making the list, but also linking that list to clinical action. So, this is a quote actually my previous supervisor who loved this quote, has it pertained to cancer. All happy families or all happy cells are alike. Each unhappy cell isn't happy in its own way. So, really this idea of precision medicine really came from the idea that all the DNA in these cells can be altered in a different way. They converge on a relatively few, relatively speaking, biological pathways. But really the way that those pathways are dysregulated really is going to be unique to each cell and each cancer type. This is essentially, you'll probably see a lot of these. This is a Circos diagram. This is another way to show chromosomes. So, in this case, instead of the sort of the paste up I showed earlier, all the chromosomes are now ringed around the outside. And what I've annotated, actually Martin Shavinsky, who runs the software program that makes this, these types of plots, he's essentially annotated in this inset all the different ways that the cancer genome can be altered. Mutations, rearrangements, there's epigenetic changes, all the things you're going to learn over the next week. The other aspect in cancer genomics is the temporal aspect. Cancer genomes, DNA cells divide, they change over time. There's a great review here by Stratton, Stratton, Campbell, and Futriol, really showing how these mutations, even especially in pediatric cancers, in fact some of these kids are actually born with cancer themselves. So there's not a cancer diagnosis. At some point they've acquired these passenger mutations. In the case of adult cancer, it's essentially waiting for the accumulation of mutations until you get a mutation in the wrong spot, in the wrong cell at the wrong time. Those are referred to as passenger mutations, or driver mutations rather. Listed here as a star. And then you essentially see this, again, temporal evolution, acquisition of additional somatic driver alterations, eventual diagnosis, and even the act of treatment can actually introduce additional mutations. The most famous of these is Teemazolamide, used for brain cancer. The whole point of that treatment is to just bombard the cancer genome with so many mutations that those cells eventually become unstable and die. It actually does not extend life to a huge degree, at least in brain cancer, but it's one therapeutic strategy and completely measurable using genomics. So just some key points on this slide. Mutation frequency depends on cancer type. The pediatric cancers, as they haven't had the opportunity to pile up these mutations, are often the low mutation rate tumors, and the high mutation rate tumors, especially the ones that have external environmental exposure, sun and ulnaroma, smoking and lung cancer. These are all the higher mutation rate tumors. And if you look at those tumors about post-Temazolamide, they're just packered with hundreds of thousands, if not millions of mutations, in the most extreme case. Here's a plot, sort of the first data plot that really has a talk, really showing all... This is actually a great pan-cancer analysis by Gadgetz's group, led by Mike Lawrence at the genome, at the Broad Institute. In this case, each dot is an individual cancer, many of these coming from the Cancer Genome Atlas project. But really to hammer home that point of low... The way to read this, sorry, y-axis is mutation rate, x-axis is cancer type. The low mutators and the pediatric cancers down here, the high mutators and the environmentally exposed cancers really on the other extreme. But the other important point on this plot is really the huge dynamic range within an individual cancer type. Well, you can say in general pediatric cancers have low mutation rates. I circled this one here, this neuroblastoma. Most of this case is very interesting. Actually, just as many as mutations is a late stage lung cancer. And the reason for that was, this was the only neuroblastoma that had two mutations, a mutation in each copy of MLH1, a DNA repair gene. So there's really a direct link between loss of the ability of that cell to repair its DNA and just piling up with additional mutations. It looked the same. If you look at the cells, it looked like neuroblastoma, it looked like a low-level resolution. You could see all these mutations and we see a molecular cause for it. The opposite story here, here's melanoma, very famous for a very high mutation rate. These tumors actually respond very well to drugs that reactivate the immune system because there are lots of potential antigens to be recognized in these tumors. This one down here, he actually has just as many mutations as pediatric cancers, very unusual. In this case, this is actually a tumor on the bottom of the patient's foot. No sun exposure whatsoever. Just extraordinarily bad luck. There's a driver mutation in a melanocyte that resulted in cancer. Actually, if you really squint, you can see here the way to read this bottom is here all the different types of mutations. There's one that's actually quite different. It's this little blue sliver. It does not have any of the sun exposure signature like all of the others. All the yellow mutations here essentially are induced by UV exposure. They are both in mutation counts, but also in mutation types. It really all hangs together. Another famous slide just in general, where are these mutations landing? In general, tumor cells acquire abnormal activities. Actually, they're not inventing new abilities. They're taking abilities from other normal cells and turning them on in cells where they shouldn't be active. This is termed the homologous of cancer. They're not really going to belabor what these are. Essentially, you can ban a lot of these driver processes into not that many biological processes. Essentially, pretty much every cancer type has co-opted or undermined these molecular processes, almost always due to a somatic alteration of their mutation, that rearrangement of the company number change, or often a collection of all of these at once. The other goal and challenge, certainly, by informatics is a question I actually often get is what is a driver mutation and how do you differentiate that the way this was done classically during the Cancer Genome Atlas project was to grind up and sequence 500 examples of every single cancer type and look for mutations that we saw over and over and over. It actually turns out specific cell of origins all depend on different oncogenic or regular normal signaling pathways, and therefore, they're getting mutations that often cluster in those pathways specifically. I'm just showing here those oncogenic mutations that are essentially targeting a core set of biological functions relevant to that cell type, and this is just an example. Actually, this is a great textbook just in general, Thompson and Thompson. It's sort of the Bible, certainly, for the medical genetics field filled with a lot of these types of concepts. In this case, having this idea of a single cell activating oncogenes these are sort of like the gas for a car, sort of how do you drive the cell cycle or cell growth. Turning off the brakes, so often for suppressor genes, we have two chromosomes just like I showed on the earlier slide. You need two mutations to knock these out, one from each copy. And then additional accumulation of mutations over time. Again, undermining or either turning on or turning off normal biological processes. And this is actually still a challenge. TCGA actually did a great job to define the high frequency recurrent mutations across all cancer types in TCGA. But as we continue to test thousands or tens of thousands of tumors, we're continuing to identify even more more infrequently mutated but still biologically meaningful somatic mutations across virtually all cancer types. Really the context of this talk is really not just finding mutations but how do we do something about them. So I've just showed two of the most important examples here. Really this concept of targeted therapies. And this is really a change around early 2000s, 2004. Really the understanding that yes, there are driver mutations specifically in cancer. But also the idea that if we know there's a pathway that's turned on abnormally, selecting a therapy that can then inhibit a block activation of that pathway. So two examples here, abnormal carcinoma with now essentially standard of care activating mutations of EGFR. These mutations activate the abnormal growth factor, receptor pathway, the therapeutic strategy is pretty obvious. Just block EGFR and block EGFR signaling. And this paper here really showing very dramatic responses to inhibition of EGFR but only effective in patients with those specific mutations. Same story here, metastatic melanoma in this case is for patients really just filled with multiple metastatic lesions and then very, very dramatic responses to Vemaraffinib when those tumors have B600D mutations. The problem with this approach having made it obvious there's a lot of genomic complexity to these tumors. And this therapeutic strategy is just hitting one of potentially hundreds of other driver mutations or other pathways active in these cells. So using these targeted therapies, the single monotherapy approach really results inevitably in resistance. There's very, very rarely that any of these single targeted agents result in cure. They certainly derive a very strong survival benefit but really the path is really the need to identify multiple oncogenic pathways active at once and hit them simultaneously or very close in time. And one piece of data to really support that is this idea that tumors are almost always or at least the majority of tumors have mutations in multiple targeted pathways. So this is a big diaphragmatic exercise published in Nature Genetics. We're going to read this plot. Each column is a different cancer type. So you'll get to know these little four letter codes or kind of the codes that TCGA has used. Kidney cancer, lung adenocarcinoma, etc. So that's how you read the columns. Each row is a different driver mutation or tumor suppressor. And the three main points from this paper were the first jugable alterations cut across cancer types. In this case we have three mutations being seen in multiple tumors. BRAF mutations are another good example. They're often, actually the reason I like this figure, they've mapped all the genes to specific oncogenic pathways. So the second story is it's very common within a cancer type to have two or more pathways implicated. So in this case these two cancer types first of all share mutations in two pathways but also two pathways are also being mutated in individual tumors. So three, at least half of tumors have at least two disrupted immediately actionable or immediately targetable drugable pathways. The other challenge and actually the main reason for resistance is I've talked so far about cancer. There are tumors. This is one amorphous blob of cells and it's definitely not that. So tumors are actually a colony or micro-environment of cells. Actually really well illustrated here in this review showing in this case you started with a cell but as those cells divide new just like we saw over time new oncogenic mutations or new alterations essentially occur in daughter cells and then those daughter cells give rise to additional subclones. So in this cartoon there are actually three subclones active and this is actually a major challenge for molecular testing because you may have your driver mutation EGFR, BRF what have you over here but it may be actually absent or these cells may no longer be dependent specifically on having those pathways those pathways activated. And this is really where one potential model for why monotherapies may not be effective long-term we may be effective against the single clone here but additional clones may actually expand in some cases to actually fill the niche that's been left by subclone one and we know that at the single cell level because if you actually take a some probe and probe for large deletions or copy number alterations you can even see by eye specific cells some have the gains some of them don't, some have deletions some of them don't and this is really the challenge certainly in genomics how do we map all the driver mutations not just to tumors but also to subclones and ultimately single cells and technology is just now coming online it's actually enabling this idea of precision medicine that single cell resolution essentially looking at the DNA of the cells individually. So I foresee probably a single cell genomics workshop there isn't one already sort of being in the works in the next couple of years here's sort of a conceptual diagram of what that can look like over time so a single cell here driven by driver mutations and loss of tumor suppressors relatively shortlist here the expansion of that highly heterogeneous mass so here are all the different subclonal populations some share mutations so this clone actually has two yellow they arose from a single clone they have a subclone of a subclone and so on in this case patient treated with chemotherapy did a very good job de-bulking the tumor in general so completely killed off purple subclone but unfortunately grey yellow and orange surviving and really the idea especially upon relapse what are the remaining drivers what are the features of the relapse disease and how can you hit that with a second targeted or combination therapy in this case here this patient ultimately relapsed multiple new driver mutations unique to a totally new clone that actually was not even a parent at diagnosis so this illustrates a few reasons or a few concepts one the power of longitudinal sampling of these tumors over time and often the disconnect between the diagnostic sample and what you actually see at relapse are really the need to have access to tumor material or in some cases self-re-DNA as close to the time of clinical decision as possible here's another example Peter Campbell again looking at multiple metastatic tumors from the same patient and really showing how essentially a tumor can move around the body and as it metastasizes then founding its own micro environment with its own subclonal structure so he had the primary tumor metastasis hopping out here additional subclones metastasis seeding additional metastasis not just some really nice work from Srav Shah where he showed even these tertiary metastasis can go back and reseed the other metastatic sites so really some great molecular complexity all of it measured and potentially actionable depending on the types of mutations that we're encountering within these subclones so you saw the list the why would we do molecular testing for cancer I've talked a lot about targeted therapies and treatments understand the drivers understand the pathways implicated measuring of drug resistance especially in the monotherapy case when do we see resistance and how do you direct the patient towards a different targeted therapeutic strategy I'm not going to talk a lot about this but certainly the germline space inherited cancer syndromes these are patients who are born with a cancer predisposition variant and are essentially waiting for loss of that remaining allele and prognosis here's a cancer type is it high risk or not is it a cancer venom primary can molecular testing help refine a specific cell of origin for unusual cancers so let's do the general overview does anyone have any questions about the reasons for molecular testing genomics in general okay seeing none so I'm going to really push into yeah Roman yeah so Roman's comments it was we can find mutations how do we show that there's actually interaction or I guess even interdependency this is really the boom now it's really being able to link a genomic readout to a functional genomic readout because I would argue certainly in clinical lab testing no it's finding mutation reported back this is really where I think research has to happen now is how do you model especially multiple mutations or multiple pathways active in single tumors and then how do you come up with therapeutic strategies against those and I think there's a few bioinformatics algorithms that do that but to me you essentially need a lab experiment to show that yes these two axes are critical one experiment we're doing with Peter Dirks now is we sequence cell lines in a dish define the clones using single cell sequencing and then predict drugs that will work on one clone but not the other so we'll try drug a we'll try drug B we'll try a plus B and see if it kills everything that's still super research it's certainly not clinical grade but it's one conceptual way to take measurement of somatic mutation and mapping that to function but yeah it's a lab experiment yeah good question any anyone else okay so I'm now going to move more from the flyover here is the landscape of all cancers now to the individual patient level so the patient it really doesn't matter as I showed in that neuroblastoma and melanoma example it's great to know this type of cancer has high or low mutation but really the question for an individual patients is what are the targets in my cancer is but this specific end of one tumor high mutation low does that mutations are predisposing polymorphism what is the copy number landscape for an individual person our viruses are bacteria involved and what can be done about them so really the point is always make a list but also interpret what that list actually means just running the pipeline is really just not is not the end point so I'm going to talk a little bit about next generation sequencing this is essentially the machine that we use to measure cancer genomes currently the workflow is I've just broken into four steps take a tumor you extract DNA and RNA make a library a library is essentially DNA that's compatible to reading on the next device which is the next generation sequencer so essentially you take the DNA or the RNA put these little synthetic adapters on that essentially makes them readable by the sequencing device and then computational analysis which is essentially tailored to either the clinical or the scientific the scientific question so I'll just show some data examples this is what comes off an aluminum sequencer it's literally just a text document you can open a notepad if you want to every row is a don't but you can each row is a is a read A C STs and Gs I've just shown 25 here on a 2500 which is now a increasingly obsolete machine one lane generates 600 million of these reads so you get massive numbers of reads this is not meant to be human readable of course you need computational analysis to make sense of just the raw reads so I'm sure you've heard this concept of a pipeline essentially this idea that takes that raw data and transforms it into a an output with meaning the idea here at least for us is to take that raw DNA sequence in run it through a series of pieces of software and really step one is that list of variation pipeline is actually pretty good analogy all the pieces have to fit together perfectly without leaks they're relatively modular so really the idea of taking reads comparing them to the genome that may result in a new data file that data file can now be read by a variant caller those variants can now be read by a variant filtrator essentially this idea of software being interoperable really around file formats and data standards the generic pipeline at least for DNA genomics is alignment take the reads I just showed you compare them to the human genome reference how do you QC or how do you rather tidy up that alignment we use liners that are very fast sometimes they make very minor mistakes so step 2 fixes that somewhat step 3 how do you essentially assess the quality of those alignments is your sample any good do you even put DNA in does it look like RNA is it human if you put human in QC is very very important calling variants I put these four in black because they're more or less automated they are here at OICR the last two steps are really not they're really sort of the enterprise of computational biology in the active variant interpretation these really still need humans to go through read by read not read by read variant by variant to really make sense of an individual case and we'll go through all of these during my talk so by far the most famous and widely used pipeline is the genome analysis toolkit this is a piece of software actually analysis framework from the Broad Institute essentially the idea of moving from raw reads moving through essentially alignment to a comparison to the human genome reference that QC step I talked about there sort of realigning around hard to align regions fixing some of the quality of the bases that came off the sequencer running variants and essentially doing interpretation at the end the only change really to this slide is I've put up some of the tools we use for cancer so GATK was actually built more for germline analysis and now of course been used by the cancer community as well the reason we need cancer specific tools is because of the types of cytogenetic events that we know happening cancer we can't guarantee that cancer have two copies of every chromosome so I just showed you at the beginning that doesn't happen they double their genomes all the time you get these awkward like three ploid you double and you lose a chromosome we need collars that know that that can happen the other challenge with tumors is they're not 100% cancer cells they're always a mixture of cancer cells and stromal blood vessels immune cells we need collars that know that that can happen you can't assume that half of your reads will always have a variant and half of your reads always won't you need cancer aware algorithms and this is really where cancer bioinformatics sort of diverges from conventional I guess I'd call it germline genomics step one is to take that, yes great is that reason you always have to make the paired normal yes especially as your panels increase in size as you look at more and more genes it's essentially mandatory so anything x omen above I would say always a paired normal even for panels 2 to 500 genes we always run into a variant that looks interesting but if we sequence the normal it happens to be there so yeah it's really worth the cost if only in everyone's salaries trying to figure out weird and rare variants to just sequence the normal and settle this as a question maybe a 5 gene panel I could probably bioinformatically go through 5 genes manually but yes I really definitely encourage to build that into your grants and budgets sequence the normal so here's what step one looks like it's actually the same data I showed on the earlier slide just aligned to the human genome reference I use pretty much every day to look at next generation sequencing reads so the way to use it here's the chromosome along the top here's the region I've zoomed into so I've zoomed in extraordinarily close just into 92 bases here's the coverage this is a histogram of how many reads aligned to the next track which is the human genome reference you can literally just download this off the internet I usually go to the UCSC genome browser there's literally a link download the reference it's only 3 or 4 gigs it's not too large and then in gray are all of my reads so essentially the question for cancer genomics is how do our reads differ from the reference and how they differ from a match normal I've intentionally chosen this case because there shouldn't be any variation in this region I think there are one or two sharp-eyed students found one years ago it's almost always a sequencing artifact especially in sort of a sea of non-variation but there are some technical questions really around what is the sequencing artifact versus what is an extraordinarily rare point mutation I'm not going to drill into that too much in this talk the other the other quirk is there are many many many ways to make this slide there's a whole Wikipedia article for short readal sequence alignment on this slide we use BWA it's by far the most widely used because it's extraordinarily fast for the quality of the alignment but there are lots of other tools some of them pay for you some of them not this is probably more than 50 now but there are lots of ways to essentially transform your data from raw which is past Q that's the file format we use at least for data coming off the Illumina Sequencer and transforming that into a BAM file which is essentially a binary alignment file and the nice thing about this IGV software is if you have an in BAM format essentially it will let you read and look around at these reads this is actually a cartoon of what I just showed you so it's the same sort of layout the reference genome is here along the top so in this case two reference chromosomes chromosome one chromosome five here all the reads I just showed you this review is actually really aged really well this is basically the same types of cancer gene variation that we still look at to this day in this case looking for reads that have a difference in this case a bunch of C's where there should be some point mutations in bills very very small stretches I think the formal definition now is up to 50 base pairs so a deletion or insertion very very small of a very very small portion of DNA contrast that to very very large deletion so sort of things that you could have seen set it genetically we actually for that as missing data so if you try to sequence a region and your cell just doesn't have that DNA it's deleted that chromosome that actually comes through in the data as lack in this case if you have half as much data in that region as you expect so sorry this is a two copy deletion this is a one of two copy deletion so a single copy deletion you have half as much data as you'd expect if you have a copy number gain you have more chromosomes than say a normal diploid cell we see that as additional number of reads we infer that as gain and translocation breakpoint this is where one end of the DNA fragment is aligned to one chromosome and if you sequence the other end of that DNA fragment it aligns to a totally different chromosome this is actually where there's been a breakage of chromosomes the cells tried to repair it incorrectly put chromosomes together this often put two often drive often put a driver gene next under the control of an inappropriate regulatory element so it's very powerful way to upregulate or turn on a driver mutation and since it's DNA if you're doing agnostic sequencing you just grind up cells and sequence them that DNA or RNA actually comes through as well if you're a tumor happens to be associated with a virus EBV, HPV that DNA sequence of course comes through you need a special algorithm to take those reads and align not just against the human reference but against other non-human references as well and here's sort of this is more for test purposes I don't think it's part of this workshop I was really here the different ways we've thought about whole genome or about DNA, RNA and protein sequencing whole genome sequencing no selection at all just take all the material from a cell sequence it whole exome sequencing this is essentially a laboratory protocol modification where you just use hybrid selection lab protocol to essentially pull out only the protein coding genes these are only 1% of the whole human genome but they are the part of the human genome that encode proteins and they're certainly the most well studied so in terms of linking anything to action almost everything can be captured by whole exome sequencing that's becoming less true now as regulatory elements and regulatory mutations are being discovered but today right the second whole exome sequencing captures a lot of that actionability targeted gene sequencing this is more done for costs so really narrowing in specifically on genes that get you on to a clinical trial or a very high frequency sites of mutation targeted variant gene typing even more specific so we're only looking at one single mutation if you really only cared about one site there's no need to sequence the whole genome just sequence that one spot and every genome modifications is actually a type of change to DNA most famous is 5-methacetazine essentially that's invisible to conventional sequencing so you have to do a chemical treatment of the DNA to essentially unmask or mark specific regions that have this epigenomic modification RNA sequencing conceptually the exact same thing very similar pathway to what I just showed instead of extracting DNA from cells you extract RNA instead and run it through a very similar pipeline in that case we're not as interested the we're not as interested necessarily mutations but we are interested in quantification so if a cell is expressing genes at a very very high level we get many many reads for that is highly expressed same sort of thing if a gene is not expressed we shouldn't see any reads from from those genes and I can talk about at all proteins is really going to be a more or less a DNA or RNA talk but this sort of field is really starting to rise certainly even in the clinical profiling world especially as it intersects with the DNA sequencing and RNA sequencing so I'm going to focus mostly on DNA and RNA I want to show you some real data this is the same sample using three strategies whole genome sequencing exome sequencing RNA sequencing this is an IGV screenshot just like I showed you earlier this time we're zoomed way out instead of 92 base pairs we're looking at 49,000 base pairs you can see a whole genome sequencing we have coverage across essentially this entire region and this 49,000 base pairs essentially corresponds to the single KRAS gene very famous cancer gene a lot of hotspot very well understood hotspot mutations specifically in KRAS in this case you can see whole genome sequencing nice relatively even coverage across the entire gene exome sequencing there's that laboratory modification so you only get reads on exon so you can see down here here the region that encode protein and you only get read so you're completely blind to what's happening on the intronic sequences maybe interesting regulatory elements structural changes all sorts of things you're not going to see them by exome sequencing and RNA sequencing you let the cell do the work the cell is the one that's splicing out the introns so again you don't actually get reads happening within the intronic sequences what you do get are these relationships between exons this is actually very powerful because you can actually see where a read starts in one exon and begins in another so if an exon is not being used or it's being used in an abnormal way RNA sequencing will actually look at the structure of how that RNA is being made I'm sure you've heard the term whole genome and whole exon sequencing it's not whole does not mean 100% so it's actually a figure from paper I worked on as postdoc really too this is actually a set of 240 odd neurobastomas the way to read this plot each tiny tiny slice here is a sample the color is our ability to find mutations so one means great coverage we're very well powered to find mutations but down in black or actually even at 50% essentially the coverage is relatively low or just not able to confidently map reads to those specific regions of the genome so you can see by far most of the exon genome is white we're pretty good at calling mutations but there is sort of this band down here where that ability to call mutation breaks down either due to problems in mapping reads to the whole genome most frequently actually in this project there were large scale deletions which actually resulted in no coverage in those regions therefore had low coverage so it wasn't necessarily a problem with whole genome sequencing it was actually a feature of the cancer genome you can't sequence DNA that's not necessarily there but there was sort of this band in black basically we could never find mutations in any samples and even using a relatively deep genomes at the time we still had this inaccessible portion of the genome and these are still challenging to sequence with short read sequencing technologies today these are repetitive regions the telomeres the centromeres regions of the genome that have been duplicated so we just can't tell whether read goes to the original copy or to a duplicated part of the genome there are long read technologies that do allow you to access this region of the genome there's new phase link technologies but really hold does not mean 100% Francis yeah so exome is depending on the company you buy the baits from will change so yeah I try not to use hold too often it sort of entered the lexicon but yeah hold is really not 100% the definition of the exome will actually change depending on who designed the bait set is it one reference transcript is it every isoform we've ever seen associated with those genes is it the exome including UTRs or not especially as bioinformaticians this is extraordinarily important to know especially as you start to take data from multiple sources many of which will use different designs of the exome both over time so newer exomes will include genes that may not have even been discovered on the earlier bait sets the exomes have also been overly annotated over time so there are some exons that are absolutely huge like 80 megabases and actually the later versions even from the same company have now shrank down to 70 or 60 megabases so it's really very important to understand especially when you get an exome data is how did they define the exome because they're not in general the genes are highly overlapping but they're not 100% consistent and certainly not the same over time yeah so sorry yeah so baits in this case the way most exome sequencing is done is essentially you have a synthetic piece of DNA based on human genome reference you tether it to a magnetic bead and essentially use a magnet to only pull out DNA that that's stipped that bait and since it's synthetic you can design that to be whatever you want and that is the reason for the change in the design of the exome yeah so you may hear a few terms on exome sequencing, baits probes, targets they all essentially mean the same thing just to pull out a specific region of interest I mentioned RNA sequencing in this case RNA of course the dynamic range is much different than DNA DNA should be 1 copy, 2 copy, 3 copy 4 and above RNA sequencing is much more continuous cells can turn on and off RNA to a variety of levels I try to make this plot for almost all the RNA seek data I look at I just basically take all the genes or all the exons and I just sort them by how many reads mapped to them you should basically always get this S-curve very very highly expressed off or very very low expression but actually a lot of the biological actions kind of in this middle sort of this middle region I just shown this example here we have this patient patient with breast cancer there's three very important genes for subtyping breast cancer and if you basically color all the exons that correspond to these genes you can see they're all sort of in this pale trending to zero and then as a reference set here's two totally different patients one who's ER positive one who's ER negative and you can see the ER positive breast cancer is basically expressing SR1 at a relatively high level certainly compared to the negatives and the negative basically looks a lot more like this patient so really a very simple example of how you can use a reference set to put your gene or your gene expression level in context with other genes from similar tumors of the same type the other data type not nearly as widely used is exome genome RNA sequencing is by sulfite sequencing are really what's being increasingly termed methyl seek so in this case you essentially treat your DNA with a chemical if you're if you're essentially your C base your cytosine base is protected by chemical modification it doesn't get changed but if you don't have that modification this chemical has the ability to then change that C to a T so here's what that data actually looks like otherwise exactly the same workflow as for whole genome sequencing just do that chemical modification at the start in this case we're looking specifically at MLH1 important DNA repair gene I talked about in the context of neuroblastoma here are the exons here along the bottom in this case we're very interested in this first exon because it has this little regulatory region and that regulatory region is under the control of promoter methylation so essentially what we're looking for is if that stretch of DNA essentially is painted or if all the C's all have a methyl a methyl group it shuts down expression of this gene and if those methyl groups are removed or are not present that then allows the gene to be transcribed so this is what looks like a normal tissue it's unmethylated so all these C's don't have the protecting the protecting methyl group essentially we know that this normal tissue normally express MLH1 so that normal healthy cell is able to repair its DNA compare that to endometrial cancer here all of the C's specifically in that region are completely methylated so all the C's stay a C they're protected from that chemical and essentially we can tell in this case this is actually a targeted diagnostic test only focused on the promoter you can see even by eye though that really that promoter is not methylated and essentially MLH1 is not active specific in this tumor the reason you do this test in this tumor this specific type is actually famous for frequently having loss of DNA mismatch repair and this is one technology that allows you to look not just at that one gene but a whole set of genes associated with MMR this is also often sort of a handy checklist especially as you're thinking of how do you design a scientific experiment hear all the different types of genome assays I've just talked about and hear all the ways that the cancer genome can go wrong I'm just going to spend a little bit of a brief time just showing data examples of what does mutation look like what does a copy number variant look like in specific types of data are there any sort of questions up to this point just as we got to calling before we go into detail on each of these variant types okay great okay so I just want to start with semantic mutations I really encourage you not to just trust the output of your callers it's actually pretty straightforward especially after this workshop to get raw reads, align them, and call mutations but really absolutely must go that extra mile to actually look at the underlying reads that support those variant calls especially if there's something biologically out of place so it's very unusual for example to get three mutations in the same gene unless you're looking at all these subclonal these subclonal variants so often if I especially if I see three mutations side by side I always open up the reads right away and ask the quick question are those mutations next to each other are they all in the same read should they actually have been treated as one very complex mutation it's very hard to write a little script that can go out and do this for you for every single edge case for mutation caller very powerful just to open up reads and look at them in IGB so here's what a real mutation or real variant looks like so same sort of thing coverage along the top here reads sort of a read stack and we're looking for positions that look like this in this case half the reads have a C half the reads have a T in this case the allele balance you actually hear a few words that refer to allele balance parent allele frequency that mutation frequency is another one mutation allele fraction these all mean the same thing what fraction of your reads have a non-reference base so in this case the allele balance or math I would probably normally say is 0.47 or 47% there are a few I guess aspects of cancer tissue that can cause that vaft to drift away from 50% so the reason this is so close to 50% is if you have a perfectly pure cancer cell population you have two copies of each allele one of them is mutant one of them is not therefore half of your DNA should have a mutation so in theory in a sort of a perfect sample certainly in a germline variant where all the copies of all the chromosomes are two essentially you should have a allele balance very very very close to 50% just by sampling error you can get up to 60 maybe 65% or down to 40 maybe 35% but in general all of the your variants in a totally pure half like diploid cell should be 50% this almost never happens with a cancer sample for all those other reasons double the genome you have copy number changes you have non-cancerous cells the DNA certainly we get in clinical labs are very poor so the priority certainly pathology departments is to collect and preserve tissue for histology over everything else which is obviously appropriate that's how you actually get the cancer diagnosis so essentially you get a a tumor sample it's essentially preserved and formalin and embedded in a wax block this has the benefit that you actually store these tissues at room temperature you can then take thin sections and look at them microscope unfortunately as you can see on this slide it results in short damaged DNA so in this case this is a DNA ladder so this is high molecular weight DNA that's more less intact this is highly fragmented it's actually from one of my first papers ever but the lesson has still held even from formalin fixed paraffin embedded tissues you can in general get still get high molecular weight DNA but I ordered these samples essentially from good to bad in fact main 12 here this was a metastasis to bone that tissue was essentially decalcified that decalcification process completely destroyed the DNA so that one's pretty much unusable but in general genomic techniques have actually gotten better and better at starting to use these highly fragmented low quality but still pretty usable sources of DNA but this is really the number one question especially if I get a sample I've never seen before or the quality control is not as good or I'm not finding mutations it's always very helpful to go back to understand what is the history of that tissue prior to sequencing it's yeah so Roman's question was really around is it improvements in the lab processing of the tissue and I guess maybe the library processing as well or the analysis it's probably all of those but I think the library construction is what has improved the most how do you take DNA like this and make a high quality library that represents all the DNA fragments that were in the original sample I think that's the layer that's improved the most the bioinformatics techniques certainly for quality control have gotten better knowing that this is an issue but still I mean the number one thing I'm looking for in a new kit that makes that library so that second step in that workflow is how do you effectively put those adapters on to every piece of DNA because it's very hard to capture especially this DNA that's highly fragmented down here so if anyone wants to make millions of dollars a better ligase to put adapters on or some alternative to that it's still a big unsolved problem so we're still you're doing very very well if half of your DNA that you extract actually make it into your next generation sequencing library it's still a very challenging lab problem the other challenge is tumors are not 100% cancer cells so you notice I always try to distinguish between tumors and cancer tumors to me are all the cells in a mass and cancer cells are the specific cells that are cancer within the tumor micro environment so I'm just showing here this is actually a planning diagram for a laser capture micro dissection this is actually a lung adenocrystinoma metastasized to a lymph node in this case sort of gone through circled out all the tumors but you can see all the intervening stroma and other types of cells that are in here as well and in this case essentially for every read of cancer drive DNA we sequence there's a second read corresponding to one of the normal DNA and this is really why we have to have very very deep sequencing we need 100 probably in many cases more than 100x coverage just because half of those reads are just not going to have any chance of having any somatic mutations at all the other way to this there's sort of two buzzwords to describe this concept tumor purity or tumor content both of these mean the same thing what fraction of your tumor of your tumor cells are truly cancer versus non-cancerous genetically or from a slide like this so from histology essentially this is what pathologists do you can look and classify cells you can't really do it using DNA sequencing they're all deployed in general they all look very similar to the match normal so it's very hard using DNA sequencing RNA sequencing has the challenge if you just take a sample like this and grind it up you're now looking at a mixture of gene expression profiles cancer and non-cancer they're actually whole algorithms now that rely on that fact because you can look specifically at transcripts from from known immune cells and for the presence of immune cells the real the only path as far as I know to actually do this is to get down to single cell RNA sequencing to essentially digest the tissue like this into free-floating single cells and sequence them one by one genomically that's really the only true way to really pinpoint exactly the cellular composition flow cytometry is actually another good way so there are ways to do it but just using conventional genomics which is cut and extract very very difficult so we talked about tumor purity and tumor content the other I'm sure so read depth is just how many reads do you have at a position so the read depth here is nineteen X so in this case is variance very easy to find because half of the nineteen are going to have a C half are going to have a T roughly the challenge actually have a slide on this later will be what if only half of your cells are tumor so now we're looking at not at ten C's we're actually looking at five C's five out of twenty so this was actually only ten X coverage now instead of five we're looking at two so as your coverage goes down it gets harder and harder to find these variants I have a slide on this a little bit later but yeah coverage read depth both of these mean the same thing how tall is your read stack at a given position and actually you can see in the whole genome there is some fluctuation to it so at each position that's why we always report read depth for coverages on average or median because standard deviation can actually be pretty wide actually great example here in the X so these two exons are right next to each other this guy probably has twice as many reads as the other purely for technical reasons in the lab. The other confounder I showed the slide earlier is this idea ploidy or tumor copies so in this case if you're looking for a mutation on chromosome seven if it's only on one of the five chromosomes this is going to be a very challenging find mutation because first of all not all tumor cells or cancer cells and even within the cancer you're looking at one copy in five so really this is where finding these extraordinarily rare mutations especially if they're on an individual chromosome that's been amplified are very difficult very difficult to uncover on the other hand if you had a mutation on chromosome seven before it was amplified those mutations are very easy to find because one tumor cell is putting in five copies of that of that mutant so it sort of has a bit of a double edged short sword depending on the type of variation that you're looking to find looking just to your point here here's that sort of that idea of coverage this is actually the same sample sequenced two different ways at the time conventional whole genome sequencing so I've zoomed way out so all the reads look very small in this case 15x coverage no hint whatsoever of a mutation contrast that to the same sample using exome sequencing this is important cancer gene and we actually found six of the 139 reads actually did have the mutation and this is purely sampling just sequencing this 15 times is not enough to essentially find even one read or to have a chance to find even one read in this very actually in this case it was sort of a combination of the worst possible so a perfect storm of low tumor content and there's a copy number gain that's not needed to this position so very very difficult mutation to find hence the read for increased read depth the other benefit of whole genome yeah yeah exactly so reads equals dollars so essentially the question especially in designing experiment is where do you want to spend your budget is it going extraordinarily deep and relatively few regions this is why clinical labs really use the targeted panels because they're using these clinical samples they just put for the same budget go very very deep on relatively few targets and yeah and this is where I'm a big proponent of comprehensive testing as part of clinical care because that actually amplifies research as well and I'm hoping the the case reports really gets at this point as well there's also the problem of what is the clinical significance of an extraordinarily rare you know three and or six and a hundred and thirty nine read subcode depends on the tissue. Yes it really depends on your question so usually if the experimental design needs detection of variants one in a hundred then I would back calculate from there so if I really need to have so a good example of self-re-DNA if I want to find a piece of DNA of tumor DNA that's circulating in blood let's say for the experiment I need it down to one percent sensitivity down to one percent and then you can back calculate okay I need one percent therefore I need this depth to have a chance to even find one in a hundred because you actually have to sequence right on a hundred there's sort of a statistical model we can sequence a little bit more a little bit less depending on your side sensitivity and then what genes do you need that sensitivity is it literally just in one spot is it a big panel do you need the entire genome sensitivity down to one percent extraordinarily expensive so there's really no rule there's sort of benchmarks that have been used so for the cancer genome atlas they targeted at a hundred X median coverage across the genome most genome studies out of the Pog program in Vancouver where they're clinically acting on genome sequence they require 80 X coverage across the whole genome gives them a certain sensitivity nothing's ever perfect you don't have infinite budget that is there's really no law it really has to be tied to science question first and then back calculate it this is one example of what can be done with whole exome sequencing so in this case this is a 250 X exome sequencing experiment we did to infer the tumor content the ploidy and to look at some sub-clonal point mutations as well so here's a tool here it's called sequenza each little dot here is a point in the genome that had a germline snip so we wanted to say are there differences in copy number across the entire genome so we just focus here here's the coverage normalized to a reference so if this was just a normal blood cell all these tracks would perfectly line up at one you can see they obviously don't it's a copy number gain here there's a deletion here big huge copy of all of chromosome 12 this tool essentially infers from this data what the likely tumor content and what the likely ploidy is so in basically using this algorithm and using this pattern of copy number changes and losses it says it's likely that 61% of your DNA come from tumor cells and those tumor cells have more gains than losses so a tumor purity of 2 means they are no gains or losses in this case there is a good portion of the genome that is gained it's not a complete genome double that would give you a ploidy of 4 in this case it's likely just a diploid and then these very large gains of chromosomes likely brought that average up you can see there's actually multiple solutions from this algorithm so I've just shown the most likely one here but this is really either it's very important to work with especially the pathology team that's working with the tissue to really give you insight into the tumor type what is the likely ploidy do these genomes always double themselves because maybe then you want to change your model for tumor content tumor purity and also just looking at the slide it says 61% but there's 10% tumor cells or the whole thing is just in the face lymph node it's 100% tumor cells that really has to go into the interpretation of this bioinformatics result a second tool here called pi-clone in this case each little stripe here is an individual mutation this is the calibrated cellular frequency so all these mutations here essentially are in 100% of the tumor cells with the assumption that the tumor content is 61% so there's sort of some calibration of that allele balance variant allele frequency whatever you want to call it and essentially this algorithm bends them into clonal mutations use mutations that are in every cancer cell subclonal mutations and then actually the only reason we saw these extremely rare mutations is because we sequenced these to very high depths we actually have the likelihood to pick up these variants that only have allele fractions around 20% yeah so you know I'm sure you've seen these beautiful like phylogenetic trees they assume that all the copy number variants or mutations at the same allele frequency belong to the same cell or population but you don't actually know that so the assumption here is all these mutations have the same bath they at least as published they belong in the same cell or the same population you can't actually know that without sequencing an individual cell and almost definitely there may be you know especially towards the end here these are starting to pale off there might actually be an additional clone here so this is really very challenging these clonal phylogeny or yeah these clonal phylogeny reconstruction methods and actually Paul Boutros and Quaid Morris wrote a really great talk where they essentially or great paper where they essentially compared all these different reconstruction methods and they often gave quite different in some cases diverging results so it's really not a solved problem certainly from the bulk data and this is where I've become such a fan of single cell sequencing approaches because you can really be very quantitative around what subclones are where and present at what frequency because this is purely inference with a statistical inference based on all the old fraction yeah you really need some gold standards and it actually turns out those algorithms are very good in the authors area of disease expertise so often you really want to work in myeloma for example with a group that knows how what driver mutations are relevant most frequently in that disease what are the true clonal founder mutations the same across cancer types and ultimately doing experiment yourself taking two cell lines putting them together or two cell lines or two samples from the same patient over time and look at how those phylogenetic trees over time those change between the two time points you really need some gold reference standards it's just very hard to show up and just run these tools on your own data they really take a lot of care and disease specific knowledge around how complex should that clonal heterogeneity be yeah it's not sort of a plug and chug kind of tool I've talked exclusively about Illumina sequencing I'm going to continue to do that but I did want to talk about some other technologies one being pack bio sequencing this system actually has a huge benefit instead of a little tiny 50 to 150 base pair reads this can generate 10,000 plus base pairs so I talked earlier about that challenge of the unreadable portion of the genome the pack bio and nanopore there's actually several technologies that can essentially allow you to sequence huge chunks of DNA with the downside that they result they have a very high artifact rate so you can see just by eye there's a lot going on in these reads the mutation of interest is right here here's the Illumina sequencing nice and clean very very low error rate very obvious where the mutations are both in point mutations and in indels but don't discount these technologies obviously real variants are coming through and essentially bioinformatics techniques are needed to essentially normalize out the background sequencer error from other next generation sequencing technologies and very powerful certainly as a cross validation these long-term technologies especially powerful certainly for pathogen sequencing complete bacterial genomes the challenge with them with clinical samples is you saw earlier on that side all the DNA comes prefragmented so it's really impossible like you never you don't even have a fragment to even sequence to tens of thousands of or sequence tens of thousands of base pairs so it's been really very challenging to applying these long-read sequencers to clinical samples this is really where you want to think about where your sample is coming from what do you know about them is the DNA even long enough for the type of downstream experiment that you're going to do I'm going to continue talking a little bit about depth as well so I just sort of double down on that mutation detection sensitivity is dictated by your depth or your coverage and it's limited by your sequencer and your background sequencer error so this is essentially a plot from a paper we published a couple years ago essentially looking at the allele fraction detectable and then your total overall coverage so exome sequencing TCGA we usually sequence to 100x coverage that gives pretty good detection down to a certain probability around 10% maybe a little bit lower to be very confident in the clinical lab we have a 555 gene panel that sequence to 1,000x coverage gives you pretty robust mutation calling down to 1% but certainly if you want to get to highly complex mixtures like circling tumor DNA you're really looking at 10,000 to 25,000x coverage and this is sort of the rationale at least for this study to sequence to a very very high depth because we want to detect these very very low allele fraction variants essentially deluded by normal DNA in the bloodstream so essentially the concept is exactly the same so cell for DNA is very attractive essentially as tumor cells die they shed their DNA into the bloodstream we can use all the concepts I've just talked about to find those very very rare mutations in this basically in DNA extracted from the liquid portion of blood so I'm not going to talk very much about circling tumor cells but specifically the application here of DNA that's been shed and deluded into that liquid portion we showed in multiple myeloma this actually worked very well in fact almost better than bone marrow aspirates because we're able to find not just mutations from the primary tumor itself but also additional mutations from other masses present within that patient the other benefit of blood of course you can take it much more frequently than you can a solid tumor a solid tumor biopsy this is what the data look like there are actually two big methods out there one is PCR based so you put a primer on either side of your exon of interest in this case and sequence them the method we used and I talked about on the previous slide was based on this cap seed paper Scott Bratton is actually now here in Toronto a hybrid capture panel basically pulls down DNA sitting on your exons the downside with hybrid capture is you actually also pull down all the DNA sequences immediately adjacent so while most of your coverage is right on top of your exon of interest you do have this sort of off target problem reads are dollars so this is slightly more expensive to get coverage specifically on your target of interest but essentially the data look the same you're looking for mutations you're looking for very low allele fraction reads amongst a whole pile of normal reads Scott will use a postdoc at Stanford showed very nice correlation between the size of tumor and amount of cell-free DNA circulating in blood still very hard to find very early stage disease at that point certainly as tumor burden grew you had a good ability to find tumor derived DNA and as tumors grow and shrink during clinical trial very tight correlation between cell-free DNA and red and tumor volume measured radiographically so as tumors shrink you have less cell-free DNA and so on and so forth really the ability to do some pretty innovative monitoring and tracking of patients tumor burden as cells turning over dump their DNA into blood and we can track them you don't so this is the other big challenge at least with mutations in this case you have a single mass that's coming and going certainly had that problem in multi-myeloma we had one aspirate and we saw the same mutation over and over we could not find it in that iliac crest aspirate but there must have been another mass somewhere else in the body that was putting that DNA into blood this is starting to change a little bit with a new anti-body-based method so Daniel Cavallo's lab has invented an antibody against five methyl cytosine it pulls down all methylated DNA from blood very powerful to look at cell of origin but you can't really tell the difference between a metastatic site to a primary site because it's the same cell of origins just move to a different part of the body just a few brief comments on somatic copy number alterations we've talked about this a lot this is actually a relatively old technology called SKY where essentially you have a probe against each different region of the genome it gives you a different color but you can see really how complicated some of these cancer-derived chromosomes can be as they rearrange and essentially stick themselves or become rearranged or have increased copy or essentially stick themselves together in different combinations essentially the whole way to find copy number variation is by coverage so as you have more reads you can see how much copy number gain and the point of this slide really showing the same results from old microarray technology exome sequencing and whole genome sequencing there are actually really robust copy number profiles available even from targeted methods here's an example from a clinical a clinical panel in this case not that many exomes baited in EGFR but all of them have extraordinarily high coverage so even from a targeted panel you can actually abstract relatively robust copy number profiles purely by looking at the number of reads relying on individual genes in this case p10, totally deleted even on the small panel p10 is much larger than this but they only baited 5 of the exons no reads whatsoever from all 5 of them and for that as a copy number change and on the larger panels basically you just fill in more dots so as you move from a targeted panel to exome or whole genome you essentially get more data to support the evidence of the gain or a copy number loss or amplification this essentially looks like this so in this case when you have a breakpoint your reads hits the breakpoint but instead of continuing on like it would in normal DNA it essentially jumps the breakpoints you get this sort of read that has a very unusual mapping structure in this case all the deletions were different had different breakpoints in these tumors but they always took out the exact same exons in the C-termal DNA so really what we saw by looking at large numbers of tumors essentially you have all these sort of raggedy breakpoints but you always took out the same functional units and that was really a key to discovering essentially the importance of deletions specifically in the C-terminal region of EGFR this is what real data looks like so you have reads that very nicely align to the reference and then they look totally different the reason they look totally different is because they belong to the partner gene in the fusion so essentially there are numerous algorithms now that go through and look for these translocation breakpoints unfortunately the positive rate is very high with these algorithms because really additional needs to validate these algorithms but also to to essentially cross validate so often what we'll do in my lab is we'll run two fusion collars and then look at the overlap between the two methods these can be extraordinarily complex so Mike Berger when he wrote this prostate cancer paper essentially on a huge whiteboard mapped out all the ways that these genes all linked to each other so he was able to show by hand temperous ergue fusion but also temperous to this middle of nowhere region that was fused to threat 3 and that was fused back to ergue there's no bioinformatics algorithm in the world that knew that this could happen and then could walk through all these breakpoints so at some point it is pen and paper work knowing this now of course people have now written algorithms like this but cancer does bizarre and crazy things and it's obsession in this case to really just go after a pretty simple fusion which is just this temperous ergue fusion very important in prostate cancer but there's this whole constellation of other changes happening around this one oncogenic event just like mutations you can have very quiet rearrangement genomes extraordinarily rearranged genomes and sort of everywhere in between this is a very interesting case from neuroblastoma nothing going on except for this shattering of chromosome 5 so this is an event called chromothripsis essentially chromosome 5 blew itself apart and then was stitched back together resulting in hyper rearrangement only of that single chromosome very important in prostate cancer hasn't been seen very much since in pediatric cancer but if it can happen in one cancer type it can happen in another I think I've filled 10.30 is that right Michelle? oh good okay I thought it was I thought I was unusually behind okay so that's sort of the DNA space on copy number changes rearrangements, fusions things that happen in DNA in general okay so the run out to the talk is really in sort of the RNA and sort of the functional space I talked earlier about really the power of RNA to find relationships or linkages between exons so in this case this is a published paper basically looking at this gene that's important to resistance to 5th row uracil and what they did basically had this cell line called mip 101 and then they treated that mip cell line with 5th flora uracil and the question was what happens to this gene at the transcriptional level they knew from DNA sequencing it wasn't mutated there wasn't a copy number change there was no structural alteration but this gene has really been linked to resistance to 5th flora uracil what on earth is going on so what they did here was they did a whole transcriptome sequencing looked at the reads that mapped to umps and basically counted the number of reads that mapped to each exon but also each exon exon boundary so focus on the blue line first so essentially reads a line to exon one you should have reads on the junction from one to two like they did exon one to three is unusual you shouldn't be skipping from one to three whoops and that does not happen in the normal cell line normal number of exon two the two three junctions intact all the pieces make sense in untreated umps totally different when you treat it with 5th flora uracil and basically what happens is it starts to use this unusual exon one exon three so essentially it's completely skipping exon two here and you can see essentially the levels downstream on the gene are pretty much the same really the only change upon 5th flora uracil treatment is this use of this exon one to exon three and a slight depletion of the use of exon two like you'd expect consistent with that skipping event totally invisible using DNA sequencing something you really could only see looking at the functional redot of the cell using RNA sequencing the other more conventional use of RNA sequencing is gene expression profiling I think there's a whole session on this essentially the idea of taking a huge matrix of genes by samples looking for basically differences in gene usage between two states states can be pre-treatment tumor A, tumor B however you want to dream up a way to compare samples you can do that using RNA sequencing the other challenge is how many genes are you looking at to be different between two states is it 100, is it 5000, is it 2000 essentially that's a power question and a sort of statistical design question I think you're going to be covering that in additional workshop as well the other use at least as I've used RNA sequencing is trying to get at tissue of origin so in this case we looked at this very unusual pediatric lung cancer almost never happens in kids we really want to know actually only occurs in women a very strange feature so we really want to know where are these cells coming from is it truly a lung cancer is it coming from somewhere else we took RNA sequencing data from that tumor and compared it to a whole host of healthy normals so this is all the color data here are all individual tissues sequenced by a gene GTACS the Genotype Tissue Expression Project and their entire project was just grind up normal tissues and sequenced them not for cancer necessarily but it turned out to be a great reference resource for us so here are all different parts of the brain from multiple donors they all clustered together like you'd expect same with lungs skin heart everything heart and muscle heart and just bulk muscle are clustering nearer each other than brain all the biology kind of makes sense as well and then surprising to us these lung tumors actually all clustered specifically with muscles those are very smooth muscles specifically and actually turns out that the tumor of origin for these or the cell of origin for these is actually the endometrial lining during development explains the female only source of these tumors and essentially they leave the endometrium and they populate the lung and become this very unusual lung sarcoma so extraordinarily rare totally new hypothesis or explanation for potential cell of origin and kicked off a whole another hypothesis comes out of this experiment and kicked off a whole another series of lab experiments to validate that finding but really shows the power of using public RNA sequencing data to augment our own research because really we only had five of our own locally generated data there augmented by probably ten thousand plus normal reference tissues now through GTX yes it's called GTX GTX so the website is GTXPortal.org it's grown much more since I did this project and can essentially search for any tissue type you can also subset by age it's also very interactive portal it's very helpful. I talked a bit about non-human sequences earlier in this case taking all the reads from an RNA sequencing experiment and we aligned them against a database of basically all known pathogens curated by NCBI and in this case we're actually able to totally unexpected this is a lung adenocarcinoma project we're able to reconstruct the entire EBV genome testing bar virus very unusual in lung cancer especially for lung adenocarcinoma diagnosis that spurred a validation experiment where we did in situ hybridization of EBV transcripts they were truly only tumor specific so all the normal surrounding cells were negative for EBV this actually turns out to be this very unusual lymphoepithelioma like lung cancer essentially the lung adenocarcinoma was not the correct diagnosis it was actually this more rare entity something that just flew through in the RNA sequencing experiment not something that's necessarily obvious by eye I did talk a little bit about RNA sequencing the many technologies to do this get a limiting dilution but one drop-up per cell all the concepts we've just talked about completely hold at the single cell level depth matters RNA sequencing is basically a counting number of reads aligned to each gene so really there are many technological ways to get to genomics at single cell resolution but all the ideas we've just talked about are basically exactly the same all those concepts genome doubling subclones coming going all that is now directly measurable with single cell sequencing this is one technology 10x genomics in this case there's basically a microfluidic chip in this case you have beads that contain or you have your cells you have beads that contain reagents that get encapsulated in the oil droplet you then burst the cell and you deliver those cell specific barcodes to the RNA for that individual cell so that gives you barcoded CT DNA and each of those barcodes is different so you can actually tell each cell differently from one another you then burst all the oil droplets and sequence all the molecules in your mixture but now you have the benefit in that you can look at all those muckier barcodes and map the RNA back to the individual cells you can still analyze this like conventional RNA sequencing data you can just ignore the barcodes and treat it as a bulk and then you have the additional ability to go much deeper into individual cells or individual populations this is what the data looks like so each dot is an RNA sequencing experiment so it's a single cell and essentially just like I showed on that sort of that heat map plot you can now start to cluster these by the transcriptional similarity and then the biology essentially rather than looking at the bulk the entire bulk population you can start to drill into individual populations so if you circle out this population here top most expressed genes are all related to the T cell receptor complex only expressed by T cells essentially we infer that these are the T cell populations in this case a neutrophil population, macrophage population you can essentially mark through each of these populations essentially using marker genes from essentially previously published RNA-seq data sets just like I showed with GTX earlier but now the challenge with GTX is it's bulk tissue and now we have a need for single cell reference sets as well you can do this in cancer so this is actually the same data just colored in a different way so this is colored by its transcriptional identity this is colored by who donated the data so in this case this is a patient with multiple myeloma two huge clusters that are not in the normals and the two healthy actually you can see all the single cells that are tingling but neither of those those healthy donors have any of these two populations in the cancer patient so of course the punchline here is these are actually two very large cancer clones that are essentially present transcriptionally distinct in the same patient and the science question here is actually what happens to these populations over time while they're on the clinical trial the colorated question was what not just happens to the tumor cell populations but also what happens to the healthy normal cells as well oh so tisney is essentially a graphical representation of essentially relatedness between gene expression profiles so essentially the it's not perfect but it's sort of the spatial representation of how do I describe this it's basically showing the relatedness between different gene expression profiles because if you think of all the different combinations there's a huge table of 20,000 genes by however many cells we sequenced so a heat map is virtually unreadable because you're looking at all the relationships of all those cells and all those genes this is basically a method to compress all that data into a 2D representation and there's actually new methods now that have replaced tisney plots actually the method we use in that more now is called UMAP which is essentially the distance between two dots is actually a measure of similarity but essentially it's a data visualization technique for similarity between two cells I really think we need a whole workshop on single cell Michelle's around there's probably like an entire week you could spend on the informatics of single cell analysis yes oh good Francis is still here um yeah I definitely did not do the tisney plot justice but um we'll all have to be ultra experts in this very soon um the other sort of call out to a reference set in terms of a single cell reference the human cell atlas uh this is a large international consortium to just like Gtex capture all single cell data from around the world and put it for free on the internet without a password this actually has had some interesting patient consent issues how do you consent a patient to put all their single cell data online but also enormously enabling because now you can do a tumor experiment or a tumor sequencing experiment and then pair all the all those cells to a large bank of healthy reference healthy reference cells as well this is also a great source of funding so founder of facebook um Mark Zuckerberg and his wife for Silla Chan are basically funding pilot projects in unusual cell types there's a cancer pilot that we're active in really starting to build a lot of the infrastructure and reference sets needed to do single cell sequencing and I suspect within five years we'll start to see maybe even sooner some of the first clinical single cell readouts just because you really get this amazing view like we did here of two cancer clones directly and the ability to start mapping these two potential treatments as well okay so I'm getting on the last 30 minutes I'll just um give a quick overview of everything we've talked about in genetics so far or in cancer genomics so far is influenced by the germline as well uh so going back to that same Thompson-Thompson textbook um really this idea of needing two hits in a tumor suppressor so essentially these are the breaks that essentially stop cells from cycling there's actually a large portion of cancer patients actually upwards of 10% according to some papers where patients are actually born with um loss of one of two copies of the tumor suppressor gene uh and what we find when we sequence there's tumors is there's almost always a second hit so roughly half of those tumors have a second hit in DNA I suspect they don't have data yet that they're half or likely epigenetic modifications or structural changes uh but what's been really striking is the manner of that second hit is different every time sometimes it's a mutation sometimes it's a deletion sometimes a rearrangement sometimes methylation and really the need especially in the hereditary cancer space where techniques that look at multiple hits uh around multiple second hits especially when you know patients are already predisposed by one inherited variant uh and here's actually a good example from a pediatric cancer study in this case here's their normal sequencing reads this is just from a blood sample you can see by eye here every uh roughly half the reads have this deletion and the other half are all intact so their normal cells are normal they still have one copy essentially allows normal cell development when you sequence the tumor first of all we saw huge deletion all of chromosome 10 so they're basically used to have two copies a mutant and a knot they deleted the mutant copy and then when you sequence the tumor every single read apart from maybe a few contaminating normal cells have that mutation so really being able to compare both the germline state and the tumor state but also actually using the germline to sort of guide interpretation of the tumor genome so just to get us into the the case report I've talked a lot about finding stuff making lists of variants but the real challenging certainly least automated part is still that act of genome interpretation so it's one thing to call all these parts but this is still the area where we need certainly better software better access to reference sets to start helping us interpret specifically what a mutation that we've never seen before maybe maybe doing within those cells and that's really this act of how do you annotate a variant how do you interpret a variant then ultimately how do you report it out to guide patient care there are many ways to do it there are whole guidelines American College of Medical Genetics College of American Pathologists there are many guidelines for how to interpret a variant they work for a good fraction of them but there's really no one hot shop piece of software as you can tell from this long list that essentially will be in a mutation as a driver or not a driver there are a few tools to do it there's one tool called Oncotator which will pump out a huge list of all the variants of all the annotations there are things that are known about that variant we also use variant effect predictor this is actually the one I use more often now there's desktop software that basically put all that data into context of all the other mutations around it this is really still a pretty manual process basically going through coming up with a hypothesis sometimes additional experiment around what we think additional of what a somatic mutation cuff never change or fusion is actually doing here's link to one of the guidelines so here's ACMG basically their goal is to bin in that case a germline variant as benign or pathogenic and what's the strength of the evidence for pathogenic so this is a really great paper just to read through how people are thinking about interpretation of the variant and all the many different aspects of what is known about a variant to try to link essentially to give that variant meaning the other challenge okay let's say we've figured out what we think a mutation is a real clinical report from in this case this mutation it's not overly convincing it's sort of what we often call a variant but no one significant it hasn't been seen very often sort of a big piece of text that describes what we think we know about the variant the downside here this really does not scale very well so if you do a whole lot of sequencing you've got 40 of these things you're going to write them all up probably not it's also a communication problem no one especially when you're running a clinic wants to just get a book of all your variant interpretations really we need more dynamic and much clearer ways to communicate scientific meaning for specific variants that we've uncovered one effort to do this is a ACR project so essentially it's encouraging clinical labs from around the world to put their molecular result but also the variant interpretation of clinical data all into one place so we're not interpreting the same variants over and over but also coming up with practices for how to interpret and share variants the other tool I'm very keen on this CBIOPORTALS is a tool that was built for the cancer genome atlas just to house all the data but it's actually become very powerful now as we start to interpret individual cases so in this case this is sort of the dashboards for a single patient from diagnosis all of their treatments, subsequent tumors looking at the copy number changes the mutations and then all the specific details about the mutations including the allele frequency over time it's really putting a huge wealth of data really into an interactive website so you can really start to look at these data and really go through that interpretive process for a single case and that's what we're going to do now is essentially go through how a single case was managed now almost 10 years ago now published if you want to read it in great detail any questions on the background before we go into this case here okay so initial presence so this is all just a single patient so everything we've talked about is basically context for interpreting this single case so 78 year old man fit and active no real reason to have cancer no family history pretty large basically started off with throat discomfort examination actually found a pretty large mass at the base of the time so this is pretty big for a head and neck cancer non-smoker non-drinker no real reason for him to have this tumor but this is sort of how he arrived step one is a PET CT scan and a biopsy so we're always starting with the tissue first just to get a diagnosis and some understanding of what we're looking at here in this case positive the PET was positive for the lung mass itself also positive for the draining lymph nodes as well so essentially this is already poor prognosis the tumor is now actually already started to metastasize or at least move to nearby nodes this is usually a test for the pathologist this is a very rare salivary gland tumor so using this abnormal carcinoma very rare tumor no obvious clinical course or obvious treatment necessarily for these beyond surgery so in his case he had surgery so laser dissection of both the tumor and the draining nodes primary as I said this micropapillary adenocarcinoma micropapillary muciness features 3 of 21 of the nodes they took out multiple nodes 3 of them are positive so this tumor is mobile head and neck very common to have surgery and radiation so I got directed therapy in February pretty good quality of life for 4 months but then most concerningly for him there's already a hint that tumor is mobile he returned with numerous small metastases in both lungs this is very serious really trying to think of you can't do surgery you can't pick out each of these nodes individually what systemic therapy might be appropriate for him so the thinking clinical trial at that point metastatic disease both lungs were involved in this case they were looking for a clinical trial this is the age where EGFR mutations had just been described there are EGFR clinical trial EGFR inhibitors rather at the BC cancer agency available to ask the question is EGFR expressed and it was kind of so we're looking for here are ground spots so there's some tumor cells are expressing EGFR so what treatments are possible targeted EGFR therapy is maybe maybe warranted there wasn't sort of an EGFR mutation service at that time so really just going purely on expression he was put on an EGFR inhibitor so in this case we're a lot in it and it really didn't work at all so EGFR expression alone was not enough to respond to this so in this case all the lung mets grew while on the tumor the largest lesion really grew this is really looking pretty serious they got him off that drug right away there's no reason for it not to work but now they're really thinking palliative care what next like there's really so rare tumor there's no obvious molecular like test to order for these tumors so here's sort of his timeline down here so we're really thinking a lot and it's failed what do we do next so this goes back to that other slide it's great to have a landscape of tumors what are the targets in an individual case so exhausted standard of care turned to the GSC for new leads they had a special REB and this is really something I do for every single project is specials come and bring samples what has the patients have the patients consented to can they be shared can they be analyzed in a specific way in this case they had an N of 1 REB meeting or the whole genome in RNA sequencing with the goal to nominate essentially potential treatments so it's not a diagnostic test but really make sure the REB and the patient is really on board and the patient consented to full genomic sequencing really came to him that it's not a diagnostic test essentially this is an N of 1 research project I mean Vancouver has done nearly a thousand cases but essentially Pog Zero so the very first case that they ever did I don't know if it's scalable in its current form I think that interpretive process really needs to be quasi automated or just needs a big skilled team to work in a coordinated way around it so I think actually it is it's just got to be big big scale for it to be real I think every patient is not especially in today's landscape but I do I think now there are many molecular profiling programs like this that sequence thousands of patients and now they're just blanket protocols to do this so I think now that was almost 10 years ago the whole culture and landscape has really changed so really a central REB to do large high throughput profiling interpretation is it's being done already I think it's still a challenge for communicating to the patient how it's shared and how so it's opened up all these new interesting nuances not so much for the individual patient but for data sharing more broadly yeah that would be great okay so we also took a lot and have failed let's try comprehensive profiling so in this case fresh hose and biopsy was taken specifically for RNA sequencing here's the CT scan pretty big mass pathology review again, very high tumor content pathologist reviewed all the cells 80% tumor content great so we know the vasts or the gene expression profiles will be highly enriched for cancer cells, good news and we also had this form of fixed paraphernal embedded DNA for the whole genome sequencing work so here's what the whole genome sequencing analysis found P53, very famous tumor suppressor mutated in half of all tumors that's surprising RB loss of function mutation, very interesting in this patient's case because loss of RB results in EGFR inhibitor resistance so had we known this mutation was there previously he would have never gone on that a lot in a trial and this is now actually an excluder for patients to go on those trials it's really the need to get profiling early and of course like you're always going to get with broad profiling mutations in two genes I've never seen before I don't know if I've seen them since I don't know in significance very helpful as you start to pile up thousands of these patients because you can ask the question do we see these mutations mutated in or these genes mutated in an appreciable frequency or not not overly relevant to drugs at least at that point in time and essentially all four of these mutations were confirmed by a different sequencing method I showed packed by earlier we used Sanger sequencing for this study essentially showing that these mutations were truly there in the DNA so essentially not a lot there therapeutically beyond explaining the treatment failure so it's a mutation that introduces an artificial stop codon so essentially it tells the transcript or the essentially yeah the transcript tells the the protein basically not to continue beyond that site so essentially it stops at 234 it should be much much longer for IRB and we infer that as loss of function because you've lost half or some large chunk of the protein just like before here's Circos here the chromosomes around the outside in this case the this little track here colors copy number changes so grains are lost reds are gained so you can see EGFR amplification perhaps linked to the overexpression they saw by IHC so it's overexpressed but also had this undermining RB mutation loss of P10 also associated with resistance to EGFR inhibitors so you have this double whammy EGFR was not ever going to work very interesting amplification of RET actually highly highly focal very hard to see here but full complications very interesting because the thinking is they're focused on relatively few genes RET also interesting jugable target MAPK so that pathways likely overexpressed loss of P53 so in this case there's a mutation and a deletion of P53 helpful for understanding biology there's not really a direct clinical path at least at that point for P53 mutant tumors there's one now just like for mutation using FISH to confirm those copy number changes so loss of P10 that focal RET amplification many many copies of RET in these cells really showing the genomics being cross validated by a cell based method and then the gene expression how do we compare RNA sequencing especially at that time we didn't have GTX there weren't a lot of reference in sets so essentially the GSC they just took a mixed bag of 50 tumors and they actually happen to have a blood sample and they just said what is unusual in this RNA sequencing profile are there genes that are abnormally expressed to an extraordinarily high level that could be active upon or could be inhibited therapeutically so in this case they saw a deletion of SMAD-4 totally consistent with under expression specifically of that gene made sense RET had that focal amplification and it was enormously high so across that entire companion of tumors it was 34 times higher than the next highest specimen and was most highly expressed oncogene very interesting in this case because this is something that could potentially be acted upon and P10 was deleted and it was one of the lowest across the entire companion so really the very powerful having that reference set and mapping the individual patient's profile to that reference and the GSCs now become famous for mapping or basically communicating biology using these pathway diagrams so essentially here's the RET pathway annotated with copy number changes and gene expression changes RET both amplified and overexpressed resulting in overexpression of the downstream pathway as you'd expect and then not a lot of action necessarily here in this parallel pathway beyond loss and down regulation of P10 so P10 is a negative regulator of this pathway so clearly this pathway is very important to this specific tumor really only evident because you've measured both mutation copy number and expression for every member in this pathway so you're not just dependent on a single gene test you're able to see the whole constellation of things that are essentially converging on this one pathway so here's the list so here's sort of the biology over here up regulation of map K, RET is important so on and so forth and what drugs do these map to in this case actually came up with four potential four potential drugs it's total hypothesis hypothesis being map K and RET are the drivers of this tumor to test it there's basically four potential treatments and really it was up to the oncologist here it's really provide using genomics as a tool to help guide clinical care and ultimately it was up to her to choose so genome sequencing and RNA sequencing still doesn't happen overnight this took a whole month so here's basically the day the biopsy was taken two masses pretty large took a whole month tumor continued to grow from 22 to 27 24 to 28 from that list the oncologist essentially chose synitinib mostly because the side effects were at least the worst side effect report for synitinib at the time was just the skin rash sort of felt that was tolerable in the context of a research protocol so it started out on October 29th and actually it worked dramatically well so four weeks later significant shrinkage of this tumor actually of both legions 27 to 22 and 28 down to 21 and it's actually stabilized for seven months there was a relatively robust response there was some side effects they reduced the dose continued to have stabilization of disease most importantly no new nodules and the nodules aren't growing at least during those seven months but as I foreshadowed on one of the first slides of the entire talk monotherapy works for only a short period of time there's really the likelihood for resistance exactly the same story here you sort of get the stabilization you're beating back the tumor but then these mats again begin to grow either due to a subclonal subclonal selection some shifts specifically in the existing clones many possible molecular ways that this could have happened but since they did have other options we're going to switch to the combination of two of the other drugs that were on that list and again the disease stabilized now hitting those pathways from a different angle resulted in disease stabilization and again another continuation for three months like you know the punchline really cancers are dynamic and changing over time the exact same story here recurrent disease again after those seven months and most currently most concerningly the recurrent disease at the base of the tongue so essentially the primary has come back potentially either metastatic disease had traveled back to that permissive microenvironmental niche or potentially just residual disease remaining after surgery potentially more likely essentially now that tumor has now grown at the tongue a new meth so a skin nodule is now growing progressive and new metastasis is long so really by forcing these tumors through the bottleneck they've essentially become resistance against this the red pathway quality of life is getting worse of course the question is what changed and again what can you do about it so there's a large neck mass this is now a chance to very easily accessible disease you can now essentially biopsy that site you can see many more mutations and we saw the first time so we saw all the mutations we saw previously but these additional mutations none of which really are a hot shot clinical trial there's not really a kinase that you could really go after we've never seen these mutations before none of these are evidence in the pre-treatment biopsy the pre-treatment was a sequence of gigantic depth but really not even one read supporting the presence of these mutations doing the exact same analysis I just showed earlier you can see there's a lot more red on this slide so essentially the tumors way to overcome this treatment was just to amp up ret signaling and to turn up this parallel pathway as well so p10 is still lost but you can see over expression of additional pathway members but also this parallel pathway also responding by due to increased copy number state but also increased treatment expression level as well and this is really where it gets extraordinarily experimental is it now cocktail drugs against all these pathway members the disease is very advanced of course at a major risk of a diverse side effects you can't just go grabbing three or four drugs and deliver them to patients could we have found these resistance mechanisms pre-treatment perhaps in today's environment could self-re-DNA have been used to detect resistance earlier than it was and could we have actually modeled and monitored these over time so especially to get additional longitudinal data there's a real potential now to start learning from metastatic disease and then go back to the primaries and look for mechanisms of resistance early on and unfortunately he actually passed away soon after that just the quality of life continued to decline here's sort of his entire timeline so essentially buying him just over an additional year of good quality of life but really that problem that still stands today is very challenging to interpret, deliver and act upon a clinical genome in RNA-seq I was wondering if there's really something about the model that we need in some kind of... Yeah, I mean this is one certain path from the single cell approach as well doing some of that pre-post-treatment modeling both of microenvironment and tumor clones I think this is sort of the next big data boom especially around comprehensive data become available on more and more tumors but yeah, when I say modeling there are many ways there's patient derived xenographs, there's cell lines there's brain tumor stem cells I think certainly infrastructure is much better now at getting serial samples like this and this now where they are those experiments that can be done with real patient derived material Yeah, I mean there's been a lot of that work I haven't seen like a prospective clinical trial that's shown this beautiful can't miss kind of test but it's certainly an area of intense research for sure and that's pretty much the end of my talk