 So good morning, everyone. Thank you all for coming out this morning to what is the final lecture in our series, Current Topics in Genome Analysis, which is organized by Dr. Andy Baxibanas and Dr. Tira Wolfsburg. So it's a tremendous honor for us to have as our final speaker today Dr. Elaine Mardis, who comes to us from Washington University in St. Louis, where she is currently the Robert E. and Louise F. Dunn Distinguished Professor of Medicine, Co-director of the Genome Institute and Director of Technology Development also at the Genome Institute. So Dr. Mardis earned her bachelor's degree in Zoology and her PhD in Chemistry and Biochemistry, all from the University of Oklahoma. In her early career, she worked as a senior research scientist at BioRad Laboratories in California before joining WashU in 1993. Both the breadth and the depth of Dr. Mardis' accomplishments are truly outstanding, and she should really serve as an inspiration to all of us. As Director of Technology Development at the Genome Institute in WashU, Dr. Mardis was a key player and a thought leader in creating sequencing methods and automation pipelines that really operationalized the human genome project. She then went on to orchestrate the institute's efforts to explore next-generation sequencing and third-generation sequencing technologies and translate them into production sequencing. Dr. Mardis, as many of you know, has been a major leader in several large federally funded research initiatives, including the Cancer Genome Atlas Project, the Human Microbiome Project, and the 1000 Genomes Project. She's also been a driving force in sequencing the genomes of the mouse, chicken, platypus, rhesus macaques, orangutang, and zebra finch. In recognition of her seminal contributions to science, Dr. Mardis has received numerous research awards. In 2011, she was named a distinguished alumni of the University of Oklahoma College of Arts and Sciences. For her seminal contributions to cancer research, Dr. Mardis received the 2010 Scripps Translational Research Award. And just earlier this year, in Thompson and Reuters' report on the world's most influential scientific minds, Dr. Mardis was named as one of the most highly cited, or what they refer to as the hottest researchers in the world, which they describe as the thought leaders of today, individuals whose research is blazing new frontiers and ship in tomorrow's world. And so we're incredibly privileged to have Dr. Mardis here with us this morning. And please join me in giving her a very warm welcome. Thanks so much, Stephanie, for the kind introduction and thanks to everyone for coming. I'm honored to be here again and asked to speak about next generation sequencing technologies, which I'll do for the majority of the time. And then I thought it would be of particular interest to just take a deep dive right at the end of the talk on how we're applying these technologies that I'll tell you about to the pursuit of cancer genomics translation. So really beginning to move away from discovery into making an impact on patients' lives today. So I've no relevant financial relationships with commercial interests to disclose, just to get rid of the perfunctory announcements here and move on to the hot topic of next generation and third generation sequencing. So what I'll do today is sort of take you through the basics of next gen, followed by the basics of third gen sequencing. And then as I said, we'll take a deep dive on cancer genomics. I really wanna start at the common core of next generation DNA sequencing instrumentation because there are a lot of common aspects to how this all shakes out in the laboratory setting in real time in terms of getting DNA prepared for sequencing and then generating sequence data itself. So all of the next generation platforms that I'll talk about today really require a fairly simple compared to old style Sanger sequencing library construction events that occurs at the very beginning of this process. And this is really characterized by just some simple molecular biology steps that involve amplification or ligation with custom linkers or adapters. So synthetic DNAs that correspond to the platform on which the sequencing will take place are the first step in making a true library for next generation sequencing. And I'll show the details of this in subsequent slides to really illustrate the point. The second step that follows library construction is really a true amplification process that occurs on a solid surface. This surface can be either a bead or a flat silicon drive surface, but in every case, regardless of the type of surface, the fundamental principle is that those same synthetic DNAs that went on to construct the library are also covalently linked to the surface on which the amplification will take place. So one question you might be asking yourself is why do we need amplification on the surface? And the easy answer is we need plenty of signal to get an accurate read on the DNA sequencing to follow. An amplification is a quick and easy way to represent the same sequence multiple times so that as the sequencing takes place on each individual DNA in that library, you generate plenty of signal and you get an accurate DNA sequence that comes out the other end of the sequencing process, if you will. So that's the reason for amplification. And we'll talk more about that as well in subsequent slides. The next step is really the one that differentiates old-style Sanger sequencing where you truly had a decoupling. First, the library was constructed and then sequenced. The sequencing reactions were then subsequently separated and analyzed. So there were two distinct steps in Sanger sequencing, the sequencing itself, the molecular biology, followed by the readout of the data. In next-gen sequencing, everything happens in a step-by-step manner, so it's a truly integrated data production and data detection that all happens in a stepwise fashion. And I'll describe this for the different sequencing platforms, but they basically all use the same premise, which is where the nucleotide base incorporation on each amplified library fragment is determined, and then subsequent steps occur to determine each sequential base as you go along. And really, this differentiating factor that you sequence and detect in lockstep on next-gen sequencing platform is conducive to the other description for next-generation sequencing, which is massively parallel. What this means is that since you're coupling together or integrating the data production and the data detection, you can actually do this times hundreds of thousands to hundreds of millions reactions all at the same time, which basically reduces down to the XY coordinate that represents that particular library fragment times all of the library fragments that are being sequenced together. So this is why people commonly refer to, and I actually prefer the term massively parallel sequencing because it really is an accurate reflection of what's going on inside the sequencer. One other aspect of next-gen sequencing that comes into play often when we want to quantitate on DNA, for example, copy number, can be very accurately quantitated by sequencing DNA, especially from whole genome data. RNA-seq, for example, where the RNAs first converted to a cDNA, then turned into a library and all these steps follow, can also be very accurately quantitated with respect to individual genes and the level to which they're expressed, mainly because this is a truly digital read type. The digital nature of the data comes from the fact that each amplified fragment originally was one fragment in that library and so you have this one-to-one correspondence. So if I have two copies of a particular part of a chromosome, let's say the HER2 locus, I get the equivalent read depth when I sequence that genome for deployty, but if in the case of a HER2 amplified breast cancer, I have three, four, or five copies, actually have a digital equivalent of that depth on that HER2 locus and I can quantitate the extent to which the copy number is amplified. So just an example of that. Now the last part I'll get into as we talk about data analysis and that's really the fact that most next-gen sequencing devices provide much shorter read lengths than those Sanger capillary reads of old, which were in the neighborhood of six to 800 bases for Sanger, whereas most next-gen boxes deliver somewhere in the neighborhood of 100 to 400 base pairs of data. So this really now confounds analysis by quite a bit. The data sets are quite large and we're in a position of needing to map those back to a reference genome for the purposes of interpretation rather than assembling them, which is what we used to do with Sanger data, but more on that in a minute. So let's look now at the detailed library construction steps. They're sort of shown up here and I've already walked you through them, but what I want to point out in this particular slide is that in many places I've denoted that there are PCR steps that are occurring at each one of these particular steps in the process. So after we start with high molecular weight genomic DNA that gets sheared by sound waves down to, let's say, two to 500 base pair pieces, the ends are polished using some molecular biology and this is this first step that I just talked about, where you're ligating on these synthetic DNA adapters and then amplifying the library fragments by just a bit with a few PCR cycles. There's also a size fractionation that takes place so that we can get very precise size fractions. This is often important in whole genome studies where we also want to predict structural variation like translocations or other inversions of chromosomes where you really need very precise size fractions and this also requires some amplification just to bring up the amount of DNA after you've isolated the fraction or fractions of interest. We quantitate these libraries very precisely before they go on for the amplification process and that's mainly to get the right amount of sequence data coming out of the amplification process itself. Now amplification is sort of simply shown here, I'll have a more detailed figure in a minute, but this is also a form of PCR. So the amplification is enzymatic and is the source of some biases as I'll describe on the next slide. This then shows at the last bit the sequencing approach where in this particular example, which is akin to what's done on the Illumina sequencer, you release by chemical breakage of the covalent bond and denature away the companion strands. So you have single strands that then get primed with first one sequencing primer. You can see it here and it's sequencing down towards the surface of the chip. And then in a second step, you can regenerate these clusters by just another amplification process, release the other end through a different chemistry and prime with the second primer and sequence. And so this is now delivering what we typically refer to as paired-end reads. Namely, you've got a fragment of about 500 base pairs. You're generating about 100 or so bases from each end and now you can move forward to accurately placing that onto the genome reference that you're of the organism you're sequencing. So just to follow on a little bit, you may be wondering why I noted all those PCR steps. I wanna talk about that just for a minute. So while PCR polymerase chain reaction, if you're not familiar with the acronym, is an effective vehicle for amplifying DNA, there are lots of problems that creep in that are also listed here that are a consequence of using PCR enzymatic amplification. For example, often you get preferentially amplification, often referred to as jackpotting. This means some of the library fragments preferentially amplify through the PCR and they'll be overrepresented when you do the alignment back to the genome. This turns out to be fairly easy to find, if you will, using a bioinformatic filter that goes on after the alignment occurs and these are typically referred to as duplicate reads because they really and truly are, they have the exact same start and stop alignments and there are algorithms that can go through and effectively deduplicate or remove all but one representative of that sequence from the library, from the library of sequence fragments. And this is particularly a challenge as we'll talk about at the end of the talk for low input DNA, which is common in a clinical setting because you have very, very little tissue from which to derive DNA and then to do your sequencing. And what that means is that you have a lack of complexity in the DNA molecules that are represented just because there are very few and this absolutely favors this jackpotting event. So when we sequence DNA from clinical samples, we're often very concerned about duplicate reads and we try to minimize PCR as much as possible. And the other problems with PCR is that you can get false positive artifacts. If these happen early in these PCR cycles that I showed in the library construction phase, that can be a problem because once it's in the population of fragments, it amplifies over and over again and it begins to look real as opposed to being a false positive, which is what it is. If it occurs in later cycles, then typically it's drowned out by the other normal correctly copied fragments, it's not a problem. And then cluster formation, as I mentioned already, is a type of PCR, as often referred to as bridge amplification, but it does introduce biases in that it amplifies high and low G plus C content fragments more poorly than fragments with a uniform distribution of ACG and T. And then reduce coverage at these low size is often a result. This was a problem that we pointed out in one of the first whole genome sequencing papers where we've resequenced the C. elegans genome way back when I think it was published in early 2008. And it's something that's improved a bit over time, but it's still a problem in terms of the bias on the Illumina sequencer. One other brief word here about a sub genome approach. So early on in next gen sequencing, all we could do were really sort of two things. One was amplify a bunch of PCR products, combine them together and sequence them, or you could go to the other extreme, which would be whole genome sequencing. So we really needed a way to partition the genome so that we could sequence less and really focus on the parts of the genome, the human in particular, that we understood the best. That would be the exome. And if you're not familiar with that term, you could define exome as the exons of every known coding gene in the genome. So this is about 1.5% of the human genome of the three billion base pairs. And round about end of 2008, early 2009, several methods were introduced that showed us the way to pull out the exome selectively from the whole genome libraries. And that's through a process described here called hybrid capture. So hybrid capture is a very straightforward thing to do. You essentially design synthetic probes that correspond to all of these exons that you're interested in. It could be the whole exome or it could be a subset of the exomes, for example, alkyneses. That's commonly referred to as targeted capture. And by hybrid capture, what you do when you design the probes is that these probes actually have biotin moieties on them. These are the little blue dots that are shown in this figure over here to your right. By combining the probes that are biotin-lated with an already prepared whole genome library and then hybridizing under specific conditions over about a three-day time period, you can get hybrids to form between the whole genome library fragments and the specific probes when those whole genome library fragments contain part of that exon that you're interested in capturing. A subsequent step then takes advantage of the biotin-lated probes by combining with streptavidin magnetic beads. You apply a magnetic force illustrated by the horseshoe magnet here. It doesn't look like that in real life. And you pull down these hybrid fragments selectively, thereby washing away the remainder of the genome that you don't care to sequence. And then these can be released quite readily just by a denaturation reaction off of the magnetic beads and the probes that are still attached to those by virtue of the strength of the streptavidin-biotin bond. So you can go directly to sequencing at this point because these were originally, as you can see by the little red tips here on the DNA fragments, they were already a sequence-ready library. It's just now been reduced in complexity by that hybrid capture phenomenon. And we often use both exome sequencing as well as custom capture reagents, these targeted capture approaches in our work to subset the human genome and other genomes that we're interested in. So one question you might be asking is, is there a lower limit? How low can you go below the exome and still get a reasonable yield because you're really now beginning to reduce down below that one and a half percent? And the answer is, there is a price to pay the lower you go. That price is typically referred to as off-target effects, which means that you start to get more and more of the sequences that you're not interested in because you're trying to subset the genome down really, really low. And so the lower threshold probably for targeted capture is somewhere in the neighborhood of about 200 KB below which you're really gonna pay a price in terms of off-target effects. Therefore, spending a lot of your sequencing dollars on parts of the genome that you don't care about just by virtue of spurious hybridization. So one way to get around this that's been used for really small gene sets is multiplex PCR. So this is again just sort of getting a bunch of primers to amplify out the regions of the genome you care most about that also behave well together if you will in terms of similar TMs for hybridization. Commonly you will subset multiplex PCR primer sets according to GC content of the regions that you're after so that all the high GC regions get amplified under specific conditions together, et cetera. So there is a little bit of optimization that is required for this type of an approach but there are now commercial multiplex PCR sets out there that are also available that can help you not have to go through that pain and suffering and just get straight to generating data. And this is just the idea behind multiplex PCR you choose your genes of interest, they go into a tube with a small amount of DNA. So clinically this is very attractive approach because you can use about five nanograms or even less of DNA and you can amplify out the regions that you want, create the library as I've said earlier by a specific ligation and off you go to sequencing. So let's talk a little bit now about the specifics of massively parallel sequencing. I'll first talk about alumina and then follow up with the second platform, the ion torrent. So this is now a more illustrative diagram if you will, it's straight off of the alumina website of how the cluster amplification process occurs and I won't dwell on it other than to say that this is how you could imagine the cluster sort of looks after the amplification cycles are all done and before that end is released up for hybridization to the sequencing primer. And during the sequencing process, a single cluster might look like this as it's being scanned by the optics of the instrument but in reality what you get is a very closely packed almost star field of clusters. And really the key to the increasing capacity on alumina sequencers has been two things. One, the ability to group those clusters tighter and tighter together yet still get single cluster resolution at the point of deciding which sequence is coming from which cluster. And secondly, if you think about the flow cell that you're putting these DNAs onto and amplifying, it's a three-dimensional flow cell. So there's a top surface and a bottom surface, both of which are decorated by the oligos and both of which can have clusters amplified on them. So one of the tricks in the alumina sequencers to actually scan when you're doing the scanning cycle which I'll talk about in a minute, scan the bottom surface of the flow cell, shift the focus of the optics just a little and then scan the top surface of the flow cell. So you essentially double your capacity by doing that simple, probably not simple, but that simple refocusing of the optics itself. So how does the sequencing actually work? It's really fairly straightforward. It's shown here where these individual circles are the different labeled nucleotides that are supplied in by the fluidics of the instrument into the flow cell. So this is when the sequencing starts. You've got your sequencing primer shown here in purple and the very first nucleotide that's been incorporated is this T against the A in the template. And it's really a series, as I said earlier, of stepwise, incorporate the nucleotide, detect it with the optics of the sequencing instrument and then go through a series of steps to essentially regenerate the three-prime end. So you can see here the nucleotides that are supplied have this three-prime chemical block in place and that prevents a second nucleotide from getting incorporated in after the first nucleotide is incorporated until you prepare it by going through this de-blocking step. The other step that's very necessary is cleavage of the floor. So you can see here the sun in purple is a floor and there's a cleavage site here that in this subsequent step will go through and remove the fluorescent groups. They get washed away by the fluidics and of course the reason for that is you don't want floors sort of hanging around from the previous incorporation step because they'll interfere with the fluorescence wavelength that's being detected in the subsequent steps. So it's really a series of where the nucleotide that's incorporated is excited by the optics of the instrument. The emission wavelength is recorded and that's specific for ACG or T. And on you go with the rest of this. Now one thing you might be wondering about is why is this sequence finite? So why do I have to stop at 100 or 150 bases? And if you're wondering that, it's a really good question and the answer is simple and complex. The simple answer is it's all about signal to noise. Well where is the source of noise coming from because we've got all these hundreds of thousands of fragments, they're all reporting the floor that they incorporated. It's not just this one, this is obviously oversimplified. So we've got lots of signal, where is the noise coming from? Well the noise comes from a phrase that you should always remember which is chemistry is never 100%. So let's talk about that for just a second. Chemistry is never 100%. So these nucleotides that get added in should look like this, but they might have a small proportion that don't. So where can things go wrong? Well one thing that can go wrong is that you don't actually have a blocking group here on the three prime end because chemistry is never 100%. And so in those cases, when that nucleotide gets incorporated into this fragment, another nucleotide can come right in because the polymerase is very good at its job. Now chances are that that nucleotide will have the blocking group and so things stop, but that strand is now out of phase with the rest of the strands in the cluster and therefore when the next incorporation cycle comes along it's one ahead of everybody else. And it's not just gonna be that one strand because chemistry is never 100%. So you can see that in all the clusters on the flow cell there's a proportional probability of incorporating a nucleotide that's not 100% either labeled here at the three prime end or another possibility is that because the chemistry of the cleavage won't always work, you either might not get the fluorescent group removed so it'll continue to interfere by providing noise in subsequent cycles or you might not actually get this three prime block removed. So that fragment now falls out of the running, if you will. It can't be extended any further. It's not gonna contribute signal anymore. It also won't contribute noise to be clear but these are some of the sources of signal to noise that ultimately limit the point at which you're not getting sufficient signal to accurately represent the nucleotide that is properly incorporated. So just to finish with the Illumina platforms, here's just a thing from their website that sort of shows what's illustrated below by the remarks and these are not my remarks but rather I sort of took a poll across the cognizantia sequencing technology folks for their impressions of the Illumina platforms and in general we see that this is a platform with high accuracy, the predominant error type is substitution typically you're in the range of about 0.1 to 0.2% error rate on a per read basis. So each read the forward and reverse read. There is a range of capacity and throughput as illustrated across the series of boxes up here so the MySeq is sort of the desktop sequencer, the HighSeq X is this sort of Titan thousand dollar genome box that's been recently announced and now is starting to populate large scale sequencing providers and there are longer read lengths available on some platforms like the MySeq which will do two by 300 base pair reads but in general most of the Illumina sequencers are still in the 100 to 150 base per end read length for the reasons that we just talked about and because of the challenges of data analysis which has already been mentioned and which I'll talk about more in detail in a minute, these providers have improved their software pipeline, the downstream analytical capabilities and are now offering some cloud computing options for users that don't have the desire to put together large compute farms to analyze on the data. Okay, so let's switch gears now to a different type of sequencing which is the IonTorrent platform and it's illustrated by these figures shown here which again are off of the company's website so I'll encourage you if you're interested in any of these technologies in particular really to go to the company websites because they have fancy animation and things that I can't do on slides or don't want to so it's much more explanatory perhaps than even I can provide. This is a unique approach to sequencing because it's truly without labels so this is using now native nucleotides for sequencing and a very unique form of a sequence detection which is shown by the chemistry here which illustrates that when you're putting in a nucleotide using the polymerase of course relative to the template so the C is now going in against this G one of the byproducts of nucleotide synthesis sorry of chain growth rather is the release of hydrogen ions and it's a proportional release so that if there were for example three Gs here in a row I would generate proportionally more hydrogen ions because the native nucleotide is going to get added in in triplicate not just once okay and how do we know that that hydrogen is being produced well we have sort of an old fashioned device here on the silicon wave for part of the sequencing chip which is a pH meter that's unique to each one of the wells in which this bead might be sitting that's going through the sequencing process so this approach uses a bead based amplification where the surface of the bead is covalently decorated with the same adapters or primers that we've used for the library so you can see here now amplified fragments that are primed and ready for sequencing this series of steps is very similar to what I just described with the exception that because we're not using fluorescent labels or labels of any kind here the nucleotide flows are one at a time so it's A let's say followed by C followed by G followed by T so this native nucleotide gets washed across the surface of the chip of which there are many wells like this most of which are occupied with a bead the diffusion process brings in the nucleotide if that's the right nucleotide to add at that cycle according to the template sequence of course it will get added in hydrogen ions will be released and the amount of hydrogen or pH change will be detected and turned into an electronic signal that registers with the software that knows which nucleotide is being washed over so you go through a series of four steps for incorporation the pH is monitored of course individually to each one of these wells and recorded according to the XY coordinate for that particular well and then at the end of this you get a readout that looks something like this if you like to look at data where you can see that for example this peak right here is quite a bit higher than the others and so on and so forth and I should point out that there is a way of registering the height of these peaks there's a sequence that's at the beginning of the adapter that's single nucleotide only so you get the representative height for GAT and C and the software cues off of that for quantitating the peaks thereafter from the sequence that you're trying to obtain and so this baselines everything for you so let's look at the platforms for this approach there are two the personal genome machine has been around the longest we have one of these in our laboratory there are three different size sequencing chips that are available depending upon how much sequence data you wanna generate the runs are quite rapid and the read lengths can be as high as 400 base pairs these are not paired end reads so this is just a single priming event followed by extension and data collection up to a stopping point and then the larger throughput device is the proton this is currently doing exomes I think and aiming for whole genomes and there are preparatory modules that are associated with both of these instruments that take care of some of the initial amplification steps on that bead which occur through a process not a bridge amplification like I showed for the Illumina but rather require encapsulating the bead the library fragments that are gonna be amplified on each bead and PCR reagents including enzymes into single micelles in an oil emulsion that then allows everything to be PCR cycled and mass and that's where you get the amplification stuff that's required for the signal strength as I talked about at the beginning and so just the characteristics of this platform because you're supplying rather one nucleotide at a time this has an inherently low substitution rate you don't detect something that's not there because only a single nucleotide is being added in insertion deletion is really the key error type in this sequencing and that's because there's a proportionality that exists only for a certain number of nucleotides typically up to five to six nucleotides of the same sequence five or six Gs in a row can be accurately detected and then above that the proportionality is lost so you do end up getting insertion deletion errors as a result around what are called homopolymer runs those runs of the same nucleotide I've already talked about peridon reads this is relatively inexpensive sequencing mainly because it's using native nucleotides and the data production turnaround is relatively fast and again they're improving their computational workflows for data analysis of different types okay so let's talk a little bit about data analysis because this is as I mentioned in the beginning one of the more challenging aspects I'm not gonna take a deep dive on this but just to sort of roughly reflect what the challenges are especially when you're dealing with a genome that's as large as the human genome which will be my exemplar so you know the goal of using science including in genomics is that if you could just have your sequencer and you could generate all of these data then the next step would be to have this beautiful figure part C for the publication that's going to the high impact journal of your choice but of course it's not that easy and sequence data alignment is really the crucial first step which allows me to put a plug in for genome references many of which we've generated in our own laboratory through NHGRI funding because these are really critical pieces in the data analysis of next gen sequencing so just to give a pictorial example if this is sort of the human genome the cover on the box of the puzzle you're trying to put together so you can generate that beautiful figure for your paper these are all the short read data that you have to actually try and make sense out of and of course the challenge here is that you can easily find the pieces with unique features those are probably many of the genes in the human genome but like figuring out where everything else goes is really the harder part of the equation and of course the genome being about 48% repetitive this turns out to be reasonably difficult and part of the problem of course is that because there's so much repetition in the genome you can get reads that look like they probably belong in multiple places where the real challenge here is mapping back accurately to where that read came from so you can properly assign any mutations that you might identify in that sequence correctly and one of the ways that we've gotten around this from a bioinformatics standpoint is that we have sort of quality scores that illustrate the quality or certainty of mapping that read to that particular spot in the genome so where you have a multiple map as illustrated here for a given sequence read you can go with the highest quality score to sort of assure that you've gotten that at the right place the other aspect that can save us in terms of certainty of placement of course is paired in reads because oftentimes while you'll have one read that sits in a repetitive sequence the opposite read or the companion read may actually properly align in a unique sequence and therefore you can give a higher certainty to the placement of that read using the paired in read mapping approach so once you have your reads aligned properly to the genome what do you need to do to get a good accurate sequence evaluation well there are a series of steps here I won't dwell on them for long but first of all you have to identify where your duplicates are we talked about those as a result of PCR and we correct any local misalignments this is particularly for identifying small insertion deletion events a few bases that are added or deleted those are the hardest things to find and then we recalculate the quality scores and call SNPs single nucleotide polymorphisms for the first pass why do we do this well it allows us to do what we call evaluating coverage and coverage is the name of the game here so if you don't have adequate coverage you don't have enough oversampling of the genome essentially to prove to yourself that any variant you identify is actually correct so the more coverage the better but of course more coverage costs money so you have to find a balance between those two where you have high confidence but you haven't sort of killed your budget and then the next and there are various ways rather of evaluating coverage for example if we have SNPs that are called from a SNP array where we took the same genomic DNA and applied it to a SNP array and called SNPs we can actually do a cross comparison what are the SNP calls from next-gen sequencing what are the SNP calls in the same loci from the array and to what percent are they concordant with one another the higher the concordance the better your coverage is and the more certain you can go on to downstream analytical steps with the notion that you've got the right coverage to be confident about anything that follows another thing you can do is look at the data so people who've been sequencing as long as I have there's a real comfort in that visual examination of the data and even though next-gen is a fundamentally different data type than Sanger which has these beautiful colored peaks or as old as I am used to go back to calling up the autoradiogram and slapping it back on the light box we won't go there there are tools now like the IGV to actually look at your data and look at the quality of the data and so on I'll illustrate these in a minute and then because we tend to do things in very large numbers at the genome institute and other large scale centers of the same we also have bulk tools that will allow us for a huge data set to just sort of say how did we do across the spectrum of coverage so this is for us a program called RoughCov that I'll show you and then once this is all sort of said and done you go off to analyze the data in a multitude of ways I'm not gonna spend time on that today because it's like a course not just a single lecture so here's the IGV this is a program that's available from the Broad Institute at this URL and we use this a lot in examining sequence data by eye so you can get sort of whole chromosome views zoom down to see more detail in the region you're interested in and even in this illustration look at the single nucleotide single read level to really see the depth of coverage that you have the quality that's ascribed to that nucleotide et cetera where low quality base calls are faint or semi-transparent just to illustrate that they have lower confidence and whereas these C's are very high confident calls as you can see from the close up here's another look now comparing one of the things that we do is compare the tumor to the normal from an individual where you can see now there's great evidence here that this is a mutation that's somatic in nature so it's truly unique to the tumor DNA and not to the normal DNA for this individual and then this is just a look at rough cover here's the site for this on our website which is now looking across a bulk number of samples many of which of you are looking carefully at the notations here from formal and fixed paraffin embedded tumors and this is showing sort of the percent coverage at different coverage levels according to the key where we wanna see is everything green or better for an above 80% for example this is how much data you've generated with a look at uniqueness versus duplicates and then this is just how much you've actually enriched across the regions of the genome that you're actually interested in sequencing and so these bulk tools can give us a quick look at just the quality of a data set again before we move on to analysis. One of the things that we've spent a lot of time doing is really then putting together a somatic variant discovery pipeline and I'm just using this as an example of how you can daisy chain together different analytical programs to take you from the original read set to ultimately what you care most about which is the analyzed data so just for the purposes of illustration when we sequence a tumor we always sequence the matched normal as I illustrated earlier so that's the input. The alignment initially is to the human reference genome we align all the tumor reads as a separate build as we call it, all of the normal reads as a separate build and then the comparison begins. So we have a variety of algorithms for first read alignment for discovering truly unique tumor unique somatic point mutations as well as indels where these are now single nucleotide variants as opposed to single nucleotide polymorphisms which would be in the germline or the constitutional DNA. We can detect structural variants as I alluded to earlier and often in cancer, translocations, inversions, fused genes together that are known to be drivers in oncogenesis so we absolutely want to detect these from whole genome data and then as I alluded to earlier with my HER2 example we can get very precise quantitation and boundaries on copy number alterations in the genome. We do apply filters to these, they're sometimes very sophisticated statistical filters where we remove sources of known false positivity. For example, I detect a variant but every read that's showing me that's a variant is at the end of the read whereas we talked about earlier the quality of the data gets poorer because of signal to noise. I can easily throw that out as being a false positive because if they're all at the ends of the reads they're likely not true positives just based on experience and validation exercises that we've gone through. And then for these structural variants really the best way again is to sort of look at the data and we have tools to look at the read support for a translocation for example where one end of the reads is mapping to one chromosome the opposite end to another chromosome and that really gives us good support visually that there's actually a translocation that's occurred there and so on and so forth. We then use the annotation of the human genome reference to annotate the variants and really tell us is this an amino acid changing mutation? Is this something that's going to alter splicing for example? Is there a fusion gene that I'm predicting from this translocation et cetera? And then we finally get to that desired result that I talked about earlier which is the beautiful representation of our tumor genome in all of its glory with the chromosomes here as colored blocks. This is a Circos plot as we call them. All of the words written to the outside are the known genes in the genome that are altered by mutation. Copy number is this gray area in the middle and then all of these arcs across for example in the center are translocations that are involving two chromosomes. The little blips are typically inversions or deletions and so on. So this is really the Cliff Notes version of a cancer genome if you will and takes an enormous amount of work to really get to that point. Now just to finish up a lot of what I'll talk about here at the end is the transition of cancer genomics into the clinic. And one of the things that we've come up against in terms of this translation is that we really need to understand our sources of both false negativity and false positivity where in the research setting we actually care more about false positives quite frankly because we wanna accurately represent out to the world those true mutations that we've identified in the course of sequencing many cancer genomes for example. So we have lots and lots of knowledge about what causes false positives and I've already alluded to one of these which is the variant being only called at the end of the read, there are others and we can design statistical filters to eliminate these as I've already told you. But most false negatives are actually due to a lack of coverage and in the clinical setting you actually worry more about false negatives because you don't wanna miss something which makes total sense, right? It's just that we have to now build in new filters to examine where our coverage is actually too low and note those so that we understand the areas where we're going to be getting false negatives and try to understand that. And really this notion here that has allowed us to come up with these statistical filters is because next gen sequencing has been changing so much over the past six years in terms of improvements obviously but changing nonetheless. We've had to go back constantly and validate. So each time we get a set of mutations we design new probes, hybrid capture is used to pull those regions back out of the genome and sequence them again to really verify what's a false positive and what's a true positive and that's allowed us to come up with these motions of where to remove false positivity. Okay, I'm gonna shift gears now to third gen sequencers and this is really a variation on a theme because unlike all the things that I've already told you about third gen sequencers which is shown here the PacBio instrument that's been with us now for about four years commercially available is a completely different paradigm than what I just talked about. So this is true single molecule sequencing as opposed to sequencing a cluster of amplified molecules which is what next gen sequencing really does if you wanna put a fine point on it. In the PacBio system the library preparation actually looks quite similar if you recall to those steps that I already talked about for next gen sequencers. So we shear the DNA, we polish the ends, we adapt on these specific adapters called smart bells in the PacBio system and then we anneal the sequencing primer to the portion of the smart bell where it's complimentary. Now unlike next gen sequencing the next series of steps is highly unique. We first bind all of these library fragments to a specific DNA polymerase. So we incubate them together, get the DNA polymerase to bind onto a single molecule and then we load this entire mixture onto the surface of this little device here which is called a smart cell. This is the sequencing mechanism in the PacBio instrument and it's about as big in diameter as the small tip of your little finger but hidden which the eye can't see in this smart cell are 150,000 zero mode waveguides. What's a zero mode waveguide? It's basically a sequencing well that this DNA polymerase complex to your library fragment can nestle down into for the sequencing reaction itself. And that's really illustrated here. Hopefully these aren't too dark to see but again these are just shots from the PacBio website so if you can't see them here you can look at them there. What we wanna do is in each one of these 150,000 zero mode waveguides are as many as possible because this is never 100% loading. You wanna have a DNA polymerase complex come down to the bottom and attach to the bottom of the zero mode waveguide. Well the function of that zero mode waveguide is what? It's actually to now precisely pinpoint the active site of that DNA polymerase with the machine optics that are gonna detect the sequencing reaction happening in real time in the active site of that polymerase. So what happens is you provide in and the instrument does this, these labeled nucleotides and they sample in and out much as they do in the cell in and out of the polymerase active site. When they get detected is when they've dwelled long enough in that active site to be incorporated and detected according to the fluorescence wavelength that's emitted ACG or T. So there's a specific label for each nucleotide. It gets detected by the optics of the instrument as it enters the active site of the polymerase and dwells there for sufficient period of time. And so this is the sort of real time sequencing because you don't take specific periodic snapshots rather the optics, the camera, the instrument watches each one of these 150,000 zero mode waveguides for a continuous period of time which is called a movie and it essentially collects data from what's going on in that active site of each one of the polymerases all the time during the duration of that movie. So any nucleotide that samples in stays for long enough to get incorporated gets its fluorescence detected. What's happening to the fluorescence? It's actually on the polyphosphate. So when the nucleotide's incorporated it diffuses away and doesn't stay around to interfere with the subsequent cycles of incorporation. And of course that's critically important. Why? Because keep in mind we're looking at a single DNA polymerase operating on a single strand of DNA in real time. So you've got to be exquisitely sensitive to detect that fluorescence and pick up the information. So one of the things that's unique about this type of sequencing as you might have guessed is that the sequencing read lengths now are extraordinarily long. So as opposed to the next gen sequencers we're now looking with improvements to the chemistry and improvements to the library prep some of the details of which are shown here. When we're isolating longer and longer fragments in our preparatory library construction process we're actually now able to extend the time of the movie generation and collect quite long reads. And so here's just some real data comparing sort of the previous chemistry to the new chemistry. I won't go into the details of this. It's all available on their website. And what we're doing here now is look at these read lengths. We're looking now at reads that are extending out to 25 to 30,000 nucleotides at a time. Now this is not all perfect, right? As I mentioned earlier and alluded to single molecule detection is really hard. Hopefully you got that point. So there is a high error rate associated with this type of sequencing somewhere around the 15% error rate. So 15 bases out of 100, totally random error rate. There's no rhyme or reason to it and the sources of error are pretty easy to pinpoint. But the bottom line is that if you cover, again, coverage is your friend. So if you cover enough of the genome with these long reads, you can actually correct the random errors to a point of ending up with about a 0.01% error rate. That's very, very low in the conglomerate but not from the single reads. And so that's really the trick to using these data. And we are using them increasingly more as these read lengths go out because they're really good for connecting small bits, for example, of chromosomes that we haven't been able to orient before. So the chicken genome is a perfect example of this. If you're not familiar with chickens and most people aren't, they have mini chromosomes. So some large chromosomes akin to human size but actually lots of these mini chromosomes that are very, very hard to stitch together until this technology came along. So again, this is just exemplary data that we've generated for the chicken and showing that when you really predominate in large fragments on this blue-pippin device which is used for the fragment separation, you can really just kick out the read length quite a bit. And I don't have data for this but I have colleagues in the business who have reported read lengths now with some of these new approaches in excess of 50,000 nucleotides. So you're really getting quite long even to the point of DNA maybe being unstable above these levels without some specific care and feeding. One of the things that we're also using this technology on just to finish up is for improving the human reference genome. So because you can generate sequence reads now of 50,000 or so bases, you can actually take entire human backs that are representing difficult regions of the genome and rather than breaking them into 2KB subclones as we used to do and try and put them all back together again, you can sequence from end to end on posiments for example, 30 to 50 KB inserts and major portions of human backs and then do assembly with the PacBio reads. So this is just a lot of data about a bunch of backs that we're sequencing across difficult regions of the human genome that aren't properly finished. And then we're doing comparative assemblies with just random human genome data to try and improve overall the assembly of the genome. And one of the ways that we're doing this is actually sequencing from a unique anomaly that's often identified in obstetrics which is the hydrotideform mole. So if you're not familiar with this terminology, this is a rare event where an enucleate egg and an egg without a nucleus is released from the ovary and gets fertilized by a sperm and develops to a certain stage in the uterus before it's removed surgically. So this hydrotideform mole represents a true haploid human genome, one copy, the sperm that fertilized the egg. And so there are some, not many, but a few cell lines that have been produced from these hydrotideform moles and we're sequencing them actively now on the PacBio sequencer trying to achieve about 60-fold coverage across the genome of the human to really begin to understand without the complications of diploidy, which you have with the most human genomes, how to stitch these difficult regions of the genome together. So this is work that's ongoing in our laboratory right now as we're sequencing on a new genome, which is CHM13, another hydrotideform mole cell line. So more on that to begin with and then just to finish up, this is just a comparison of the tiling path versus a long-read assembly that we were able to obtain using PacBio for a specific segment of the genome with these approaches on the hydrotideform mole. So just one last word about sequencing and then I'll finish with my little vignette about cancer genomics and that's nanopore sequencing. So this is the next type of sequencing that's on the horizon. It sounds a little weird to say that it's on the horizon because actually if you look in PubMed, the earliest report for nanopore sequencing, namely pulling a DNA strand through a nanopore is about 19 years old. So this is an idea that's been around for a while, which ought to give you an idea that it's really, really hard to make it actually work. So this is just one example, which are an idea of the way that this could work out, which is that you have your nanopore here, you have maybe, for example, an exonuclease perched at the top of the nanopore and when it grabs a strand of DNA like it does, cutting off one base at a time, then you may be able to pull those nucleotides through and somehow detect A from C from G from T. So that's one possible approach. The other approach is just having an enzyme here that maybe works to separate the double strands and translocate the single strand through the pore, okay? So that's another approach that could be used. And again, here, the challenge is twofold. One, uniform pores because you wanna do multiple pores, not just one at a time, ideally, so that you have throughput and the less uniform these pores are, the more differential readout you get at each one. And then the readout itself, so what's the signal? How are you detecting these? And typically, that's sort of a charge differential, so if you have a differential on either side of this pseudo membrane shown here, when the DNA translocates through, there should be some abrupt change in the charge ratio across the two halves of the membrane and that should correspond to an identity of a nucleotide, for example. So in practice, there is a commercially not yet available, but in testing device called the Oxford Nanopore. This uses the approach, the latter approach that I showed you earlier. And the idea here is that you have this little thumb drive. This is obviously a prototype because it's got some lab tape on it. The idea is to put your DNA fragments in through here and then they essentially get pulled through the pores and the readout comes in through the USB3 port of your laptop and you can read out the sequence based on that. So these are out in testing in certain laboratories. There are just some early reports where it looks like the error rate is, I would say, quite high at this point in time. North of 30% error rate, I think 15% alone is pretty hard to deal with from an algorithmic standpoint. So I think that this needs some refinement before it really sees the light of day. But it's an interesting and new approach is truly reagentless sequencing. So this is just DNA in, sequence out. There's no reagent here other than the device that is happening, on which the sequencing is happening. Okay, so just wanna spend the last few minutes talking about how this all coalesces down to really change science and ultimately maybe the practice of medicine. So it's a little bit of a forward look but things that we're actively working on now. We've known for some time now, since the early 1900s really, before we knew that there was a genome, there were hints that there was something fundamentally different about the DNA and cancer cells. And this is one of my scientific heroes, Janet Rowley, who really sat down with a microscope in the early 1970s and started looking at cancer chromosomes under a microscope. She devised several preparatory methods that made these much more clear to look at. And this is one of the figures from one of our early papers showing this T1517 translocation that is diagnostic now for a specific subtype of acute myeloid leukemia known as APL or acute promyelocytic leukemia. And so really her studies as well as several others began to lay the foundations that when cancer occurs by looking at the chromosomes in the cells, you can see physical differences but now we can sequence whole genomes of cancer, we can actually begin to really understand these translocations at the ACG and T level whereas with her microscope, she could see just the gross result of the translocation. So I often say that the next gen sequencers are really just the new form of microscope, if you will, where we have resolution down to this single nucleotide. And there has been, if you're aware of the cancer genomics field, a lot of work in cancer genomics just over the past five to six years, really this is a reflective of the work that's gone on in our laboratory but this is now an international effort to categorize cancer genomes across multiple tumor types. You can see in this particular display that's a few months out of date, we've now sequenced an excess of 2,700 whole genome sequences from over 1,000 cancer patients across different subtypes, AML, breast cancer, et cetera. And now a very large fraction of our work has been in pediatric cancer setting as well where a collaboration with St. Jude Children's Research Hospital has sequenced to date over 750 pediatric cancer cases. So this is really a scalable enterprise and this is true discovery, really what are the genomic roots of cancer and how can we tease them out using next gen approaches such as whole genome sequencing and exome sequencing as well. And again, this is an international exercise within the US, we have the cancer genome atlas which has been jointly funded by the National Cancer Institute and the National Human Genome Research Institute. It'll wrap up round about next year but because we've now sequenced through almost 20 adult cancer types across multiple different types of assays and not just DNA mutation and copy number but also RNA methylation, et cetera, protein data, we're now beginning to coalesce around the commonalities and differences across cancer types. So this is just a recent publication of this so-called pan-cancer approach that really tells us that cancer is a disease of the ohms. So the genome is important to be sure but there are things that we can detect only at the RNA level, only at the methylation level and combined together they really begin to tell us about the biology of human cancer as opposed to human health. And so this is I think a really foundational set of data if you will that really now sets the stage for translation and making a difference in cancer patients' lives. So let me just talk briefly about what we're doing at our center because I think it's maybe different a bit than most places which are tending to sort of pick known cancer genes, put together a targeted hybrid capture set and just look at those genes in particular which can be very informative but ultimately is not as comprehensive as I think we need. So what we're approaching is sort of a combined and integrated approach that uses whole genome sequencing of the tumor and the normal for each patient. This really gives us as I've already talked about the full breadth of alterations that are unique to the cancer genome and also will tell us about any constitutional predisposition that's known for specific genes for these patients. Exome sequencing is important for two reasons. One is a standalone analysis of tumor versus normal exomes. We can get most of the sites that we've already detected in the whole genome and really have that interplay that says hey, you really got it right because you detected it in both data sets. And so that's an important sort of validation approach. The other thing this gives us is great depth because typically exomes are sequenced at about 100 fold or higher. Combining the whole genome coverage with the exome coverage now gives us great depth at these sites and tells us a lot about what we know is true in cancer which is that not all cancer cells are created equal in terms of their mutational profile. So there's so-called heterogeneity in the cancer genome and even in a single tumor mass and this can really be identified through deep coverage analysis which you get out of the combined exome whole genome. Lastly, and perhaps I could argue most importantly, doing the transcriptome of the tumor cell is fundamentally important. Why? Well, first of all, it tells us about genes that are overexpressed that we might not detect from just sequencing DNA. So for example, there's a new transcription factor binding site but we haven't detected it. There's a change in methylation but we haven't detected it but the downstream consequence is that gene is overexpressed and that may be pathogenic. I'll show you an example in a minute of that. We also know that even though we can detect lots of mutations often in cancer genomes, only about 40 to 50% of those genes that carry mutations are expressing those mutations. So if you're going to drug a mutation you really wanna know that that's being expressed at the level of RNA and RNA sequencing alone and we'll tell you that. And then lastly, I've already alluded to gene fusions. The T15-17 that I showed you earlier from Janet Rowley's work fuses two genes together, PML with RAR alpha and this is a sufficient gene fusion for causing acute promyelocytic leukemia. So for detecting that at the level of structural variant analysis in whole genomes which is the only place it can be done, that still has a high false positive rate. So if we can identify the fusion gene in the RNA-seq data that really gives us a nice validation that that structural variant fusion we're predicting is actually being expressed. So there's a huge interplay and integration in these data that needs to take place. And at the end of the day what we really wanna do is to identify gene drug interactions that may be indicative for that particular patient of a key drug that they should be taking to help alleviate their tumor burden. Now the analysis as I've already alluded to is very complicated. It really takes a team to put together an analytical approach in all of the downstream decision support tools to really make this fly in the clinical setting. So this is my dream team, OB and Malachi Griffiths. So they have the same last name. They kinda have the same faces, one's hairier than the other. Same eyes, yes, they're identical twins. So it's nice to have a dream team that works this closely together, lives together, et cetera, et cetera. These guys have a personal commitment to cancer treatment because their mother died of breast cancer when they were 18 years old. So they've developed this sort of system, complicated. I'm not gonna sort of delve into the details, but suffice it to say that all of these green boxes are the things that you can find through that approach that I just walked you through. So there's a lot of information. How do we make sense out of all of it? Well, there are a variety of key steps that have to be followed. They're here in the center. First of all, for all the altered genes, whether it's from DNA or RNA or both, we have to really understand functionally what the consequence is, and that's a hard step. So there were really putting together some decision support tools to help with that. We're also putting together decision support tools for this next part, which is what are the activating mutations? Because you can really only apply a drug therapy to things that are activating. If the gene's been knocked out, which happens in tumor suppressors, for example, it's not really a good drug target. And then layering this information onto pathways within the cell is critically important. Why? Because cancer's not a disease of genes, it's a disease of pathways that are activated and aberrant and cause a disruption in the normal division and growth cycles in the cell. So by layering these mutations onto pathways, we not only understand the pathway that's activated, but we can also now strategically identify the best place to drug that pathway. And coming to those strategic decisions is not easy. We've generated this drug gene interaction database, which I'll show you in just a second. It combines information from a lot of different sources that are essentially not meant to talk to each other. And that's part of the complicating factor when it comes to making decisions. So the drug gene interaction database helps to interpret where best to drug, what drugs are available, what clinical trials may be available for that patient to go on to, negative indications, et cetera, and rolls this all out into a clinically actionable events list that we often refer to as the report. So here's just briefly some of the decision support tools that we're curating to come up with tools that anyone can use that's in this enterprise, and there are many people entering into this. This is the database of canonical mutations, which is not yet ready for release, but close. It's just gonna give curated database of mutations that have a demonstrated association with cancer. So that should be coming out soon. And this is DGIDB, which I already alluded to. Here's the URL. It's been released, it's been published, and we just recently updated it, so it's a bit more sophisticated. All you do at DGI is type in the genes that you're interested in, set a few parameters that are shown here in terms of the databases you wanna look at, whether you want anti-neoplastic drugs only, et cetera, and push the green button. And then this is a search interface that results for this particular query, where all I'm showing is Able1 kinase that's involved in CML. You can see there are multiple drugs available, all inhibitors in this particular screenshot. And now here's a link to the database source. In this case, these are all from MyCancerGenome. So by clicking on any one of these drug links, you go to MyCancerGenome and get more information from that data source about that drug, about clinical trials, et cetera, and so on. So really DGI is a clearinghouse for information that helps to link drugs to mutations and genes, and really is just meant to simplify the search for information that a clinician might come up against. So how have we used this type of an approach? Well, we had an early success that was reported here in the Journal of the American Medical Association. I won't take too long to talk about it because it's already in the peer-reviewed literature, but this is just an example where whole genome sequencing was able to solve a diagnostic dilemma for this patient who presented from the pathology examination of her leukemia cells with what appeared to be acute promyelocytic leukemia, that form of leukemia that I talked about earlier, but upon cytogenetic examination of her chromosomes, there was no evidence for that T15-17 translocation. However, by whole genome sequencing, what we were able to show is that physically a large piece of chromosome 15 had inserted into chromosome 17, recapitulating the PML-RAR-alpha fusion, much like the T15-17 does in 90 plus percent of APL patients and resulting in the disease that she had. So based on this analysis and subsequent CLIA lab verification because this was done in the research setting, this patient was able to go to consolidation therapy with all transretinoic acid, the standard of care for acute promyelocytic leukemia, and she's alive and well this day because most patients with APL experience about a 94% cure rate with chemotherapy and all transretinoic acid consolidation. Another success that we had is here in the New York Times, not my favorite peer-reviewed journal, but we're still in the process of fully investigating this example, which is based around the acute lymphocytic leukemia second relapse in my colleague and friend, Lucas Wortman, who's shown here in this cancer story from Gena Colada in the New York Times a couple of years ago. In Lucas's case, RNA was really the bellwether for the driver in his disease. So by sequencing RNA, what we were able to show, unlike the nothing that we got from the whole genome sequencing of Lucas's cancer and normal, was that FLIT3 was extraordinarily overexpressed in his tumor cells. We didn't really understand the significance of this until we looked in the literature. This reference from blood shows that even in patients that are moving towards B cell ALL, there's an increase in FLIT3 expression. So this looked like a putative driver. More importantly, a search of DGI showed that there was a good possible inhibitor, which is synitinib at the time and still to this day not approved by the FDA for acute lymphocytic leukemia, but he was able to get the drug through Compassionate Appeal to Pfizer, took the drug, was put in full remission by taking the drug and was able to get an unrelated stem cell transplant and is alive and well to this day because of this intervention. So that's some of what we're doing. We continue to sequence cancer patients' genomes now through a coordinated effort at our institute and sponsored by our medical school that tries to also bring in more clinicians for the purposes of education, which is a critically important aspect of understanding genomics that I haven't talked about. I just wanna finish with this one little, I think interesting other approach that we're using and then I'll open for questions. So in the targeted therapy arena, much like the synitinib that you just heard about, patients are receiving great relief from their tumor burdens, but often patients come back with what's called acquired resistance, which means that the cancer cells basically invent around the blockade. And one of the ways that we're now looking at cancer in a different way is to, for those patients who have now passed through acquired resistance, perhaps take a different look at their tumor genomes to design a specific and highly personalized vaccine-based approach that might invoke their immune system into helping them battle their progressive disease. So this is just the paradigm that we're following. We're starting with melanoma in this setting that I'll describe. In melanoma, you have multiple cutaneous lesions in the metastatic setting that can be readily sampled through the skin. Here we're studying tumor versus germline DNA by exome sequencing. We identify somatic mutations, but we don't worry about all those parameters that I showed you in the earlier diagram. Here what we do is we first check RNA to verify which of these mutations are being expressed. And with the high-mutational load in melanoma, because of UV, you have to do this. But more importantly, we also want to understand the highly expressed RNAs because those are likely to also be highly expressed proteins. Why do we care about that? Because these are the targets that we want to examine in terms of their immunogenicity for that patient. To do that, we need another piece of data, which is the HLA class one type for the patient. This is a readily obtained clinical grade assay, although you can also derive it from whole genome sequencing. And then we put all this information through this algorithm called NET-MHC. So we translate the peptides for the mutated genes to give the wild type peptide and the mutated peptide. We put in the information about the class one type for the patient and then NET-MHC returns to us a prioritized list of those most immunogenic peptides that are highly unique to that patient's tumor. So unlike broad spectrum immunotherapy here, we're really going after the tumor specific molecules and we hope by that virtue of that to have a reduced amount of side effects because we're not impacting the normal immune system as it were. Now, would we ever trust an algorithm to give us exactly the right answer? No. So we also do a series of downstream tests using an aphorisa sample from that patient to look, for example, at existing T cell memory and any T cell lysis that we might be able to illustrate in a dish. This helps us to reprioritize that list and then we take the top eight to 10 mutant peptides and we move them into the vaccine setting. For melanoma that's shown here, what we're doing is we're removing dendritic cells from the patient and we're conditioning them with GMP quality peptides that correspond to that eight to 10 at the top of our list. And then the dendritic cell vaccine is infused back into the patient. So if you're not familiar with the immune system, dendritic cells once conditioned with these peptides in the body will present them to T cells and it's only dendritic cells that can evoke T cell memory. So if there's an existing T cell memory to those tumor specific epitopes, it will be elicited by the dendritic vaccine or in principle anyway. This is all of course in testing. And so this is the paradigm that we're using and just to be clear, this is really happening. We have a five patient FDA approved IND that's ongoing. As you can see from the progress here so far, patients one through three have already received their vaccines. They're being monitored in two ways, imaging which is conventional. And then by blood draws, we're looking at whether we're eliciting T cell memory for any one of those eight to 10 peptides that we've used for the vaccine. And then patients four and five, we're just getting ready to go on. I'll have a meeting when I get back next week to work to get that transition into patients four and five. So too early to say whether we're being successful with this approach but I think it's a really exciting new way of using genomics to really inform vaccine development. I'll be clear dendritic cells aren't the only vaccine platform. It's just what we're using here. And there are other groups that are now pursuing a similar approach. So I think this is now beginning to introduce a new set of potential answers for cancer patients all stemming from the work that we've been doing in discovery genomics for the past few years. So I'll just finish here by thanking the group back at the Genome Institute. They're all listed. Also special thanks to my clinical collaborators across multiple oncology targets. Jerry Lynette and Beatrice Carino in particular are the two that we're working with on that last trial in melanoma therapy. And also wanna thank a couple of my buddies from the genomics sector in analysis who provided the slides for the analytical aspects of the human genome. So thanks for your attention. There's about 10 minutes left and I'm happy to answer any questions or if you're shy feel free to come up afterwards and I can answer questions one on one as well. Thanks so much. Questions, questions, questions. Now it's all abundantly clear. Daphne. Yeah. So it's a good question. So Daphne Soft spoke and I'll repeat the question and that was regarding my comment about the 40 to 50% of genes that are mutated but not expressed in the tumor cell. Could those have perhaps been important early in the progression of the cancer but then switched off for other reasons? And it's a great question. The answer is probably or maybe possibly but part of the problem is of course that we have little to no ability to kind of capture that moment in time. So it's a downside of the way that we're doing things nowadays which is that we're really isolated to whenever that patient was biopsied that's the sort of look that we have at the tumor in isolation. Getting progression events which would be sort of pre-cancerous lesions for example, early tumor samples, advanced tumor samples is incredibly difficult. Probably in the leukemia setting it's a little bit easier than in the solid tumor setting to be clear but it's a very difficult thing to do. So mostly people have tried to do this with mouse models often sort of reproducing the mutations that we know are drivers and then following resecting tumors at different stages as they develop in the mouse. So you can get some insights there I think but the comprehensivity hasn't been there. So RNA has been always this hard target that we've even waited to pursue because the analytical spectrum that you can get out of RNA seek data is myriad and complicated. So I think those will come but they really haven't been done to that level of detail at this point in time but it's entirely possible that those were important early on, yeah, sure. Sure, so the question is in the clinical setting often pathology or always pathology puts samples biopsies into formalin and then into paraffin and this is for preservation of cell structure and proteins but as I may have alluded to does pretty horrible things to DNA and RNA. So how do we deal with that in the setting that I described at the end of the talk? I think sort of two things to keep in mind in the clinical setting we're often wanting to get the most recent example of that patient's tumor. So sequencing their primary tumor when they're metastatic is sometimes not terribly a good idea mainly because depending upon the treatments that they've gone through broad spectrum chemotherapy the damages DNA, radiation which we know damages DNA what comes out of that in that metastatic setting may be fundamentally changed and different than the initial primary tumor. So that's one thing. The second thing is that formalin while it doesn't introduce DNA backbone breaks typically does that over the age of the block so the older the block is the more difficult it is to get high quality sequence and there are some artifacts that are characteristic to formalin damage that we can actually subtract away so those are well known now. I think the biggest challenge of formalin fixed material is just how intact is the DNA still and if it's a recent biopsy even if it's gone into FFP typically the intactness is still very much there so we're not as afraid of that as we were. I think the bigger challenge is still because we're in a struggle with sort of conventional pathology and we wanna add in the value of genomics which I think is ultimately quite valuable for most patients if there's a precious little material then who gets what and it's always gonna go to pathology because of the standard of care which I understand but then we're often left with precious little or none to do genomics so that's really the challenge there even more than formalin fixation and paraffin embedding. There's another question here. In the vaccine approach so what's your typically the end to the vaccine? Yeah so that would be a nice to have. So the question is how long are the peptides that we're using in our vaccine based approach typically they're sort of on the order of nine to 11 amino acids so fairly short. I alluded to other vaccine platforms. There is a camp in the cancer vaccines forums that likes the idea of long peptide as a vaccine so a cocktail of peptides that are similar in the neighborhood of 20 plus amino acids. Those get very expensive very quickly especially keep in mind these are going into a human being so they're in the GMP so they have to be go through rigorous QC and that sort of thing but yeah our vaccines are in the nine to 11 amino acid line. Yes at the mic. Yeah I don't think the mic is live but in any case I can hear you. Great talk Lane. Thank you. I wanted to know what next gen sequencing is doing in regards to modifications of DNA such as methylation which we now know of course is important for pathophysiological processes as well as normal development. Yeah so there's sort of two camps. This is regarding methylation of DNA which is a very common chemical modification and actually takes many forms and flavors so methylation has some nuances to it but the methylation writ large we're sort of now beginning to approach genome wide by just doing a bisulfite conversion where C residues that have a methyl residue are protected from bisulfite conversion non-methylated C's are converted to T and then that read out comes with the alignment back to the genome. Fundamental questions for a whole genome bisulfite are what's the right coverage right? So if we want to get down to single nucleotide CPGs and whether they're methylated even if they're in an island then we're going to have to get sufficient coverage to get good granularity at single base resolution. So we're still a little bit struggling with that aspect. To date most studies are reduced representation by sulfite so you're only looking at a subset of the genome that keeps it cheap and allows you to do higher coverage but it's kind of like an exome whole genome argument what are you missing? So we're really going for the whole genome approach and trying to sort out that coverage aspect. Five methyl C is also sorry five hydroxy methyl C is also an interesting target because it has the opposite effect from generic just methylation but the number of those residues is precious few compared to just plain methyl so five methyl there are kits to actually convert and detect but there the coverage looks like it's even higher so there we might be doing a reduced representation just around an exome capture type approach where we're targeting only the known methylated islands in the genome and just sequencing on those with the hydroxy methyl conversion and now there are I think kits for four mile methyl and maybe other flavors of methyl that are coming out so it's becoming very sophisticated. Personally I feel like while methylation is probably interesting ultimately what I'm gauging maybe as a surrogacy for that is the RNA seek alterations but it's not a perfect or complete picture so we'd like to do everything of course where everything is limited sort of by your budget more than anything rather than your imagination. I should also point out although that it's not at human scale just yet there are aspects of methylation that can be evaluated from the PacBio data the single molecule sequencing data where the dwell time for methylated residues again generically speaking about methylated residues is different for un-methylated and that over multiple samplings can actually be teased out by different algorithmic treatment of the data so that's mostly been looked at for bacterial genomes and there they have even more than human wild and different types of methyl modifications so it's probably very interesting in that setting for bacteriologists but hasn't really scaled to human scale just yet so that's kind of what's going on in the subset of epigenetics that is methylation it's fascinating but complicated. Yeah anybody else? Yes, oh yeah we have done a lot of that I neglected to mention it I apologize but those fusion transcripts that we're very keenly interested in in cancer we've done a lot of evaluation of the fusion transcripts full length in PacBio and they're actually extraordinarily easy to pick up yeah so we've done a lot of that for example in prostate cancer where fusions are the drivers as near as we can tell and the tempers to ergue and other ergue fusions are key to discover so yeah it's a great platform for doing that. You had a question ma'am, sorry? Yeah, right. It's both so it really just depends on what's in the template so what I was trying to illustrate there is if you have a single G you get X amount of hydrogen ions released according to how many fragments actually incorporate that G because keep in mind it's a population of fragments not just a single one. If you have three or four Gs that are in a row because you have native nucleotides so no blocking groups or anything like I described for alumina all four of those Gs will get incorporated just like that and four times the amount of hydrogen ions will be produced as a result so it's a it's a immediacy really of incorporation polymerases work quite quickly to incorporate nucleotides and however many there are in the template that's however many the polymerase will incorporate in that system and the resulting outflow of hydrogen ions is correlated to how many nucleotides got incorporated so that's really the secret in the sauce if you will for determining what the nucleotide sequence is is gauging that level of hydrogen ion release where there is an upper limit as I pointed out that'll limit your accuracy in those regions that makes sense? Okay anybody else? Yes okay yes yes and yes so where does Nanopore fit best is the question in terms of different types of sequencing I mean all of the things being equal and assuming you know that it will improve over time as other types of sequencing have I think that there are broad applications for Nanopore basically because if you can take a laptop in a you know thumb drive sort of looking device I mean you could be even sequencing in the depths of the jungle if that's what interests you assuming you can find a place to plug your laptop in sooner or later but seriously so I think my impression from Nanopore is that much like PacBio the read lengths should be long all of the things being equal so if you have the right preparatory method to get that DNA ready for the chip then you load it in and off it goes so PacBio because of the long reads is being used a lot I didn't mention it because it's not my area is being used a lot in plant sequencing where the genomes are large sometimes even larger than human but highly repetitive and repetitive over very long stretches so human repeats tend to be sort of choppy unless they're big segmental duplications so it's really important I think for AgBio and you see a lot of Ag places really interested in it but already using PacBio and there have been some nice reports about wheat for example which has a large genome being sequenced by PacBio et cetera so I think across the scale and then as I just mentioned in response to the other question regarding methylation obviously in bacterial systems this can be very informative data so clinical microbiology may be interested in this as well and there are some early indications that even in the Nanopore setting you may be able to detect different methylation residues on the DNA as it's translocating through the pore that would be similar in that regard to what I described for PacBio but for those of you who have seen the PacBio instruments they are physically quite large so it's not something that at this day and age reduces down to sequencing in the depths of the jungle so maybe there's some portability aspects to the Nanopore that could be really important for that type of remote sequencing as well as other applications as well this isn't something I think about a lot but you think about for example forensic sites and that sort of thing where the immediacy of preparation and sequencing might be incredibly important for the output so yeah open up your imagination because I think it's gonna be sort of all available given enough electricity and enough compute cycles to actually make sense out of the data okay we should probably wrap up I think but I really wanna thank people for coming and hopefully it's been informative and again if you're shy come up and I'll answer some questions here thanks so much