 Please join me in welcoming today's speaker, Dr. Elaine Martis. All right, thanks so much, Andy. I always feel so much better about myself after I hear your introductions. It's fantastic to be here, and thank you for joining me today to talk about my first love, which is sequencing technologies. And so I think we should just get going without hesitation here and jump right in. Now, I do have one conflict to announce. Just to keep in mind throughout the presentation is that I am on the supervisory board for Kiogen, which is a company based over in Europe. OK, so let's just jump right in, I think, and talk about what I like to refer to as massively parallel sequencing. You'll also hear me slip up sometimes and talk about next generation sequencing. They're the same thing. And I just want to walk you through the basics behind this, then move forward into some newer sequencing technologies and finish up with just a little vignette on how we're using all of these technologies, the associated bioinformatics pipelines, and really beginning to make inroads in changing the course of outcomes for cancer patients. So I'll just leave you with that little teaser, and you can hopefully enjoy it at the end. So as Andy said beautifully in his introduction, massively parallel sequencing and next generation sequencing have really transformed biomedical inquiry. You can see this output per instrument run figure shown here from a little perspective that I wrote for Nature in 2011, cited at the bottom of the slide, that really shows the magnificent jump in the amount of sequence data that we could generate in the advent of next generation sequencing devices between 2004 and 2006. But above and beyond just the sequence output, which has continued to climb in a radical way, as I'll talk about in just a moment, there are really other procedural aspects of next generation sequencing that have freed us from some of the old ways and really contribute to this overall acceleration in our ability to generate sequencing data. So just for purposes of illustration, you can see sequencing as I learned it back in the day, where we did a lot of bacterial work with subcloning, plating, and DNA preparation of individual subclones. And then importantly, we did a separate set of sequencing reactions on those subclones, followed by a separate electrophoresis and detection step. So really decoupling the molecular biology of sequencing away from the actual sequence data generation. So the contrast here is remarkable if you look in panel B, which sort of illustrates the stepwise process for next generation sequencing, which starts with just standard fragmentation of DNA, which is done using sound waves or other shearing-based approaches. We do some repair on the ends of the DNA and put adapters on to the ends with a ligase enzyme, attach these to some kind of a surface, amplify those in situ, and then proceed importantly to a combined molecular biology and sequencing detection step. So rather than the separate process up here in old-style Sanger sequencing, next generation or massively parallel sequencing does everything together at the same time. So let me just illustrate the differences in a little bit more detail in terms of how massively parallel DNA sequencing works. So first of all, as I already alluded to, you have to create a library to do sequencing on. But actually, library generation is a very rapid process. It can be completed even by a high school senior with a reasonable attention to detail. My daughter is a testament to this back in the day and can be done sort of in the period of time of an afternoon or maybe even a day if you're not in a super big hurry. So this library approach is really just using these custom linkers or adapters, as they're called. They're attached to the ends of the DNA with a ligase enzyme, as I mentioned. And over time, the genesis of different library kits has led to kits with really much more efficient ligation procedures. This is important for low-input DNA, which we'll talk a little bit about. And this is sort of really now a wide open field for additional commercial development in terms of improving library kits into next-gen sequencing platforms. The other aspect, which I showed in that mini figure at the bottom of the last slide, is that we do need an amplification step for these resulting adapter ligated fragments. So why is that? Well, unlike some of the technologies I'll talk about in a few minutes, this is sort of sequencing of a population of fragments. So we require amplification of each one of these library fragments so that the downstream molecular biology and detection actually works. In other words, the instruments that I'll be talking about here for the next few minutes aren't sufficiently sensitive to sequence from a single molecule. Rather, they need the amplification of that molecule into multiple copies in order to generate sufficient signal to be seen by the imaging optics or other detectors that are on the sequencing instrument. So the way this amplification is accomplished is by enzymology. First, there's an attachment to a solid surface. This can be either a bead, which is round, of course, spherical, or a flat silicon drive surface. And this just depends on the different technologies, as I'll illustrate in a few minutes. And the way that this attachment happens is that the surface of the bead or the flat glass is actually covalently has these complementary adapters on its surface. And so they're available there for the hybridization of those library fragments followed by an enzymatic amplification. So it's a really straightforward way to attach these library fragments onto the surface and get them amplified up so that you can see them. The next step is really this combined molecular biology of sequencing, often referred to as sequencing by synthesis or SBS, with the detection of the nucleotide base or bases that have been incorporated in the molecular biology reaction. And so this is a stepwise process, as you'll see from the illustrations that follow, where you provide the substrates for sequencing, let the sequencing reaction happen, and then detect it as a subsequent step in the process. And really the thing that distinguishes massively parallel sequencing from Sanger sequencing, as I've already alluded to, is that we're not sequencing 96 reactions at a time, which was the maximum per machine throughput in the past for electrophoresis and detection approaches. Rather, because we can sort of use multiple beads or decorate the surface of this flat silicon glass with hundreds of thousands of these library fragments, we can literally generate the DNA sequence needed to sequence an entire organism's genome in a single run of the instrument. So you're really talking about massively parallel sequencing as being hundreds of thousands to hundreds of millions of reactions that are detected happening in the stepwise process on the instrument all at the same time. So the throughput acceleration, as you saw from that graph, is extraordinary over what we used to be able to do with Sanger, where we would just buy more and more machines the more sequencing that we needed to do. And then lastly, and I'll talk about this a little bit, but keep in mind that because each one of these amplified fragments starts as an original single molecule, you're really getting digital retype information. So what that means is that if we have, for example, a portion of the human genome that's amplified, so multiple copies beyond the normal deployity are there, a great example would be her two amplification in specific subtypes of breast cancer, you can literally go in and count the fold amplification of that locus in the genome relative to diploid regions and really understand with exquisite detail the number of amplifications that are there, the number of extra copies, and actually in whole genome sequencing, the boundaries, so the exact start and stop regions on that chromosome where the amplification has occurred. So this is incredibly powerful. And going to RNA, you can also calculate very exact digital expression values for given genes from RNA sequencing data, which starts with RNA, converts to DNA and goes through very similar processes for the sequencing as I'll describe for DNA here in just a moment. Lastly, and we'll get to this important point in just a moment when it comes to talking about bioinformatics or the analysis of sequencing data, but unlike those Sanger sequencers of the past, one of the downsides or one of the confounders of massively parallel sequencing is that the overall length of read, the number of bases that you sequence from any DNA fragment is actually quite a bit shorter than we're used to seeing from conventional Sanger sequencing. So typically back in the day, the Sanger read lengths were on the order of about 800 base pairs, let's just say for the sake of argument, most capital or most massively parallel sequencers will give you round about 100 to 200, maybe 300 base pairs. So significantly less. And when we're talking about analyzing data from a whole human genome, this can actually lead to some significant consequences in the analysis of that data. Okay, so let's get deep in the weeds a little bit on the molecular biology steps and other aspects of massively parallel sequencing. So I mentioned the need for constructing a library prior to sequencing. As I talked about already, we fragment the DNA into smaller pieces starting from high molecular weight, isolated genomic DNA, for example. There are a variety of different steps that are listed here that are enzymatic steps that are a workup to this adapter ligation that's important for the purposes of the subsequent amplification of fragments and sequencing as well. And so these are really just the stepwise processes. We also can now include in this adapter a so-called DNA barcode, which is a stretch of eight or more nucleotides that have a defined sequence to it. What that means is that we can ultimately take fragments from different libraries, mix them together into an equimolar pool and sequence those all together with this increasing throughput that I've told you about and actually generate data from a multitude of different individuals all at the same time and then deconvolute that pool using the DNA barcode information once we have the sequence data available to us. So this just takes the place as another read that is actually sampling the DNA barcode information to identify the library that that actual fragment came from. Once we have these libraries created, we do a quantitation. That tells us the dilution of the library that should go on to the sequencer or into the pool. And then we either proceed directly to whole genome sequencing or we can perform exome sequencing or specific gene hybrid capture approaches, which I'll tell you about just next. So I want to talk first about this amplification of the library that's required, which uses PCR and the adapter sequences to just increase the number of fragments that are in the sequencing population. And the reason for mentioning this is because there are some confounders here that you have to know about in terms of the downstream data analysis. So let me walk you through this. So obviously PCR has been with us for quite some time now since the mid-80s and it's a very effective way for amplifying DNA but there are some downsides to it. In massively parallel construction, because we're doing PCR after the adapter ligation, you can actually get preferential amplification. This is sometimes referred to as jackpotting. And what this means is that just smaller fragments tend to amplify better. And so you may get an over-representation of those fragments in the read population that can lead to duplication. And the problem here is that if you also incorporate, for example, a polymerase error early on in the PCR amplification, the multiple components that are coming from duplicate reads can make that masquerade as a true variant when you go downstream to analyzing the data. So we need to be aware that jackpotting can occur. This used to be a problem but we now have good algorithms to essentially look for exact same start and stop sites from aligned reads onto the genome and eliminate all but one representative copy of those duplicate reads. So this is algorithmic now whereas we used to have to do very careful examination of the aligned reads back in the day. And I think about jackpotting a lot in my work because in cancer samples, you often receive a very, very tiny amount of tissue from which DNA can be extracted. And of course, the less DNA you have to put into a library, the more of a problem this becomes. And with formal infixation and paraffin embedding, often the DNA is also fragmented into small pieces even before the fragmentation step and smaller pieces actually lead to an increase in jackpotting and duplicate reads. So this is a real concern for the data analysis piece. I've already kind of alluded to this in my previous comments about jackpotting but we can get some false positive artifacts because PCR is an enzymatic process and it can introduce errors. It's not perfect. And if it occurs in early PCR cycles, this will appear as a true variant. And then lastly is I'll talk about cluster formation or amplification. Once you get these fragments onto the solid surface that you're gonna do the amplification on is actually a type of PCR. And this can introduce bias in amplifying high or low G plus C content fragments since there's no guarantee of the base content of any given fragment. Some will actually amplify better or worse depending upon the percentage GC. And as a result, you may not actually detect these fragments from the library as well as something that has a more balanced ATGC component to it. So these are all considerations that come from the use of PCR. The trend over time now has been actually to make the DNA sequencing instruments more sensitive so that you actually have to do fewer cycles of PCR in the library amplification and solid surface amplification processes. And this can lead to a reduction in jackpotting and also better representation of high GC, low GC fragments. But I would argue that it's not still perfect just yet. So these are things to be aware of. Now alluded to hybrid capture and so just historically when we first had next generation sequencing instruments, whole genome sequencing was sort of the only option. But around about 2009, several groups actually developed this approach in various forms and flavors. And it gives us an opportunity to actually use DNA, DNA, or RNA, DNA hybridization kinetics to subset the genome into just the regions of the genome that we care about. So this is often referred to as exome sequencing where the probes that are used in hybrid capture correspond to all of the genes annotated in the reference genome. But of course, you don't have to look at all the genes. You can actually look at a subset of the exomes such as alkyneses, for example, and generate sort of custom hybrid capture probe sets to do this. So how does this work? Well, like everything, we start from whole genome sequencing libraries. But in this case, the way the subsetting happens is we combine the whole genome library first with this sort of capture reagent, which really consists of just synthetic probes that are synthesized to correspond to exomes of the genes that you're interested in. And the secret sauce here are these blue circles that are shown on the surface of the probes that really reflect the presence of biotinylated nucleotides that are part of those probe sequences. So in mixing together the synthetic probes with the whole genome library under appropriate conditions for DNA, DNA, or DNA RNA probe hybridization, sometimes these probes can be RNA instead of DNA. You affect a hybridization between the DNA fragments in the library and the corresponding sequences of the probes which represent the regions that you wanna focus on in your sequencing. The secret sauce biotin comes into play when you now actually wanna isolate out the hybrid capture fragments of your library. And this is by mixing the whole hybridization mixture with strep davidin-linked magnetic beads. And of course, biotin binds very tightly to strep davidin. So then by applying this horseshoe magnet or some facsimile thereof, you can actually pull down selectively the hybridized fragments from your library and throw away all of the regions of the genome that you don't wanna sequence. This isn't perfect, so we do several washes to actually release additional spurious hybridized fragments. And then we can just simply denature away the library fragments from these captured probes. They stay with the magnetic beads and the hybridized library fragments that have been selected out float free in solution and then can be amplified and sequenced after they're quantitated. So this has actually reached a fairly high degree of art again where we can take these barcoded DNA libraries now make an equal molar pool and just go against one aliquot of the exome capture reagent for all of the library molecules from all of the individuals that we wanna sequence. Sequence those out on a single lane of the sequencer, for example, and then deconvolute the data into different pots that correspond to the DNA barcodes and indeed to the different individuals that have been sequenced. So this is now very, very high throughput where we can combine 12 to 14 individuals from an exome capture into a single lane of a high throughput sequencer and get that information very, very rapidly. And then just to finish, as I mentioned, we don't have to do the whole exome, we can design custom capture reagents just for specific loci or genes of interest and only study those from the standpoint of winnowing down the whole genome to the regions that we care most about. Now, that's not a perfect or infinite winnowing. So just to be clear, we there's a sort of lower limit of about three or 400 megabases of sequencing information that you can efficiently sequence through a hybrid capture process. Below that, you really need to go for purposes of sequencing efficiency and decreasing the amount of spurious hybridization or off target effects to something that's more like this. So back to our old friend PCR, we now have ways to design primers for PCR amplification across different loci in the genome to amplify out small numbers of genes in their entirety and take these multiplex PCR products, turn those into a library and sequence those directly. And this is really best for very small regions of the genome, as I said, below about 300 megabases where you don't wanna pay a price in off target sequencing effects. This is also not a perfect approach because of course, it's hard to come up with PCR primers that all play well together in the same PCR amplification, GC bias and those sorts of things coming into play. But actually most manufacturers of these primer sets now have the ability to either sell you something that's already configured and well sort of tested in terms of giving good representation or you can actually work with the manufacturers to design a custom set of multiplex PCR, Amplicon primer sets as well. Okay, so back to the actual sequencing reaction now that we've even either decided to do old genome sequencing a exome or a subset or a multiplex PCR, how does the actual sequencing reaction work? And since I'm a chemist by training, this is the part that I really enjoy talking about. So hopefully you'll indulge me here. So in this illustration, what we're gonna be focused on is really the Illuminous Sequencing process in particular. So let me walk you through this. Now earlier I said that we have this flat silicon surface that has adapters ligated covalently rather attached to the surface and those sequences correspond to the adapter sequences on our library. Okay, so once we've got our library sort of quantitated the instrument will introduce the library fragments onto the silicon surface and you'll get just hybridization under the appropriate conditions of individual fragments. And of course the reason for carefully diluting these fragments is so that you get the right distribution of these amplified clusters across the surface of the silicon that's then gonna be viewable by the instrument optics. What follows next are a series of amplification steps which I referred to earlier as bridge amplification. The reason for that is that in the course of this amplification the free end of the library fragments finds the complement down on the surface of the chip and then you get essentially a polymerase by a stepwise process that builds the increasing numbers of fragments inside you for each one of these hundreds of millions of library fragments that are down on the surface of your chip. So at the end of this bridge amplification cycle you might end up with a cluster of fragments that looks like this. So sort of on the order of 100,000 or so copies of the exact same molecule. And if you image this cluster it would look like this sort of bright dot. And if you look at a bunch of clusters that are all together in one small area of that chip they would look a little bit like a star field. And indeed the oldest versions of the software for this type of sequencing were really derived from individuals who had previously been studying deep space images. So it's a little bit like deconvoluting that where you have to identify the cluster and then isolate its signal from all other adjacent clusters to the best possible so that you get the truest set of signals coming out of it. And then so we don't sequence this amplified cluster like this. We actually have to go through a series of steps that releases one free end of all of the molecules in the cluster. There's just a single representative here. But on that freed end we then hybridize a sequencing primer that corresponds to part of the adapter sequence. And this is pointing now down towards the surface of the chip. And as you'll see here in this blow up of this sequence to be sequence fragment we can then get a polymerase molecule to recognize that DNA-DNA hybrid. And now with the inclusion of sequencing substrates such as these labeled deoxynucleotides we can start our sequencing process. So this is now the amplified fragment shown in isolation but imagine that there are hundreds of thousands of copies of this in the cluster. They've all been hybridized by this very specific primer sequence here and we've got at its end a free three prime hydroxyl for the polymerase to begin adding on nucleotides. In the Illumina process these nucleotides are very specialized as they are in other platform but in particular they have two attributes that are shown here. One is that they have a fluoride that's specific for the identity of the nucleotide. So A fluoresces at a different wavelength than C, G, and T. The second thing that's specialized about them is the three prime hydroxyl group is actually blocked with a chemical blocker. The reason for this is that in each of the sequencing by synthesis steps for the Illumina process you just want to ideally add in a single nucleotide at a time, detect it with the optics and then remove the block from the three prime end so that you can now bring in the next nucleotide G in this example and cleave the floor so that when this new G nucleotide gets imaged by the optics of the instrument there's no leftover residual T fluorescence to interfere with the identification that that is in fact a G that's been incorporated. So if we look at this sort of then ideally what we would end up with at the end of these two cleavage steps is a free three prime hydroxyl and the absence of a fluorescent group where there was one so that the next step of incorporation can be successfully detected. Now this is a point where you might be asking yourself well this sounds really great why can't we just sequence this entire fragment and make the fragments even longer than 300 or 400 base pairs and then we could get really really long reads out of this technology and our lives would be simpler. Would that be the case? I would love it. The limitation here is signal to noise. Okay so two things contribute to that. One chemistry is never 100% so although you try to cleave all of these floors off there will be some residual fluorescence that remains and that will interfere with subsequent imaging cycles. They might disappear in subsequent cycles but they may be there to interfere and nonetheless it's unclear and not 100% as I pointed out. Similarly there may actually be the absence of a blocking group on some of the nucleotides so rather than just incorporating the T in this first cycle I might actually incorporate a T or a set of T's without the blocker and then G's can come in right away because everything is supplied at once in this type of sequencing and then I would get a set of fragments that are so-called out of phase. That means they're now sequencing one nucleotide ahead of everybody else in the population and over time this is an increasing phenomenology. So what happens over time in increasing cycles of incorporation with this approach is that noise increases and so at some point becomes equal to the signal that's being produced by all of these fragments that are being sequenced in the cluster and so you begin to lose the ability to define with high accuracy which nucleotide just got incorporated into the fragment and so this is increased over time. The first Illumina Selexa sequencers that we used back in the day in 2007 had read lengths of about 25 base pairs. The current read lengths are now 150 base pairs so there has been an improvement over time in the read lengths that are available and similarly after we go through one set of sequencing like I've just showed you coming from this end of the adapter we can actually go through now with some additional amplification cycles and release the other end through different chemical cleavage prime it with a different primer and now sequence the opposite end of the fragment. So this is so-called paired end sequencing where we can now collect 150 base pairs from each end of the fragment and those pairs as I'll talk about in a minute can map back to the genome of interest. Now just one, a couple last slides on Illumina their overall approach has changed whereas they used to just have these empty lanes on the silicon derived surface shown here. Now they're actually patterned into these little pits on the surface of the flow cell so each lane consists of hundreds of millions of these that are in a very defined order sort of like a honeycomb and this is a so-called patterned flow cell and what this allows you to do is now pack the clusters very, very closely together to one another and also to not have to find the clusters you know where they are essentially based on how that flow cell sits in the instrument and the fact that this set of patterned pits on the surface is a very uniform array. I mean so now what we get and this is highly idealized shot from their website is just sort of this is where the amplification reaction takes place and in a best possible world all of the regions around this particular portion of the patterned flow cell are entirely clean and so you get a very clean distinct signature from that and all of its companion wells as well. Okay and then just to finish with Illumina and this is just a shot from their website to show you one thing which is that you can sequence a little or you can sequence a lot I think just basically to cut to the chase this is sort of their highest throughput instrument the high CX which can sequence on the order of I think it's like 12 human genomes in a 24 hour period so it's very, very high throughput this is more like the desktop sequencer and if you talk to people in the field about Illumina's strengths and weaknesses you'll find that the accuracy of the sequencing is high so less than 1% error rate collectively on both reads there's a range as you can see of capacity and throughput some of these platforms actually the MySeq has very relatively long read lengths you can get 300 base pair paired in reads from it so that improves part of the problem as well but it's a lower throughput sequencer so you can't sequence a whole human genome on it and then there are some improvements that have been coming along over time including the ability to do cloud computing which we'll talk about in a minute and now let me shift gears to a different type of sequencer than the fluorescence based Illumina sequencer just for the series of completeness this is using a different idea which is actually the fact that when you incorporate a nucleotide into a growing chain that's being sequenced there's actually the release of hydrogen ions so this is using that release of hydrogen ions the resultant change in pH to actually detect when and how many nucleotides have been incorporated in the sequencing reaction and this is offered out in the form of an ion torrent sequencer which is available commercially as well and so the idea here is bead based amplification so you can see the round bead here with the derivatized surface having these adapters to which the library molecules are individually amplified so the best case scenario is that each bead represents multiple copies again of the same library fragment this is done in an emulsion PCR approach where you mix together and make micelles that contain in the best case scenario the single bead, a single library fragment in all of the PCR amplification reagents that are necessary going through PCR type cycles to decorate the surface of this bead with each of the library fragments that you would like to sequence these are then loaded onto a chip which is this sort of idealized structure here and consists of two parts the upper part where the bead sits in the nucleotides flow across is really sort of the molecular biology part of the action if you will and the lower part is really just a very miniaturized pH meter that senses the release of these hydrogen ions in flows of different nucleotides and registers the corresponding amount of signal to tell you which nucleotide was incorporated based on which one is flowing across at the time and how many of those were incorporated well how does it get to how many now these are native nucleotides so they have no fluorescent groups no modifications whatsoever so keep that in mind so if you have a string of A's for example in your template that's being sequenced you can incorporate as many T's as possible to correspond to the number of A's that are there the way that we discern what got incorporated and how many is based on a key sequence which is shown here this is on the adapter itself that's used to make the library and is the point at which the sequencing begins and so when we flow through a defined set of the four nucleotides we will get a signal from each one of these incorporations that's equivalent to a single base worth of pH change if you will and that sets the standard for what a single nucleotide incorporation looks like so that's the key sequence which then forms the template off of which all subsequent incorporations are gauged so in that sense where you have four A nucleotides in a row you'll have a signal that's approximately four times greater in terms of the pH change compared to a single nucleotide incorporation and the software can go through after the run and evaluate this and the resulting sequence comes out and is available for downstream interpretation so you can kind of see this idealized here where we have the key sequence and then multiple incorporations some of which spike higher degrees of signal pH change than others do and so this is really the way that the sequencing takes place in the ion torrent system this has two platforms available again more of a sort of desktop sequencer called the PGM and then a larger throughput sequencer called the proton and you can see the different attributes here just to point out this is not paired in sequencing but because of the read lengths here up to about 400 base pairs you just sequence from a single end so there's not a read pairing that happens with this type of an approach but the read length is longer and if you look across the various attributes of the sequencing platform we really like to use this oftentimes to counter check the Illumina sequencer because it has a very low substitution error rate since each nucleotide type flows one at a time with a wash in between you almost never see a substitution error where the wrong nucleotide got incorporated because of the way that reagent flow works however because of the sort of relative pH change that I just talked about when you get above a certain number of the same nucleotide occurring in a row you can actually lose the linearity of response and so this masquerades as a problem with insertion deletion errors in a single read however an averaging approach with multiple coverage across that region of a homopolymer run can usually get you to the right answer so it's a consensus accuracy approach that's really needed in this type of sequencing and this is a relatively inexpensive fast turnaround platform for data production so as I said we typically use this in the lab for focused sets like the multiplex PCR that I talked about earlier and also for just data checking for variants that we wanna proceed with from our Illumina sequencing pipeline. Okay, let's talk just a little bit about bioinformatics I don't wanna get too deep in the weeds here but to give you an appreciation of the challenges of short read sequencing so I'll focus my attention here on the human genome which is the one that I study the most because substitute any reference genome in place of the human genome in these slides really what we're doing here now with these short read technologies unlike back in the day with Sanger long reads is we're not doing an assembly of the sequencing reads where we try to match up long stretches of similar nucleotide sequence and sort of build a contig or a fragment of long sequence over time. Here with short read sequencing and especially with genomes that are as complicated as the human genome sequence you actually have to align reads onto the sequence rather than assemble reads. The lower you get in terms of your genome size so viruses and some simple bacterial genomes you actually can do assembly but for large complicated genomes it's really not a formal possibility. So alignment of reads to the human or other reference sequences is really the first step before you go and identify where the variants exist. And in the spectrum of using paired end reads we actually can identify for example a chromosomal translocation where a group of reads on one end actually maps to one chromosome and the other end of the fragments maps to another chromosome thereby identifying that there's something that's gone on there to marry up those two chromosomal segments together. And I alluded to RNA sequencing earlier. RNA sequencing data goes through the exact same process of alignment followed by downstream interpretation. So there are no differences and in fact across all the things you can do with OMIC data alignment to the reference genome is really always the first step. So just think about it like this because human genome is large three billion base pairs lots of repeats about 48% repetitive. So it looks a little bit like this jigsaw puzzle where here's all the repeat space in the grass and the sky and the tree is the genes that are sort of interspersed into all of this. So here are all of your short read sequence data that you have to figure out where the original came from when you made your library and a lot of the pieces look a lot like each other. So it's sort of difficult to figure out exactly where they go but when you find this tree you can accurately place that with a pretty high degree of certainty. And so how do we deal with this sort of confusion about mapping where and how accurately? And that's really just because we've been able to come up with a variety of statistical measures of certainty that tell us that if a read can map here or here our best possible guess comes from a variety of mapping scores that sort of tells us that the read is most likely to map here as opposed to the other places. This has been tremendously enhanced I should point out by paired end sequencing because oftentimes you can get a read end that maps into a repetitive sequence but as long as that other read from the other end of the fragment maps out into unique sequence you increase the certainty of mapping for that particular sequence in the genome. So once we have the read aligned then we have to go through a series of steps that I won't spend a lot of time on but just by the sheer number to let you know that this is an important aspect before you go actually into variant detection. So if you're interested in structural variation or other aspects of read pairing you have to go find the read pairs that are properly mapped at the distance you expect based on your library fragment size on average and the ones that are not if you wanna then go subsequently to identify structural variations as I talked about. You also wanna eliminate and mark and eliminate the duplicate reads this is done through an algorithmic approach. Correct any local misalignments this is just getting rid of sequences that are aligned to properly. Calculate quality scores and then finally go through a variant detection process which is again an algorithmic look at how well the sequence of your target maps back to the reference and this allows you to identify single nucleotide differences between your sequence and the reference sequence and on average in the human genome you see about three million of these per individual sequence. We then need to evaluate coverage because coverage is everything. Coverage really reflects to the number of times that you've over sampled that genome how deep is the sequencing on any given region of the genome and what this really goes to ultimately is your certainty of identifying a variant that's real as opposed to something that simply doesn't have enough coverage to really support that that variant is accurate. And so one of the things that we do in terms of evaluating coverage is we compare these SNPs that we've identified from our next gen coverage to SNPs that come from array data like a genotyping array. So if you have a high concordance there you probably have sufficient coverage on your genome to look at those data downstream in an interpretive way. We can also compare SNPs from tumor to normal in the cancer sequencing realm where we've sequenced both the cancer genome and the normal genome and identify the number that are shared between those two genomes because of course the constitutional SNPs of any individual will come through in that tumor genome as well as the somatic variants that are unique to the tumor itself. We can also look at the data so this is comforting to people like me who over the years were used to looking at auto radiograms and then chromatograms on a computer screen and now you're sort of like what do I look at but as I'll show you in a minute there's a nice viewer that we often use for what we call manual inspection of sequencing data and we can also generate tools to give us information about coverage as I'll show. And when all of these things check out then and only then can we finally analyze the data to interpret the variants that we find there. So here's just a quick look at IGV which is sort of the commonly used tool across next gen sequencing laboratories to look at coverage and other aspects. You can get a whole chromosome view or you can zoom in to specific areas. You can see here in the gray bars all of the coverage that's resulted in this area and you can even get down to the single nucleotide level to identify a clear presence of a variant compared back to the human reference genome. And here's another IGV shot just showing what I'd normally do which is look at the normal coverage and the tumor coverage where you're clearly identifying a somatic variant here that's unique to the tumor genome itself. And then lastly just to give a plug for a tool that we've come up with there's a very long list here but my slides are available or a URL here. This is just a tool that takes the bulk of a bunch of capture data that I talked about earlier and really compares the coverage levels according to a variety of color coded depth levels here and looks at the breadth of sequencing coverage and also the amount of enrichment that you've been able to do. And these bulk tools are really necessary in high throughput sequencing so that you can rapidly evaluate whether these are data that now need to go downstream to subsequent analysis or back to the sequencing queue to generate additional coverage if the coverage levels are inadequate. One of the things I get asked a lot is sort of well what's better, whole genome sequencing or exome sequencing? I guess my typical answer is it depends on what you wanna do with the data so I won't try to bias you one way or the other but just give you some facts and figures here. So this is just sort of looking at the sequencing data how much exome is about six gigabases, whole genome much larger and this can increase when you're sequencing a cancer genome because more coverage is always better there. You can see obviously the different target spaces and this number varies depending upon the exome reagent that you're using so this is sort of an averaged amount. Mapping rates are higher for exome than whole genome because whole genome is harder to map because of all the repetitive sequences that I mentioned already. Duplication rates tend to be a little higher for exome sequencing, it's less of a unique set of molecules and these are the kinds of coverages that we typically achieve although they can be higher if you want, it's really just a matter of economics. How many of this coding regions or CDS are covered at greater than 10X on average? You get about this much coverage for exome significantly higher for whole genome and that's just because where the probes hybridized from in the genome is sometimes differential and you may get better coverage just by sequencing across the whole genome as well. And then really important is what you can get from exome sequencing which is good point mutation and indel calling but not so much resolution on copy number and really it's difficult but not impossible to call structural variants whereas really pretty much everything is available from whole genome which is intuitive. The most challenging thing is calling structural variants just because there's a very high false positive rate to this type of an approach. And then lastly, what do you worry about? Well it depends on if you're in a clinical setting or in a research setting. In a clinical setting we worry about false negatives because we don't wanna miss anything and most of these are actually due to lack of coverage so an exact examination or post variant filtering approach to remove a lack of coverage read that are indicating mutations that we've missed is a problem. False positives are more important maybe in the research space where these are some of the sources of false positivity but we've actually found over time that going back and revisiting these sites can lead us to filters that can actually remove the sources of false positivity such as variants that are only called on one strand or right at the end of the read where your signal to noise is starting to approach each other. So I'll end here with the bioinformatics diatribe but I just want you to sort of appreciate in a microcosm all of the factors that go into this and why incorporating massively parallel sequencing especially in the clinical space is actually fairly problematic because you have to not only understand and appreciate all of these factors you actually have to build into your sequence analysis pipelines the ability to deal with all of them so that the result you get out is as high quality and high certainty as possible because often therapeutic decisions and other types of clinical decisions are being made off of this. And so one of the current trends in the marketplace is to sort of package together for clinical utility multiple systems that allow you like this kaijin system shown here that allow you to sort of produce the library sorry produce the DNA RNA, produce the library do the sequencing and then have all of the bioinformatics and analysis packaged for you at the end of it. So it's sort of a sample to insight as they call it type of solution. So this is just one example where you have modules for all of these things and then you have onboard analysis and interpretation software for clinical utility of sequencing data. Okay, I want to turn my attention to third generation sequencers and focus really on single molecule detection now as opposed to bulk molecule detection which is what we've been talking about. And what you're staring at here is the so-called smart cell of the pack bio sequencer which is used for real-time sequencing of single DNA molecules. This is the surface of the smart cell that is the action part of the operation here and consists of about 150,000 so-called zero mode waveguides little isolated pockets that individual DNA and polymerases can fit down into and can be sequenced and watched in real time as the sequencing reaction occurs. So how does this work? First of all, you make a DNA polymerase complex and this gets immobilized down at the bottom of the ZMW where the bottom is specifically derivatized, the sides of the ZMW are not. And so on average, what ends up is that this polymerase is sitting right at the bottom of the zero mode waveguide. Well, why do we want it there? Well, because we want to add in now fluorescent labeled nucleotides that the polymerase can incorporate relative to the strand that it's glommed onto and these fluorescent nucleotides are evaluated as they come into the active side of the polymerase which is in the viewing area of the optics of the instrument which is focused on each one of these 150,000 zero mode waveguides. What happens when you get an incorporation of course is that the nucleotides sits for long enough in the active side of the polymerase for the fluorescence to be excited by the imposing wavelength that's coming into the bottom of the ZMW. You get a fluorescent readout that's captured by the optics and the detection system of the instrument and the phosphate group is actually cleaved during incorporation and it contains the fluorophore so it diffuses away and now you're ready for the next incoming nucleotide to sit in that active site long enough to also excite its fluorescence and get a readout and so you get a base by base readout that occurs based on the fluorescent emission wavelength of each of the nucleotides which are specifically labeled and this is a process that's occurring in parallel in all of the ZMWs that contain a polymerase with a DNA fragment that's properly primed and what you're essentially doing with this device is really taking a movie of all of these ZMWs over a defined period of time during which the data is accumulated. I'm not gonna walk through this extensive workflow in any way shape or form but just to give you an appreciation of what it takes to process the library. Some of these steps are very similar to what we've already talked about for massively parallel sequencing instruments but the big difference here is that you actually polymerase, mix the polymerase, the primer and the library fragments together and then apply that onto the sequencing instrument and as you can see here we collect data in the era of four to six hours of collection time from the zero mode waveguides in the smart cell. So this is a really different departure from what we've been talking about for a variety of reasons. First of all, the idea here is to start with, again, high molecular weight genomic DNA but we actually shear to very long read length sizes about 30 to 50 even up to 80 kilobases. The reason why is that during that four to six hour run from individual fragments that go into the library prep that are this long, we can literally generate sequence reads that are in axis of 30,000 base pairs. Average is about 15,000 so as you can see this is diametrically opposed to all of that short read sequencing that we were just talking about. So we've worked out methods for consistent shearing of DNA to these long read lengths. As you can imagine, it's reasonably fragile. So there are a variety of devices that we use to maintain the stability and also to make the library as high quality as possible. And this does take a pretty good amount of DNA. So to sequence a whole human genome as I'll talk about in a minute is a considerable investment that really, at this point in time does not render to an individual core biopsy from a cancer sample, for example. Now we're talking about sequencing really here from cell lines where you can generate lots and lots of DNA to get the kind of coverage that we need 60-fold or higher to sequence the human genome. This is just an example of some of the read lengths that are attainable from some recent examples where you can see the mean read lengths are sort of in the 13,000, 15,000 base pair read length and some of these bins go extraordinarily high towards generating this. And the reason for pointing this out is now we're moving from a realm where we need to align reads back to a human reference genome to rather coming up with algorithms that can assemble these reads into portions of entire human chromosomes. And this is of course really important for a variety of reasons. One that I'll talk about today is that we're trying to use these long read technologies not only improve the existing current human genome reference but to also produce additional high quality human genomes to sort of spread out the knowledge about diversity in the human genome across different populations across the world and also to really understand the unique content in genomes that you can't get by simply aligning reads back to a fixed reference. So this is just a shot from our website that talks about our reference genomes improvement project which is funded by NHGRI and shows you that we plan to produce gold reference and already have produced some platinum reference genomes. The difference here is that these are haploid genomes. They come from an abnormality called a hydrotideform mole where you get a single enucleid egg that's fertilized by a sperm and this really grows out to a certain stage and then gets turned into a cell line. So these are haploid high quality human genomes. These will be from diploid individuals as you can see across different populations of the world. The plan is sort of outlined here and this is again linkable to our website of the URL which I just showed starts from PacBio sequencing reads de novo assembly of those reads which will be contiguous across long stretches but not perfect and then using a different technology called the Bionano which makes maps of the human reference genome. We can actually get an accurate representation of how good our PacBio assembly is and how big the gaps are that we still have left to fill and in some cases as I'll show you we're also using the PacBio sequencer to sequence from bacterial artificial chromosomes made from these same cell lines. We fill in the gaps using these data here and come up with a high quality gold reference genome that's highly contiguous to the extent possible across the chromosomes. So this is just an example of the first gold genome that we produce. This is from a Yerubin individual in the thousand genomes project and you can see all kinds of metrics here for the quality of the assembly but our biggest contig is 20 million base pairs of assembled sequence data. That's pretty remarkable when you stop and think about it. I mean on average the contigs are about six million base pairs. We can align this as I said to the maps that are created using the Bionano which really takes similar long pieces of DNA, does a restriction digest and then calculates the restriction fragment sizes and maps those back and when you compare in this particular example there's a conflict between the PacBio assembly and the Bionano data which you can see is now resolved by alignment of these back reads using the PacBio long read assembly approach here. I mean we can resolve significantly through this very complex region of the genome which involves a segmental duplication. These are historically the hardest parts of the human genome to actually finish the high quality and contiguity but here we've actually done it. I mean so these are the approaches that we're using sorry the different genomes, the source or origin and the level of coverage that's planned and you can sort of see a current snapshot from just a couple of weeks ago of where we're at with producing these data which of course will be available for use from the NCBI. Okay I just want to finish up here with a couple of new technologies to mention. This is again a 10X genomics a company that's really aimed at high quality contiguity but using a different approach than long read technology. So how does this work? Well the idea here is to generate these little segments so starting with a long piece of DNA similar to what we would be using for a PacBio library. You combine this in a isothermal incubation in micelles. So similar to the amplification approach that we use for ion torrents so oil and buffer micelles where you have these little molecular barcodes that sit down on the surface of this long molecule and then get extended for a certain period of time. You then turn these into full sequenceable libraries by adapting, it's already ligating on these adapters to the ends with a molecular barcode at one end and then the adapter at the other end. You can sequence and analyze these and then using bioinformatics take these finished sequence reads and actually combine them back using a linked read approach into a full contig similar to what I just showed from the assembly of PacBio reads. So this is getting long range information from short read technology and indeed this uses the Illumina platform to read out. You're starting with this gel and bead emulsion or GEMS approach that I just showed you to take long molecules into a micelle partition, amplify off of these different molecular barcodes in sequence and then use these in the sequencing library that then gets read out and the barcodes then identify those individual short sequence reads as having come from that original long fragment that's isolated in that micelle. And so using this approach, you can actually generate very long contiguity by just mapping up the barcodes from the short reads and linking them together using specific algorithms. So there are lots of things that you can do with this. I see that this didn't animate properly, so sorry about that, I just go to the website for this approach and it will show you the different things that you can do, including long range haplotyping information and then getting information like I just talked about from diploid de novo assemblies, I'm using this supernova assembler that's been created by the 10 X genomics crew. Last but not least, I'll just talk about Oxford Nanopore sequencing briefly to give you an update. This is a protein based Nanopore which is meant to pull a DNA fragment through. I'm using a variety of mechanisms. Each Nanopore is linked to a specific application, specific integrated circuit that collects data and basically fits the data to a model of what each nucleotide combination looks like inside of that pore to call the base sequences. And so during the run, you basically get a little output that looks like this which reflects the translocation of the DNA sequence through the pore and you can also sequence these reads twice, one direction and then the other to get higher accuracy information. The read lengths from this are variable so there's no sort of set read length. It's really just sequence until you have the amount of sequence information that you need for the coverage that you need. And as I mentioned, this data collection really is based on electrical current differential across this membrane that the pore is sitting in. So as the DNA translocates, different combinations of nucleotides give differential changes in the electrical current and then this can be fit back to a model of all possible multimers. This has been evolving over time. The error rates initially were quite high with improvements in pores and software. The newest iterations of this type of approach are error rates of around 10% for the dual read where you sample the sequence twice and about 20% if you're just able to sequence through that sequence data at one time. So probably in the same realm at this point in time is the error rate on packed bio sequencing which I apologize or forgot to mention earlier. This is kind of what the device looks like. So this is a small sort of the size of a stick drive that you might put in your computer USB port. It actually fits and connects to the computer via USB. And then this is the actual sequencing device here, the flow cell which contains the array of nanopores that the DNA translocates through and that the data is collected and fed into the associated computer and then can be analyzed after the run is completed to give you information. And in a promised next version, the company will basically put together a lot of these different nanopore devices into a very large sort of compute cluster looking device that's shown here which has the formidable name of Prometheon. Okay, so I'm gonna just finish up with three more slides and then we'll open for questions to just give you a feel for kind of where things are going with regard to the application of next generation sequencing in the clinical cancer care of patients. And this just refers to what I call immunogenomics. In the past, people who know a lot more about immunity and cancer than I do, Terry Boone, Hans Schreiber and others actually predicted that because you have specific mutations as we've talked about in tumor genomes that produce proteins that are mutated and therefore have a different sequence, these proteins actually might look different to the immune system if you could sort of tell the patient's immune system about them, if you will. And this could happen through some sort of a vaccine mediated approach or otherwise that could alert the immune system to the presence of these abnormal cells and lead to their destruction. The problem in the past is that identifying these neoanogens, these proteins or peptides that look most different to the patient's immune system was extraordinarily difficult. And as I'll illustrate in a few quick slides here, this has largely been overcome by next generation sequencing and bioinformatic analysis. So in the immunogenomics realm and identifying these neoanogens, we have three sources of data. We have exome sequencing from cancer and normal to identify cancer unique peptides. We can also obtain from next gen sequencing the HLA haplotypes of the individual. So what are their specific HLA molecules that will bind these cancer unique peptides and present them to the immune system? And then what are the RNA sequencing data? Because most of the samples we'll be talking about are very high mutation level samples. So not every DNA mutation results in an express gene or that express mutation. We combine all three of these data types into an algorithmic approach that compares the binding of these altered peptides, the mutants to the wild type peptides and gives us the ones that look most different to the immune system based on just binding to the MHC. And we call these TSMAs or neoanogens and this can describe the cancer's neoanogen load. Just for those of you in the bioinformatics realm, we have a GitHub available pipeline for doing everything that I just described, including the RNA filtering and coverage based filtering and that was just published in January of this year. So why do we care about this? Well, there's evidence from the medical literature that actually patients with the highest mutation rate, the highest neoanogen load, are the ones that are most responsive to a remarkable new class of drugs that are commonly referred to as checkpoint blockade therapy. These are the types of molecules that antibody based drugs that release the brakes on the immune system and actually allow T cells to infiltrate, identify and selectively kill cancer cells in most cases. And these are just two publications from the literature showing what I just told you which is that the response curves look dramatically different in patients with high versus low mutation loads. Here's just another study from the Hopkins group. Again, looking at patients with either mismatch repair proficient in red or mismatch repair deficient where you have a germline susceptibility or sometimes a somatic sort of tendency to develop many, many, many more mutations. We refer to these people as ultramutators whereas MSI stable genomes tend to not actually respond to these checkpoint inhibitor drugs. So this is an important thing to be able to monitor from the genomics of a tumor because people with high neoanogen loads may be suitable for checkpoint blockade therapies, many of which are now approved by the FDA, including just last week, the Genentech drug Atezolizumab was approved in bladder cancer, which is another high mutation load tumor. So this is just a quick case history because I love N of ones and I try to present one every time I come here to give this lecture. It always has to be a different one. This is the most recent one. So let me walk you through this real quickly. This is a male patient who presented before this glioblastoma with a history of colon polyps. So he's much too young to be having colon polyps. I think we would all agree. This GBM was diagnosed when he started having seizures while on vacation and was removed at UCSF and part of the sample from this tumor was shared with us for the genomic studies I'll tell you about. He received as the most GBM patients a drug called Temazolamide, which is an alkylating chemotherapy that's often given in the context of radiation therapy for GBM patients. But when he was back in St. Louis was diagnosed with a spinal metastasis in the cerebral vertebrae that was identified through some additional sequela and conditions that he was experiencing. This went to Foundation Medicine, which is a commercial testing for cancer tissues and came back with practically every gene on their cancer panel having some sort of a mutation which should tell you something. It did to the oncologist and so he sent a blood sample to a colorectal cancer panel and this individual has a mutation in a gene called polymerase epsilon which is important in DNA repair processes in the normal cell and this is a known variant that causes defects in the polymerase epsilon activity. So based on all of this and based on the clinical evidence that I just showed you he was assigned to take an anti-PD1 checkpoint immunotherapy called Pembrolizumab. Curiously, just a few weeks or two weeks after he started on the course of anti-PD1 therapy a second spinal metastasis was identified. So there was some concern that he was resistant to this drug but actually when we went back to the initial MRIs not me but the radiologist identified that this was probably just physically too small to accurately pick up the second metastasis. It was also removed and we studied all of these samples compared to the blood normal of the individual using exome sequencing and also some immune history chemistry which I'll show you in a minute. So this is the sort of clonal evolution over time not much time but admittedly in different presentation of this individual's tumors. You can see a very complex tumor across the spectrum of all variants that are identified and this is the clonality plot showing that there's a founder clone set of mutations and then a three detectable additional subclones that are present in the disease. In the spinal metastasis now you see a much more complex post-temosolamide and we know this is a phenomenology that's associated with alkylating chemotherapy so very complex disease. And now look at the sort of winnowing influence that using the anti-PD1 checkpoint blockade has on this individual's tumor that was removed from the second spinal metastasis. It's now a much simpler genomic from a genomic standpoint and shows the impact of the drug. Why is this important? So these are antibody-based therapies. As I mentioned, there has been long speculation that because they're antibodies and not small molecules they wouldn't cross the blood-brain barrier and I think this is evidence to the contrary. If you need more evidence then you can look at immunohistochemistry for staining of different immune molecules in the T-cell repertoire which is absolutely not there in the first metastasis so this is again post-temosolamide but before the use of anti-PD1. And once we go through a couple of cycles of anti-PD1 the removed second metastasis now is full of immune infiltrate and indicative of why this subsequent tumor has a significantly different genomic context than the previous one that was removed from the spine. So this patient remains without evidence of disease so far on the pembrolizumab therapeutic and really sort of also raises a very other interesting question medically which is that if patients like this with these polymerase epsilon constitutional mutations are identified early they may actually benefit from so-called prophylactic therapy in the checkpoint blockade realm where it helps them to stave off developing tumors because of this activation of the immune system against the tumors as you see here. So that's a formal possibility but hasn't been tested obviously. I should also point out that as we were evaluating this patient a report for two SIV pairs that had also germline constitutional DNA repair defects and had developed brain cancers came out in the JCO from a group in Toronto. So this is now two pieces of evidence in their response to a checkpoint blockade therapy and this individuals that this is suitable for patients in the neurological cancer space. And then a recent MRI indicated that a remaining lesion that had been left behind during the neurosurgery on this patient initially to remove this very large GBM was actually diminishing over time with additional MRI imaging. So he's clearly responding in that area of the brain as well. So I think this is sort of an exciting new area. We're also developing personalized vaccines which I didn't have time to talk about which you can read about. We published this in Science 2015 a manuscript describing our first in human melanoma trials where we use the same approach but then use the neo antigens that we identified to design a personalized vaccine for patients with advanced melanoma and we're expanding this out into other tumor types as well. So just a little vignette on how genomics now is really having an impact in the clinical setting for the care of cancer patients. There are obviously multiple other examples from people's work but I think this is particularly exciting. So I'll finish up and just thank my colleagues at the Genome Institute including Rick Wilson who's the director of our institute. Also want to give a nod to my very important collaborators who provided the samples that I just talked about and Bob Schreiber who taught me everything I know about immunology which still isn't much but he's absolutely the expert and I really value his collaboration. And then I really want to say also thanks for all of the different components of the technologies that I talked about to all of these individuals who contributed slides into the mix. And lastly with respect to the last vignette I want to acknowledge patients and families that contribute valuable samples so that I can present exciting information like this to you all. Thanks for your attention.