 Okay. All right. Good morning, everyone, and welcome to week six of this Current Topics Lecture Series. Before I introduce today's guest to you, just a brief program note to remind you that there's no Current Topics Lecture next week, February 29th. Instead, next Wednesday morning, I'd strongly encourage all of you to attend NHGRI's 10th Annual Trent Lecture, which will be held at 10 a.m. in the Kirsten Auditorium over in the Natura Building. This year's Trent Lecture will be given by Bert Vogelstein from the Johns Hopkins University School of Medicine, and his lecture is entitled Cancer Genomes and Their Implications for Basic and Applied Research. Of course, those of you working on questions in cancer research already know that Dr. Vogelstein was the first researcher to elucidate the molecular basis of a common human cancer, and that his work on colorectal cancer forms the paradigm for much of the work that's being done in modern cancer research. If you haven't heard him before, Bert is a fantastic speaker, and he brings a unique worldview to the field of cancer genomics, and the themes he'll touch upon during his lecture will dovetail quite nicely with a lot of the themes that we're going to be addressing over the 13 weeks of this Current Topics Lecture Series. Please keep in mind that Dr. Vogelstein's lecture will not be videocast and it will not be videotaped, so please mark your calendars for next week, and I hope to see many of you over in the natural auditorium for Bert's lecture. So today, it's my great pleasure to introduce to you Dr. Elaine Martis, who is a professor of genetics and molecular microbiology and the co-director of the Genome Institute at the Washington University School of Medicine. Dr. Martis' involvement in the field of genomics dates back to the beginning of human genome sequencing. When, in 1993, she joined the Genome Institute at Wash U as its director of technology development, and in that role, she helped create methods and automation pipelines that were critical for sequencing the human genome. And if you think back to Dr. Green's lecture during the first week of this course, you'll recall the technological challenges that those that were working on the human genome project in the early days of that project faced as they sequenced each and every one of the chromosomes that comprised the human genome. And Dr. Martis was really one of the key players and thought leaders who helped to figure out how to best approach and operationalize such a huge biological and technological problem. In her current role as the co-director, she orchestrates the Genome Institute's efforts to explore next generation and third generation sequencing technology, some of which you'll hear about today. And the goal of that is to transition these technologies into a production sequencing environment that could serve as a very strong foundation for addressing really important questions in genetics and genomics. Dr. Martis also has a very strong research interest involving the application of DNA sequencing approaches to the characterization of cancer genomes with a particular focus on facilitating the translation of basic science discoveries about human disease into the clinical setting. Her work and contributions to the field of genomics has been recognized by numerous organizations. Most recently, in 2010, she was awarded the Scripps Translational Research Award for her work on cancer genomics. And in 2011, she was named a distinguished alumna of the University of Oklahoma College of Arts and Sciences. I'm very pleased that Elaine could join us today and be part of this series presenting her perspective on next generation sequencing technologies. So with that, please join me in welcoming today's speaker, Dr. Elaine Martis. That was great. Well, good morning, everybody. And thanks again, Andy, for the kind invitation to be here and to provide an educational lecture for you on next-gen sequencing technologies. I believe I'm supposed to flash the next slide, which is not working for some reason, to provide the fact that I have nothing to disclose. Am I doing this wrong? I think I have a pretty good amount of practice having done this. There we go. So this is my non-disclosure slide, or I mean my disclosure slide of no relevant financial relationships. So what I'm going to do today is basically tell you all about next-gen and third-generation sequencing instruments. And I sort of have a laundry list for you here, excuse me, to consider. And then I'll spend the last portion of the talk just giving you probably somewhat limited, but hopefully a good broad brush perspective on all of the ways that these technologies are now being used, I think, to really transform the biological research enterprise. And I hope to give you a feel for that. Of course, there's so much going on that it would never be comprehensive or we would be here for hours and neither you nor I want that. So I'll try and just give you a few salient features, give you some references from our own work, and try to mention the work of others as I go along, assuming that I can remember it to do that. So you may be familiar with a very nice issue of nature that came out towards the beginning of last year. In it was featured the roadmap for the National Human Genome Research Institute's next five years. And I was also very honored to have been asked at that point in time to provide this perspective piece, the reference for which is at the bottom here, to really look back over the past 10 years since the human genome sequence was completed at the trajectory of technology improvements toward DNA sequencing. And this is just a figure from the paper that sort of gives a timeline over that 10-year period. I'll talk about some of the highlights during this timeline today as I go through the work. But really also to just reflect that back around the time that we finished the genome and moving forward there has been this explosion of ability to produce sequence data, as you can see. So inflecting around about 2005 in the introduction of the first next generation sequencing instruments from 454 technologies and now moving up and over upward with recent announcements that I'll just briefly mention in late January about sequencing technologies that will now take us to the point of sequencing an entire human genome in essentially an overnight time period. So this is a very radical transformation of a very short period of time and has had a tremendous amount of impact already in the short timeframe on biological research and I'll try to give you a flavor for that. But if you don't remember anything about it, just remember that the cost of data production, not the cost of data analysis, but the cost of data production has fallen dramatically. So if you look at capillary based sequencing technology around about the time that we finished finishing the human genome in 2004, if you went back to that capillary sequencer from ABI and you ran through DNA sequences to satisfy coverage on the 3 billion base per human genome, you would be talking about a $15 million price tag. So most of us in this room, including myself, can't afford that if you wanted to have your genome sequence. Maybe some of you guys can and if you can, congratulations on that. But you would have wanted to wait it because in a very, very short time period, the transition of about six or seven years, if you will, that cost has fallen dramatically to around about $10,000, perhaps moving towards the mythical $1,000 figure in this calendar year, we'll see. And the time to produce that data wouldn't have been weeks and weeks, months and years, depending upon how many of these capillary sequencers you had, you can literally today on this aluminum box just provided for illustration sequence six human genome equivalents in about a 10 or 11 day period. So you and five of your friends can all pool your money together and get your genome sequence very rapidly. So what are the basics behind all of these next generation sequencing platforms? I mean, for years and years, all we had to choose from basically was the capillary sequencer from applied bio systems. So it's kind of as they'll try to illustrate for you a crazy wealth of riches in terms of all of the sequencing platforms that are available. That's the good news. The bad news is that for people who don't live, sleep and breathe this like I do, there are some questions that may arise about what's the exact right technology for the application that I have in mind. They'll try and shed a little bit of light on that later in the talk. But let's just take some time pacing through the basics of how these things work, how they do what they do and what they turn out in terms of the data that's produced. And I'll try and present that for you here. Now, each and every one of the manufacturers of the sequencing instruments would like you to think that their instrument is highly unique and capable and poised above all of the others that are available in the commercial space. Of course, as a skeptical scientist, you won't believe that and that would be wise. But what I want to walk you through first is all of the ways that these sequencers are actually the same because there are a lot of similarities. And I do this to set the stage for then pacing through each one and telling you how they're unique. But keep in mind these similarities that go across the different platforms because that gives you a fundamental basis for understanding how they work. So all of the shared attributes are listed here. First, we'll start with the fact that making libraries, for those of you who may have been the past and clone based libraries for capillary sequencing, is now faster, easier and cheaper than ever. There's no need to run through an E. coli intermediate. There's no need to do cloning. It's all very straightforward process that begins with the random fragmentation of the starting DNA that you're interested in sequencing. And if, for example, these are just PCR products, then there's no need for fragmentation, you can just go forward to the next step, which is ligation of these fragments with custom linkers or adapters to make a library. And as you'll see with each one of these technologies, library construction is basically the same approach. Each instrument has their own specific and unique adapters, as you might guess. But nonetheless, the overall process is exactly the same and highly concordant. So instead of spending a week producing a sub clone library, which you then pick, amplify any coli and isolate all of the DNA in the old style process, instead with this process, you essentially, in the period of a day's time, fragment the DNA, add on the adapters by ligation, do some purification and amplification steps, quantitate the library and you're ready to go. So the whole process literally takes less than a day's time and costs in our hands on the order of about $100 or $150 to complete to the point where you now have hundreds of millions of DNA fragments that are ready to do next-gen sequencing on. So I mentioned that there's library amplification in these processes. Depending upon the platform you're talking about, it takes place on some sort of a solid surface. So either a bead or a glass and a glass surface and I'll show you the differences between those different sequencers. But the net impact is the same. You're taking these unique fragments and now you're starting from one fragment, amplifying it up to multiple copies. Well, sometimes when I present this lecture, I ask students, why do you think we do that? I won't do that for you guys, but I'll just tell you the answer, which is for all but some of the single molecule sequencers, which I'll mention as we go through them, you must do library amplification in order to see the signal that's coming back from the sequencing reaction itself. So most of these molecules, most of these sequencers start with single molecules. They're amplified in place either on a bead or glass and then they're sequenced. And to see the sequencing reaction going on in real time, you actually have to do that amplification step. So that's not a bad thing. In most cases, however, with any type of enzymatic amplification, you're always going to get some aspects of biasing and some aspects of what's called duplication or jackpotting, where some of the library fragments will preferentially amplify and you'll get more of those sequences than of others. And so we have ways to adjust or ameliorate for that in our processes. Okay. And then on to the sequencing reactions themselves. For most of these technologies I'll talk about today, a direct step-by-step detection of each nucleotide base incorporated during the sequencing reaction occurs. Commonly, these approaches are referred to as sequencing by synthesis, if you will, if they use a polymerase or sequencing by ligation. But they do occur in a direct step-by-step fashion. So again, let's harken back to days of old and capillary sequencing. If you're familiar with this, what happened procedurally was you sequenced all of your DNA fragments, perhaps 96 or 384 at a time, and then you applied them to a sequencer after the sequencing reaction was over with and they were separated by electrophoresis and the fragments were detected, either by radioactivity if you're as old as I am, or fluorescence if you're younger than I am. So in contrast, next generation sequencing, everything happens together at the same time. Sequencing and detection happen in a step-by-step fashion so that you essentially don't decouple the sequencing reaction from the detection of the sequencing reaction. And this leads to another name, which I actually quite prefer over next generation sequencing or some generation of sequencing, which more accurately represents what's going on in these sequencing instruments, as I hope you'll come to appreciate, namely, you're performing hundreds of thousands to hundreds of millions of sequencing reactions all at the same time. And so the term that's often applied to these technologies is massively parallel sequencing, which is exactly what you're doing. You're sequencing everybody together, simultaneously performing an imaging step to detect what happened and then moving on to the next base incorporation step over and over and over again until you generate your full sequence run. Now, the consequences of doing next generation sequencing with a couple of exceptions, again that I'll point out, is that in general, these reads are shorter than capillary sequencers. And there are a number of reasons for this, but it mainly comes down to one word, signal versus noise. I guess that's three words, technically, although I think of them as one. So what you're always battling in this detection game is the signal to noise ratio. And in most of these technologies there's some cost to pay that ultimately limits the read lengths. And I'll give you the specifics for each platform, but just consider in principle that these are going to be shorter read lengths. And I mentioned earlier the contrast between the cost to generate data and the cost to analyze data. And this is where push comes to shove because there's a toll exacted from the fact that you can produce lots of short reads, and then you have to go and analyze those. And I'll point out to you why the reasons why that becomes much more difficult and why the bioinformatics overhead, the analysis overhead is quite expensive for us still. So I already alluded to this for you, but I'll talk about it again a little bit later on when I show some examples of how we're actually using now the fact that these are digital reads. So each read of a massively parallel sequencer originates from one fragment in the library, even though it's amplified. What that means is you can literally apply counting based methods to the analysis of these data that will tell you things for example as I'll show how many tumor cells in the collective that produce DNA for a tumor genome actually contain each one of the mutations that you've detected. So you can get it down to that level of sensitivity. You can look at the number of counts for a given messenger RNA for example, and look at quantitative aspects of sequencing as we've never been able to do before. And this is a tremendously exciting application space for next generation sequencing. I'll try and give you a feel for that. And then lastly one of the newest abilities that's come on board for these sequencing instruments is the use of what we refer to as paired-end reads. So most of these technologies started out by priming a sequencing reaction, extending off of a single primer for certain read lengths and that was it. And you got a single fragment read. And in most cases that was pretty darn good and we learned to work with it. But over time what emerged was the ability to sample from not one but both ends of the fragment in that library. Namely, each technology applies a different adapter to each end of the DNA fragment that's being put into the library and you can exploit that by using one primer for one adapter. And in the second round of sequencing a second primer for the second adapter, effectively collecting data from both ends of the fragment. And also understanding that based on the size of fragments that went into the library that you made, let's say 300 base pairs, now that you have 100 from one end and 100 from the other end, you can actually align those back to a reference genome and expect that they will align at about 300 base pairs apart from one another. And when that doesn't work out and you have a distance further apart or closer to gather or maybe even mapping to two separate chromosomes, you can actually use that information to make sense out of the genome that you're sequencing and I'll give you a few examples of this from our work a little bit later on. So Parrot and Reeds had all kinds of other advantages that I've listed here. There's also a bit of a nuance to Parrot and Reeds that I want to spend a little bit of time on because it is a major point of misunderstanding, if you will, in various literature and manufacturers, quite frankly, will try to trip you up with this one. So you can you can get the Parrot and Reeds data. Sequence can be derived from both ends of the library fragments, as I just mentioned. There are basically two kinds of Parrot and Reeds, however, that go by different names. So in my vocabulary true Parrot ends mean that you have a linear fragment. It's typically, as I said, on the order of three to five hundred base pairs, if you will, and you're going to literally using two different primer extension steps sequence at both ends in two separate reactions. So that's Parrot and Reeds. The second type of read pair that you can generate on a next-gen sequencer is a so-called mate pair and the nuance here is that rather than using two separate adapters you literally circularize a large DNA fragment typically greater than a KB in length three, eight, twenty KB libraries are typically made and by circularizing that around a common single adapter you actually can generate mate pairs where the ends of the DNA come together. You go through a second step to remove the extraneous DNA that's the part of the circle you don't care about and you either use the adapter to sequence across the DNA or you do a single reaction read, sorry, use a single reaction read to sequence across the DNA or two separate end reads to tell you the sequences at either end of those fragments. So the advantage of mate pairs is that you can stretch out to much longer lengths across the DNA that you're interested in sequencing and hopefully understand better the long-range structure of that DNA. The downside of mate pairing as opposed to paired end reads is that this approach because DNA circularization is inherently not very efficient is that large amounts of DNA typically several micrograms are required for each library that's constructed. Okay and that just goes to the inefficiency of mate pairing. So in general these, whether they're mate pairs or paired end reads, these offer advantages for sequencing especially when the genome is like the human, large and complex. The reason why is you can more accurately place that read on the genome than you can a single ended read and the main reason for that is as long as both read, as long as both reads don't fall into a repetitive structure you can anchor one with certainty even if the other one doesn't anchor with a high degree of mapping quality as we call it. So it may be a read that could place at multiple positions in the genome but as long as the companion read to it places exactly at one place in the genome when you go back to align the reads to the reference you can identify where exactly that read came from. So the net result of this is that you can use more reads towards your ultimate analysis than you might be able to do with just single-end reads and this provides a huge advantage in the economy of sequencing as well. Okay so that's the sort of introductory similarities and differences and basic terminology. Now what I want to do is use those concepts to walk you through each of the different types of approaches and keep those similarities in mind as we also examine the differences. So this was the as I mentioned one of the first technologies to come to us for next generation sequencing around about late 2004 or early 2005. This is a massively parallel version of a sequencing approach you might have heard about called pyro sequencing which basically uses the emission of light to register the incorporation of DNA nucleotides into a growing strand. So the 454 approach to constructing a library is exactly as I walked you through so we do a random fragmentation step you can see the DNA is intact here now in small pieces now with the adapters ligated on to either end and actually this is a denatured molecule which is a precursor to the amplification step. So you take your single-stranded adapter ligated library now so the yellow or the DNAs that you care about the different colors at the ends are the specific adapters for this platform and you essentially do a step called emulsion PCR. This is the amplification step that I talked about earlier and it's a really unique way of doing this namely all of the PCR occurs in an emulsion of oil and reagents that are in aqueous phase. So if you can see this hopefully well enough what we try to achieve in emulsion PCR is a micelle sort of clear area in the center of the of the picture that contains a bead shown here for amplification and if you can see the little squiggles on the surface of the bead those represent the complementary sequences to the adapter that you ligated on in your library construction step and for each amplification ideally we have only a single molecule of fragment from that library in association with the bead. Now what's invisible in this aqueous micelle of course are the basic building blocks of DNA as well as DNA polymerases that are going to effect the amplification of the single fragment on the surface of the bead during a PCR step namely temperature cycling. But of course we aren't just doing one bead at a time we're doing several million beads at a time in a single micro micro tube in this oil and water emulsion called emulsion PCR. Okay so you go through the PCR cycling steps what you end up with is this micelle containing a bead with lots and lots of copies of this original single fragment and placed on its surface. You then go through a series of steps that I won't show you to effectively break the emulsion so that you separate oil from water and you can extract the beads away during this step free from the oil and ready for deposition into the sequencing plate that's used for the 454 sequencer. So that process is shown here in the sequencer for 454 the the pico-titer plate is the literally the glass structure that's going to serve as the flow cell. This is the diffusion mediated process that occurs sort of on the upper surface of this pico-titer plate. So we're depositing these DNA containing beads cleaned up from our emulsion PCR down into the wells by the use of a centrifugation step and effectively these wells are about the right size so that just one bead fits. They won't all be ultimately filled but most of them will be filled with a single bead that's going to provide the sequencing reaction. So the upper surface of the flow cell as I mentioned is where the reagent flow occurs. Okay so we're going to be flowing reagents through this process allowing them to diffuse in and out of the wells to provide the sequencing process. In the meantime this side of the pico-titer plate is the business side for imaging. So this is optically flat, optically clear glass that sits right up against a very very high sensitivity CCD camera and is literally going to be recording the light flashes from about a million sequencing reactions as they all occur in lock step. So to do this pyro sequencing reaction we also need some helper beads these little brown beads that are added in and they sort of nestled down around the larger bead with the DNA on it and their purpose is that they're linked with one of two enzymes Sulfuralase and Luciferase here that affect the sequencing reaction as I'll describe in just a moment. But they need to be down in there in the mix so that all of the reactions can take place and so that the light can be produced when a base is incorporated. So let's look quickly at the sequencing by synthesis steps on the 454. We're going to imagine that we have one of our DNA capture beads here and one of the you know millions of copies that are hanging off of it is going to be imagined right here and this large gray blob is the DNA polymerase that is now seated at this annealed primer and ready to go. We then add in the first nucleotide and this is a T. The first four nucleotides for this process are always the same because these four nucleotides are always determined by the sequencing adapter and as you'll see there's one A, one C, one T, one G and this is a so-called key sequence that now tells the downstream interpretation software what a single nucleotide incorporation looks like. Why is that important? Because what we're flowing across the surface of the flow cell are native nucleotides so in the case where you run into like four As here in a row together all four of those As are going to get incorporated against at once. There's no stopping A by A by A. They all four go together and the downstream output of light is effectively four times as high as one nucleotide. You wouldn't know that if you didn't have that key sequence at the beginning for your software to look at and gauge all of the other incorporation cycles. So when this T gets incorporated what happens? Well we all know this basic polymerase biology a pyrophosphate moiety is released that goes through a series of downstream reactions that are catalyzed by these enzymes on the bead and the output is essentially light and that light is detected now by the CCD camera which knows all the positions of all the wells that are emitting in these first four key sequences and now records that cycle by cycle by cycle. And so we run these cycles effectively for several hundred times to generate the read lengths that are obtained from the sequencing instrument. I present for you here just sort of a trajectory of improvements if you will on the 454 instrumentation since this was introduced in 2005 where you can see that there have been increases in read lengths so that with this latest Flex Plus which we're just testing in our laboratory over the last couple of months you can now get close to Sanger capillary read lengths actually out of this technology. So about 650 to 700 about a one gigabase of data per run is is being yielded and this takes on the order of about 20 hours to complete so it's an overnight run still. The error rate is about one percent so you get 99% accuracy out of any given read and we know that when you have an inaccuracy it's typically in the range of an insertion deletion type error. These typically are occurring now at those homopolymer runs like the 4a stretch that I pointed out to you where that exceeds six or seven nucleotides of the same identity in a row you basically max out the detection on the CCD camera and you can now no longer make that correlation back to the key sequence that I was talking about earlier. So that's a deficiency but you can actually typically make this up with what we call coverage which means you don't just ever sequence through once you actually have multiple molecules which will include that multi a homopolymer run and the more you sequence it the more sure that you're six versus seven nucleotides of the same type for example. So that's one way to get around the insertion deletion error model here. The other advantage of this platform which I think I've pointed out here this is a great platform for targeted validation where you're looking for single nucleotide changes because the way the nucleotides are flowed one at a time you almost never see a substitution error on this platform. The error rate for substitutions is extraordinarily low and so if you're looking for a specific base in a PCR product or whatever you can almost always detect that that's there and not be worried that you're getting some sort of a platform specific error. Okay, let's shift gears now to the Illumina platform. This was round about the second platform introduced originally marketed as Selexa. Do you have a question? That's been amplified. Three prime hydroxyl for extension if you will and it goes and goes. Okay, so the Illumina technology again note the similarities now with what we've already been discussing. So DNA is fragmented. Here we blunt the ends because they tend to be ragged and we did that in the 454 process I just didn't point it out. We actually fast-forward the ends and add an A overhang. These are all enzymatic steps that take place in quick succession. You ligate on the adapters utilizing this A overhang to get the adapters ligated on and you do a quick cleanup step, a sizing step if you're interested in getting very definitive sizes from the library which we usually are just for uniformity's sake and you're good to go. So this is a very straightforward process as I mentioned earlier in our laboratory because of the need to sequence thousands of samples at a time in some cases for very directed sequencing projects such as looking at case control cohorts we've actually automated this to a very large extent to where we can produce on the order of about 1,000 Illumina libraries on a weekly basis with just a technician and a fleet of small inexpensive pipetting robots. So this is very automatable and works very well. In Illumina sequencing the amplification of the library fragments now is occurring on the surface of a glass flow cell. So each of these technologies has their own nomenclature as well for the device that does the sequencing. In Illumina it's the flow cell and again you can see sort of very shared characteristics here. The surface of the flow cell is decorated with the same type of adapters that are put onto the Illumina library fragments. This provides a point for hybridization of the single-stranded fragments. The amplification steps are essentially what's called a bridge amplification where the DNA molecule will then bend over and encounter a complementary second-end primer and the polymerase essentially does multiple copies in one place which results in this collection of fragments times several hundred million on the surface of the flow cell now and this is called a cluster. So when you image a cluster during sequencing it looks like this very bright little dot here and when you image a portion or all of the flow cell it begins to look like this star field if you will as the incorporated fragments are scanned with the laser they emit a light frequency and that's detected by the camera that's coincident with the scanning by the laser. And I should point out that then there's another process that takes place when you go to the read-to or the parent-end read which is essentially that you wash away all the fragments that you've already synthesized you go through another round of some amplification and now you change the chemistry for release of this fragment up for sequencing and you effectively copy the other strand in the other direction from the way that it was first copied in this initial go round. Okay. So how does the sequencing chemistry work? Now this is fundamentally different than the 454 chemistry that I showed you in a couple of ways first of all we're supplying all four nucleotides into each step of the stepwise process and in fact the way these nucleotides are designed is very specific so we have all four in the mix because each one A, C, G and T has their own unique fluid so they report at a specific wavelength when they're scanned by the laser so that's why you can have them all in there at once because you can get the identity back just based on the wavelength that's interpreted by the machine camera. In addition at the three prime end where normally you would have a hydroxyl available for the next base incorporation there's a chemical block that's in place and that chemical block doesn't allow you to incorporate another nucleotide until you go through the detection step and the de-blocking step that removes that and turns it into a hydroxyl ready for the next go round of incorporation. In a subsequent step the fluorescent group is also cleaved off because it has a labile connection here and that also removes the fluorescence so that you don't get any background noise if you will for the next step of incorporation. Now what I just said is absolutely not true 100% of the time right because of one fundamental rule of life if you don't walk out of here today with anything else you have to understand the chemistry is never 100% right you probably learned that in you know college and it's also true here so two things can go wrong in particular the block cannot be there so it may not have been synthesized correctly some proportion of the molecules in this mix actually won't have a block and you'll incorporate two nucleotides let's just say for the sake of argument instead of one in a single step that puts the strands that incorporated those two nucleotides instead of one so-called out of phase with the rest of the nucleotides in that cluster and if this will happen multiple times throughout the 100 cycles per read you will encounter a noise that's due to molecules that are out of phase with the others so this is the source that limits read length on this particular instrument of course the other thing that can go wrong here is twofold again once because chemistry is never 100% you might not have a fluor on there so you can't detect the molecule that's been incorporated but of course this is the beauty of having several million copies of it right because that's only one of several million or two if you will so that's not necessarily a bad thing but then if this cleavage stuff doesn't occur the worst encounter is that it's actually going to interfere with the signal that's coming from the next go round and again these are cumulative processes so they may happen a lot over the course of the sequencing reaction they produce noise and that can cause errors and ultimately limits the read length as well so just always be skeptical about how well each of these steps work because there are places where they fall apart okay Illumina has been I think in my opinion a pretty remarkable company also just in terms of the amount of data produced from these instruments so I don't have a comprehensive listing here of all of the you know the early selection basically produced about a billion base pairs per read of single-ended reads the newer iteration once Illumina took over was the GA2X we then in early 2010 encountered the high-seq 2000 which ran two flow cells coincident with one another and produced on the order of about 200 gigabases per run in about an 8-day period and then the most recent version of the high-seq which was announced released rather in July of this past year is the equivalent of six sequences per run of about 10 to 11 days as I mentioned so this is a remarkable jump up in terms of the quantity and actually also the quality of the data that results from this and in our laboratory this is the primary sequencing instrument that we're using there's also a newer version called the MySeq which I'll talk about in a little bit which is a personal genome sequencer if you will that's sort of a desktop instrument much lower scale as you can see from the numbers here but I'll get to that in just a moment and then the last thing to mention on the Illumina technology is as I mentioned there were some recent announcements in January about improvements to the different sequencing platforms Illumina's announcement was for the 2500 sequencing instrument which is basically now just a modification of the HighSeq instrument in fact you would be able to upgrade the instrument itself and that will produce less data about 120 gigabase pairs per run but the run requires only 25 hours to complete so now you're talking about the rough equivalent of covering a human genome in a one day period to generate the data and we don't have those instruments yet but we're beginning to look at data from them I should point out that nobody has them yet so they're somewhat vaporware at this particular point in time we shall see the error rate is yes thank you I forgot to mention that the error rate has been improving pretty dramatically over time originally we were around about 1% error rate on this platform when we first started working with it the recent version 3 chemistry that I mentioned is down around 0.3% error on the reads and we also are seeing a much better coverage on the G plus C content where in the past very high G plus C sequences of 95% higher actually did not represent well in the Illumina data set and that's now been addressed by some changes to the chemistry that we saw with the version 3 release so that's improved the coverage overall on the genome as well which has been a relief because it was pretty easy to see we published a paper in 2008 that actually showed that you couldn't see these sequences okay so the third large sequencing technology for human sequencing and other approaches is from a company life technologies this is a different beast sequencing by ligation and what we've been talking about earlier which is sequencing by polymerase we use a custom adapter library as I mentioned this is also an emulsion PCR based sequencing instrument and life technologies actually has some nice modular equipment that you can buy to facilitate the emulsion PCR steps because they are pretty manual if you don't have that instrumentation and subject to some errors and failure points and those instruments seem to help so this is sequencing by ligation the bottom line is that we have fluorescent probes there are about nine bases in length and they have a very defined content and we are also priming from a common primer once we have these emulsion PCR beads to sequence from so rather than sequential rounds of base incorporation this is now sequential rounds of ligation of the primer a detection step takes place there's some enzymatic trimming of the primer that was or of the enmer that was added on and then a second round of ligation follows etc etc so we go through sequential rounds of ligation we also go through sequential rounds of primer so when your first sequencing primer sits down on the adapter at the n-0 position the second and you sequence bases 5, 10, 15, 20 etc the second go round your primer sits down at n-1 and you sequence bases 4, 9 etc so it's sequential and sequential if you will the beauty of this approach is that in the design of the ligated nymers you effectively have the first two bases identified with a specific known sequence and those correspond to the fluorescent group that's on that particular nymar so why is that important? well what it means is that you're effectively sequencing every base two times because these first two nucleotides are fixed and their sequence is known from the fluorescent group that's there so effectively if you can look at this particular little diagram here and if you can't see these well on on your slides or on or on my slides for any of these approaches I would suggest very strongly going to the manufacturer's websites because they have extraordinary sometimes animated visual aids to help you understand the unique attributes of their sequencing process but effectively by sequencing two bases per cycle on each incorporation you end up overlapping the bases that are read from the fragment and so you effectively sample every base twice I like to refer to this for a common analogy is when you write a check you write in the number of the dollars and cents that you want to pay on the check and you write it out secondarily in long hand so there's that ability to sort of cross compare if you will from one read to the next read to make sure you've gotten it right and this yields an extraordinarily high or extraordinarily low rather substitution error rate for this technology I'm going to have to change that analogy because not many people write checks anymore but so far it's still working I guess so this is the solid instrument is the name of the platform these are the two most recent versions with the 5500 XL being most recently introduced just last year and these are some of the attributes as you can see here the error rate is extraordinarily low we have this front-end automation a six-lane flow chip that actually allows you to use some lanes and not other lanes sort of to ameliorate this need to sort of load everything up at once if you don't have enough samples etc and the very high accuracy as I mentioned and they're introducing some new primer chemistries I'm not sure that those are actually out yet that would increase the accuracy pretty wildly high so that could be I think very interesting okay so let's shift gears now away from next generation massively parallel instruments to what are commonly referred to by some as third generation sequencers and this is really more than anything to just denote a time point which was sort of the beginning of last year when these sequencers started to hit either early access or to actually hit the marketplace so these include the pacific biosciences sequencer which is the first true single molecule sequencing instrument that I'm going to talk about so we'll see what the specific attributes are for that system which basically marries nanotechnology with molecular biology the ion torrent system now is a variation of pyro sequencing that instead of detecting light actually detects changes in pH because not only is the pyrophosphate released when you incorporate a base but a hydrogen ion is too and so you effectively can monitor base by base changes in terms of whether in a corporation it has happened by monitoring the pH and effectively a little modified semi conductor apparatus and then the my seek I've already mentioned which is really just a scaled down version of the high seek with very great similarities in terms of the process in the chemistry etc these all offer some shared attributes as well faster run times as we'll talk about lower cost per run and the reduced amount of data generated relative to the second gen or next gen platforms that I talked about and also as some are touting a potential to address genetic questions in a clinical setting because of the low cost and speed with which results can be returned from these instruments and I probably won't talk about that but during Q&A we can we can address that if you're interested so I put these in with the other systems just to place them in context along the lines of what I just talked about different detection methodologies how the libraries are made and you can see here from the Pacific biosciences that we don't really require any of these concerted amplification steps although in the true sense of full disclosure there are some PCR steps involved here as you make the libraries with the specific adapters and then the run times as you can see are quite low 45 minutes, two hours and on the order of about 19 hours now for the MySeq platform so let's go through these step by step I just put the PacBio here first because that was the first third gen instrument that we received in our laboratory and you can see again just sort of very common shared steps here relative to the next gen platform so the sample prep requires shearing polishing of the ends and ligation of a specific adapter that's called a smart bell I'll show you in the next slide exactly what that is smart bell is obviously just a marketing name for it but it's kind of a clever little adaptation and then sequencing a primer and kneeling takes place where you actually bind the DNA polymerase to the library molecules first and then introduce them onto the surface of the smart cell which is a little nano device essentially that contains on the order of about 150,000 so-called zero mode wave guides I'll tell you what those are in just a moment but you effectively image half of the chip at the time so you look at the possibility of sequencing about 75,000 single molecule reactions then you actually switch the imaging the chip physically moves on the platform and you image the other 75,000 zero mode wave guides and that's one run of the smart cell which is what this little device is called so this just shows that you're actually sequencing first half of the smart cell and then the second half of the smart cell now the reason that it says movies here is that this is a true real-time sequencer so what you're doing in this sequencing instrument is you have the DNA complex to a DNA polymerase you're providing it with fluorescent nucleotides once it nestles down at the bottom of that zero mode wave guide you use the camera and optics and laser system to effectively watch every one of those DNA polymerases in real time as it accepts the end fluorescent nucleotides to the active site which is where the optics is focused incorporates it and in the process of incorporation the fluorescent moiety diffuses away and the strand translocates so that you can watch the next nucleotide float in sample if it's not the right one it floats out too quickly to be detected in the ideal world if it's not it sits there long enough to be detected and then floats away and this is the source of errors etc the difficulty in single molecule sequencing as opposed to amplified molecule sequencing is just that you have one shot to get it right and there are a variety of sources of error that are highly unique to single molecule sequencing that don't occur in amplified molecule sequencing such as dark bases such as sampling in for too long so that you get detected but you're not really the right nucleotides so you don't get incorporated and multiple nucleotides getting incorporated too quickly to distinguish the individual pulses that result so the smart bell is shown here you can see the source of the name it's essentially just a DNA lollipop if you will that gets adapted onto the ends of double-stranded fragments and gets primed with the primer so the sequence is known it's complementary at its ends but not in its middle and so it forms that open single-stranded portion of the molecule when a denaturation impact such as sodium hydroxide is applied the molecule then opens up and becomes a circle so the beauty of this is that for very short fragments complex with the DNA polymerase in a 40 or so minute runtime you can sample around multiple times both the Watson and Crick strands of that circle going through the adapter each time and then in bioinformatics space you can align all of those reads and end up with a much higher quality consensus sequence for that short fragment so that's one application of the PAC-Bio if you want to run the sequencing for longer you take these VLRs as we call them very large fragments also with the Smart Bell adapter and you essentially just sequence as long as you can during that 45 minute movie if you're wondering the read length here is really limited by the amount of data storage capacity on the movies because they're actually very very large the movies themselves never get stored because it wouldn't work storage-wise but they get converted very quickly into a down-sampled data file that then gets operated on by the instrument software to do the base calling okay so that's the limitation but as you'll see in the next slide these can be very very long reads as we've experienced them so now we're really out into outer space in terms of read length compared to everything I've shown you so far and compared to capillary sequencers we're looking with the latest chemistries that read lengths that average to about 3,500 base pairs in length for these VLR libraries and you can see that some are actually very very long but these would be you know the outliers in this distribution curve if you will and here's just some exemplary data down here from a 45 minute movie where you're looking here at about 8,000 nucleotides in length at the extreme on that distribution curve and you can see that there were also some failure sequences as well so that all sounds great because of the sensitivity of single molecule sequencing to errors as I just mentioned the error rate on this is still quite high about 15 percent so 15 out of 100 bases are incorrect and most of those errors confoundingly are actually insertion errors so this makes the alignment of the reads difficult to do back to a known template or genome reference and it also complicates the ability to assemble these reads but I'll show you some of the ways that we're trying to address that here in just a minute or two shift to the ion torrent this was released last year this is the pyro sequencing approach that doesn't use light but rather uses the release of hydrogen ions again this is very similar to pyro sequencing the guy who developed pyro sequencing is also the guy who developed the ion torrent you won't be surprised and this is bead based amplification based you're sitting in a well here on a semiconductor chip and you're basically on top of a sensor plate that when the hydrogen ion release occurs at each flow of the nucleotide again this is the same sort of stick as the 454 approach you get the release of hydrogen and this is impacting the detector there that's sensing changes in pH some advantages for this are potentially that the linear range dynamic range on a pH meter is much better than on a camera so you could have a better sensitivity to long home of polymers we haven't really seen that yet but then fairness to the instrument this is these very new days and also we should see a lower substitution error rate on this instrument as well this has been on a trajectory of improvements over time since we first received the instrument commercial release in last or late well it was actually early this past year in 2011 and these chips are increasingly bumping up the yield on the run of the sequencer keep in mind this is about a two hour to three hour runtime and we're just recently now experimenting with these 318 chips where the read length is 200 base pairs still not paired end reads but they're working on that so this is a 200 base pair fragment and they're hoping in this calendar year to go up to 400 base pairs I should also point out that ion torrent life technologies is one of the other companies that also mentioned a new version of this instrument at the JP Morgan meeting in January and that's the ion proton which will also move using the same technology a slightly larger chip with more wells etc etc will move you to this mythical thousand dollar genome in a 24 hour period on a completely different instrument that's relatively low cost again vaporware we have to make sure that it actually comes through fruition but they're projecting towards the end of 2012 that this would actually be available for the coverage of a whole human genome thousand dollars a 24 hour period this just shows some of the work that we've been doing on this platform you can see the different chips listed here these are all bacterial genomes that we're working with enterococcus fecalis or Escherichia coli these are different emotion approaches manual versus these now scale down modules that were the same as what were developed essentially for the solid technology that I mentioned and some enrichment approaches also can be automated as well and of course you'll immediately see that kicking in the automation actually bumps up the output of this sequencer inordinately so you know it's kind of I think probably worth the money for the automation to get that kind of yield increase just my personal opinion now last week was the advances in genome biology and technology meeting that I've organized now for 13 years down in Florida which is really kind of the showcase for sequencing technology and at that meeting we had a late breaking abstract provided by this company Oxford nanopore on their nanopore sequencing device that should if it again is real revolutionize everything that I've talked about today we have to be skeptical however we are scientists so this essentially could use two different processes for sequencing through nanopores and not trivial technological feat I might point out that many have been pursuing at the academic level for well over 15 years now with no tangible commercial success having resulted so if these guys can do it they're really bucking the trend but we'll see so there are two flavors if you will that are being proposed by this company the one that wasn't talked about is this exonuclease-aided sequencing where you can see a lipid bilayer here a nanopore with a sensor of some sort and exonuclease poise right at the top which would of course routinely and uniformly cleave off the DNA bases in this strand shown here and electrical field would suck them through and detect them one at a time in a neat and orderly fashion and if you can detect the sarcasm in my voice you can imagine the ways in which that might go wrong but we that remains to be seen because I don't think biology is always that need in orderly for starters but I am a skeptic as you probably can tell by now so the type of sequencing that was talked about is being sort of near term second half of 2012 a device would be available that looks for all intents and purposes like the data stick that I loaded my talk from today is this poor translocation sequencing approach here you have a double-stranded molecule that's held by an enzyme probably a helicase that essentially single strands it and preferentially feeds one strand down through the poor with a tether on this end to make it behave because the biggest problem or one of the biggest problems in nanopores is that the DNA looks nice and straight in these beautiful pictures but of course the tendency to form secondary structure or even base stacking interactions sometimes kibosh is the fact that it's actually going to go through that poor in a nice orderly fashion so these guys have apparently solved that problem they're reporting base length reads of hundreds of thousands of bases having sequenced through lambda in its entirety for an example all highly purified DNAs I might add and so it remains to be seen and in their big version of this which looks you should check out their website if you're intrigued it looks for all intents and purposes like a server rack in a data center this would be the consumable that you would put into each one of those server modules if you will to perform large-scale sequencing two to eight thousand nanopores per membrane that goes into this you add on your DNA you push the button and you walk away as it complexes it together with the enzyme they find the nanopores and the sequencing begins I should point out that this is all detection by current fluctuation so as the DNA translocates through the poor the current fluctuation differs depending upon the triplet codon that's occupying that poor at the time this group is studied that extensively for every triplet codon and they have supposedly a hidden Markov model of what each triplet looks like and you can infer the DNA-based sequence using that hidden Markov model base collar okay so that's that must then just the last five or 10 minutes here talking about some applications of next-gen sequencing I've made sort of a perfunctory but probably incomplete list here and I would refer you to an old review that I wrote and then also to this nature paper that I mentioned at the beginning of the talk which is a little more up to date and not just to sort of toot my own horn there are lots of reviews out there on what people are doing with next-gen sequencing it's just that I didn't list them all on the slide but they're pretty easy to find so let's just talk through a few of these examples and I'll show you some papers from the literature on our work so one of the things that we've been doing a lot and really pioneered in many ways is whole genome sequencing these just show the Lumina and solid 5500s because these are really the high throughput here today whole human genome sequencing instruments that are out there on the marketplace and the stepwise progression to generate a whole genome is actually pretty simple wasn't when we first started trying to do this mind you but I think we've got it pretty well down to an art at this point in time that's highly reproducible and as you'll see in a minute we do hundreds of whole genomes a year you prepare the pared end libraries using the processes I've walked you through you produce pared end data about 30 folds so about 100 GBs of data per 3 GB base haploid human genome is sufficient you can go deeper but this is an economic decision in some ways because this is of course the most expensive application towards the human genome and then you use computer programs to align the read pair sequences back on to that human reference sequence that we generated in the early 2000s and we use different algorithms as I'll show you to discover variants genome-wide of all types we first published the initial description of a whole genome tumor normal comparison in 2008 in this nature paper with Tim Lay our AML collaborator using back then the Selexa technology 32 base pared unpaired reads a gigabase at a time it took us about 90 runs on six sequencers 90 runs total on six sequencers to produce the full equivalent of the tumor genome at the time that we did this we couldn't get any funding to do it so we went to a private donor and he contributed a million dollars to this project which we when it was all said and done figured probably cost us about a million six because at this point in time none of the bioinformatics that I'll tell you about had ever been developed so we've kind of scaled this up since then I say tongue in cheek these are the number of tumor genomes just that a fairly out of date look you can see a up to date look if you check our website that have been sequenced keep in mind each one of these cases reflects at minimum a tumor normal pair that have been sequenced by whole genome methods you can see that most of our work has been an AML and breast cancer and then we have a very large amount of work that's now entering publication phase with the St. Jude pediatric cancer project that just reported the first two three papers in nature two and nature genetics one within the last three weeks so this is now ramping up at to the point where we have over 500 whole genome sequence alone as you can imagine sequencing lots of genomes exacts a higher tool than just taking each one and finding out all the somatic mutations so we've actually developed now a software that's available for download through our website or source forge called music which allows you to take all this information from genome sequencing and start to make sense of it across different types of functionality like significantly mutated genes pathway based analysis mutation rate analysis and looking out to the databases such as cosmic and OMIM to identify previously identified mutations and then also taking in any clinical data that might be available for those samples to do clinical correlation analyses to the different mutations that are identified so that's music most recently we've moved forward exploiting the digital nature of the technology in a paper that was published in nature January 12th of this year looking at patients at their initial presentation of AML and at their relapse of AML and basically showing that unlike that before this is actually often an oligo clonal presentation of the disease I mentioned to you earlier that we can map each one of these tumor mutations into a specific subset of the tumor cells that originally contributed DNA showing for this patient that four subclones are originally present in her tumor the effect of chemotherapy is to winnow away most but not all of those subclones they acquire additional mutations often through the DNA damaging impact of chemotherapy and come out on the other side as a flourishing usually monoclonal presentation that kills the patient we've also spent a lot of time developing software looking at read pair analysis for structural variant detection we've known since the 1970s that lots of translocations occur in cancer for example and this is one way of getting at it using that read pairing information that I told you about earlier where you look specifically at how and in what orientations the read pairs map and you can use this break dancer algorithm to interpret different types of structural variants including deletions, insertions inversions on the chromosome and intra and inter chromosomal translocations and this is actually quite a widely used software now by many groups to do this type of interpretation one of the things that we often want to do once we've identified a structural variant is we want to understand the base sequence at that structural variant so we really understand exactly where it took place down to the nucleotide you need an assembly algorithm to do that this is just one example from our work called TIGRA SV which allows us to assemble many but not all structural variants down to single nucleotide resolution and here's an example of how we use that in a clinical case of a patient whose genome was sequenced this is a patient presenting with that pathologic examination acute promyelocytic leukemia often found under the microscope by cytogenetics as a translocation between chromosomes 15 and 17 however cytogenetics on this patient did not allow her to take the standard of care which is ultran's retinoic acid consolidation because her cytogenetics showed that she did not have the 15, 17 translocation so she was referred into us we decided to do sequencing on her genome because this is a very important consideration as you can imagine her referral into us was for stem cell transplant this is because the cytogenetic examination did reveal that she had complex cytogenetics and was therefore a high-risk patient this is the standard of care for high-risk patients nonetheless this is very expensive about half a million dollars unless there are complications then it's more associated mortality and morbidity and if we could make this determination of whether the 15, 17 translocation was really there we could allow her to avoid stem cell transplant and go instead back onto the normal paradigm of care so this is now trying to ameliorate conventional pathology and cytogenetics with an intermediary that involves whole genome sequencing and basically what we found is shown here rather than the translocation this patient actually encountered an insertional mutation if you will a portion of 15 containing the PML gene inserted physically into chromosome 17 producing the net PML-RAR-Alpha fusion that is the hallmark of the normal 15, 17 translocation so the mechanism was different the net result was exactly the same in using TIGRA we were actually able to assemble this break and this break down to nucleotide resolution to predict the open reading frames on the proteins were being conserved here but not in the other fusion insertion products and this was reported in the journal of the American Medical Association in mid last year and you can see a diagram here from the paper that shows you the net result of her insertional change so just a few other sort of illustrations here of using next-gen and third-gen sequencing I mentioned the very long reads available from next-gen PAC-Bio reads this is now showing an Illumina assembly which are all of these colored short reads and we're spanning a gap here in the assembly where we lacked coverage using some very long reads from the PAC-Bio that were able to span across that gap so this just shows the power of contiguating assemblies with the very long read technologies there are actually now mechanisms that have been published from Mike Schatzlab at Cold Spring Harbor where you can use the high quality of the Illumina sequence to actually improve the quality of the PAC-Bio reads once you have those aligned we're also using the Ion Torrin sequencer that I mentioned earlier as are many people to just do rapid genotyping specifically small PCR products that can be quickly sequenced in a two-hour time frame and analyzed for the presence of mutations and the company has just announced recently some defined content oligos that will allow you to generate amplicons and then sequence them on this platform and that's also available for MySeq One last note here on hybrid capture technology this is sort of an offshoot that many people prefer for as opposed to whole genome sequencing it came along a little bit later but it's been very rapidly advancing and this is really just taking a very clever approach where you begin with your standard whole genome library but instead of sequencing you take it through a hybrid capture approach with a biotinylated probe set in this example the biotinylated probes are representing most of the human exome namely all of the exons that are present in the known genes of the human genome you can complex the library with the capture probes during a hybridization step which allows you to have a biotinylated probe and a captured DNA molecule those can be separated away from everything that didn't capture using streptavidin magnetic beads that bind to the biotin apply a magnet wash everything else away and then release the bound fragment which is already adapted and ready for whatever next gen sequencing approach you'd like so this is a way of down sampling complex genomes to only look at the genetic content and this has now been exploited beyond the exome to custom capture reagents that can be synthesized through these same manufacturers and you can do the same approach over and over and over again here's just an example from our work a really hard target which is merkel cell polyomavirus a very small genome about 5kb this is a virus that inserts in the human genome but you can't go in routinely by PCR and amplify it out because it deletes its own genome in specific and unknown places that are not uniform and the site of integration is not known either so we published this in last year in the journal of molecular diagnostics with some of our pathology colleagues this is basically looking at four different cases in this initial set out of formal and fixed paraffin embedded tissue we're actually able to show the differences in the genomes as they were captured relative to a reference full length merkel polyomacel and we're also able to use a bioinformatic approach called slope to look at the paired reads and identify the exact junction fragment into the human genome where these viruses were able to insert predict which genes were being interrupted by them and so on and then just one last word on RNA sequencing which turns out to be pretty important if you take an RNA isolate you can treat it in a multitude of different ways some of which are shown here adapt or ligate the obtained fragments and perform next generation sequencing this usually starts like whole genome sequencing with alignment of the reads to a reference database and some discovery efforts but the multitude of different types of analysis that one can do on RNA as opposed to DNA because there are so many things that happen to RNA such as looking at expression levels looking at novel splicing events where exons are added or deleted looking at allelic bias for example where one allele is preferentially transcribed over the other and in cancer looking at known fusion transcripts to identify them in a cell population all are possible using the right bioinformatics I would say that this is one of the most intriguing and tricky problems in analysis all of these things that we currently face today and we're working very very hard on it just a couple of quick examples from our tumor sequencing you can expect anything to happen with RNA here's a tumor normal pair predicting a mutation but when we look in the RNA seek data to very high depth this gene isn't even expressed in the tissue at all this is from the tumor here's an example of very allele specific expression where the wild type allele only is being expressed in spite of very high prevalence of reads in the tumor genome and here's an example of a splice site mutation showing alternative splicing that results from an acceptor site mutation detected in the whole genome data and verified by the mapping of reads from RNA seek across this region and the links between mate pairs that show that there are some problems in terms of the splice site being missed by the transcription machinery and so on and then another way that this technology is being used is not just for human sequencing but also for looking at microbes so there's been a very large project that's been funded by the NIH common fund on the human microbiome project I won't go into the details of it for you today but just to say that this has been a site of our medical center has been a site of collection for healthy volunteers being sampled across multiple body sites sequencing takes place you can do whole genome sequencing or 16S ribosomal RNA sequencing and the bottom line is the question is who's there and how can we determine that by examining DNA sequencing data this as you can imagine presents huge challenges in terms of bioinformatic interpretation of the data and has been a major cornerstone of a lot of bioinformatic development here's just a couple of quick examples of what you can do with microbial sequencing looking at the stability of the virome over time for multiple body sites that are sampled on different individuals you can identify different viral types that are shared and similar between a person when they visit the first time to clinic and a person when they visit the second time to clinic according to these different body sites this is one of the ways that we can monitor healthy individuals and how their microbiome changes over time this gives us a beautiful baseline from now understanding the impact of the microbiome changes when a diseased individual if you will comes into clinic and we really I think did this project right by looking at healthy volunteers first and now moving into disease here's just one example of that which I think doesn't show up particularly well but shows the combined power of next gen sequencing all 400 of these MRSA genomes are essentially sequenced on one run of an Illumina sequencer and then using phylogenetic analysis you can see that most of the 400 these ones in green conform to the common ST8 strain of MRSA but there are a variety here that group together phylogenetically with sequence differences that distinguish them from the ST8 subtype and are also distinguished when we specifically look at the MLFT subtyping as well so I'm running out of time I won't have time to cover this last bit but we are working very hard on envisioning a clinical sequencing pipeline using some of the human analysis tools that I've told you about today towards individuals that consent for return of information about targetable therapies for helping to treat their cancers we have some examples already published I mentioned the JAMA paper with the individual with acute promyelocytic leukemia who's now in remission two and a half years after we sequenced her genome and she was treated with standard induction and consolidation therapy and is healthy and alive and back at work today we've also sequenced a number of patients now many of them metastatic including a patient shown here which is a HER2 positive disease patient and by sequencing her genome we can detect the extreme amplification of HER2 by sequencing her transcriptome we can also show that this transcript is wildly overexpressed relative to ER positive patients and we can also do if you will conventional pathology with our sequence data from RNA showing that she's also PR negative and ER negative interestingly this patient was already known to be her two positive what we were looking for was some potential therapeutic options and you can see a similar picture here for chromosome 6 amplification in the extreme on the DNA amplification of expression extreme on the RNA and this turns out to be a drugable target a histone deacetylase which should respond to varinostat and her oncologist now has this at the ready as she may progress out of her currently stable brain metastasis that was her last data point in the clinic we've done this for additional metastatic patients and this is just to illustrate that going through the genome we can actually predict a large number of potential targets for each one of these patients we return this information to the oncologist it's then up to them to decide what's the best option for the patient but the bottom line is is that most of the prescribed therapies are going to be so-called off-target I think this presents a challenge for clinical paradigm but one that we should start thinking about facing up to in the near term and these are just some other examples of off-target drugs now for estrogen receptor-positive disease that were identified from sequencing through 50 patients in a clinical cohort trial from the American College of Surgeons Oncology group which we're just now getting ready to publish so I want to thank you for your attention and just leave you with these conclusions hopefully I've convinced you that these approaches are revolutionizing biological research the earliest impacts have been on cancer genomics and metagenomics but many other types of impact that I haven't been able to cover just because it would be overwhelming I think I also really want to emphasize and this will make Andy happy you know the extreme need for bioinformatics-based analytical approaches is really still a big challenge for this it's getting better but it's not quite there yet and this is the most expensive part of the sequencing so when people say sequencing is cheap they mean one thing and that's generating sequence data I would say that this analysis is still the most expensive and complicated part we're looking in that context at now integrating through multiple data types for cancer patients as I illustrated and I think the clinical applications of these technologies are pressing they're real they're happening and they really increase the ante if you will on needing good bioinformatics now not just for analysis but for what I would call interpretation of the data so that the physician ordering the test gets the maximum benefit back from having ordered that test to the benefit ultimately of the patient and I think that's the biggest challenge in front of us right now today so I will finish then by saying thanks to all of my colleagues back at the Genome Institute clinical collaborators only a few of who are listed here and without whom we could not do our work and I will take any final questions in the remaining minutes thanks for your attention yeah you're welcome to come up and ask questions if you'd like because I know people have to get on with the rest of their day thanks