 All right good morning everybody Let's go ahead and get started First a few course announcements and those of you who are on the mailing list already know all of this since we had to cancel last week's Lecture we've made a couple of adjustments to the course schedule So at the end of the course now we're going to be tacking lectures on we're going to skip the week of March 30th because that's both Spring break and the first day of Passover, so we're going to skip that week and The lecture by Paul Meltzer on large-scale expression analysis, which was originally scheduled for today We'll take place instead on April 6th, and you can get the updated copy of the syllabus on the course's website so Even though we've got mountains of snow outside. We need to talk about mountains of sequence data today And so with that I it's my pleasure to introduce you to today's lecturer Dr. Elliot Margolese Elliot is an investigator in the genome technology branch at NHGRI His research program is focused on developing bioinformatic approaches for identifying and Characterizing regions of the human genome that are evolutionarily conserved across multiple species And you've hopefully started to develop an appreciation for why this is important over the past several lectures particularly from Dr. Green's introductory lecture in the first week since these kinds of approaches allow us to identify important parts of the genome that are Code for genes or function as regulatory elements his Research program like many of the lecturers in this course employ both laboratory based and computationally based Approaches in accomplishing his research with the goal of finding evolutionarily conserved sequences in determining their functional significance Dr. Margolese has played a key role in the encode consortium the effort that was aimed at Compiling a nice comprehensive encyclopedia of all of the functional elements in the human genome the effort that Dr. Green described to you in week one Elliot is also at the forefront of developing new methodologies that are intended to capitalize on next generation Sequencing technologies positioning us to be better able to address key questions both in genomic biology and in clinical medicine His lecture today will be devoted to next generation sequencing technologies during which he'll be providing you a nice survey of The various platforms that are currently available and giving you some Examples of how these technologies can be used in practice. So please join me in welcoming today's lecturer Dr. Elliot Margolese Thank you Andy for that kind introduction also Thank you Paul Meltzer for being so kind to shift your schedule so that I could be fit in at one week later And thank you to the crowd who has probably itching to get out of your house and into a nice fun Lecture all about new sequencing technologies. So I should Start by saying this is actually an incredibly difficult lecture to put together because Normally when you give lectures year after year you kind of look at your slides You adjust a couple here and there and then you can kind of move on in a year or so ago I had Devoted half of my current topics lecture to new sequencing technologies virtually all of those slides are obsolete This is such the one thing that I would like to well There are many things but one of the many things I would like to emphasize today is how rapidly moving this technology is and how Or this field is and what I would like to Get across to you is not so much Data points sometimes people can be focused on you know How many bases do you get out of this particular technology, you know And how what's the most cost-effective way of doing something what I'd rather get across to you are the concepts the way these methods are Working and what are the benefits? To using one method over another and and what's the rationale behind these methods and then how can they be employed and hopefully give you an idea of the trajectory of where these methods are going and just to give you an idea of how rapidly moving this field is from Last week when I submitted my slides to this week I actually had to update some things For some some late-breaking information that I was able to find out over the last few days So and I think part of this rapid moving right now is is because there's a meeting happening next week Related to new sequencing technology So all of these companies are rapidly trying to get their latest greatest technology out the door just in time for This this meeting to take place Okay, so just a very brief overview of what we're gonna talk about over the next hour plus time for Together's I'll give you some background on why we actually want to sequence DNA and then we'll go over that the Technologies and then get across some of the applications. So it's pretty much split into these kind of three high-level areas Okay, so this is a we're starting a little philosophical here why sequence DNA it's it's an important question to answer because We're talking about sequencing gigabases of DNA very cheaply and very fast and why why are we trying to do this? Well, the first as was mentioned before is is for comparative genomics reasons that the DNA is the fundamental unit of heredity This is something that we've we've learned in basic biology now Obviously to get the genome sequence of an organism is the first step in understanding and and creating the foundation of Being able to put all that biology that we want to learn about a particular organism Particular interest of mine is in comparative sequence analysis by being able to compare all of these interesting organisms To each other to find regions of the of genomes that are Candidates to be functional so these new sequencing technologies are starting to allow us to sequence many many more genomes to advance the fields of Comparative genomics and the algorithms that we can develop there, but there are many other reasons than just to sequence DNA to understand a particular genome a Great example that I heard actually Eric Lander give is equating DNA sequencing to personal computing 20 30 years ago when a computer was designed it was designed for a very specific task Now a personal computer is used for everything from a calculator to managing your photos And this was never envisioned before when we designed computers back, you know 30 40 years ago The sequencing is becoming the same thing. It's becoming a general-purpose tool to identify functional sequences and characterize Genomes and so one of the other ways that we're we're now using sequencing is to look for variation among different organisms or different individuals of the same organism to Brief examples. I give here that we'll actually talk about a little bit towards the end of my talk are finding out differences between a tumor and the the actual Normal genome in an individual looking for what what are called somatic mutations and this has become very powerful Especially when you have a reference sequence like the human genome with which to align other human genomes to these new sequencing technologies become tremendously powerful But the other area that I will touch on a little bit as well are what we call now counting experiments And a great review of these types of methods was written by Barbara Wald and Rick Myers Listed over here and these these types of methods have become very popular the so-called chip seek so ways of finding parts of the genome that your favorite protein binds to Something called RNA seek which is a way of accurately quantitating gene expression by sequencing the RNA of a Particular cell or or tissue and also things like methyl seek which are looking at methylation status of various parts And I believe that Laurel nitsky and her talk will probably be touching on a lot of these areas in much more detail than than I will Hear I'll focus more on how the technology can capture this these bits of information Okay, so that's why we would want to sequence DNA. This is just a brief Overview going back to Eric Green's lecture on the kind of the history of DNA sequencing that the bottom line from this talk is that There are obviously over a hundred years ago some fundamental things identified about DNA the first of all discovering the molecule itself Understanding that it is the unit of heredity and then kind of in the late 70s coming up with this concept of using Didoxy Nucleotides in a in a very fancy way Fred Sanger I've developed this method to to really popularize and be able to kind of commercialize the ability to do DNA sequencing much like Now I don't know if I'm allowed to use Companies but you know much like Kai Jin has kind of popularized the ability to do a mini prep for DNA You know, this is now a very, you know canned thing that we can do and many times people don't even know the chemistry behind What's going on when they're sequencing DNA anymore? But what's really happened from the from the late 70s to the late 90s is really an improvement of the didoxy sequencing method automating it more moving away from slab gels to capillaries and And every part of the process being automated and refined with the nucleotides and what fluorophores are being put on these nucleotides and The polymerases that are being used all of this is built around the didoxy sequencing and in the late mid to late 90s the human genome project began to really ramp up and become Create major factories for genome sequencing using this method and it kind of plateaued in kind of the in early 2000s with with our ability to do more sequencing We certainly got great at commercializing and in factorizing all of this But there came a point of kind of no return where if you built more machines You weren't actually getting more efficiency here And this is where some of these new sequencing technologies kind of completely transformed things so we kind of coasted for a while with the with the didoxy sequencing and and really the the main machine that kind of was kind of the The front of all of this is is this applied biosystems 37 30 machine You could kind of stick a I think they call this a hotel Where you kind of stick a bunch of 96 well or 384 well plates in it and over the course of several days you could just walk away and this machine just kind of churned away generating the sequence and So you had factories of these machines sitting in various places around the world just creating enormous amounts of DNA that was then analyzed and we got really good at understanding the chromatographs from these machines And the but the way of really changing things that happened Maybe four or five years ago with the advent of some of these people call them next generation sequencing technologies I try and stay away from that term even though the title of my talk is next generation I like to call it new sequencing technologies because there's always going to be something new now And we're in again another kind of really rapid growth phase with the ability to sequence a lot more DNA Faster and cheaper and the rest of my talk is aimed at describing these new methods to you So here are kind of the what I would consider kind of the three main players today And there's there are some more players coming on the scene that I'll introduce you to a little bit So the the first kind of next generation sequencing platform was this machine from a company called 454 Which then got bought by a company called Roche And so now it's called our Roche 454 pyro sequencing machine and it's this little cutesy machine that's got a little monitor on top of it and And I'll talk about that in a little bit the the next player to come on the field has become was originally from a company called Selexa, which was then bought by a company called Illumina and This thing is called the genome analyzer. They're now up to the genome analyzer to X For for those who are counting over here and the the main stay behind this and I'll describe this in detail is is a The chemistry behind it is this reversible terminator chemistry different very different than the didoxy sequencing And then the other player on the field is from a company called applied biosystems Which now got bought out by life technologies so you can see everybody's buying everybody to Try and get their share in this new sequencing technologies field And this is kind of an interesting way of using a ligation based extension to to do sequencing And so we'll we'll talk a lot about these three platforms but I'll also touch on some of the the new new sequencing technologies and one of them is from a company called Helicos and So I should mention that these these machines they call them single molecule sequencing But what they really are actually clonally amplified single molecules for sequencing because the detection of a single Molecule and the fluorescence off of a single molecule is very difficult and that's really the difference with these two new machines in that they They they are truly sequencing single molecules and I'll describe a little bit actually on the the Pacific biosciences machine Kind of how that works a little bit I won't touch on the the details behind this method, but it is kind of true single molecule sequencing here Okay a Review that came out last month and is already outdated is this it's actually a great review from Michael Metzger at the Baylor College of Medicine on new sequencing technologies and he outlines a lot of the different methods that I'll be talking about and I actually Borrow some of his figures To nicely describe some of the the methods that are being used here When I say this is outdated there are a couple tables that I'll point to where You know again, I say everybody wants to know what's the throughput of this machine? How many gigabases can I sequence? Well, he's got a table in there Which you know is already outdated and I'll I'll show you some some details of that But it definitely gives you an idea of where things are going I guess this is a good time for me to say you know, it's it's very difficult to tease out Objectively what each of these machines can do you can't simply go to one of these companies websites? Because they'll tell you what they think their machine can do a few months from now, you know And and but they don't really tell you that it's a few months from now Or and some some are better than others and try and I'll and I'll try and point out the differences when you when you We try and make an apples to apples comparison between one machine and another for example One machine may run one flow cell while another machine may run two flow cells at the same time So what so the other machine might say it produces twice as much data But it also produces in twice as much time at twice as much cost So it's a little bit difficult to try and and get down to that apples to apples comparison And I'll try and tease that out for you through the rest of this talk Okay, so one of the ways I like to look at these these three main different methods of sequencing is Highlighted in this graph and there there are certainly trade-offs. So what I've shown over here is a kind of a hypothetical It's not hypothetical, but it's not drawn to scale I guess is the right right word here throughput which is kind of either measured in kilobases megabases or gigabases and you can either say you know Tens or hundreds of megabases or single tens or hundreds of gigabases So it's it's we're really talking about orders of magnitude here and then along the the x-axis is the read length So some of these technologies like the Illumina platform the AB or life technologies solid machine and the Helicos machine Produce relatively short reads on the order of 25 50 bases Maybe up to a hundred or even a hundred and fifty bases But not much much larger than that but they can produce a massive amount of this what we call short read sequence data And then you've got this pyro sequencing machine from 454 Roche That produces reads in the order of three to four hundred bases And they'll tell you they're coming out with something even longer than that probably soon But they produce on the order of hundreds of megabases of sequence in a matter of hours actually And then you also have now, you know your standard applied bias I guess life technologies owns these guys now to the capillary base that 37 30 and that can produce read lengths out to seven or 800 bases And but it produces it on the order of kilobases of sequence at a time now I mentioned this Pacific biosciences machine this this could potentially be a game changer I think it might fall somewhere way out here kind of get the benefit of kind of kilobases in read length At least some of the reads that might come off of this and producing, you know hundreds of gigabases of sequence at a phenomenal rate But I don't believe you can actually buy one of these machines yet where I do believe you can buy all of these types of machines Right now, even I think you can even buy this Helicos machine right now as well So and again my throughput is not only how much sequence can you generate but also by some sort of you know standard unit time or cost here So this is kind of the the mainstay of these technologies and we'll now kind of go through each of these these three technologies These three main technologies The rest of my talk, okay So something that's common to all of these different methods There are little things that that are different but I wanted to separate this out from the three different methods that that I'll highlight in detail Are really how do you prepare a library of DNA to sequence? And it all so this again the single molecule or a clonally amplified version of a single molecule starts with taking your DNA whatever that might be and Shearing it up into smaller pieces of DNA all of these methods require pieces of DNA that are much smaller than the size of a genome on the order of hundreds of bases and in length and then you add Some sort of a proprietary or not so proprietary anymore adapters onto the ends of these these little short pieces of DNA that you want to sequence And then you select for for adapters But in in and each of the methods has a their their trick on doing this for molecules I'd only have an A adapter and a B adapter So you have to have different adapters on each end and then again what what differs with each of these different methods is you then attach these Adapter ligated sequences to some sort of solid surface that could be a glass flow cell that I'll show you in a little bit It might be some sort of Bead that that has little sequences on it that are complementary to one of these adapters And then you use the the whatever method to actually look at those molecules that you've stuck on to this solid surface In sequence and obviously analyze all of them Okay Now I wanted to talk a little bit about the ability to generate what are called paired end reads So these are two reads that are physically linked to each other by some distance apart and in various for various techniques It's useful to know this information where you might have 50 or 100 bases over here 50 or 100 bases over here and then some fixed space in between That separates where you don't know what the sequence is and that could be very helpful Especially for genome assembly methods to get this DNA the challenge is that except for the Illumina-based technology which allows you to Inherently sequence the both ends of a fragment that's somewhere on the order of 400 bases in length The other two methods require you to do Experimental tricks to get two ends of DNA that are normally far apart from each other very close to each other So what this slide does is actually Describe to you some of those tricks that that can be used to bring two pieces of DNA close to each other for sequencing And actually this slide was made by two people in my lab Andrew Young and and Hatija And this this is how it works So you you shear up your genomic DNA into fragments that are of the the size that you want to get this this pairing information and then one method of dealing with Trying to get these these ends close to each other is by using an adapter that has the sequence of Restriction enzyme called eco p-15. I this is one of these type 2s restriction enzymes that recognizes a site over here but actually cuts some 20 or 25 bases away from its recognition site and You actually create molecules that have your fragment in it And then this adapter that has these these eco p-15 I sites on it and also set have some biotin On the inside of it and so then you actually digest the these circularized molecules with the eco p-15 I and you using the biotin you could capture out these Sequences that then represent a little bit of sequence from one end of a 2 to 3 kb molecule And a little bit of sequence from the other end of a 2 to 3 kb molecule But this molecule is now on the order of a couple hundred bases So it can it can now be sequenced either from one end and then another end on let's say the Illumina platform Or what the other platforms do is they just do one single read and they sequence right through this entire thing So they actually you throw out all of the sequence in the middle and then you get the second read over here Informatically you have to tease this this bit out for both the four five four and the and the solid based methods So this kind of highlights it again again the Illumina platform This is kind of focused to Illumina centric right now It's it happens to be the method that I use the most so I'm but I'm trying to be unbiased in my presentation of These three methods here to highlight all the advantages of these different methods But the Illumina method you can actually sequence in on this side and then Redo the reaction and sequence in again on the this other side Whereas and then you can take advantage of this But you can't do this for any more than 400 bases So if you want to get larger insert libraries, then you have got to do the tricks Like this and there are other ways of doing this now as well where you sequence in 27 bases and then you redo the sequence on the other side for 27 bases or again What the other methods would do is just sequence straight through But now you have the identity of this fragment the end of this fragment in the end of this fragment Which were originally two to three kb away from each other Okay, so another Bit of technology that can be quite useful is to index Some of these reads and because the capacities of these machines are so large it might be that one lane or one spot of One of these runs is producing way more data than you want to want to get But what you might be after are data from a small locus on hundreds of samples. Well, there are ways where you can Identify which sample comes from which Which read comes from which sample By adding on kind of an index of a short piece of DNA and each method does this a little bit differently The way that the Illumina based method works is you add in these these adapters your sequence of interest is over here So you would kind of do read one and then instead of doing the second paired end read you actually your read two is on the same strand But in a different location That is a key you actually sequence six bases and each of these indices Would represent a different sample that you have added one of these indices on so you can kind of decombolute Which read comes from which sample after you've pulled all of these samples together And then you could go and re-synthesize your second strand and do read two and I'll describe some of this in a little bit more detail But I wanted to point out that this is now Useful although you have to weigh the the advantages and disadvantages of just simply doing many samples and Overkill on the number of reads and bases that you would get for each sample Versus the upfront cost of spending extra time indexing your libraries because you still have to make one library per sample And each of the technologies are working on ways of pooling these things at an earlier stage So you don't have to treat each sample separately until a certain point Okay, so we're gonna move on to some new sequencing platforms Okay Starting to wonder if this is the right talk. I think this is the right version. Okay, so let's talk about the 454 sequencing platform This is a I Got a little bit taken back because I thought I had a different slide in here. So bear with me for a second Okay, this is the the article that came out describing that the 454 Sequencing technology again the machine is over here so you can read in detail this method this method is fundamentally based on pyro sequencing and The way that this works is you actually create I Create your library like I described it in the few slides beforehand and you end up annealing single-stranded DNA To these little styrofoam beads that have complementary DNA molecules on them And what do you end up trying to create or they call this an emulsification of a it's kind of an oil water mix And you try and get one bead and one DNA molecule in one Water bubble and so you can imagine trying to get the ratios of your beads and your DNA molecules And the right amount of oil and water together and you mix this up in a certain way And hopefully you get some proportion of little bubbles that have this in it and inside of these beads are also attached to it the reagents necessary to do a Little kind of PCR inside to clonally amplify that single molecule that then attaches to various other sites Within this styrofoam bead and then you can kind of break this off And then you have a bead that's got hopefully one molecule that's been clonally amplified all over it and then these beads Okay, right so each bubble contains different fragment and then each of these beads kind of goes into what they call a pico tighter plate So you hopefully are getting One of these little beads into one of these little wells and then these little wells get kind of packed with enzymes that allow You to do the the sequencing reaction at the end so rather than dealing with kind of a 96 well plate You're dealing with kind of this pico tighter plate that then gets imaged by a by a camera and however many you know you can kind of Gauge the throughput by however many of these molecules or these beads You can create from a single molecule that then gets stuck in one of these individual wells So there various places that you can optimize to try and get as much sequence out of this as you possibly can So the again the idea here is rather than going from this old capillary-based way of dealing with 96 reads per run you're dealing with kind of hundreds of thousands of reads per run and this is just a Showing you the the actual physical hardware that these pico tighter plates are designed for Okay, so this is a pyro sequencing reaction that works a little bit differently than than a didoxy method That the reaction is described over here But the idea is that you end up putting in kind of your a and then you let it you let it do its thing and then What is measured is an emission of light? And the amount of light that's emitted is proportional to the number of ACs G's or T's in a row that could be extended So you can only extend with the nucleotides that you have in there And that's kind of highlighted a little bit and on this what I think this is actually a nice review of pyro sequencing so you can imagine if you have You kind of read these they're called flow grams if you read these a little bit differently than then some of the other Chromatograms you're used to seeing so if you stick in an a it might light up to You know one unit of a so you know that you only have one a here But let's say you then stick in G and it lights up again Or four times as much as this a so you know that there are actually four G's there instead of one G and then you add in a T and You get a measure of one one unit of T if you add in a C maybe there's nothing there So there's no C coming out next you add in the a again And there's nothing there then you add in the G it looks like there's two G's and so forth So you're constantly you're adding an a GTC a GTC and you're measuring how many AC's G's or T's were extended by the the amount of light so inherently you can imagine that if you have long Homo polymers this is going to become quite a challenge because maybe you can kind of jump nicely between one two three or four G's But kind of beyond that it becomes very difficult to accurately measure Do I have ten G's or eleven G's and so forth and sometimes this isn't as important in in trying to find snips, let's say, but it can certainly Become a challenge and obviously this is one of the main areas that four or five four is working towards being able to resolve Is to better better accurately call these homo polymers This is actually what kind of the data look like coming off of an actual four or five four instrument And they kind of colored them in in different ways And so you can kind of see what they where how they draw their one-mers two-mers three-mers and so forth And and this all comes off and then it's converted into into the sequences So just to kind of briefly summarize on the the four five four technology Kind of the runtime is roughly eight hours and it produces on the order of hundreds of megabases of sequence for kind of Several thousand dollars. I think the number is somewhere in the ten thousand dollar range to to do one of these runs Right now I think the read lengths are in kind of the three to four hundred base range And this is kind of one of the most kind of mature of these new sequencing Technologies it was one of the first ones to kind of be released to the market that was different than a di-deoxy Sequencing platform and again these these homo polymers can be an issue you can really use this method for many different types of Applications particularly de novo sequencing because you have these longer read lengths that allow you to bridge repetitive regions and Make assemblies better with with the longer reads, but again the stuff up here is kind of the least important as far as accuracy The what I would want to impress upon you is the order of magnitude here. It's not taking days to do this run It's taking hours We're generating hundreds of megabases of sequence not hundreds of gigabases of sequence And we're getting read lengths in the hundreds of base pairs not in the tens of base pairs So that's the take-home message of what a four five four instrument can do for you It's kind of nice that you you know within a the next day You have the sequence of hundreds of megabases of DNA in a matter of hours and it is it is At a mature level where things are working most of the time, you know They the kits are are quite nice, and it's not a tinker toy type of an instrument now It's it's very hardened and quite a nice instrument This is just a highlight some of the applications that have been done early on they were able to actually sequence Neanderthal DNA With using this method actually think this guy looks like Kelsey grammar a little bit not this one, but this one definitely does Yeah, Kelsey grammar with a mullet. That's why Okay, the other thing that was done with with a four five four instrument was a generating Jim Watson sequence and this became a kind of a political or Very I shouldn't say political Just a nice news article and hubbub of you know handing somebody's Genome on a DVD over to Jim Watson He was one of the first people to kind of publicly announce and make his genome kind of publicly available The only thing he didn't want to make available were the the low side that can show predisposition to Alzheimer's disease because he had relatives in his family who Had had this disease and he didn't want to know You know, he's already in his 80s. I believe and he didn't want to know this that locus either you can actually go to a Cold Spring Harbor website and Use a genome browser to view Jim Watson's personal genome and we'll talk a little bit about this later on in my talk Okay, so this is my child and this is being used as a way of just waking you up a little bit because we've got still a long Talk ahead of us and I am a doting father and I wanted to see what she looked like on this big screen as well Thank you, thank you Okay, so now we're moving on to the Illumina genome analyzer formerly known as Selexa I've actually got a picture of this instrument when it was a Selexa instrument, but it's now owned by Illumina This Method works again my slides are not in the order that I think they're in my head But we're eventually getting to all of the slides so I apologize a little bit this method works as follows you Kind of do the sample prep again the way I described with getting an adapter on an A adapter on one side a B adapter on the other side and then you Attach these single molecules to a solid surface that that has the complementary to one or the other Adapter side and you create what are called clusters of these single molecules So they this is a cutie animation So these things attach over here and they go through what's called a bridge Amplification and the bridge is really the fact that the the other PCR primer is actually attached to the surface of the The solid surface of this of what's called a flow cell so you create what they call a bridge You then create the Replica of this sequence and then kind of go through that gets washed away and then you go through this kind of clonal amplification So if you create these clusters of approximately a thousand molecules separated out on this flow cell and then you go through this kind of reversible terminator chain sequencing and obviously go through various bioinformatics analysis so this reversible terminator I want to go through this a little bit in more detail because it's actually subtly but importantly different and actually the whole reason why this machine actually works And is proprietary on what's what's behind all of this, but is really the main reason why This machine has such a high throughput. So you start with your your DNA molecule and your your known adapter sequence You you end up, you know Kneeling a primer and then you've got in the reaction You've got your polymerase attached and then all the ac's g's and t's all the different nucleotides Each one has a different color and what you but these nucleotides are modified in such a way that The polymerase can only extend by one base. You cannot extend this any more than one base And this is what what's the key to their chemistry working? You can then actually remove all the unincorporated nucleotides Use a laser to measure was this an ac g or t because each of these molecules Nucleotides has a different fluorophore attached to it and you detect that signal you can then Reverse the termination on this and this is again way cool I don't know how it actually works, but it works and allow extension again by one and only one base you then measure you then strike a laser at it detect what that next molecule was and So forth and then you could sequence the the next molecule and then you continue on down the line And and you're able to do this for you know tens of bases at a time So you're sequencing kind of a single base at a time But you're doing this for literally millions of molecules at a time and so The you know each company is going to tell you they can do one thing better than another So the essentially you've got all four nucleotides in one reaction And they of course don't have these problems with these homo polymers You can kind of keep going and that that isn't an issue of course there might be other subtle issues with with sequencing Using the Illumina-based platform. I don't think I have the data showing here But but sometimes extreme biases in GC content can affect the the The yield of the DNA that you can get out of some of these machines It's relatively minor although it could be substantial on a genome-wide scale But these are the things that you know just like four or five four is working on improving their homo polymer Identification the luminous working on minimizing the this GC bias So this is what kind of the machines look like at least yesterday They're gonna look very different again two weeks from now or maybe even today I think they're starting to release some new machines. So you've got this thing called a flow cell It's got eight lanes in it So you can run up to eight samples before you do where you're indexing and then right now They have kind of reagents that allow you to index up to 12 samples at a time I think they're going up to 96 at a time and it really is a glorified kind of two microscope slides that are sandwiched together by these grooves that are shown here and then kind of the machine attaches at the beginning side and the other side to flow liquids through This what they call a flow cell and there's like a I think it's made of acrylamide or something some sort of matrix inside solid matrix That they attach the the different oligos to that that allow the single molecules to attach to it They have a separate machine over here that in my mind looks like a futuristic flash Gordon like machine That the flow cell attaches to and and this is kind of like a glorified PCR machine where the PCR is and liquid handler So you throw in your reagents you put your flow cell attached to this and it's sitting on top of a Peltier unit which can heat and cool the The flow cell as it's flowing different reagents through to amplify these these clusters this flow cell then goes on to this machine Which is the actual sequencer and this sequencer is really a glorified microscope and liquid handler So it's got a stage in here that you attach the microscope to that can move around with an objective in a camera and a laser Inside of it and then reagents are flowed through this and one cycle is sequenced at a time and then images are captured and really the The bottleneck in generating a lot of sequence Or the in the time to generate a lot of sequence is in imaging We can only image so fast in with today's hardware So what I want to impress upon is that the old-school way of doing things has been and the tried and true way of doing things Has been with diodeoxy saying or sequencing and we generate these chromatographs We're changing the fundamental unit of what is the primary data And the primary data are now these pretty little spots like looking up at a sky and where each of these Individual spots represent a different clonally amplified molecule and you sequence these in parallel So you can imagine you need to take hundreds of pictures like this per flow cell Because it's something like I think today. It's like a hundred tiles per lane and there's eight lanes. So Actually, I think it's now 120 tiles per lane. So you've got like 960 Images that you need to take times four because you need to do an ACG and a T image And then you have to do the bioinformatics to sandwich all these together This is kind of a beautified taking the ACG and T and colorizing each one differently So you can see that different molecules would have an ACG or T at a certain base length So this is just kind of a close-up kind of showing you how you would actually read this for two different Clusters, so I've circled this upper cluster over here and this lower cluster over here This is the same spot on a In an image at done at different cycles, so at cycle one you can read what the color is I'm actually red green color blind So it's difficult for me to see the difference between the T's I know or I think it's the C's that are red as well So I can't tell the difference, but thankfully the computer can and most of you can so you kind of read what color This is and then you go through the single base extension again And you can read what what color this base is and then you do a single base extension again You do what this color is and then you can read out at each cycle What the what the base position is so you can imagine the bioinformatics? For lack of a better term we'll call them headaches that are involved in stacking You know 960 images for ACG and T for cycle one and then for ACG and T for cycle two You're generating terabytes of images that need to be stacked on top of each other So for each image you want to localize and find the same same cluster So because you don't want to be jumping from one cluster to another and and then calling these bases And this is what the what the Illumina machine does and this is how it's able to sequence Millions of million tens of millions or hundreds of millions of molecules at the same time The catch is that you're getting 50 a hundred base reads out of this You're not getting the three to four hundred base reads that you can from a from the four five four pyro sequencing instrument Okay, so again, this is this is the already outdated slide of numbers But we got to throw them up for relative terms anyway So you're generating nowadays you can put up to kind of in the order of twenty four to thirty six million molecules I think that's per lane So each lane can so over eight lanes you can yield, you know over a quarter of a billion molecules And you're producing data kind of at a rate of kind of five to six gigabases a day So to generate kind of a hundred base paired and reads this is a ten-day run So this is a substantial amount of time and of course, you know, you you're investing that amount of time You don't want anything to go wrong in in that length of time But you know, you can you can gauge this instrument Towards the experiment that you want to do sometimes all you're after are lots of different counting experiment molecules So kind of 35 or 36 bases is enough to find a unique position in a genome for the most part But other times you want to do a variation type of experiment and the yield that you get out of one run is more important to you And there you might want to extend that out to all the way to a hundred bases And again, maybe getting the paired end data is something that you want to do as well So this would add add time But you can kind of with today's chemistry that's that's commercially available You can kind of generate up to 50 gigabases of sequence in one run, which I think is pretty impressive when I gave this talk I think a year and a half ago. We were at like a gigabase and maybe two gigabases So again to give you an idea of the rapid rapidly moving technologies here But wait, they've just released a new instrument, right? So everything I've told you is now being transitioned So I just want to give you some highlights I think this instrument is literally just released like yesterday that you can now buy this machine They're still going to be supporting this this genome analyzer but for really super high throughput they've now kind of changed things up a bit and created a new machine they call this thing high-seq 2000 and The I guess the nice thing is that it's actually still based on the fundamental chemistry that I described So that's not changing what they've done is kind of you know made a Ferrari of optics and imaging Inside of this thing so that they can image faster better higher density more surface area and so forth to generate lots of data They're also actually Be able to run kind of two-flow cells at the same time So while one is doing the imaging the other one can do the chemistry and some other technologies I'll describe the solid kind of takes advantage of this as well So the way that this machine differs I just want to briefly touch on this here You can kind of run two flows of this at the same time Again, the flow cells are bigger The other kind of cool thing that they take advantage of is not only are you imaging a larger surface area? But you're actually actually now imaging you generate clusters on both sides of the flow cell So they're actually now able to focus on the top side of the flow cell scan that and then refocus on the bottom Side of the flow cell and scan that so right there They've kind of doubled their capacity on what they've done And so they try and like each method is trying to take advantage of these little tricks inherent to their their method and again The main improvements here are in the hardware in that in that new instrument So they're trying to release this thing. Oh, we're already in mid-February At 100 to 125 gigabases per flow cell. This is I'll describe this a little bit later This is actually enough to analyze One whole genome When you already have a reference sequence for with which to realign and we'll go over why this is the amount of sequence that you'd want to get There was a runtime that they now have rather than 10 days for one flow cell is kind of eight days for two flow cells So the and they've up kind of so their data rate is now from five to six gig a day upwards of like 25 gig a day So you're essentially getting two whole genomes in an eight day period of time Which is kind of neat and of course the way they're building this hardware They're trying to scale this up So this is probably the beginning of their kind of foray into making things faster better cheaper as well Again, the reason I'm showing this to you is not to advertise the next latest Greatest but is to kind of impress upon you how rapidly moving this technology is and Where this really becomes a nightmare is all on the informatics because I shouldn't say all but one of the areas Is on the informatics because we as Informaticians start to understand the nuances of a certain data type and we got used to with the With the didoxy sequencing Understanding that really really really well and now we're being asked to completely change pipelines and transform things You know over months of periods of time rather than years of periods of time So this is causing all sorts of headaches. You can imagine Places and institutions That are that are using dozens of these machines at a time how quickly they need to refine and change their pipelines as well So this is this is a big challenge Okay, so let's talk about the third main machine that that is out there This is a machine. They call it solid. It has something to do with I always forget these lingoes But the li is for ligation It was previously owned by applied biosystems now life technologies This is all rapidly changing and this is what kind of this machine looks like it actually has a mini Linux cluster underneath it to be able to handle all the images that are being produced so rapidly each machine is Has to figure out how to deal with the processing of all of these these terabytes of raw data that are coming off In a slightly different way and that's one of the major challenges that all these methods have to deal with not just this method here So this is actually quite challenging. I have to kind of I had to like relearn this again For this lecture because it's very confusing to me But it's one of the things that they highlight is this two-base encoding technology What happens is they generate a number of probes that where where the the first two bases Are what's known? They're only using four different colors So there are many two bases many different combinations of two bases that are blue So forth and green and and so forth but they know what they are and And then they have kind of a random five bases over here And then I think they use kind of universal like in a scene bases for the last so they have got this little What is that? I think it's seven or an eight more one two three four five six seven eight so an eight mer and they create all of these different probes and what Happens in this kind of two-base encoding technology is that what you're reading off of this are colors That represent two bases at a time Now the key to decoding all this So this is actually the the this is the first part in knowing How to decode what the bases are from what they call color space you kind of if you read if you're reading a red you can say okay If the first base is in a then the second base is a t if the first base is a C Then the second base is a g but again it so this color Sequence of colors could actually be up to four different Sequences and the key here is that you actually need to know the identity of that first base to be able to deconvolute What the colors? What the sequences of the the other colors now there are various bits of lingo here that that show how Doing this two-base encoding especially for looking for variant detection could be advantageous because you have to have a Specific sequence of color change events that occur not just one that are valid for a particular snip Whereas other methods you if there's a if there's a base change You don't know if that's an air you have an idea if that's an error based on the quality of that read but There's no other sort of error checking validation in there And so one of the ways that they pitch this to base encoding in this color space analysis is that you can have these You have to have this kind of Corresponding other color change to call a valid variation So this is another kind of weight way that this works again. It's a very similar in template preparation to the to the 454 method using these The the oil water emulsion mixes where you try and get one molecule one bead Into one water drop in an oil mix and but what what happens is kind of in round one You you like a primer Where you kind of know what I don't know if you know what the first bases at this point Is but you know it overlaps kind of one of the one of the positions where you know what that base is and then you kind of add in one of those those that mix of adapters that have the first two bases where it's known and then other random bases after that and the ligation occurs This this ligation When it kind of steals off this ligation it emits one of these fluorophores And this is where they're kind of proprietary technology comes and then you kind of read what that that fluorophore is And yep, so that's what's showing over here So when the ligation happens you can kind of excite this and then it so it stays here Whereas all the others get washed away, so I take it back It wasn't the allegation event that that causes the emission It's it's why it's keeping it there so that it can be excited and then you you measure what what got incorporated And then you kind of cleave this off and then continue down and do this for let's say seven or so cycles And so what you end up getting on this first round are the positions of you know The first two bases then you skip three bases and then you get the other two bases Then you skip three bases and you get the other two bases So that's what you've kind of got after doing this first round of sequencing But then you should kind of strip this away and you repeat this with the primer that sits n minus one So one back from that and you do that whole sequencing bit again to get another two bases that are then kind of overlapping with each other so each position ends up getting kind of sequence twice and you kind of Build this this whole thing up so then you get these kind of color space reads And you can then use their alignment algorithms to align the color space reads to a color space genome and figure out where all these reads align and then this This idea that I was describing to you is that if you have a snip in a particular region It's great that you've got you know half the reads having one color half the reads having another color But you also have to see this this corresponding change at the adjacent base as well with it with this two color encoding Whereas something like this would be considered an error Because there is no corresponding Color change with the base afterwards So this is kind of the advantage of using this this kind of color space way of dealing with things You can imagine this is a little bit of a headache trying to deal with your raw data so it becomes it's it's not as as easily plug-and-playable as we would like to Be able to run a bunch of genome analyzers and they go Oh, no wait. I just want to run run the same sample on these solid machines as well It's a kind of a completely different way of dealing with the bioinformatics. So it's it's quite a challenge you've Unfortunately, you end up kind of committing to one or another and and make a substantial investment in the bioinformatics and learning How to do one method or another method So the already outdated slide of numbers here You're producing and this is kind of the trickery that we have to be careful about kind of Currently, I think they're producing somewhere on the order of 30 to 50 gigabases per run But this is this is a two-flow cell system. So that's kind of on the order of 15 to 25 gigabases per flow cell They're getting kind of 50 base read lengths and I believe that's that's kind of the where they see their their read lengths going I think a very crude comparison if you want to do between a solid and an alumina would be the Illumina is getting probably slightly longer read lengths, but not by orders of magnitude But the solid is probably getting slightly more numbers of molecules being able to be sequenced The runtimes are a little bit longer than the genome analyzer It's either seven days for a single read or 14 days for a paired-end read And again the reason why this is all out data is because last week They just announced solid four rather than solid three And so what they're now saying is getting on the order of and I don't know if this is available now or about to be available But now they're announcing something on the order of a hundred gigabases across two flow cells so again this is kind of 50 gigabases a flow cell and the cost the relative costs are about the same per flow cell if you think about it in an alumina versus a A Solid type of platform and then and I just got this really from their website saying it's expandable to 300 gig in the future with some System upgrade. They're calling like HQX or whatever. It's various lingos to the bit So this is apparently what this machine is going to look like Okay, so this is the outdated table Phenomenal to me that a publication from January of 2010 can already have an outdated table in here But it's it's still again gives you kind of the the rough relative terms over here of kind of the pros and the cons and what you might Want to use these different platforms for so the four or five four instrument is listed over here The Illumina GA 2x is listed over here. So remember they're saying kind of 18 to 35 gigabases I think we're now at more like 25 to 50 gigabases These AB is saying 30 to 50. It's more like 50 to 100 These machines are all roughly in order of magnitude about a two bedroom apartment in Bethesda. I think half a million dollars and kind of the It doesn't have how these work. So remember this is this is the pyro sequencing This is the reversible terminator chemistry and this is the ligation based kind of two-base encoding Okay, but some other methods that I'm actually not going to talk about today are listed in this table This is actually kind of an open-source version of a solid instrument. It's they call it the pollinator and it comes out of George Church's lab and You kind of build it or you buy the hardware, but then you kind of Purchase your own reagents and make your own reagents and kind of go through this and has relatively short reads But can be quite useful kind of for For kind of small-scale types of things or trying to scale this out without being wed to a particular commercial company And then this this heliscope Instrument down here, which is really true single molecule sequencing This maybe you can get yourself a three or four bedroom place in Bethesda for that kind of money All right, so just quickly touch on What's coming in the near future? Which is very different than the other the reason why I didn't take the time really right now to describe these other two methods Is that they kind of are Relatively similar to one of these other three methods that I went into detail this Pacific bio sciences instrument is something that we've been Kind of watching for several years. They're claiming to have that kind of a beta early access instrument in the next several months Available I think they're starting to generate data in house for various collaborators and the the basic idea behind this is I think quite cool You've got so what do they call it? They call it smart technology SMRT which stands for I think single molecule real-time and they've got cool names for things So they've got this these little kind of Itsy-Bitsy wells that I think they call them zero mode waveguides It sounds like we're very much in the future here, but I'm sure it's got some physics property to it That means something that I don't quite understand and what they've done is they've actually attached a polymerase to the base of one of these where they got Zmw's and Then they they also have these nucleotides that have a fluorophore attached to them But I think the idea is that they're where they're attached is they're not actually attached to the nucleotide They're actually attached to the the phosphate backbone of this or that that triphosphate component So what actually is happening is as a single molecule gets extended through this polymerase that's attached to the solid surface over here the The fluorophore that's attached to it somehow gets excited because it's brought in close proximity to a laser That's also on the other side of where this polymerase is attached to so you're you kind of read off the fluorescence in kind of real time As a as a single molecule is being synthesized through this polymerase So you can imagine I think this is a kind of a close-up over here Single molecule goes through and the other side of this is probably some laser That's trying to capture the emission of a single molecule being extended And so you end up sequencing this DNA kind of I think it's on the order of like 10 bases a second Or something like that of these molecules kind of coming through and of course the idea is that if you can Capture the images or capture the fluorescence and read this off off of thousands or maybe up to millions of these Individual zero-mode wave guys at a time you can really start to generate Tons of data in a way. That's a total game changer to all the other methods I just kind of described to you and I think they're saying that some of the reads that are that end up coming through this Now can be up to up to three kilobases in length Of course There's a wide distribution of read lengths that end up coming through because you kind of randomly sheared your DNA And I think one of the issues that they're dealing with just talking with some of my colleagues is The fact that the laser ends up kind of frying this little polymerase over here So it can only last for so long and some of them can last longer than others through this this instrument So this is kind of cool and you can imagine they're kind of starting out Producing a pretty high data or data rate That's kind of comparable to this new high-seq machine And I'm sure they're gonna go upwards from there as well But it sequence is very differently than some of these other instruments So I wanted to kind of touch on this a little bit and point this out to everybody Okay Something I'd like to touch on just a little bit But I'm not going to go into detail because to me that would be a whole another lecture series is the fact that these machines are producing Orders of magnitude more data and it requires you to be extremely nice to your IT folks And these are two of our IT and bioinformatician folks standing over what is now an antiquated Analysis instrument for one of these genome analyzers over here but he's looking big and strong able to conquer all sorts of bioinformatics challenges and Kind of what I wanted to touch on here is is that the data volumes are so massive if you think about it on the order of kind of a per human genome all of these instruments end up producing Somewhere on the order of 15 ish terabytes. This is kind of a back of the envelope calculation 15 terabytes of raw data and Until very recently we actually had to capture and store that 15 terabytes of data and process it to get it down to Some sort of usable form That ended up being on the order of a hundred gigabases of process data what all of these companies are trying to do is Pretend this part didn't actually exist and trying real time on the fly Analyze the raw data so that what comes out the end of the instrument is really kind of the processed raw data And call this the new raw data But the problem is that we don't know what are the important bits of information to store We used to know that really really well we still know that really really well with the capillary based methods We know the chromatograms. We know what is a good quality. What's a bad quality peak? And so we can throw away these chromatograms very soon after we've analyzed them Because we have a very accurate quality score that measures how accurate is that base call? We're still trying to learn that with all of these other other instruments and there are quality scores But we don't know all those bioinformatics Don't necessarily trust them as much as we would like to trust them So there's a whole kind of world of bioinformatics. That's really trying to figure out What's the best way of going from 15 terabytes down to a hundred gigabytes of of data? And a lot of people are starting to think about ways of using the cloud and the cloud computing and and this has a lot of advantages but it makes you know I Used to be the fact that kind of any biologists that kind of picked up Learn how to program per in pearl book over the course of the few months And my that's myself included became a card-carrying bioinformatician and you can do all sorts of cool things We're getting to the point where some of this stuff you really need kind of a hardcore Computer scientist to be on your team to work with you because there are there are languages and ways of engineering systems that require much more sophisticated ways of doing things Then kind of a fly-by-night bioinformatician can do as I'll put it the challenge with this this concept of cloud computing is Wrong one is as follows So it used to be in kind of the I'll call them the old-school days But the kind of in the physics world where they know what data, you know They're they're working with these particle accelerators generating tons of data But they process it on the fly and they know exactly what bits of data they need They they generate they end up getting a small amount of data that they need to kind of compute all sorts of things on So what they end up doing is what's called to kind of taking the data and moving it to wherever this very large Computer cluster lived and that was kind of the mode of working with things for a while But now we're getting to a point where the data is so huge that the concept of trying to shove terabytes and terabytes of Data up into this cloud is is one of the biggest bottlenecks and we're starting to think about ways of maybe we should Move the the compute over to where the data are being generated So maybe you have a few massive centers around the world that are generating tons and tons of data And you move your compute clusters over there and maybe make them kind of the open-source way of doing things And if so if if you know the Broad Institute Generated the 500 genomes that you want to analyze rather than sucking the data from them and moving it to your local cluster The bio wolf cluster over here. You actually log on to their systems and do it I'm talking kind of hypothetically here, but that's the that's the general concept of what we're thinking about In and the challenges and hopefully we'll get to a point where the data aren't so huge and we can be generating data for thousands of genomes, but We're able to capture the important bits of what we need not hundreds of gigabytes or even in terabytes of data Okay, we're nearing the end This is her seven days old on top of what used to be my former my former baby my espresso machine and Hopefully you don't need coffee to stay through with me But I would like to now finish up talking about some of the applications of all of these different technologies Okay, so the first The first one that I wanted to touch on that I mentioned earlier on is this concept of Counting based experiments And I wanted to kind of quickly highlight how this kind of works Let's say you are a biologist and you're studying your favorite protein and you've raised this perfect antibody that can that can Highlight which parts of the genome your protein is binding to well You can use this in a kind of a chromatin IP experiment Let's say these little orange bits represent those parts of the genome you want to figure out Where's your protein binding? So what you can do is to kind of share up your DNA into these small little fragments In a way that's kind of cross-linked your all the proteins to the DNA You can then purify these fragments and I think Laura Elnitsky will be going into Detail on a lot of these types of experiments But let's go through this briefly just to highlight how this can be used on these sequencing platforms You then isolate only the fragments that have your protein bound to it using let's say your favorite antibody or the special Antibody that you created so now you can try and sequence the ends of all these DNA to figure out Which are the bits of DNA that that you actually enriched for and so the idea is that? You end up doing this for many different fragments and in a particular place in the genome you can align these reads It's possible you might only be getting the identity of a short bit at the end over here But these fragments might actually pile up on each other such that the only thing in common with all of these Fragments are the places where your protein bound and so this is you can imagine generating kind of these peaks that that look like where your protein is binding to DNA and Early on you know way back in 2007 actually one of our own colleagues KJ Xiao showed that this type of analysis was very comparable to doing what used to be done Which was a chip chip experiment where you took that immunoprecipitated DNA? and Hybridized it to a microarray that had tiled parts of the genome on it So now you could kind of more accurately and maybe faster try and identify these places and this became Dubbed the chip seek method rather than chip chip method. This is also another Early on was shown by another group to correlate well with with histone modifications Whether it was chip chip or chip seek you kind of see the similar pat profiles But where this method so I'm going to shift here now where I think this these types of methods are really becoming advantageous is in kind of the world a whole genome sequencing where you can take a whole human genome and in a Matter of days or weeks Generate enough data to identify virtually all of the variants between that genome and and and the reference genome They're actually just to highlight kind of the first two what are being dubbed as personal genomes being sequenced I mentioned Jim Watson's genome also Craig Venter's genome is has now been publicly been made available And in fact it was the Solera assembly that was that was generated back When the human genome project was generating its draft genome as well So a lot of times in in current genome analyses people talked about Subtracting out the differences that were identified from Watson and Venter because we know those two But there are others now being available as well And in fact another table in in the the Metzger review that I mentioned earlier on kind of list a bunch of the Whole genomes that are kind of available To date and what platform they were sequenced on and kind of roughly how much money we think it cost to sequence them And it ranges from on the order of kind of 70 million dollars for ventures genome using these 3730s all the way down to kind of you know tens of thousands of dollars for something using kind of the helicose machine and a Lot of these these machines today are now getting down into the tens of thousands of dollars per genome range So what does this is what I wanted to talk about what does whole genome really mean and now this this paper Where I believe David Bentley who's now at Illumina was one of the lead authors on this manuscript Really kind of set the stage for the fact that we need kind of 30x Base wise alignment coverage depth of coverage on average across a genome to define that as being a whole genome And I'll show you then the next slide why that's the case So this kind of ends up being because roughly 80 ish percent of you the good data that you generate will be able to align to a genome ends up being 90 gigabases of aligned sequence or a 120 gigabases of kind of passing filter data This also ends up being something on the order of 600 million paired end hundred base reads and it also means We're realigning it back to the reference sequence So when people talk about whole genome sequencing now Usually what they're referring to is not generating an assembly like we did for the human genome or or other comparative genomics Genomes it refers to creating roughly 30x base wise coverage with one of these new sequencing technology Platforms and realigning it back to the human reference sequence That that is what we mean by kind of whole genome and why why do we say 30x? Well along this this is a figure from that Bentley paper Along the x-axis is kind of average read depth So five ten fifteen twenty all the way out to thirty or thirty five x and then kind of the number of known snips that were identified Obviously very quickly you can identify accurately identify homozygous snips The challenge is in identifying virtually all of the heterozygous snips to get enough Observations to accurately know that there's a heterozygous call at a particular region you need more than just a few reads and so they they kind of came out with this this Plateau in the curve of all the snips being right around 30 30x coverage and again this just kind of shows The how the data become a less discordant out at kind of 30x coverage I won't go into the details of these figures, but you can certainly read up on them in this manuscript But that's where we get kind of this 30x coverage now, of course this might this is for your average diploid genome What happens if you actually start sequencing? Cancer genomes they might not be deployed in many places, and this is really where I'm going now is kind of the future And there's been some recent publications showing people sequencing both What they think is the normal genome of an individual and then Purifying the cancer cells from a particular tumor and sequencing the genome of the tumor and now you have a really close match to that tumor As far as understanding what are the differences in that tumor that made that a tumor and not just what are the differences in that Individual like trying to subtract out Ventura and Watson, but you can really get a close idea of what are the differences in that tumor that May have caused that tumor to actually grow and really shed a lot of light on tumor biology So the idea behind that type of an experiment Oh, and I should just mention this is one of the first papers from Tim Lay and the group out at Wash U The Wash U genome center Elaine Mardis and Rick Wilson run that center As the first done on an Illumina platform the idea behind this is you take just I was saying I should have just went to This slide and said it takes a normal sample a tumor sample sequence it on these what are what they did was on these genome Analyzers so you have essentially two whole genomes that you've sequenced and then you kind of compare and look for the tumor Specific variation So this is kind of impressive because they this publication. I believe occurred. I'm I'm forgetting dates. I apologize This was December November of 2008. So a year ago, okay, and it took them I Think this is kind of impressive 98 runs for the tumor sample and kind of 34 Illumina runs for the skin sample so you can imagine how much time it took to generate these data They were generating over the course of a year I believe was the amount of time it took to to generate all all of these data And then they were able to identify all the variants in their their Tumor genome which is shown in red over here and then again one of the comparisons they made was you know Here's the Venter genome. Here's the Watson genome. How did they kind of overlap here? The numbers but you're still left with a large number of variants over 1.7 million That are specific to just the in the tumor in that individual So what what really counts is to be able to do the comparison? This is kind of a flowchart taken from their paper Where you take the 3.8 million variants that were identified in their tumor and the first thing that you end up doing is subtracting out all of the the Variants that were also identified in their normal sample So now rather than dealing with 1.7 million that are you know not inventor and And Watson now you're down to sixty three thousand that are kind of tumor specific And they kind of go through trying to find you know ones that are novel and not in Not in other databases like DB snip they try and then go down and rule out things that are in non-genic regions because People find it rather challenging to study non-coding DNA We know a lot about coding sequence so we can really boil down all of that sequencing down to kind of 241 variants and they look at synonymous and unsynonymous things and end up with kind of eight validated ones So it's a it's this is really kind of I think Showing how it took a lot of effort two years ago to do this and what they got out of it because they didn't know Where to look were kind of eight validated somatic variants now the again The field is moving so rapidly that today we could do this on the order of two runs And we're working on ways of being able to get through this type of a variant analysis in a more routine way This isn't routine yet, but this is what's kind of coming and I believe that the next time I give this lecture I will be talking not so much about all the technologies But talking about how do we start doing these types of pipelines in a more automated way? And what are the patterns of variation that we can find with all all of these different genomes that we're now able to sequence Another example that I wanted to highlight because it gives you an idea. This was now I believe it came out in I don't think the dates on this one unfortunately Back in late 2009. So a year later. They were now able to sequence I don't have the the raw data showing over here, but I believe they were able to do this in about a dozen runs So down from, you know, 100 runs to 30 something runs to now a dozen runs and today we're at kind of two to three runs so Things are kind of really on a downward trajectory and one of the things that they did By comparing this melanoma tumor cell line to a match normal cell line was start to define what it meant to be Somatic variant and what they what they dubbed as a somatic variant was saying we needed Three high-quality reads in the tumor to show to identify that variant and also have a Minimum of 10x coverage in the normal and no evidence of that variant now This is a relatively crude way of doing things and they went through Some validation to show that this was a relatively reasonable approach but you can still imagine how crude this might be and we really need to start defining the basis of How to accurately call somatic variants because this is going to be a big way of of Of I think unleashing kind of tumor biology The nice thing about doing a tumor and a normal on the same platform Is the fact that kind of systematic biases and errors that are inherent to a platform and alignment method and so forth are kind of Washed out and so while you might identify certain variants in both You might not care because what you what you really after are the variants that are specific to your tumor sample Great so I actually did have this in here, okay, so so they ended up talking about The validation that I was mentioning They ended up identifying something on the order of thirty thousand somatic variants 42 of the 48 were actually previously known somatic variants and If they go back and look at another I think five of these It looks like the chromatograph actually does show evidence for it after all that the other one is still in Conclusive this is in a talk that I heard about these data and then they were able to just pick randomly, I think four hundred and seventy of them and Verify that 452 of those four hundred and seventy that they identified by saying or sequencing were in fact somatic variants And so they talk about having roughly an eighty eight percent sensitivity with a three percent Balls positive rate the other kind of neat thing about doing this on a melanoma sample is that the the mutation profile was really reflective of Of the c2t conversions that that's known with UV damage So that was kind of a cool cool thing that they showed in this paper Other things that you can do I won't go into detail here about This these these types of approaches are really looking at kind of translocations and copy number variants Especially when you get that pairing information you can identify breakpoints and in various genomes and where you might have Amplifications and deletions and if you correlate that to various genes you can obviously, you know uncover tons of information About doing all this. This is kind of a nice Plot through a program that's available to try and summarize a genomes worth of data both in kind of copy number variants Where snips are along the outer side where translocations occur along the inside it's kind of a nifty little program But obviously where where where people are going now is not looking at one tumor normal pair, but really trying to compare Dozens and maybe hundreds and thousands of tumor normal pairs or thousands of tumors to really identify Those variants that tend to be In common with many many types of tumors of the same type or maybe across many different types of tumors I really believe this kind of whole genome sequencing approach is really going to shed light in ways that we Probably didn't realize was going to shed light on before and so to I've got the final few slides here So he's just bear with me to highlight a couple of other Consortia that are actually starting to hit at lots of genomes sequencing. There's one one Group called that cancer genome atlas. It's nice that it's TCGA a lot of people like to be nifty in that way Where they're trying to do a lot of tumor sequencing and make this these data publicly available another group They call themselves the thousand genomes consortium. They're trying to accurately sequence Over a thousand genomes to identify variants that are occurring at lower and lower allele frequency within the human population and this is going to be Also a very useful resource in in comparing to other new genomes The other thing I'd like to a kind of throw out at you is you know Maybe all of this technology that we're trying to learn in-house and buy all of these machines Is going to be supplanted by the fact that it's going to become easier to just do this as a service I just throw this out as a hypothesis. I'm not necessarily advocating this But this is actually a company. I wanted to make you aware of they call themselves Complete genomics and I think for on the order of five thousand dollars a genome You of course need to do more than one genome at a time You send them the sample and they send you back the genome sequence and you don't and what they have is a proprietary Technology in-house they've built essentially an industrial-sized genome center to be able to accurately and rapidly sequence genomes and provide you with the variants because You might not be as excited about the technology or advancing the the methodologies of things Which is really where this state where this field is at right now is in a method development stage in trying to refine these tools You might just be after where are the variants and you don't want to invest in a big You know half a million or million dollar sequencing machine and and do all of this Maybe it's worth your while to just spend tens of thousands of dollars and sequencing a number of genomes and you get that back As a service they've actually recently released Some data that was generated from their proprietary technology. I think it was three human genomes That were available and two of these human genomes I believe are sequenced by multiple other technologies now as well and so you can download these data and check it out yourself Actually like I mean I appreciate what they need to do But they're a little legal disclaimer Which is kind of funny if they're trying to say this is how accurate their genome is so I think this is what The human Joe in a sequence data are preliminary and may contain errors Well, of course we know that but obviously they have to say that on their their web page I found it kind of amusing in a in them trying to say this is all great and accurate everything But keep your eye out for this company called complete genomics It's certainly in the field of things along with Illumina and a bee and and Roche and life technologies now of all these companies really trying to to Make genome sequencing kind of a commodity tool So I'd like to just kind of end with these these final couple of ideas here in that We're starting to approach a time where it might be realistic to say that Genomic DNA sample and and identifying the variants that are in you could help Diagnose a disease could help figure out I think more importantly figure out how you will respond to certain drugs and which drugs you might be responding to better And this is I think one of the main reasons why we're after doing a lot of these whole genome sequencing Efforts in trying to translate this to the clinic I think this is going to make this translation happen in kind of unprecedented ways So this concept of kind of a designer prescription tailored for you what your alleles are And how you're gonna respond is going to I think tremendously benefit the medical and public health community And so with that I will end the AV guy made me end with this slide rather than a blank slide So I will leave that up and take any questions you might have. Thank you very much