 I have about 25 minutes of material in 15 minutes, so I'm going to talk fast. This is not intended to be a review of all competing and emerging sequencing technologies, but just sort of give you a state of the art, if you will, talk a little bit about what we can do with this at present, so you can think a little bit about the whole genome versus exome discussion, and then try to give you some additional thoughts about how this might play forward. So of course, the ability to sequence a human genome has come a long, long way, just in the last eight years. The old technology was possible in 2004 to sequence your human genome, but you needed about $15 million in the bank account, and you would need to wait for a few years. Of course, the next-gen technology has just completely revolutionized not only what we can do, but the questions we can ask by sequencing. The Illumina technology is arguably the best that's out there right now for these tasks. The cost is actually probably dropping even more so than what I have. I think if you come to St. Louis with 100 nanograms of your DNA, within a week or two, you'll leave with a hard drive with all the data on there, all the variants called probably for somewhere around $3,500 inclusive. We've applied this now to a number of problems. One of the first big successes was in the study of the cancer genome. The slide just represents a genome that we sequenced, the first cancer genome, back in 2007, 2008, when the technology was known as Selexa. Even with the short 32 base pair reads and no paired-end sequencing, it worked. We were actually able to get into the tumor normal genome comparison of this cancer patient and find the only somatic mutations that are present in the tumor. It was just a small number of 10. We've gone on now to sequence thousands of cancer genomes at our center and the other centers that are funded by NHGRI as well as around the world. I think what we've really gotten good at, besides just data production for much less money in a much shorter time period, is actually the analysis of the human genome. Part of the way that we solve this problem is to reduce complexity as much as possible. A system that we devised a few years ago is to simply divide the genome into tiers. Tier one is what you might call the exome. This is about 1.3% of the human genome sequence and it's all of the coding region and other genes that are well annotated. There's another 8.5% which is the region of the genome that's conserved across all 28 mammals that have been sequenced to good depth. Another 41% of the genome is non-repetitive, leaving about 50% that is repetitive. So most of the analysis obviously focuses on the red and orange areas. As we sequence across multiple participants in a particular study, be it somatic or germline, quite often we will see what look to be interesting variants or mutations in these other tiers. There's no annotation in terms of the function of those. It's simply by the fact that we see them popping up in genome after genome gives us some notion that those might be interesting. There's a tremendous amount of software tools that have been developed for looking both at germline variants, somatic mutations. This just shows you a pipeline that's used at our center, a number of different software tools that are used to identify either single nucleotide variants, insertions and deletions. Additional software is used to find structural variation that sort of falls in the black hole between what you might find with sequencing and what you might find with say cytogenetics or SNP arrays. There are other modes of these pipelines that can be then applied to a large cohort study and we often use these same annotation rules that we developed for the cancer genomes and all of these tools are of course available from our website as well as from Sourceforge. This also can be fit into a much broader pipeline that takes whatever clinical information is available. So again, features from genome or exome sequences over here on the left merge into this major pipeline where we bring in functional annotation for many different databases and as we move through various filters and we start to consider candidate genes and pathways, we can bring in the appropriate drug gene interactions and end up with something called clinically actionable events and this is work largely of a fellow in our center by the name of Malachi Griffith. This is a very powerful approach to the study of various cohorts. You're looking at here the results from a whole genome sequencing of 200 AML patients that's been done in St. Louis and you're basically just seeing the most commonly mutated genes in these particular cancers. One of the things that we've been able to do by just bringing together excellent clinical phenotype data is we're starting to identify shown by asterisks here. The small number of genes that help us get some idea when this biopsy is first done and the patient is assessed by simply sequencing a few genes, which of the patients might have a more high risk aggressive disease and should be treated thus more aggressively and which may have a much longer event free survival. We're able to use the genome sequencing data to delve into some pretty amazing things that happen in some of these cancer genomes. This is just the results from a study that we did about two years ago. A patient came into the clinic. The cytogeneticist saw one thing and recommended a bone marrow transplant. The pathologist said no, I think if we simply put her on all trans-retinoic acid she'll be fine. The oncologist was part of our AML genomics group and said can we just sequence her genome and see what's going on here. And what we found is a small region from chromosome 17 replaced a five base pair region on chromosome 7 resulting in a fusion of the PML and RAR alpha genes, which is common in leukemia but typically by translocation that can be seen by the cytogeneticist. She was treated with ATRA. I went into remission and still doing well two years later. So being able to pick up this fine structure is key and these are things that if we had simply done exome sequencing for this particular case we wouldn't have uncovered the problem. One of the other key features of next generation sequencing is digital in nature. The old capillary sequences were essentially analog signals because you were sequencing a population of molecules. Every sequence read that we get on an aluminum machine derives from a single molecule. We amplify along the way but the truth is nonetheless that that it is digital. So we can count reads within a sequencing experiment and as shown here by counting reads and assigning allele frequencies to to the mutations that we find we can get some understanding of the heterogeneity that's present in a tumor sample. So what we've actually done here is sequence three genomes normal from this patient, the de novo tumor that was found at the initial biopsy and then over here on this axis we're showing results from the genome sequence of a relapse tumor that was taken about 12 months after the patient's initial diagnosis. So you see here these in gray are all of the somatic events that are present in both the de novo tumor as well as in the relapse and then we have additional events here shown in purple which are present in the de novo tumor but are lost in the relapse tumor. We also have some other clusters of mutations that are enriched for the relapse and then we have a group over here that are not seen in the primary tumor they're only present in the relapse and we hypothesized that there's a temporal nature so we might be able to suss out a bit even though we're working with only three snapshots instead of a movie the evolution of this relapse tumor. And we're able to build a model such as as shown here this is the patient's tumor at the initial biopsy here's a second biopsy that was taken when the patient was in remission after chemotherapy and then this is the result of a biopsy taken at the point of relapse. You can see the most dominant clone presence at diagnosis was completely eradicated by chemotherapy and the tumor that essentially then expanded with additional mutations and is responsible for the relapse was present at only 5% in that initial tumor sample. So these are the kinds of things that we can find with these besides just simply finding mutations or interesting variants in genes and so forth. What are additional opportunities in large cohorts? Basically three options for sequencing this is that's a bit of a simplification but I think it probably works for the reasons of conversation. So targeted sequencing where we make use of this technology that sort of emerged in the last few years hybrid capture we simply make probes to all of the regions of the genome or regions of genes that we want to sequence. We do hybridization and we use some method of extraction magnetic beads or so forth to pull those targeted regions away from the remainder of the genome. So this works well we can simply make a list of candidate genes that we'd like to sequence for a particular disease or phenotype or perhaps regions of interest these might be GWAS peaks and then simply capture and sequence these in a large collection of samples. We can expand this to include all genes and this of course is exome sequencing and there are a number now of commercially available reagents probes that approach the exome and there are many definitions of the exome in a way to try to ideally catch all CCDS exons and other selected RNA genes. And then lastly of course there's whole genome sequencing where we simply go after the whole thing. So whole genome or exome a couple of pros and cons are shown on this slide. Exome sequencing of course costs way less than whole genome about a sixth of the cost at present because there's still a lot of biochemistry that needs to happen before you ever get to the sequencing machines. The analysis is greatly simplified you're typically operating on only about six mega bases 60 mega bases of the genome thus you can sequence more samples so more study power perhaps with an exome versus whole genome approach and you're essentially getting at the low hanging fruit. You're looking at the annotated regions of the genome and hence it's going to be a little easier to sort of tie that to the biology. Whole genome sequencing has the advantages of also sampling all non-exonic variants this tier two and tier three that I mentioned a bit earlier and there's good evidence of course that these may play a role in various human diseases. Whole genome sequencing also resolves the fine structure around deleted genes and exons similar to what I showed you earlier for the one leukemia case that we sequenced. This is also a good way to understand what's happened if a particular gene or a region of the gene is deleted. With exome sequencing in this case you would simply get a negative report. The whole genome will cover exons that are not covered or poorly covered by the capture reagents. One of the things to keep in mind is that you're trying to do hybridization and fit it all into one universal condition where you have you know various GC contents and so forth aiming to be captured and there are a lot of good tricks that have been employed. So these reagents are good but there are still genes and exons that they do not pick up. And lastly the whole genome sequencing will allow you to resolve structural variation again similar to what I showed you cryptic translocations and so forth. You can pick up copy number variation some that are missed by SNP arrays and so forth. Well how well do the exome reagents work? They're pretty good and I've listed a number of them over here. We tend to use this one from Nimbolgen. There's one from Agilent here. Illumina now has their own exome reagent and you can see in parentheses the amount of the genome that each of these capture reagents target. You can see how Illumina did or sorry Nimbolgen did a nice job of evolving adding more content to their reagent here. We compare these all relative to something that we call WooSpace which is our in-house definition of the exome. This is about 47 megabases and includes all CDS exons RNA annotations from the various databases, 38,000 gene names, 120,000 transcript names, etc. All right and you can see that even for these best reagents we're still targeting only about 70, 75% of what we would really ideally like to capture with these reagents. And these continue to evolve. They will add content quite often will spike in additional probes for regions that we know are poorly covered and so forth. Another thing to keep in mind is that we can when we sequence exomes or when we sequence just selected regions of interest we're really underusing the Illumina capacity. The high-seq 2000 now generates about 300 plus gigabases of sequence data per flow cell and we run two flow cells every time we turn on the instrument. So a lot of people have devised these indexing or barcoding schemes that have allowed us to come up with really nice sort of pooling and multiplexing methods to make better use of the technology. So we simply would make libraries. The black region here indicates the target sequence. We have two library adapters one over here in red, one over here in purple. The purple one in this case has a small green tag of six bases. And we can take two different DNA samples and two different of these adapters with different six base pair tags. We can make libraries so we have DNA fragments such as this with all these unique regions in the middle. And then we can go through all of the steps before sequencing. The key thing here is that after we constructed libraries we can actually pool these samples and this is going to make all of the work as well as all the DNA sequencing much easier, much more economical. We then demultiplex using software tools aligned to reference sequence and execute variant detection. So after that demultiplexing is done, because of our little barcode tags we know which sequences came from which individual DNA samples and we can look at the various regions where we might have variants showing up and then again know which individuals these derive from. We can take this to an extreme where we simply start with a 96 well plate of these indexed DNA libraries, use our capture probes, do sequencing and then end up with the data. Going with 96 samples, 96 samples per lane and there are eight lanes on each of the two flow cells that we use. We get very nice coverage across genes, very uniform on target results and very high enrichment factors. We've now used these in St. Louis for quite a number of projects. This is all targeted sequencing sub-exome where you can see we've started out say in metabolic syndromes with a target region of about a quarter megabases up to this cleft lip project where we've targeted about six and a half megabases of the genome and we've done these in fairly large numbers of samples. This works well. This is just one really good result from the metabolic syndromes where we found a number of variants in our targeted sequencing that all correlated nicely with this gene here. So just in terms of giving you some thoughts about costs and how we might utilize these various approaches in a large scale sequencing project, let's just start with an arbitrary budget of $10 million and this is focused only on data production. Targeted sequencing, done the way that I explained it on previous slides for a targeted set of up to four megabases or a targeted set of between four and eight megabases. You could see that for a cost of about $200 per DNA sample. You're looking at being able to use that sort of funding to sequence 50,000 individual DNA samples. It falls off a little bit when the target region expands a bit. Exome sequencing using commercial reagents with about a 60 megabase target. This has actually fallen down below $1,000 per exome. It's closer to about $700 per exome now and this is using again indexing where we can sequence five individual DNAs in a single lane on the on the aluminum flow cell. And then whole genome sequencing. Typically we're right around 30x coverage although we tend to dial this in depending exactly on the questions asked and the nature of the DNA samples. But at a conservative $5,000 per genome cost, this extends out to 2,000 genomes. These costs include library production, capture and reagents, sequence reagents for sequence production, data processing and storage, which is actually one of the more expensive components of the process, and an initial pass through variant detection. My costs here don't include higher level analysis or validation. And I will stop there. Should we have about five minutes? Oh, I could have talked slower. Rick, how many samples can you multiplex at a time? So how many patients can you run at the same time? So the best results have so far have been, we've pushed it up to 96 per lane and it's simply a question of, you've heard about these multiplexing schemes for a long time with great claims of having four or 5,000 codes. Yeah, the problem is that it's just like the hybrid capture business. They don't all perform uniformly and you need that to get a good result. So 96 is the upward limit right now, I would say. Yeah, Rick, could you speculate on what you think the next two years will bring with respect to increasing the percentage of both the exome and the whole genome that we miss? As you suggested, 75 percent of the exome is really caught by what we have now with high enough quality. And then similarly, we miss what, 10 to 15 percent of a whole genome due to all sorts of things. And many are worried that a lot of interesting things are embedded in those particular places. And so the question of revisitation, and do you do it once, are a lot of these people going to potentially need to be sequenced the second time with the supplementary technology or something so that at some point we can really say we have thoroughly sequenced them as opposed to the sequencing of them? Yeah, it's a great question, Steve. Exome, incrementally better. Again, and I think there's an opportunity for sort of, you know, rogue boosting of performance. So like I said, we just started thinking really a lot about Alzheimer's disease sequencing. But one of the things that we discovered is that APP has exons that are basically not covered by some of the exon reagents. So we'll spike it and we'll make sure that we boost those up for those missing exons. We've done this for cancer, right? P53, first couple of exons poorly covered. So how much of that happens at the company versus how much of it happens sort of among the folks that are really driving the projects? I think that's an important question that's yet to be answered. I think for whole genome, the problem is, so there are two things. I mean, we have to get better at the reference. People at NHGRI hear this from me all the time. And we're working on that, but we need to work harder and we need to have more resources in the mix. That's my, you know, funding comment for the night. You have the talking to us about it. Yes, I know. But I think the tougher problem is, is resolving sequences that are very similar when we only have 100 base pair reads. All right, so we continue to bang on that. We expect that we'll get a little longer. But so for example, think about MHC genes, right? We don't resolve all of those because they're very similar. If we had a 400 base pair read, it would be a different ballgame. So do I see the aluminum technology going to 400 bases? Probably not, not in the next couple of years. Do I see it extending maybe 50 bases in the next couple of years? Good chance of that. So that was a good question about the coverage. But what about cost? Five years from now, will we be doing whole exome sequencing or will it just seem like there's no need to limit yourself? Well, there's a what what do I think and what do I hope you know what I hope, right? Because I think just, you know, having having seen the things that we've seen in these whole genomes that we've done for cancer, there's, we find all kinds of stuff that we're not going to get with exomes. And so we want to push for more. The how much costs will come down? I don't know. I mean, there's there's a little bit of an artificial ingredient right now that is because of the different commercial players. So for example, there's lots of reports of Illumina doing whole genome sequencing in house for as little as $2,000. Right? Why is that? It's because there's another company up the coast that's sequencing for cheaper and cheaper. So those $2,000 costs are true, but they're subsidized. And they're probably not going to continue. So what will happen with cost? I think costs will again, incrementally creep down over the next couple years, just like exome reagents, and our ability to better resolve what we see in exome or in whole genome sequence data will creep up incrementally. You know, it's nice to think about a new technology popping onto the scene. Oxford, Nanopore, you know, suddenly suddenly somebody had packed bio has a wonderful breakthrough and all the problems will there go away. But there's a lot of things that I think have to happen there. So I think I think I'm more optimistic about the incremental improvements over the next year and a half as I that I am about that was bang thing happening. Rick, let me ask you a question about somatic variation, which you've thought more about than most is, you know, but outside of cancer. I mean, the people in this room are coming together with large sample sets and we use the word coercts and often a cohort studies, we have a sampling structure, we follow these people over time. We typically think of it as sequencing and then adding clinical information over time. I wonder if we need to think beyond that of looking at genomic variation over time and thinking about the impact of genomic variation over time on health. But again, my question is going beyond cancer. Be comment on that. Well, I think the only place that I've seen that so far is in cancer, right? And it's been very powerful. It's, you know, the other place that we've done this besides the relapse tumors in leukemia or the metastatic tumors in solid tumors, in some cases where we've had multiple METs going to different organs. And you essentially end up building a phylogenetic tree for five or six different genomes from the same patient. So it works well. It's pretty firewall. And I don't see any reason why you couldn't extend that to simply, you know, temporal sequencing of individuals in a cohort. Yeah. Great. Thank you very much. So