 Well, thank you, Eric, for that wonderful overview of where we are today in the world of genomics and sequencing. And as he said, I'm going to be talking about the how-tos of whole exome sequencing. It was really quite amazing when we put this course together and then announced it on the web that we had such an amazing request to be here for this series of lectures today. The storm kind of threw a little wrench in the works, I think, but luckily this will all be recorded and people can view it live now or later. So I'm Jim Mulligan. I'm the director of the NIH Intramural Sequencing Center. I was just appointed that about a month ago. Prior to that, I was acting director for about two years. And I've been, well, NISC itself has been in existence since 1997 that Eric put together and headed that until I became the acting director and I'm also head of the Comparative Genomics Unit. So in today's talk, I want to just focus on some of the technical details. But I, you know, the hows, but let's first ask why. And with the amazing turnout that we have for this course today, there's great interest. And so I thought I'd just highlight a few of the reasons why that we have actually brought up in this opinion piece that you'll see in genome biology that was written by Les Beeseker and myself as kind of the argument for whole exome. And then the argument for whole genome was by Kevin Shiana from Duke Center for Human Genome Variation. So what are the reasons that we say people are interested in whole exome data is that it focuses on the part of the genome that we understand best, the exons of genes. We know that if there's certain changes in those regions of a gene, it can change the way the protein works. And that's also borne out by exomes are ideal for us to help us understand the high penetrance, allelic variation, and its relationship to phenotype, and you'll see that in some of the talks later today in application of this technology. And then getting down to one of the fundamental reasons why exome is a good first step in the process is that whole exome is currently right around one-sixth the cost of whole genome sequencing. If you compare apples to apple, you know, what does it cost within one center to do either a genome or an exome, it stayed pretty constant at about one-sixth the cost. So that's quite substantial. And another huge impact is that it's only one-fifteenth the data. And if you have storage devices that you, that cost you money and you want to store the data a long time, if you have 15 times more data, that's a lot more storage. Another aspect is that if you just look at the, if you have, it's also machine time. It takes 15 times longer to generate a whole genome on the same machine as it takes to generate a whole exome. So you can tie up your machines a lot longer if you're only doing whole genomes than if you're doing whole exomes. So there's, there are quite a few reasons to stick with the whole exome for the near term and maybe even for quite a bit longer term. Now I'm going to jump into more of the technical details and I just wanted to say that the approaches and platforms that we're, will be, I'll be describing in this presentation do not constitute an endorsement of any product or commercial entity and are instead used solely for the purposes of illustration of general principles. So the NISC group occupies the top floor of this building that you see out at Rockville, the Twinbrook Research Building. And we moved there in 2004 prior to that. It was at another location, but we've been in this facility since 2004. And this is what the sequencing floor looked like in February 2010. So a little, a year and a half ago. And you can see in this area here are six of our GA2s that we had at the time. Two others are over at this position and here are some 3730s. And then a year, a year later, we went to installing our first two production machines of the high-seq sequencing technology. And have been using those in conjunction with the other machines as well. But it's just, as Eric was saying, the sequencing technologies are racing along. And we, you have to keep up and keep upgrading your technology as, as the field changes so rapidly. Just to look at a comparison of different types of platforms. The 3730 down here, this is not a timeline. This is now just what machines produce how much data today. So back at the 3730 Excel machine, you can generate per run this amount of data at this point, and this is a log scale along this axis, going from kilobases to terabases per instrument run. And you can just see that the different technologies have hugely different outputs. And the high-seq right now is generating about 600 gigabases of sequence per machine run, excuse me. And if you convert that into cost, which is on this scale over here, that the cost per base is driving quite low for these newest technologies. There's advantages and disadvantages to all the different ones. There's turnaround time is very nice for these and longer turnaround time for these, but the cost is higher per base. So at NISC, we've been taking on a lot of different sequencing projects. And you'll notice that the top one and this break in the graph, we're actually well over a thousand exomes that we've processed so far. And but we're also doing a lot of other types of experiments with our machines. And for various investigators and even other institutes. Looking at the Illumina technology output by month, since we first started running these in production with two machines in production to start with, and then adding more machines and the technology improved. So we got greater output. This is in billions of reads per month. And then we switched over while we added the high-seqs and decreased the number of GAs and so we started to focus much more on the high-seq running those machines. And now you see in the month of August, we were able to hit about 35 billion bases. They're running in this mode of 100 base reads. So that's about three and a half terabases per month that we can generate with our sequencing machines today. With thousands of samples coming through, it's very important to be able to track all the data. So as investigators come to us with projects and we take their samples in. We need to be able to make sure that we track them properly. So we have a laboratory information management system based on the Cimarron software that we've been using for many years at NIST. And so we've written a new system for managing these types of projects. And here we see a flow cell layout. So on a high-seq or a GA, there's eight lanes and this is just one of the operations where you're assigning a sample for whichever lane you want it to go on in that stage. There's a lot of computational needs for the data that's flowing off of these machines. And at NIST, we have for the production processing of the data that comes from the six GA's and the three high-seqs. We have a Linux cluster on that floor with 1,000 cores and about 250 of those are available for production operations. We've got a petabyte of disk and about a quarter of that again for production. And it is eaten up very quickly. And so we're always looking at how to expand those systems or compress the data in new ways to stretch the length of time that these storage systems will last. And networking is also very important. You need to have high bandwidth networking in order to handle the large data flows that come from these machines. But today we want to hear about whole exome sequencing. So now I'm going to drill down just on this particular aspect of our pipeline. And the exome sequencing pipeline, I've highlighted the major topics I'm going to go through today. I'll just go through them quickly here. Of course, once we get a sample in, it needs to be fragmented into shorter fragments so that we can make a library according to the whatever protocol we're using. Then the exome enrichment stage happens. I'll detail that. And then the samples get loaded onto the sequencing machine and clusters are generated. Sequencing and base calling is performed. Sequence read alignment, a critical stage prior to variation detection is performed. So now I'm going to go through each of these steps. As I said, first the sample is fragmented and we fragment the DNA into lengths of about 300 to 400 bases in length. And then the fragmented ends are repaired and an A overhang is added. And this is specifically the Illumina protocol and from this website if you want to see it in more detail and explanation. Then they have adapters with a T overhang that will ligate to the end repaired fragments. And then we can select out the ones that are properly conformed properly to this layout just by amplifying the DNA with a few PCR cycles, five cycles at this stage. Then we can enter into the process of enriching the exome part of the genome. And this is an example with the Agilent technology that we've used quite a bit. We've used both the 38 megabase capture and the 50 megabase capture kits. And it's basically the same type of approach. This I've already described. This is the fragmenting the DNA, making the initial library for the sequencing platform. Then you mix it together with their reagents and biotinylated RNA library. And these are designed in regions located on the exons of the genome. And when these are hybridized together, you can then mix those with streptavid encoded magnetic beads. And by holding those captured fragments with a magnet, you can wash away the unbound fraction. And this is the enrichment process that's happening here. And then you can remove this from the beads by digesting the RNA. And then you have a library that's the same library that you started with, but it's been enriched just for those regions that they targeted with whatever capture kit we were using. And then one more round of amplification occurs, 10, typically about 10 PCR cycles, again, to amplify prior to having a library ready to load on a sequencing machine. We've also used the Illumina TruSeq exome enrichment kit. Many of the same processes that I've just described, except they use DNA bates instead of RNA bates. Their latest kit targets a little bit more of the genome. So once you've enriched a library for the exome, and another process that can be implemented along the way is you can use a portion of the sequence to read an index. And there's 12 indices that you can use right now with the Illumina system. And you can pool up to six of those different tags at once before you do the exome enrichment. So you're actually enriching six libraries at once that have all been tagged uniquely. A critical stage there is that you need to be able to balance these libraries correctly so that each of these will be equally represented once you get the sequence back out. So we use QPCR to figure out to quantitate the concentration of these libraries independently. And then we can balance those and pool them together and end up once they've been enriched, load them on to typically two lanes of the latest flow cell from Illumina, where the next step of the Illumina process occurs. It's cluster generation diagramed here. The DNA has special ends that will attach to the flow cell that I just showed you. Then this wonderful step of bridge amplification occurs, which amplifies and makes a large enough cluster that can actually be imaged after the clusters have been generated. You can actually get a good signal off of each of these clusters as the sequencing is performed, which I'll show next. Here is the sequencing primer that's been annealed. And then as each base is added, it's read with, sorry. So as each base is added, we can see the fluorescence that comes from that. And read each base as the sequencing occurs in a cyclical fashion as each base is added. We just read off the different color fluorophores tell us which base is which. And it generates the base calls. So the high-seq flow cell currently has eight lanes and can generate up to 1 and 1 half billion clusters of data, or that converts to 300 gigabases of sequence per flow cell. So next is the sequencing and data processing. This is running through their, initially up here, their whole pipeline, their Casava pipeline that they have for processing the data. We run the machine in the 2 by 100 paired-in read mode, and we also get the index. The system will see the index tag and demultiplex that information and pull your data back apart. Since it was all pulled together and put on one lane, it needs to be pulled back apart bioinformatically. And it's done so in this stage called demultiplexing. So then each of these samples are data sets, independent data sets, referring back to the original sample that they were associated with. The Casava pipeline includes an alignment stage that's called ELAND, and these are aligned then to the human genome. Or in another case, we've done the mouse genome as well. So this also works if you just say you've captured the exome of a mouse. You can also have it processed against the mouse genome. But the bulk of what we've done is against the human genome. So the alignment is performed. And just to give you an idea of the amount of sequence, nominally per sample, if it's properly balanced, we'll generate about 10 gigabases of sequence from the two lanes per sample, the two lanes. And there's six samples across these two lanes. So that's 60 gigabases of sequence that comes from the two lanes, but 10 gigabases per sample. And then this converts to about 100x coverage of the targeted regions of the exome. This gives you an idea of what it looks like and just the raw coverage. So this is a UCSE browser view of a gene that is on chromosome 2, many exons. And you'll see different patterns. Here's isolated, targeted regions. And you get fairly consistent coverage. This is one sample. And if you look vertically, that's four different samples. So consistency is fairly nice across samples. There's more variability as you look along the different regions that are targeted. And this will happen because, well, here they needed a lot of probes. So you get a lot of overlap of information in that region. And if we zoom in on that particular cluster, you'll see that even in the zoomed in view that the coverage is fairly consistent across samples, but variable along one sample across the different targeted regions. And this leads to one of the reasons why exome sequencing, you can't just say, well, if it's only 2% of the genome, why isn't it just 2% of the sequence? The reason you need more sequence than just the direct ratio there is that if you had an ideal system and you were doing whole genome sequencing, for example, you would get more of a Poisson distribution of the reads across the genome. So if you sequenced to about 60x coverage, you would get this kind of distribution nicely covering every base. But because the affinities are different for the different probes, and sometimes when probes are close together like I showed earlier, you'll get some regions that get very high depth of coverage and other areas that get lower, because things didn't quite amplify, not amplify, but didn't enrich quite as well. So you need to push more sequence through the system to make sure that you cover this more difficult fraction to get. So I just wanted to make sure you had that in mind that this variability here translates into a broadening of this distribution. And thus it takes about a 15th instead of a 50th of the sequence. So after we have the Elan alignment, I'm going to talk about a specific refinement of the alignment that we do that my group developed called DIAG CM. So Elan, we found it's part of the standard pipeline. We can leverage off of that, because Elan accurately places the reads in the correct genomic location. But I'll show you a couple images shortly that show the fine scale alignment isn't as perfect as we would like. So we use another aligner called Cross Match. It's a Smith-Waterman aligner. It's a local aligner. And it does a very good job of also spanning across indels. And another thing that we do here is if a read was never placed, but its mate was, because these are paired in reads, then we throw the unaligned read that's associated with an aligned read that it's mate. We'll throw it into the same bin. And this Cross Match algorithm works in regions of the genome. It believes what Elan says is the correct location. But then it will do a realignment in a localized region of 100 kb regions across the genome of whatever reads fall in those bins. So this is what the alignment looked like from Elan to start with. There's a six-base deletion in here. And it just couldn't quite capture that information. And so it went quite wrong on either side. And if you then look at after its Cross Match improved, everything looks very nice. You can see clearly there's a six-base deletion at this location of the genome. We developed this about two years ago. This whole approach and other groups are pushing along quite rapidly with other alignment methods. And we always want to make sure that we're at least as good, if not as the best in the field. And it turns out with this, so to be able to compare liners, you need to know truth. And the only way you can do truth is to simulate. So these are simulated data. And this axis shows percentage that are correctly placed versus simulated variant size. So zero means that it's a single nucleotide substitution and then various insertions or deletions if it's negative. So we simulated that with appropriate depth of coverage that you would need to do proper depth, like you saw in the previous slides. So we modeled the depth, we modeled the error models as well. And the outcome of this is that the green line is Elan. And because DIAG-CM throws in the unaligned mate, we get a little bit higher percentage alignment there. And we're doing as well, if not a little bit better, than NOVO align. And it was surprising that CBWA didn't quite hold up as we would think. But six-base deletions and insertions are more challenging for that aligner. Next, after you have a good alignment of the reads to the genome, you need to be able to convert that into variation calls. But it's more accurate to say what you really want from the data are what are the genotype calls at every position. So most of all the autosomes are typically diploid. And you would want to know what both alleles are at each position. So this program called Most Probable Genotype is a Bayesian genotype calling package. And it models the 10 different possible genotypes at every single base. And we look at every base in the genome and figure out what the call would be, even if it's homozygous reference. We have a genotype call for every single position that we have enough coverage for. And in this case, you'll see a column of 11 As and three T's at this position. And you would expect, if you were picking reads at random, if you picked out of 14 possible coin flips, that it would be 50-50. It isn't quite 50-50. But there are cases that you'll have a skew like this as well. So this system will give a probability of all the genotypes given the data. And in this case, the most probable, which is the AT genotype, minus the score for the next most probable genotype, which is minus 14 on the log scale, gives us an overall score for this allele call of AT of MPG score of 14. And we found empirically that scores of 10 or higher are generally good calls. But you can always raise that score if you need to. And we don't, again, we don't look at depth of coverage, really, for saying a base is covered. We look at whether we have a good quality MPG score at each base position in the genome. Then we know it's covered properly or not. There's another part of the genome that we need to take care of. And I've already mentioned that autosomes are normally diploid in human samples. But in cancer samples, that can be quite different. So MPG is designed to call for two alleles on the autosomes. And on the X chromosome, if it's a female sample. But for male samples, we'll change the mode that it works in so it will attempt to call just one of the four bases, one of the four nucleotides in those regions because you only expect one of the four bases instead of all 10 possible genotypes for a diploid region of the genome. So this is applied to the non-sudo autosomal region of the X and Y chromosomes. And then another test that we've looked at is to figure out for a given kit. This is the TruSeq kit. You want to know what is the optimal amount of sequence to generate for a given experiment. And for the human genome, we had one dataset that had a lot of coverage. And we titrated that back in total gigabases from 19 right down to one. And so you can see this curve of coverage of all the coding exon bases as we increase the amount of sequence. And typically we feel that right around five or six gigabases would be enough. I told you earlier we're doing 10 gigabases of sequence. But there's gonna be variability in the balancing of the sequences from those two lanes. So the lower ones will probably hit with enough coverage and the higher ones will be definitely enough coverage. I must say that this is all of the UCSC coding exons. The TruSeq kit didn't design for every single one. So even their design isn't at 100%. But it's very high as you can see here. We've run three different capture kits. We started with a 38 megabase capture from Agilent, the SureSelect 38 megabase capture. We've run about 600 samples through our pipeline. Oh, this is a month down here up to September of this year and number of samples that we've processed. And then we switched to the 50 megabase kit because it's an improvement and we need to keep moving as the technology moves just like we do in the sequencing technologies but also the capture technologies. And then the Illumina TruSeq came along and it allowed to capture pooled index samples. So there was some efficiency savings there and a little bit more of the genome was captured. So we've just recently switched to that kit as well. To show an example of a gene that has different coverage designs. So the top three graphs here are baits that are designed by these different kits. So this gene was not included at all. NBL2 was not included at all for the 38 megabase capture. For the 50 megabase capture, they added it. TruSeq has it. This is all because these were designed at different time points. So everyone wants to do as best they can. And then the TruSeq also added in these other two genes that were out here that weren't covered by the 50 megabase capture. It's very important to look at what actually gets covered. You can ask what are the baits but what actually gets covered. And you'll see here with the 38 megabase capture even though it wasn't targeted, there's some off-target capturing of these regions. These probably wouldn't be as trusted because they weren't targeted. But in this case, this gene was targeted for the 50 megabase capture and the TruSeq capture. And it's hard to see at this resolution what's going on so I can zoom in on this end of the gene and see that, yes, there's nothing targeted in the 38, 50 is targeting mostly the coding exons, a little bit of the UTR. But what actually comes through is very good coverage of this gene, even the UTR from the TruSeq and very good coverage of the exons for the 50 megabase Agilent TruSeq capture. Then you want to figure out, well, what do you get from all of this? So I've told you a lot of the hows but what do you get out of the system? And with this latest capture kit, this is kind of a typical example. Sometimes you get more, we don't get a whole lot less than this. This is kind of the lower end typical of total bases that have high quality genotype calls. And they're only targeting 62 megabases but we're actually getting much more than that, almost double the amount of sequence that has good quality calls after this analysis. The single nucleotide variants that were detected is this number here, 142,000 variants, both homozygous and heterozygous. If you look at just the heterozygous fraction of this and look now just, let's say, look at the only autosome sample, you can see that since an individual on the autosomes has essentially two genomes within them, one from their mother, one from their father, if you compare those to each other, there should be what was typically stated was one variant in a thousand bases. You'll see here that it's a little bit less than that. Why is that? It's because we're focusing on the coding regions of the genome which are more highly conserved and don't vary as greatly. So this number is less than that typical, it would be 0.001 instead of 0.0076. So it's a little bit less because we are targeting a, well, much more conserved region of the genome. We don't get these values for the X or the Y because this is a male sample. But if we do look, if we go and look at a female sample, again we can calculate this within sample heterozygosity value, the same value as we had before, a different individual, but the X chromosome has a much lower overall heterozygosity value and there's population history reasons for that and also the fact that there's only three X chromosomes for every four in the population for the autosomes. So this gives you an idea of the scale of the amount of data that's generated by this process. Just if you're interested in seeing a few variants, this is one example of a heterozygous position, single nucleotide change at this position in the genome and this is a heterozygous deletion in this individual. These can be reviewed, typically we just rely on the MPG calls for the data, but we can go back and look at the raw versions of the data and make sure that things really do make sense. So what does the coverage look like from what was for the different kits and even whole genome if you want to compare to that? So here's a total input sequence for these different capture methods and the coverage is wherever we have these MPG scores of 10 or higher and if you look at the align sequence for the sure select we have five gigabases of align sequence, 131 X coverage for the sure select 50 megabase, a little more sequence, but a little less coverage because they're targeting more, similarly for here a little more sequence, a little more targeted, 114 X coverage. So fairly similar total align sequence for the different kits and we get, if you're just looking at the consensus coding sequence, the CCDS portion of the genome that all of these kits were designed for, we end up with right around 90% coverage of the Xome. When you look at whole genome, people think, oh, whole genome, that will solve the day because you'll get everything. Well, in this case, slightly older sequencing technology, so this would probably be a little bit better today, but still there's reasons why you won't get complete coverage of the consensus coding sequence regions because there were GC biases and not quite as uniform coverage as you would like to see on the coding regions of the genome. And if you broaden this even more widely to the UCSC coding gene, so this is a much more inclusive set, but still just coding sequence, the first kit was not designed for everything, so it's less, the 50 megabase and the 62 megabase captures are up there in the high 80s or near 90%, but then the whole genome is significantly less. And then another comparison we can do is, well, how accurate are these gene type calls that we get from whole Xome sequencing? And here's again the different capture kits below, whole genome shotgun, and when you just compare all the genotypes that overlap with each other between the two platforms of 1M alumina genotype chip, the concordance between the two are in the 99.9% level across the board, so we're doing very well in that regard. What are we applying this to? And you're gonna hear this later from the later talks today. We're applying this to many different projects. Undiagnosed disease program is one of them with hundreds of samples. The ClinSeq project, which is over a thousand samples, a variety of other PI-driven projects, and you'll hear from those investigators later today, for example, at cancer or other rare diseases. Just to give you an idea of the throughput per year per type of machine, we can get about 200 Xomes processed per GA, and about six times that for a high seat 2000. As I've already said, we're at near 90% or higher of the consensus coding exon bases, and the accuracy of the genotype calls is also quite high. So these are the areas that I wanted to cover today in the exome sequencing pipeline. But now that we've generated these, over 100,000 variants per sample, and when you mix together, let's say 100 samples, it'll give rise to up to 600,000 or more variants. What do you do with those? How do you work with such a large data set? And the next speaker, Dr. Jamie Tier, will address these steps in how to annotate and then work with these large data sets. And finally, in closing, I'd like to thank all the people that have worked on this project at NISC, sequencing operations headed up by Bob Blakesley, Alice Young, and all the lab staff that have worked on this, lots of bioinformatics needs for this whole pipeline that's headed up by Jerry Buffard. Lenick Support, Jesse Becker, and Matt Lesko. In my group, my research group, Nancy Hansen has pioneered a lot of the work that you've seen here today and also Pedro Cruz and Praveen Cherakuri, who were heavily involved in the whole pipeline. They've recently left to go to other places and work on new things. And then from the Beesaker lab, Jamie Tier. And it's Jamie Tier that will give the next talk and I'd like to thank you all for your attention. Yes, the questions will take place at the end of this session at 11.30, I think, or so. Thank you.