 OK, welcome back. I hope everybody is properly caffeinated and snacked. And we will take it to the next step. So I am going to present sort of the phase three. You heard from Jim Mulliken, some of the nuts and bolts of how the DNA actually goes into the instruments, how raw data are generated and sort of early analyzed. You heard from Jamie about intermediate analysis and some user tools that people can use to actually begin to manipulate the data. And so what I wanted to present next are some general considerations. I come at this from the perspective of a practicing medical clinical human geneticist. And I think about what kinds of projects ought to go into this pipeline and how I'm going to look at the data. And so I will give you some examples of how those things work. And so what everybody's asking themselves is what can I do with these tools that are so widely available and becoming so inexpensive? Which of my patients should I bring in to be sequenced using this technology? And which of the samples that I have previously collected might be appropriate for that? And there's lots to choose from and a lot of considerations that you need to take into account to decide which projects you want to use and how you will analyze them to answer the questions that you want to approach. OK, so there's sort of some good news and some bad news here, as always. Exomes are not magic. They can work wonders and do things that were absolutely not possible before. You can do things with small families that you could not do with positional cloning in the past. And that is certainly an incredibly powerful option with exomes. You can sort of, if you will, unstick stuck positional cloning projects. And I'll give you an example of what I mean by that. But basically, in the old days, if you were to land in a huge region and you weren't in the mood to sequence 300 or 400 genes using a capillary instrument, you can now approach that with exomes. De novo dominant disorders, of course, were completely non-tractable using positional cloning because there was no mapping to be done. And there are some other circumstances that I'll talk about in my examples. Just this summer attended the Gordon conference on genetics and genomics up in Rhode Island. And there was quite a bit of discussion about how often does exome work for Mendelian disease projects? And the consensus from a number of the speakers and presenters at that meeting, which was a bunch of people who are doing really good work in this area, is that it works about 30% to 40% of the time. So again, it's not magic. And there are reasons why your project may not work. And so hopefully this course and my talk will help you figure out which ones are more likely to work so we could increase that. Certainly, we're all subject to seeing publication bias. Not too many people want to publish unsuccessful exome gene identification projects. So we don't get to see those. Don't get to see what can go wrong. But I'll try and give you, again, some background to think about that. Lots of reasons for that to happen. And the other thing to think about too is it is not necessarily enough these days to do a gene identification by exome sequencing to get yourself a good publication out of your efforts. And so coupling your exome disease gene identification to additional genetic and functional data should help that a lot. And I'll give you some examples of that. OK, so what I'm going to talk about in this talk is from a perspective of a practicing physician who wants to understand what's going on in their patients, what is in an exome and what is not. Because you have to think about that. An exome is not the whole genome and what are those differences. The differences of exome sequencing when compared to positional cloning, they each have strengths and weaknesses. And then some sort of how-to's. And I'm going to give examples of five projects. Some are from our lab. Some are from other people's labs just to give you a flavor of different types of projects. Excellent, recessive, dominant, sporadic de novo, and a mosaic case. So what is a whole exome sequence? And this is somewhat of an unfortunate label because it is not all that whole. What it includes is supposed to include, the definition of an exome is, is the sequence of all exons in the genome. And there are some problems here. First, of course, is that we don't currently know what all the genes are in the genome. The second is we do not necessarily know of all the exons of all the genes that we do recognize to be there. Non-coding exons are not consistently targeted. Some kits and some approaches do target those. Some do not. Not all the exons that you do target are effectively captured. If you do target it, not all of the sequences that you generate can be aligned. And accurate bases can be called from them. And that leaves you with a fraction of the exome that you're actually interrogating. So some of the things that are missing from a whole exome sequence are then, and this derives from the previous slide, some genes are just not there at all. And you'll see an example of how that tripped up some investigators in a study. An example I'm going to give later. Some parts of some genes may not be there. Almost all of the non-genic control elements that control gene regulation are going to be missing from an exome sequence. Non-canonical or what we sometimes call deep intronic splicing elements that are important areas for mutations in genes may be missing. Structural DNA assessments, that is small inversions or movements of pieces of DNA within your genome, are very poorly assessed by exome sequencing. So if your disease is caused by a mutation of that nature, you're in trouble. Copy number variation is currently not very tractable using exome sequence. Some work is being done in that area. mitochondrial DNA is mostly not targeted, but some information on that can be derived from an exome. Some microRNAs and other small RNA molecules are included in exome sequence. Some are not, so it's very important if you're thinking that your disease or trait could be caused by one of those kinds of DNA elements. You have to think very hard about whether exome is your right approach or whether you should go to whole genome as Jim discussed in the first talk. Okay, so to sort of set up the contrast of the pros and cons, strengths and weaknesses of exome sequencing versus positional cloning. And as I mentioned, I think in the second slide, with exome sequencing, you can get away and you'll see examples of that. We're using very small families and sometimes single individuals to identify potentially causative mutations. As those of you know who practice positional cloning, it is nearly essential to have large families if you wanna do myotic mapping, which is the first step of a positional cloning project. For exome sequencing in general, what you want is locus homogeneity. Locus heterogeneity is going to dramatically increase the number of samples that you're going to need to sequence to find the variants that you want to find because in most cases, you're going to be looking across samples for variation in genes shared amongst your different affected persons or samples. The more loci you have that cause your trait, the harder it will be to do that. Locus heterogeneity is not a huge issue for positional cloning because that's sorted out in the first step of positional cloning. When you're doing your myotic mapping, you are localizing genes and you can tell right away where if your family that you're analyzing is at a given locus or not, using linkage inclusion or exclusion. So this is an issue that is tricky for exome sequence. It can cause some difficulties in positional cloning but it's easily sorted out. Allelic heterogeneity in an exome project is a great thing and that's because the overall sensitivity as has been discussed shown by Jim Mulligan, if you have a 90% sensitivity for detecting any given part of the exome, were it to be that your trait is caused by a single mutation in a single gene, there is a 10% chance that no matter how hard you flog that exome, the variant will not be there. So if you have allelic heterogeneity that spreads your risk across multiple sequencing targets and increases the chance that you are going to see it. And for positional cloning, allelic heterogeneity, not a big issue. The issue of what I call the bird in the hand versus one in the bush is always one that is tricky to parse when you're doing exome sequencing. You are going to be faced as you have seen from the previous speakers by a list of variants that are the potential cause of the trait that you're studying. And your task at that juncture is to make a decision as to whether you should pursue some subset of those variants that have been identified versus pursuing in more depth, other variants that may not have been initially detected or may be problematic to detect in your samples. And that's a trade-off and a critical decision juncture that you have to consider as you're moving through this process of gene identification in a pipe one. In positional cloning, that's not so much of an issue. It's an issue in exome sequencing because these are gene candidates that are gonna be most likely all over the genome. In positional cloning, you usually have a region of the genome nailed down and you know you need to interrogate just about everything that's in that region to be thorough, which is something you just can't do in an exome sequence. Finocopies are not generally a big issue in exome sequencing as you will see in some of the examples I'm gonna show. That is patients who appear to have the phenotype but don't have it for the underlying genetic reason that you think is the cause. Finocopies can be a huge issue in myotic mapping because they can lead you to exclude or include regions in your mapping project. So there are pros and cons and you should think about your project, your phenotype, your trait, and decide how it fits in here and how you're gonna approach it. A lot of talk also at the Gordon conference about type one error and the disparaging term that was thrown around at the meeting is that exome sequencing or whole genome sequencing as well has the potential of generating what are called just so stories. And this is based on the Kipling book where fanciful stories that are circular reasoning that mean absolutely nothing can be beautifully drawn up and be completely wrong. And that's really a manifestation of type one error. When you're interrogating 20,000 candidate genes for a trait, the potential that any one variant that you find of being the cause, the prior probability of that variant being the cause is small and you can be led down a garden path if you don't do the other supportive studies to buttress and validate those exome findings and carry them forward. And again, without myotic mapping which massively decreases your type one statistical error problem, without myotic mapping you're gonna need additional sources of evidence for causation and in fact I'll show you how you can actually fold mapping in and linkage data later to support and reduce the probability of that error. Okay, first example of the five kinds of traits I wanna talk about today. First, because it's the easiest to do because it's the smallest amount of the genome to interrogate is an X-linked disorder. Doesn't so much matter what the trait is but the general nature is important because the character again of your trait will determine what filters you want to apply and how you think about the results of that filtering that is the result of that sequencing project. So this is an X-linked trait that causes defects in males, severe congenital anomalies with 100% lethality which helped us a lot in the filtering. Also the fact that it was ultra rare and that there were only two known families to be affected with this trait allows you to do good powerful things with your filtering with controls as Jamie described. Now in this particular case, again you always have to think about a sample availability, quality and quantity of the DNA that you have and it turned out we didn't have any usable DNA on the boys for whole exome sequencing or exome sequencing I should say and so we sequenced the carriers. All right, so what do these numbers look like? So just for X here are some of the raw starting numbers of what the initial data onslaught will start to look like that you need to deal with. So for the exome capture and I put that in quotes because we got a fairly hard time from our reviewers they thought we shouldn't call it an exome at all because it's only the exons of a single chromosome. So I put that in quotes for you there. It's all of the exons on X that are not in the pseudo-autosomal regions of the X chromosome and the genes are defined as the UCSC annotated coding exons. So right there again you see a decrement of interrogation from everything on X we're starting to notch our way down once. The sequences of two samples were about 20 million reads off the Illumina instrument and about two thirds to three quarters of a billion base pairs of sequence. Of that sequence just under half of it aligned back to the intended sequencing target of the experiment which sounds low but of course it's very good because the coding region of X is a small fraction of all the input DNA that went into that experiment. So that is an incredibly strong enrichment. So about half aligned to those targeted exons with high coverage and as Jim mentioned with that wide dynamic range of coverage that you see in exome sequencing you need to go deep to ensure that you have a lot of bases covered by 10X or more sequencing because of that tail on the left side. So we had about two million base pairs of 10X coverage sequence which is in that case a 77%. This is an old iteration. I think the hit rate would actually be higher now with the new versions of this kit. This is old technology probably 18 months old. Again things change so fast here. And just as a note I did mention that we are sequencing carrier females here so you can use an autosomal base caller for calling your variants but if you do an excellent project and you're using males you need to have a different caller because the algorithm is different of course for calling variants in hemizagis males. Okay so what did the filtering look like? It was based on a number of criteria again the biology and the genetics of the trait. Since we were sequencing carrier females we said that the variant should be heterozygous. We knew that the trait was severe so we could set a relatively high filtering threshold for the kinds of predicted changes in the genome that wouldn't cause this. We didn't think it was too likely that this would be caused by a mild missense variation. So we looked for non-synonymous indels that would frame shift the open reading frame nonsense and those types of variants and splice should be on here also that's typo. And again because this trait was so rare we could implement relatively stringent filters for presence of the putative variant in controls. And we started out here with two samples with 350 substitutions and this already excludes a number of changes and so thousands of variants have already been excluded by implementing some of these filters in the heterozygous, non-synonymous, absent in DB SNP which as Jamie mentioned is a dangerous thing to do and you have to be very careful about that. I'll have more to say about that in a few slides. We sequenced concurrently patients who had other disorders and so we excluded variants in those samples as well. And so what you can see here is that this worked exceedingly well. So this was an easy project. So family one had no substitution variants that met all of the filters and it had one variant that was an frame shifting indel that did meet the filters. Family two had a single variant that was a nonsense variant and had no variants that were frame shifting in all of the exons on the X chromosome. And it turned out that this hit, this hit and this hit were in the same gene. So that's a very nice and tidy result and this was an example of what I was talking about earlier was a stuck project. This was a project that had sat in the laboratory for about five or six years because it had been mapped to a 25 to 30 megabase region of the X chromosome that included 200 or 300 candidate genes and we just didn't have the wherewithal to sequence those using PCR and a 31, 30 sequencing instrument. So it sat there until the X exome came and then we applied that tool and this worked. Now this was a fortunate occurrence because this relatively simple and relatively stringent filters that we implemented took us right to variants in a single gene and you could say, well, what if that wasn't the case and when you weren't quite so lucky, what could you do? And this is an example of what I was referring to earlier you can actually now use myotic or linkage or haplotype data to begin to exclude regions of the genome to reduce the number of variants you have to consider. So if you look and use mapping data from a family like this that took us to a region of the X chromosome that eliminated between, for the various categories between 60 and 80% of the variants from consideration because they were excluded by linkage alone. And the important thing to remember here about this use of linkage is that it is not it does not require those high thresholds that we use to apply to ourselves for ab initio linkage mapping, i.e. a LOD score of 3.0 for an autosomal locus or 2.0 for an X locus. Whatever linkage you can extract from the family that you have will help eliminate variants. And so any amount of cranking that down if your first filters don't give you a hint can be very, very helpful for you in pushing variants out of consideration. Again, but you have to be careful in the old criteria that we used to use in positional cloning as to whether you should allow a single recombination to exclude a region or require two recombinants to exclude it, still apply because you can erroneously push things in and out of consideration. Again, if your disorder has phenocopies or ambiguous affection assignment of the patients that you're studying. Okay, in this particular case because this was a gene of no known function with no assay available, the only thing we could look for for supportive evidence was expression. And so we took this to some colleagues who were good at mouse expression analysis and cloned out the mouse gene and hybridized that to embryos and showed that the gene was expressed in exactly the tissues one would predict based on the phenotype in the humans, which was nice. It's not the strongest supporting evidence. But again, if you're on the X chromosome the probability of a type one error is much less likely and so not quite as robust of an evidence set as needed. Okay, next I wanna flip to autosomal recessive. And this is a case that we worked on with a colleague Chuck Banditti who's in our institute and this is a disorder, a severe childhood onset metabolic acidosis disorder. And Chuck's group did a huge amount of work to exclude other loci that may have been other known loci that can potentially cause this phenotype and this was a very refined cohort of patients for whom all known causes had been excluded. And this was an exome project where we sequenced only three samples and affected child and his two parents. So this single trio led us to variants and again, this is an autosomal traits and now we're looking across the genome. And this gives you the kind of numbers that Jim and Jamie both started with that 100,000 to 150,000 range for variants and then we started implementing filters to crank the number down because obviously this is completely intractable. So we could exclude a number of variants based on the most probable genotype based calling algorithm that Jim described. Then we used the zygosity status of the samples to exclude variants, i.e. the parents should not have two the affected should have at least two and then the type of mutation that we want and this again is something that you will need to play with with your projects. And generally what we recommend is starting on a relatively stringent level to see what you can find and then if you don't find what you're looking for you start to relax your criteria one at a time backing down your stringency to let more variants come through and see if that shows what you're looking for. We did again a risky thing at the outset which was excluded variants that were in DB snip and then we excluded as well variants that were homozygous and controls and that would be in our clincic set which is a set of adults which are presumably normal for this trait and you can see how much trouble in a few minutes that can get you into. Or we excluded a allele frequency of greater than 10% reasoning that the carrier rate could not be anywhere near that high if it's a rare recessive disorder. That left us with a dozen genes that is that the affected had two variants in that gene and those were present one in each parent. Here is the list of genes right here and then Jennifer Johnson in the group looked manually at this list and curated those and discovered that ACSF3 actually was a gene that was predicted to have a mitochondrial leader sequence and was a good candidate for that region. Reason, excuse me. All right, oh, I wanna mention here. We also had to buttress the genetic data in this case. We had seven unrelated affected for genetic confirmation of variants in the gene and it's always a question when you start the outset of this project how many samples should go into the exome pipeline versus how many you should hold back for manual genotyping PCR and 3130 or capillary sequencing post hoc and that's a decision that you can make based on of course cost considerations because exomes are fairly expensive but don't forget that technician and postdoc time is also a very expensive commodity as well and so balancing that and doing the right number of exomes versus manual followup work is an important design consideration in whether and how to use DBSNP. Okay, as Jamie said, DBSNP is a helpful and potentially dangerous tool to use. It is a repository of genetic variation that is completely independent of any relationship of the variant to any particular disease and individual variants in DBSNP can be pathologic and some of those variants are going into DBSNP from disease studies and even from clinical pathology labs and the cohorts that are in those studies may be very, very different from what you think they might be. We had a situation recently where we were annotating genomes for cardiac conduction defects and found that over 2% of a DBSNP sample set had the same variant we were looking at in their samples and when we dug down deep into the DBSNP data it became clear that that was a sample set of patients who were recruited for a pharmaceutical trial for cardiac conduction defects. So you have to be very careful in what you use and it can be tedious to dig down to that level in DBSNP to find that. So we tend to rely more on our own internal control sets because we know them better. So your causative variant can be in DBSNP and again your filtering is going to be iterative so you can try some DBSNP filtering early on but don't hesitate to back off on that filtering early if you don't find what you want and use minor allele frequency cut-offs that are conservative. We generally shoot for five to 10 times the estimated frequency of the disorder as a threshold and say if the frequency of a single variant is above that it's unlikely to be causative. And remember cystic fibrosis as I think was Jim mentioned earlier. It's a good example. Cystic fibrosis has a very peculiar allele distribution such that 70% of the alleles are a single allele. So you would want to have used that population very, very carefully because you wouldn't want to have excluded a variant based on commonality because it might have been the causative variant. Okay, I'll skip that. All right, so we used our ClinSync control set to measure the carrier frequency of this candidate gene ACSF3 that was identified in the study. And you can see here these are the raw data this is exported from Jamie's VAR-Sifter program into an Excel spreadsheet. And then you can see homozygous reference, heterozygous, and homozygous non-reference. And what you notice here is in our so-called control set we had a patient pop up who was homozygous for a variant and this was a variant that was found and thought to be pathologic in the patients with this methylmalonic acid disorder. When it turned out we went back and looked at this patient carefully and it turns out she actually was symptomatic. And when we evaluated her biochemically she actually has this disease. So you have to be very careful. Your controls, your own control sets may have this disease and this is one of the powers of exome and genome studies is you can find phenotypes that you didn't even know that you ought to be looking for in your preconceived notions about who's affected and who's not could be challenged by these data. So Chuck's group went on to do functional studies, nice enzymatic studies and localization studies to confirm that this was the cause of this disorder. Okay, so that's recessive. Next, dominant. I gave an example of two papers that appear recently on a rare disorder that includes skeletal dysplasia and other anomalies and they have, it was thought to be a great project because it has associated with it severe osteoporosis and it was hopeful that they were hopeful that it would be generalizable to more common forms of osteoporosis. But again, given that this is ultra rare and autosomal dominant, they could institute relatively stringent filters and they also had several cases that were de novo's. So this is another group that use essentially the same processes as NISC, including Agilent capture and Illumina sequencing. Three to four gigabases per sample was generated. They use one simplex case, that is there was only one affected in the family and one multiplex family proband. And so three samples, again, went in the sequencing instrument. Their filtering criteria were relatively similar, non-synonymous, nonsense, splice and insertion deletions and they again instituted a relatively stringent DB SNP filter as well as 1000 genomes and their own controls. So they said because their disorder was so rare that it should appear, and it was dominant, it should appear zero times in any of these control samples. And they came out with all three cases. The only gene that was mutated in all three cases was notch two. They went on to buttress that with genetic data and Sanger sequence 12 kindreds, 11 out of the 12 had mutations in the same gene and then seven of them that were simplex cases again and those six had two parents and they were confirmed as de novo using manual sequencing. And interestingly enough, this paper appeared in Nature Genetics and was published without functional data quite recently, I think because of its high priority as a skeletal phenotype. Fourth example, sporadic de novo disorders. Again, this is a genetic pattern that was essentially completely intractable with positional cloning. The first one that was done was Kabuki syndrome or Kabuki makeup syndrome. Some people call this disorder. Again, it's a dysmorphic syndrome, skeletal phenotypes and intellectual disability. Quite rare, most of the cases are simplex or apparently de novo. Slightly different approach done than NISC and as well they did a strategy that you might not think. They did not exploit the de novo aspect of this disorder. What they did instead was sequenced across a number of affected pro bands and looked to see if they could find the variant in that way. A reading between the lines, I'm guessing that they were concerned about the error rate of their platform leading to too many false positives for potential de novos. Their selection platform, which works quite well in addition to other ones you've heard about today was a on array selection approach followed by sequencing to relatively lower depth than we used. This is one of the first studies so they only went to a 40X coverage of the mapable regions. So this is a good example and I really applaud these authors. They published a filtering strategy that they used that did not work. I thought this was illustrative. So what they asked was from their 10 pro bands, any X, any N number of those 10, how many of them shared variants in genes with the given criteria? So here's your criteria of non-synonymous, splicing, indel and nonsense variants, absence and DBSNP or 1000 genomes, absence in their own controls and absent in both these two data sets. So for single gene variants, you could see they had 7,000 across those specimens. Variants that were present in two or more, variants in the same gene in two or more individuals narrowed it down and then successively three, four, five, six, seven, eight, nine. And as they went across and down, of course your numbers go down because it's more and more filtering and that left them with a single variant and this variant was not the cause of that disease. So it's a good example of how filtering can leave you at a dead end. You have to then take a step back and say what's an alternative strategy? What are the potential explanations for not finding the gene and how should I do it next? They did another strategy where they actually went back and did really careful phenotyping and rank ordered their patients using clinical experts to rank the patients from the most severe to the least severe reasoning that they should be filtering based on the best, strongest, most clear cases. They also implemented here, you'd notice on the previous slide there was no assessment of the functional consequence of the predicted change in the genome and then they did manual review of those data and ranked ordered the patients and they identified nonsense variants in this gene MLL2 in four out of four of the highest ranked, i.e. the most severe cases and then I had a couple of hits in the lower cases. So you can see why they failed with the first filtering strategy which is they were trying to be too inclusive and they had two problems. Number one, there's probably, there is locus heterogeneity for this trait so if you filter across all those variants you're gonna stumble on that. And number two is they did not have great coverage of the genome in their original capture. So they had overall 96% coverage but they were missing a significant number of exons in this particular gene and when they went back and did manual capillary sequencing they found mutations in two of the cases that were missed by the next gen sequencing and then they went and did manual testing of that candidate gene in 43 cases and found mutations in more than half of them and then went back and did the de novo check for those cases for which they had DNA samples from both patients, both parents and in 12 out of 12 cases both parents were negative for the mutation confirming that it was de novo. Mutations, very, very heterogeneous here. Again, so your friend is allelic heterogeneity. There was some locus heterogeneity here which is what one of the things that tripped them up in their earlier filtering and here's an example, two groups were working on this disease at the same time, European group and they were also gracious enough to make public the fact that they did not find the causative gene because the exome targeting kit that they used did not include MLL2. So again, if your targeting kit doesn't include your gene you can spend all the money you want on sequencing and you won't find the mutation and again, no functional data were included here. And I'll briefly go through a mosaic disorder because it highlights some sampling issues. This is a disorder of asymmetric overgrowth and includes pigmentary lesions that follow the lines of Blaschko vascular malformations. It's never familial and has been described in discordant monozygotic twins. So the hypothesis was that this was a somatic and not a germline disorder which poses some challenges that explains the clinical observations but then for the sequencing you have to take a different approach. Almost all sequencing is done using peripheral blood mononuclear DNA samples from affected patients but here one might not necessarily wanna do that. What and who do you sequence? If a disorder is mosaic you might not want to be sequencing just any part of a patient. You might wanna sequence and compare what's affected with what's unaffected. And fortunately for us as well there were a pair of discordant monozygotic twins that we had access to and that unaffected twin was a great control. What DNA? We actually sequenced here DNA from skin biopsies and surgical specimens and the skin biopsies were collected from clinically affected and unaffected areas based on the clinical judgment of those taking care of the patients as well as tissues harvested in the operating room with a clinician standing at the surgeon's side designating specific tissues as being affected or unaffected and no blood DNA was used in the sequencing study because there was no hematopoietic phenotype known for this disorder and so the affection status of blood could not be determined. Filtering criteria here were pretty same similar to the previous ones although what tissues were compared again was quite different. So we have the same set, non-synonymous, non-sense splice in indels, absent in DB SNP, although Jamie showed you how that could have tripped us up because when TCGA started sequencing tumors they found the same variant we did in some cancers and deposited that in DB SNP. So that can lead to a false negative result if you filter that out. In most of our affected, unaffected intrapatient pairs we have between 100 and 300 sequence differences in the exome data and those had to be validated manually. One of them in fact persisted and you can see here the distribution of the mutations in blue, the non-mutation is in white. All of the patients have the same variant at different frequencies and different tissues. And then when you look across hundreds of specimens you can see a bias and that the affected tissues had much higher levels of the mutation than did the unaffected. And again the blood was born out to be a not a very fruitful place to look for this mutation. So following up with the patients, 29 out of 31 had the same mutations. The mutation was more often found in the affected tissues than in the unaffected tissues. Perforable blood was not positive and it was absent in controls. It was absent as a call in the Hot 1000 Genomes Project. You can dig deeper into 1000 Genomes data and ask how many sequence reads include the variant you're interested in. When we did that we found that a single sequence read out of 30,000 reads at that position in 1000 Genomes in fact had this but of course that base wasn't called because that was below the base calling stringency. And functional data to show that that protein was actually malfunctioning in those patients as predicted by the mutation. Okay, so where are we in the big picture here? So I think you can see from these five examples for all different genetic models of Mendelian traits in humans, you can use exome sequencing to delineate the molecular etiology. It doesn't always work but it can work a lot of the time and it can be a really fruitful way to get projects moving again. There's plenty of work to be done if you just use a source of a small subset of inherited disease, malformation syndromes in humans. There's 2,500 clinical entities in this textbook. This is an online database of congenital malformation syndromes, 4,500 entities in there. Very few of these have known genetic etiology or very little is known about the natural history, how to diagnose these patients. So it's a huge challenge both from the clinical and basic science perspective to push through these, find their etiology and I'm sure all of you have other classes of phenotypes that are hundreds if not thousands more projects that need to be undertaken along these lines. I think as well thinking beyond just what we do here in a research environment that the exome and genome is going to become a clinical diagnostic tool. So I hope that as all of you start to work with these data you think creatively about how to solve these problems and answer these questions and put these tools out there so that these things can be transformed not just as esoteric research tools but into the clinical arena so that we can improve clinical diagnosis going forward. So in thinking about how you do a project you have to think about the entire picture all the way from what is the disorder that I'm studying, what are the genetic attributes of that disorder, what are the clinical attributes, the phenotypic attributes that will allow me to determine what samples I should sequence, what filters I should apply to the data that come out of the instrument, the genetic data I need to buttress the results of the variants that come out of the sequence, the next gen sequencing instrument and then coupling that to functional assays downstream to nail down the cause and effect relationship so that we can have robust associations of genotype in the diseases that we're studying and not write a bunch of just so stories. Thank you very much for your time and we have one more talk and then we'll take some questions.