 Okay, so what I want to talk about is some of the experience we've had using our ClinSeq cohort, which is a cohort of nearly 1,000 people that we have recruited to the Intramural NIH Research Program to begin to develop infrastructure and approaches to acquiring, storing, analyzing, and most importantly returning research results to participants based on whole genome interrogation. And of course as a researcher, what many of us are primarily interested in are the primary variants, that is variants that are new discoveries of causes of disease and pathology in human beings. And so we are trying to move from patients through genomes to discover new associations of disease and genetics. However, as is the topic of this meeting, we're not looking to filter things for discovery, but in fact to filter things in order to present clinically actionable variation. And here there's a lot to be done, and how the filtering is done I think is quite different from the discovery filtering. And when we take a first pass at some genomes, exomes, which you can see here this is 572 exomes from the ClinSeq cohort. If you look at variant positions and you limit those, mind you limit those to things that are nonsense, frame shift, and non-synonymous variants, you have in 572 people 180,000 variant positions, and this does not count how many people have those. This is as if it's just positions regardless of how many patients have them. So obviously this is a huge number to begin to approach, and what we wanted to do is to say, okay, we have this huge data set. How do we begin to wrestle with these data and figure out how to get from these huge numbers down to things that we can actually talk to patients about? So you have to narrow and focus on some things, and what we selected initially was cancer susceptibility genes, which we reason high penetrance cancer susceptibility variation was something that is readily clinically actionable. And so we curated the database and literature to identify genes, we identified 37 genes that we thought were candidates for this approach, and that takes your 181,000 variant positions down to a much more manageable but still prodigious 450. So then what did we do with those? So then we again applied a bunch of filters to try and get that down to something more manageable. Our first set of filters were what we call quality filters. That is, we didn't feel that we should spend a lot of time struggling with variants if we didn't even have good data to suggest that they were real sequence findings. And so we have our own internal quality metrics. I'm sure many other groups have similar metrics that we apply to push variants out. And I'd say overall what we're trying to do here is take large numbers of candidate variants and find valid reasons to push them out of consideration in order that we can focus our brain power on the ones that we should be focusing on. We next used, and this is a tricky one, frequency filters. And we reasoned that if you are looking at high penetrance, rare variation as a cause of Mendelian or Mendelian-like traits in humans, which we think are the most tractable kinds of results to return in a personalized medicine approach, these should by definition be rare, although it can trip you up. And so we use our own internal frequency statistics to try and filter things out, as well as external filtering, which we started with DBSnip. That's challenging for a number of reasons, migrating towards 1,000 genomes. But how you use those sets and what the cutoffs are will vary substantially depending on what trait you're looking at. We picked for cancer susceptibility filters of 1% for our own cohort because this cohort was not ascertained for cancer, although again, I'll show you an example later where that kind of an assumption can trip you up. And we used a slightly different filter for DBSnip. So that helped us a little bit taking the variation down from 450 or so down to 330 and then we're left with other things that we have to consider for these variants. And of course we, like everyone else, consider lots of different kinds of data and information where we're trying to consider exclude or include variation, not only frequency in cases, frequency in controls, but things like functional data, presence with other mutations, segregation, bioinformatic predictive analyses, et cetera. So very heterogeneous, multi-dimensional data that varies in how it's weighted and how we think about that for every different phenotype and often for many different variants. And so where do we go to get that? Well, one can be tempted to go to places where such data might already be aggregated and consolidated, such as the human gene mutation database. So we have incorporated into our own genome viewer data acquired from a complete download of the HGMD dataset, which can be useful but has limitations. And so they have categorized variants based on what they think are causative or suspicious or polymorphisms, et cetera. And those can be useful for some initial considerations, although we have found that they can be pretty limiting and there are a number, a significant number of incorrect causal attributions in those datasets. And so we cannot just use those as they are. The one thing that is very useful is that databases like HGMD do allow a quick linking out to the underlying primary data, which you can then access more quickly. So even that functionality is a good thing to do. And so we have found reviewing underlying primary data is often necessary to interpret variations. We have found more utility in locus-specific databases, and as you know, there are many of these for many genes, and there are sometimes many databases for the same gene, which poses another set of challenges. But you can go into a locus-specific database and look at data, and we have actually written software to allow us to go through and scrape databases, to extract data from those, download those, and try and make generalizations from those data again to try and make decisions about filtering. Some of the challenges there are, you can see on the left here, this little first column, which has these little characters, which are pathogenicity assessments. Those of either the primary literature that is cited or that of the database curator. That second one is itself enormously variable. If you look carefully at some of these databases, what you will find is that the curator puts a question mark for every single variant, which is not terribly useful. Other people actually make decisions and stratify variants. That is useful. And looking across those databases can be tricky. The counting can be very dicey here also in that different databases have different standards for how many things are counted. Here's an example of a variant, which is reported three times in a locus-specific database. However, what that is is three members of the same family counted separately. So again, you have to be careful to filter these things to see what you're looking at. So we then take these and port these through a group process of pulling down all of these data, looking at them manually again as a model for what we want to do going forward to build semi-automated approaches to how to do this. We then try and categorize variants at least for the cancer study based on a pathogenicity scale. This was proposed by the International Association of Research on Cancer. And we use this in a somewhat informal way, semi-quantitative way, to basically recognize the fact that there is a large bell-shaped distribution of causal probabilities associated with variation, where most variants are going to be in that broad middle segment of uncertain pathogenicity, with a smaller number of variants being certainly benign, and a small number of variants being highly likely to be pathogenic. And when we take these variants and classify them according to this scoring system, we end up actually with a pretty substantial number of benign polymorphisms, a large number of variants of unknown significance, which would be classes 2, 3, and 4. And importantly, there is a significant yield of things in here that are clinically impressive and potentially very useful for patients. So what we've learned by doing this is by putting our own feet to the fire by generating the data and forcing ourselves to annotate this, is that, yes, we need to think about every variation, but we need to think about some much more than others. And there's a lot of filtering that can be done to take variants and push them aside as either being not worthy of large amounts of brain power, or at least not urgently or soon. So we think we can filter variants that are highly likely to be benign and set thresholds for downstream evaluation that reflect our judgments of what we know about the disease biology, the medical reality, which matters to us about what you can actually do about this, the genetics, and, of course, the patient attributes, what we're setting out to do. We then try and focus our energies on the variants that do need to be thought about, and we need to characterize each one of those very carefully. We actually think that there isn't a substitute for a knowledgeable geneticist actually looking at these variants and thinking about the primary data. We clearly need a lot of help in capturing and storing these judgments, because we don't want to be doing this more than once for the same variant, but we do want to be able to reassess things as the underlying data change over time. And we're committed to re-annotating every variant as the underlying data change. And so we want to have abilities in a semi-automated way, again, to do that. What is at stake here is that there is important stuff to be found. So in this set, I talked to you about how we filtered for cancer variants. We find lots of variants that are of clinical significance. We found seven individuals who had high penetrance cancer susceptibility loci, as well as patients with familial hypercholesterolemia, a surprising number of patients with malignant hyperthermia variants, which is startlingly high percentage compared to what is predicted, patients with hereditary liability to nerve and pressure palsy at a frequency of about 100 to 200 times the predicted population rate. So again, you have to be careful with what you assume are your frequencies, and patients with potential cardiac disarrhythmia genes. So there's plenty to be found. We're struggling hard with these data and tools to make this more efficient and effective would be fantastic. So I'll stop there and take any questions. Thank you. Great. Thank you very much, Les. Time for a couple of questions. Okay. Could you introduce yourself, please? Yeah. Yeah. Elaine Lyon, University of Utah, U.P. Laboratories. You had one slide that talked about the percentages of probability above 99% that way. Could you put that back? You bet. That's, yes, the IARC scale, yes. So the probability of being pathogenic. My question there is, how do you come up with that probability? Yes. And that's why I said we are using an informal iteration of this. This is not a quantitative assessment. And basically what we're saying here is a five is a variant where I can write a clear report that will say to the patient, our judgment is that this variant causes this disease. A four is where we would say we believe that this variant may cause this disease. And then everything below that we would call variants of unknown significance or completely benign. We use that in a way that has that causal, strongly suspicious, and VUS that many of us use clinically. And we think that that's what this group tried to do. They did it in a more quantitative way using functional data that has its own problems. But we just thought this is how we should think about this bell-shaped curve distribution of causality of variants. But you're right. It is not truly quantitative how we're doing this. Howard Lee, John's Hopkins, probably not so much a question as much as to amplify a point that you didn't go all the way to state outright, which is that on your, I think it was the final slide with the actual findings in your preliminary analysis, you said you found way more than expected malignant agrothermia and HNPP. And I think that highlights the point that there will be more and more cases like hemocomatosis where there's a genetic variant, but penetrance may be much lower than ever thought. And that's going to require a lot of care before we bring these things into the clinical area. And the converse of that is as well is that we may be identifying false positive variants, right? And that's what we have to really worry about because it can be difficult to extinguish with these low penetrance ones. And the more you look at the malignant hyperthermia literature, actually I think that the penetrance estimates may be off quite a bit. But it's not certain and that knife cuts both ways. Well, there's also population structure issues relating to that because if that co-workers from Wisconsin, because there's a founder in Wisconsin from malignant hyperthermia, the population frequency of that particular condition, that particular population, is going to be much higher than what you would expect. And that's something that we also don't understand at a very fine level right now is these population variations that we will surely encounter as we roll this into the clinic. And that's why we'll be so wonderful about these studies like this, is we will finally have a more agnostic view of this instead of these cohorts that we're all ascertained for somewhat peculiar reasons, and we'll get these larger population incidences and begin to ascertain that in a much more unbiased way. Steve? I'll ask a question for you. Steve Sherry, NCBI. Les, did you have much experience working with the bioinformatic tools and their conclusions or their assertions about pathogenicity, such as SIFT or polyfilm? And if that was pre-computed for the whole set, say, in DBSNIP or other repositories, would that induce more false positives in your curations? Yeah, so we have not so far found predictive algorithms for mis-sense variants to be very useful. It's a tough thing, and as you know, there are many, many false positive and the false negative rate of that is high. So if you're starting with a subject for whom you don't have a high prior probability that they have the phenotype, those probabilistic assessments we're concerned are going to get us into trouble. So we are starting by skimming for things for which we have, for which we find things where we are highly likely to be correct in our pathogenicity assessment, and we're going to work down to those, but our initial feeling is that they're going to be tough. So is it corollary that the primary databases could maybe do more value, provide more value by working on frequency and the population context of variants rather than the pathway interpretation? Is that, because you used that as a filter? We did, we feel at least at this stage more comfortable with us. Okay, thank you very much, Liz. Thanks, Matt. We've all set, and so next week-