 relating to publication about an algorithm for binning. Okay, well thanks to the organizers and especially to Mark for inviting me to speak here. This is really exciting for me and a great opportunity and I'm really happy to be here. So the talk is really about implementing a system of binning that I helped to sort of propose along with Jim Evans, who's my colleague at UNC. And I'm gonna go through some of the background because I think it's important to set a definition of what we call binning and utility and actionability and those kinds of things and really define things out front. So the bottom line is that the goal here is to implement next generation sequencing whether it's whole genome, whole exome or targeted exome sequencing as a clinical diagnostic test. That's really what we in our group are after. And so the question is how do you find the clinically relevant variants? We've heard a lot about pharmacogenomics. We've heard some about GWAS risk snips from the informatics point of view. Those are easy to find. They're all labeled, they have names. They say RS something, something, something. So you can find them very easily. The question for those is what do you do with those and which ones do you report and so forth? I'm actually not gonna talk about that at all because for our point of view, I think the rare diseases and the mutations that actually cause Mendelian disorders are the things that we're interested in and I'm gonna sort of pick up a little bit where Les left off earlier today from a sort of rare variants point of view. So we envision that for a clinical diagnostic test based on a next gen sequencing platform, we're gonna do three sweeps through the data. The first one's gonna be your diagnostic results. So this person is being sequenced because of multiple congenital anomalies. There's gonna be an analysis of those genes that are involved in those conditions. We're gonna report back something that says, this is the diagnostic result or you could replace that with breast cancer or aortic aneurysm, whatever the phenotype is, there's gonna be some diagnostic analysis that's focused on that and the goal there is gonna be to maximize sensitivity and frankly if we find variants of uncertain significance, those are gonna be things that may be relevant for the clinician and the patient to know about and so those would be the sorts of things that even if we don't understand if that's the disease causing variant, we're gonna report it back in the same way that the standard diagnostic test report back that kind of result. On the other hand, and I think this is where this work kind of impacts what people are talking about today maybe, is there's also gonna be a whole lot of incidental findings and I know that some people don't like the term incidental. I think it's descriptive, I think it captures the idea that these are findings that are, not what you are looking for, but they're there nevertheless and you can look for them if you choose to or you can choose not to look for them but they are there. But from a clinical point of view we need to maximize specificity and I'm gonna come back to this with incidental findings and then of course the third sweep that all of us can do with this data is to simply go back through it in a research context and see if we can discover new gene associations and so on and I'm not gonna really get into that today. So one thing that I've really kind of seen a lot of today is that context really matters when we're talking about these variants and what to do with them. If you have a symptomatic patient who's being assessed for something and you're looking for a diagnosis, you have to report the full range of variants, even those that you don't necessarily understand. If it's a missense mutation in a BRCA gene, in a patient with breast cancer, that should be reported and it should be reported as a variant of uncertain significance. On the other hand, if you have an asymptomatic patient or you have an incidental finding in a patient that's not for what you are looking for, the prior probability that that person actually has a genetic disorder approaches zero. In other words, if we were to do population screening with whole genome sequencing in this room, there should be close to none, maybe one or two people in here with a strongly hereditary disorder and so all of us are gonna have millions of genetic variants and the likelihood that any of those variants is actually relevant is essentially zero. So we need to maximize specificity so that when we are providing information to the patient, we have a clinically relevant posterior probability. In other words, we think this variant is actually meaningful and it's predictive for you. So for the incidental analysis, again, we're gonna be identifying clinically relevant findings unrelated to the patient's presentation. The vast majority of the things that we find will be absolutely meaningless and we have to ignore them. So we're gonna maximize specificity. We're not gonna report the US's. We need to set a high bar so that the things we do report are medically actionable or have medical utility so that when we report them to physicians and patients, there's a clear translation to what should happen next and that's sort of what Howard's getting at. What should we do next with that variant? So this has been alluded to, the commentary that I wrote with Jim and Moine and really the idea of the binning came from Jim and I've been involved a lot in developing this and now the idea is that we need to have some type of structured analysis that we put together ahead of time that we know exactly how we're analyzing the genome and we know what we did and we do it consistently and if it changes over time, that's fine. You version things and you know how they're changing over time so that if you analyze someone two years ago on version one of the binning scheme and now you're on version three of the binning scheme, you know that and you can reanalyze them. So our strategy is to do automated annotation. We first categorize genes according to the clinical utility and the risk for harm. So when we say binning, we're binning genes, not variants. We're saying the BRCA1 gene should belong in this bin because if there were a deleterious mutation in that gene it would be either clinically actionable or not clinically actionable and that's the decision that we have to make. And then we also have an a priori definition of what types of variants in those genes should be reported. So again, this gets back to the issue of the variants of uncertain significance versus the variants that we know are deleterious and we have to be very confident about that call. So we do sort the variants that we find into these predetermined bins and then we would only review or report those that we think are likely deleterious. So in other words, we're really setting a high bar for what actually is gonna get back to the patient. Okay, so bin one as in our lingo would be the clinically actionable or genes with utility in that if you find a deleterious mutation in that gene there's a well-defined action that can be taken. Lynch syndrome is a good example. There are well-defined screening. There's cost-effectiveness research. It's very clear that if you find someone with Lynch syndrome you can do them a lot of good. Long QT syndrome, thoracic aneurysm syndromes and essentially any of the diseases that would be on the newborn screening panel which we feel are something that we can do about if we find them incidentally. Bin two in our lingo are the genes that are clinically valid but not necessarily directly actionable. So in other words, a gene from Mendelian Disorder which is a clearly associated Mendelian Disorder but there may not be a specific intervention that will help that person prevent morbidity, prevent mortality. And I agree that there is some gray zone in terms of what utility means for these types of genes but when we're talking about it we're talking about what would that patient be able to do with the information? We subcategorize bin two into three groups. Bin two A which we consider very low risk for causing any harm to people such as their GWAS risk snips and their pharmacogenomic variants and so on. If somebody found out that they were a carrier for a particular SIP2D6 allele it's probably not gonna cause them to go do anything drastic and it's probably not gonna harm them in pretty much any way. There's a sort of middle category for medium risk of harm. Most Mendelian Disorders would fit into here. You know, people will deal with information differently and be affected by it differently. We have two C being the conditions where there's a high risk for harm. The example is given here, Huntington's of course being a triplet repeat disorders not likely to be picked up on whole exome sequencing as Gail was helpfully pointing out earlier. My favorite example for this category now is fatal familial insomnia. A horrible, horrible condition that I think many of us would think twice before we decided to learn if we were gonna get that condition. And then bin three for us are all genes for which there's no clear link to a genetic disorder. So no known clinical significance. One of the things that I regret about the commentary is that we didn't think through the issue of carrier status enough. And I think if we could redo that commentary we would have a separate category for a carrier status, reproductive status which I think is a clearly separate issue and for young families would be certainly have lots of usefulness whether you call that actionability or utility. So we would consider there to be a separate category for carrier status which could include recessive conditions that are in bin one, bin two or bin three. Whatever the carrier status happens to be. So essentially what we did since the time of the commentary was to take OMIM and use some computer scripting to basically take all the genes from OMIM that are strongly associated with a Mendelian disease and what the inheritance pattern is for those genes and then try to categorize those or bin them into those categories. So it ended up being 2016 genes. And really the final bin decision is a bit of a judgment call. There are some genes that I think lots of people would pretty well get consensus around this gene ought to be in bin one, this gene ought to be in bin two C. Most genes probably are in bin two B. So bin one ended up being 161 genes, bin two B ended up being 1798 and bin two C was 57 genes. I think that kind of reflects the fact that most genetic disorders don't have a clearly definable treatment or something that we can do to benefit the person. And most genetic disorders are not terribly worrisome in terms of their potential effect on the person's psychosocial wellbeing. So as my PhD advisor said theories are good, but data are better. So we took 80 genomes that were sequenced by complete genomics. 19 of them were families we sequenced at our institute for hereditary cancer susceptibility. 61 of them were from the publicly available genomes that complete genomics has on their website. We used 1000 genomes project allele frequency data. We used the human gene mutation database. And a lot of this is gonna look a little bit similar to what Les presented earlier. So of course our expectations are the likelihood of any of these people actually having a disease causing Mendelian disorders pretty low. So we're gonna expect to find very few in one or two findings per person. Although most people will be heterozygous carriers of some type of recessive genes. So we should find some positive findings. So essentially what happens is if you take all the variants in those bin genes, we have about 13,000 per person in bin one, 175,000 per person in bin two B and 9,200 in bin two C. Again, this is genome, not exome. So we've got a lot of intronic variants in there. If we just reduce by allele frequency, if we just say anything that's less than 5% or anything that's less than 1%, what does that do to our numbers? Less than 5% reduces the numbers 10 fold, less than 1% reduces it about 15 fold. Still we've got a lot of variants that we really don't want to be looking at. So we then looked at what if you just took the protein coding variants in those same allele frequency categories. And again, this is including missense, truncating, frameshift, nonsense, splice site. You still have way too many variants per person, over 100 variants per person at the 5% allele frequency. So clearly most of these must just be very innocuous, benign types of things. Now if we just looked at the rare truncating variants, so these would be the frameshift, nonsense, splice site, we get down to a tractable number. These people now have less than 10 variants per person. These are the numbers that a human could then look at and say, well, what does that truncating mutation look like it does? So the question that came up when we were talking about this with our molecular lab colleagues was, what about the missense variants? You're just gonna ignore all those. We know that there's some disease causing missense variants out there. So this is where we incorporated the human gene mutation database because of course we're gonna sacrifice sensitivity if we exclude those missense variants entirely. So by doing that query, we were able to identify 871 unique variants, 771 of them missense. So the vast majority of what we identified through the human gene mutation database were indeed missense variants, so we were able to feel like we were recapturing some possibly disease-causing mutations. And the average was about 74 per person, which is still way too many. You should not have 74 disease-causing mutations per person. And this is exactly the range of numbers that was reported in the 1000 Genomes paper in Nature recently. So we know that we're doing the analysis correctly. It's just that there must be something going on with those variants in the human gene mutation database as has been alluded to previously. And what was interesting is that there was very surprisingly little overlap with the rare missense mutations we had found. We kind of thought that maybe these 100-some-odd variants per person might be the really important ones of those missense variants that we had seen. But in fact, about 80% of the disease-causing mutations identified per person have an allele frequency greater than 5%. And so this graph is a little bit confusing, but what this amounts to is a histogram. So the allele frequency bins are on the, I shouldn't use the bin. The allele frequency ranges are on the bottom, on the y-axis are the number of variants. And the bar is the average number of variants per person within that allele frequency range and then the error bars are the standard deviation. So essentially you have a large number of variants per person with greater than 5% allele frequency. So those are clearly errors of some kind that are in that database. And so that surprised us enough that we wanted to go back and see what is going on in the human gene mutation database. And in fact, most of the DM mutations, the disease-causing mutations in the human gene mutation database are indeed rare. This huge spike of about 75,000 variants is zero. They don't have 1,000 genomes allele frequencies, so they're very rare. And that's good. The problem is that there's still too many that are common. And so if you just sort of change your scale on the y-axis now, you can see that there are a sizable number, over 100 variants that are labeled DM that are 5% to 10% allele frequency and so forth. And so this tells us that there are probably many misanotations in that database. And unfortunately, because of the fact that they are prevalent in the population, these are the ones you're gonna find when you do your whole genome sequencing. The good news is that we can exclude those and we can eventually edit those out of the database. And so I think we may actually be able to use this information with a little bit of extracuration. So the final binning algorithm that we use is to take a variant that's within a binned gene and less than 5% allele frequency and either DM and HGMDs, that's their disease mutation in the human gene mutation database, or protein truncating. And what we ended up with were numbers that I think are getting fairly close to what we would expect in the general population. So 1.5 in one variance per person. Again, that's a little bit too high. But again, a very tractable number for a human to actually sit down and look at. So these were about, in this set of 80 genomes, about 1391 variants that we basically just manually curated to either remove the ones that looked like they were polymorphisms or VUS is kind of in the same way that Les described earlier. You do your Google scholar search, your OMIM, you look in a locospecific database, try to assess the evidence for that variant being in the gene mutation database or what the evidence would be for a truncating mutation in that gene and then reassign or remove those variants from consideration. So after reviewing those, I removed about 50% of the variants. That's just being clearly polymorphic or there was conflicting evidence in literature or there was enough evidence to say that variant probably was not an actual disease causing variant and moved about 5% of the variants into the carrier status. And again, this is sort of a Goldilocks approach. I was trying not to be too harsh, trying not to be too lenient. Some people would probably be much more harsh and remove more variants and so forth. The bottom line is that when you remove those variants, each of the bins lost variants. So these are stacked bar graphs showing what proportion of the variants were either in the same bin in white or moved to a different bin in light gray or carrier status in dark gray and then removed in black. And so each of the bins lost some variants either that were removed or were switched to carrier status. What was interesting to me was that more of the human gene mutation database variants were removed compared to sort of the novel truncating ones. So in the end, after removing that 50% of the variants, we ended up with about 0.3 variants and bin one per person, 2.6 bin 2B variants per person, et cetera, about 5.5 carrier variants per person. And this amounts to about eight to nine variants that you had to actually then go and confirm in the CLIA lab and see if they're actually real. Again, I think a lot of these may be sequencing artifacts. We're gonna find that many of the Frameshift mutations are probably sequencing artifacts and we'll remove those at that point. There certainly are false positive reports in the literature and as has been alluded to before, we're gonna learn a whole lot about what the actual penetrance is of these conditions as we get more and more of this type of data. And then just as a final slide, I wanted to kind of compare this to the paper from Bell at all earlier in the year, looking at a little bit over 400 genes for carrier testing using next-gen sequencing. We get a very similar histogram of the number of carrier variants per person right around the five to six range with a spread from zero to 12. So I'm over my time, but I just wanna summarize a couple of points. I think this actually says that we can implement an automated strategy for identifying these clinically relevant incidental findings. It may be a lot harder on the diagnostic end because you're gonna have to be looking at a lot more variants. What's nice to me is that this is a versionable process. It's written in code. The code gets updated. You change the version. You know what's different about one version to the next with the incorporation of databases and so forth, it's gonna be very important to keep track of what was changed from one version to the next. I expect that there's gonna be vigorous debate over the bending structure that we proposed, which genes should be in which category. Should we even have this many categories? Should there be more categories or fewer categories? I think that as we learn more about medical genetics, we're gonna see some changes. Things that are in bin three right now are gonna get moved up to bin two or bin one. Things that are in bin two will get moved to bin one as we learn about what we can actually do about these conditions. And I alluded earlier to the SNPs and the GWAS polymorphisms. Those are all to me in bin two A because very few of them are essentially a directly actionable thing. They're contextually actionable. If you have a person on a medication, they become actionable or if there were a risk assessment that could be done for diabetes that could lead to beneficial treatments and so forth, those could be moved into bin one. I think we'll refine the rules for reporting the variants again. What we did here was a very kind of, rough draft of what the binning or what the variant reporting criteria should be. We'll get much more nuanced as we learn more about the types of mutations that cause disease. And of course with a clinical grade database like the one we were sort of kind of pie in the sky, thinking about earlier today, replacing the gene mutation database or incorporating those other databases into the system, I think it'll be very valuable. All right, there's some other issues that are being addressed in terms of how we should counsel people and how we should talk about it. I think putting the next gen results in bins allows us to talk to patients about them in categories that make sense to patients and allows us to analyze them in that same way and return them in that same way. I'll stop there and take any questions.