 So, the first thing that we did when we came together is really sort of reject the charge. So in that sense what we decided to do first is really just point out that the basic premise we don't entirely agree with that there really are such things as known variants and then other variants and really came to a view that the way one wants to think about it is that all the variants are on a continuum of evidence in terms of connections between those variants and phenotypes and sort of taking that perspective we decided to focus mostly on what kinds of questions researchers ask when they're asking themselves whether a set of variants are actually influencing a phenotype that they're looking at and so what I'm going to do is just kind of walk through those considerations and what are some things to watch out for and what resources are available for asking the kinds of questions that are asked when you're actually doing a genetic study. So, obviously one of the most important pieces of information is what exists in terms of what kinds of phenotypes have been associated with a particular variant and the things that you need to think about here and what I'm going to do is I'm going to verbally make the points and then just some of the resources that are relevant to the investigation are at the bottom that I won't walk through. So you have to be careful that the evaluation is actually tied to a specific phenotype that is connected to the phenotype that you're looking at. That's actually often a much harder thing to do than I think may be appreciated by people that aren't used to looking at clinical data. That's certainly been my impression in trying to do this kind of thing. It's important also to consider, you know, exactly how the phenotyping was carried out and that there is, you know, a consistent standard for how phenotypes are described. One also wants to have access to the number and the frequency of the observation of a variant or a class of variants in people with the phenotype. The number of frequency here is used separately to get at the idea that you want to consider a replication of the findings and the contribution to the total mutational load. And you would contrast things that are seen again and again and again in patients with a very clearly defined phenotype as in the Delta F508 mutation as opposed to a variant that is only seen once. It's essential that efforts are made to ensure that there aren't other obvious explanations for a patient's phenotype that's being studied. So one would want to determine in focusing on a particular variant that the patient or individuals don't have other mutations in that gene. And it's also important that the individuals being looked at don't have already known variants or suspected variants in other genes. That is an important evaluation to make and it is surprising how often that's not done. So you report that a mutation in a new gene for a condition may be causal but there are a few genes already known for that condition that have not been well interrogated. So it's striking how often that's missed but it's also worth noting that we have to be really careful in how one talks about negative evidence and again something I've referred to already, you know, the idea that somebody says this must be my mutation because I don't see any other mutations in the exome data is a very, very weak argument. The number of frequency of the finding of the variant in people without the phenotype, this of course is a critical step and it's already been said just how important the EVS database is for this purpose and other growing databases. This obviously is essential but there are a number of things to be careful of and watch for. One of course is the ethnic matching between the cases that you're looking at and controls and it is striking, I think Mark made this point and others have, the extent to which lessons that we've already learned very well are being ignored when we're looking at sequence data and one of them is precisely that, you know, suddenly it's as if if we're dealing with rare variants that have a function, they can't be more common in one ethnic or population group than another, well of course they can and so you need to actually ask whether the variant is absent in the appropriate ethnic group. But the other thing to watch for is that the early filtering approaches that were applied in just really basically tossing out any variant that is seen in any of the databases clearly can't be used for very much longer, we are going to start seeing and in fact already are of course in DBSNIP but we are going to start seeing many pathogenic variants that are sitting in these control databases at some frequency and so simply chucking them out is certainly not going to be the long term solution. So one needs to think hard about how the numbers of cases and controls compare so getting at the relative risk in how the frequency of variants compare in cases and controls. There are again some resources available there and I've made some comments about things that one needs to watch out for. In making those comparisons as the control databases grow and the other comment I'd make in that regard is that it is going to be essential going forward that we are able to get at phenotypic information of individuals that we're using as controls and that probably is a central priority for the field to think about the establishment of large genome databases with individuals that are either phenotyped or preferably available for phenotyping. One is always going to think about the existing genetic data supporting a variant or set of variants these come in various types segregation data of course accompanied by appropriate statistical evaluation and those data of course are very comforting because we know that's very established we know how to interpret segregation data even if that's not always now currently being done. You also need to make sure that you actually know what allele you're talking about and that can actually often be a challenge most particularly actually when you're using the literature but in general it can be a challenge. It's important to ask questions about you know whether we know what types of mutations are most likely to be important from family history data and so on including you know both de novo mutations and somatic mutations and of course there is in fact a gray zone between those and also evidence on prior evidence on the penetrance of the variant. The mutational spectrum for the mutations in a gene can often be relevant I would make one kind of warning comment there which is that this is a term that actually is used with an incredible degree of casualness in some papers whereby there's evidence of independent signals of association with risk for different mutations in a gene and that is turned into claims of a mutational spectrum in the gene. I would actually say that you know when one is talking about a mutational spectrum what you want to be talking about is variance that you have convinced yourself for one reason or another are likely to actually be causal and you can talk about that. It's also often important to look closely at where the mutations occur within the gene that can often have quite a lot of relevance to interpretation. For many many Mendelian diseases the mutations are very very clustered and in fact you know for different diseases caused by mutation in the same gene you get different clustering patterns so a very close look at that can often be very important some resources again for that are at the bottom. Functional data this is I'm going to say more about this at the end but one sort of unfortunate way in which functional data has been used is that it's sort of sometimes mixed in with the genetic data in terms of making an overall case and I think one has to be quite careful in how one does that and again I will say more about that in a minute but just sort of thinking about the functional data narrowly defined you know you want to actually assess whether there really are properly controlled experiments that are possible and you also want to really assess whether you can do a functional assay of some kind either you know in vitro or in vivo that you really can clearly connect to the phenotype that you are studying and I think this is a place where functional evaluation of variance really often does fall down in some cases you can really make perfectly a perfectly solid convincing argument that you are looking at something relevant to the phenotype and that's you know clear for example in cases of enzymatic deficiency where you can actually carry out an assay that quantitates the degree to which a mutation reduces the activity of an enzyme. Now there you really can make a pretty clear connection to the end phenotype when you've shown that the patient's phenotype is the deficient activity of an enzyme. I would really just like to contrast that situation to what has been done again and again in you know human association studies where you'll take variance that are associated with something like depression and then you'll do a functional assay for example showing that the variant has an effect on gene expression and maybe there is something like okay you're looking at a gene that's involved in dopamine metabolism and therefore or serotonin or something like that and therefore we are connected to either depression or schizophrenia. Those two applications of functional work are absolutely worlds apart because in the case of the enzyme we are connected to the phenotype and in the case of a variant affecting the expression of some gene when you're studying depression and schizophrenia it's appropriate to say we have no connection whatsoever to the phenotype all you're doing is showing that the variant has some kind of a function as a very different situation. Even in the situation where you're clearly connected to the phenotype as in the case of enzymatic deficiency then you still have challenges for example there are very well documented cases of so-called pseudo-deficiency where you've got an allele that's low activity for an enzyme but it doesn't appear to produce the condition that may be because it's being modified by the genetic background it may be because your assay of enzymatic activity is inaccurate in some way. One of the things that I think is generally considered in interpreting genetic evidence is the prediction of whether or not a variant is likely to be damaging in some sense. Our overall view of this is that these kinds of predictions are an extremely poor guide to pathogenicity and one thing that maybe should be clarified there is that they can actually be a better guide to whether a variant is deleterious or not in a population genetic sense but we really need to distinguish that from the question of whether a variant is influencing the particular pathology that you're studying. Those are absolutely not the same thing you know all of our genomes are stuffed full of deleterious variants but it doesn't mean that those are all connected to condition X. So in its general role these prediction programs are judged not very good at telling you whether a variant is pathogenic. Now that said they are an entirely appropriate filter for where to concentrate your energy and I think sort of one way you could view that is if you had two graduate students who were going to follow up a set of candidate variants in further samples or with functional work and you divided the candidate variants into you know polyphen benign and polyphen damaging you know you'd probably prefer to be the graduate student that got the damaging variants to work on than the other set but that is not an argument that they're pathogenic. It's just an order for how you do the follow up work to try to get some other evidence that these variants are actually doing something. So there's a lot of things that you can consider in trying to assess the sort of pathophysiological plausibility and these are just sort of outlining some of these things that you might consider in making arguments about pathophysiological plausibility but it's probably worth saying I won't read through these but it's probably worth saying that there isn't a kind of overarching statistical structure that one could use to try to build these kinds of considerations into a formal statistical assessment at least not right now. We also had a lot of discussion about the importance of provenance and also you know so that you could actually work out for different kinds of evaluations and assessments that have been made you know who made them on what basis and so on and we also had some discussion surrounding this about the importance of keeping track of the evaluations in some usable way that have been made for specific sets of variants because one thing that's kind of striking is that you know people are making careful evaluations of variants or classes of variants again and again in different centers and it's very difficult to share that experience and information. Okay so those are sort of the things that we had discussed as we went through and I want to now just try to zero in on at least from my own perspective what are some of the sort of key points that emerge. I think maybe the single most important one is the importance of integrating the data that's emerging from all the work that's going on and trying to connect genetic variation to phenotypic variation and you could almost you know refer back to David Bosting's quip about arrays and ask the question if we didn't have anything currently available for you know connecting variants to phenotypes given the rate at which we're accruing data how long would it take us to actually build up a replacement that was a good replacement for everything that's out there right now and if that rate and this is the question that Russ was just asking too if that rate were fast then one really might think about trying to establish a model where you know there's a kind of finite window where we're using all the historical data when it's replaced by some new structure and this new structure you know would be some kind of a system for centralizing genotype phenotype correlation data and I want to just give one of my own recent experiences with the potential power of this. We recently sequenced a bunch of children that had undiagnosed genetic conditions and we went through a whole bunch of evaluations just as been earlier had been described earlier to try to decide whether or not we believed that we had a likely genetic diagnosis for each and every one of those individuals and one of the examples was this case where a child had been tested for genes for glycosylation disorders and had a presentation that was consistent with glycosylation disorders except for this one unusual clinical feature which is that the child didn't make tears. Now when we did the sequencing we found that the child was a compound heterozygote for knockout mutations in n-glyconase which in fact had never been described as a glycosylation disorder gene but obviously is in the same pathway in a pretty straightforward sense as genes that are responsible for glycosylation disorders. Now as a kind of aside we actually went through a very long process that went all the way up to the dean at Duke of the medical school to decide whether to communicate those results to the family and all everybody involved had come to a view that we were not absolutely certain that we had a genetic diagnosis but we felt that it was likely and that we felt that it was appropriate to share that information with the family with the very carefully explained caveat that we were not certain that it was a genetic diagnosis but the research team and the clinical team felt that it was a likely genetic diagnosis. Another thing that's interesting here is that part of this calculation for us anyway was the very strong feeling that the family could fully understand that uncertainty and would take it on board in what we told them. In any event we eventually did that and communicated it to them this family is very, very engaged and is writing a blog where people can read about this. The relevant point here is that the family themselves have been looking for other cases that are similar and they are very, very good at it in fact and they have been involved in identifying several cases that are similar one of which is being genetically analyzed as very second at Duke and two others in fact that are written about on this blog where it was exactly the same story of the patients going in for testing for glycosylation disorder genes but not making tears and the family describes the experience and that those individuals have now been tested and sure enough they have mutations in the end glyconase gene. So what is this telling you? What it's telling you is that something that you can have proof for once you combine these cases would emerge and jump right out if we had a centralization. In this case it happened because the family did it but if we want to make this systematic in general what we need to do is make sure that the sequencing that we did for this individual at Duke showing the mutations of interest exists in a database that also itemizes the distinctive clinical features so that people can look through and say okay this is a patient that came in for glycosylation disorders and doesn't make tears and you could actually look for genetic similarities amongst all of those individuals. So what we need really clearly as a real priority in the field is some kind of a centralized database that would permit that. It really needs to be inter-institutional because clearly there is tremendous benefit in numbers. It needs to house the genetic data in hopefully a complete way but at least all the variants that you can't discard as candidates for a given phenotype. It needs to do a careful and thoughtful job of phenotyping and clearly this kind of centralized database is going to need control data and it's going to need control data where you can access phenotypes. As I alluded to earlier one important question that one could ask is how one would transition between the use of existing data and the use of such a new resource and a point to emphasize here is that a further benefit of this centralization is it would really help to make the statistical argumentation plain because what you could say is that this is actually statistically convincing considering all the kinds of analyses that I might be able to make using these genetic data and these phenotypic data. That's something that actually is rather difficult to do when you're plucking out part of the prior evidence that's most conducive to telling your story. Another key question here is what steps we might be able to take to reduce the burden of obviously unfounded claims in the literature. That sort of formulation is my own so don't get mad at any of my colleagues for that. But one thing I really do object to is the fact that there are a lot of papers out there that very, very casual reading of them makes perfectly clear that there was no reason for the claims to be made in the first place and I think that trying to set up a mechanism to have less of that in the literature would be really, really valuable. And then the final sort of key point to consider is how to think about threading together different kinds of evidence and this is something that particularly bothers me in that a lot of papers do try to integrate a little bit of weak evidence on the genetic side and weak evidence on the functional side and a whole bunch of weak evidence into a final strong story and I think it isn't the solution to just say that's disallowed because of course we actually do have to use what we know about the biology and interpretation. But we need to think about some better way of handling that. One of the things discussed earlier is that you could require that people claim at each step in the argument this part of it is or isn't sufficient to make the case by itself and if you're not, if you don't have any particular step that's sufficient to make the case by itself then you are obliged to explain how the entire story passes statistical muster. That's actually very difficult. It's not impossible but it's very difficult but if you are going to use that thread argument then you have to take on that job of explaining how as a package you pass statistical muster. So that's it. Oh, yeah, there is one more point here. So just to throw out one thing to think about, most of what we're actually talking about is really focused more on trying to find kind of main effects of variance. This is the point that Russ made a lot and I guess just worth thinking about, you know, Mark is entirely right. If you're thinking about main effects of variance, the genome, you know, it's clearly finite and it's appropriate to think about it as finite and you can think about establishing clear statistical standards. When we're talking about interactions and the fact that variants are clearly modified by the genetic background, you might as well consider the genome infinite that is a problem that is in fact really not defined at all because we don't know how many variants might interact and it is a problem where clearly there's not enough people on the planet to actually assess interactions amongst variants in a general sense. And the only comment I'd make there is that from my perspective, the only way we're really going to handle that is in fact by taking on the biology is by asking biological questions about how the effects of mutations are modified and that is it. Okay, lots of interesting points. David, take it. So I wanted to dive in on kind of the issue of enzymatic proof. I spent a lot of my life dealing with patients with mutations in discreet enzymes where we know where the enzyme is and the literature is scattered and I actually have to go back to the excellent Adrenal Incondistrophy. The patient that was sequenced that has this variant that apparently many men in our lab have had no enzyme activity and so the patient had the correct phenotype and the paper that I think everybody read that we sent out before was a nice paper of a gene discovery where they found a variant in a gene. They proved that the gene didn't function and the assumption they made was that the variant they found was the reason why the gene didn't function. And so I think we have to be very, very cautious about using an individual in which we find a variant using their cells to prove that the variant that they have is the cause because it may be in the non-coding region. It may be a splice, a septus, splice donor deep in the introns. And so I would argue that the only way that we can really prove that something disrupts an enzyme is probably by using a construct where just that variant is the only thing that is introduced to look at it. I totally agree. I, you know, the point I was making was a little bit more along the lines of, you know, whether we can get from a functional assay to the phenotype that you're studying in the patient. But no, I completely agree. I mean, if you're going to make a claim about causation for a, for a variant, then yes, then you, you most certainly should show that that exact variant changes the activity of the enzyme. I completely agree with that. That was, I'm just responding to David. That's why we said a controlled experiment, right? And the first thing you described was not a controlled experiment. What we were talking about was a controlled experiment which does get to that exact point. So I just wanted to raise a thing. You talked about how things like CIF scores and polyphen scores and also how it was useful maybe seeing a number of mutations clustered together within a domain. And I was just curious, do you think that protein structure is at all useful for really assessing mutations? I mean, these are, you know, a lot of work goes into developing structures of things and using those structures to integrate information. Do you, do you think that's useful or has it been? I, I, so it, the answer is yes, I think it probably is. I think that it's, it might be the case that's a little bit more useful for interpreting, you know, what, what the likely function is of mutations that you have securely implicated with the genetics as opposed to telling you that you've, you've, you've got the causal, you know, variance just because I think it's hard to construct a story out of that. But if you actually do look, you know, at identified Mendelian mutations, they often have very clear relationships with the structures of proteins. And so that shows you that that relationship is there. Yes, Shemar? So I, I just wanted to answer to Mark's point because we spent years trying to develop methods based on protein structures. And our conclusion is that evolution is, evolutionary information is more powerful. Structures, structures do help and the, there is a good news apparently as John Mall published saying that if you look at disease mutation database, you'll find that majority of mutations impact stability, meaning that individual functional features are not that important. And potentially you can build unified methods to predicting changes in stability. But as soon as you open literature on predicting delta, delta G, change in stability of folding, the accuracy of methods are not good enough. And if you'd like to develop computational method which people can actually use, you cannot run weeks of molecular dynamic simulations. I tried a whole bunch of potentials for predicting delta, delta G. I'm, I believe that information isn't structure. I believe structures are useful, but we have no idea how to use them. So it, we published no single paper, didn't work in our hands. We should, we could maybe move this discussion to the next session, which is, you know, I think we'll focus on a lot of these issues. And so I was interested actually in discussing the, the database question because we really have three separate databases. Well, possibly separate databases that are being discussed. One is a, a database of known mutations with all of the associated evidence that's required to assess causality. The second is a, a database of reference individuals where we have complete genome data, preferably knowing that these individuals are without disease or at least having some phenotype data. And Gonzalo mentioned that the ability to recontact them for phenotyping potentially. And then the third is this database that's been raised a number of times of both full sequence data and also detailed phenotype data and controlled phenotype data that would allow you to look across other patients with similar, similar sorts of phenotypes. I just wanted to get a handle from people in the room. How would these databases work together? Could we, is the control database effectively the same as the, as the disease database? I mean, how are these actually separate domains? Or do we, do we see this as one, one enterprise? I can take a shot of that. You know, I think on the level of variant database, I think that is a separate one, although connections between them would be useful on the phenotype database. I think it would be useful to combine control and patient together in the sense that really every control is a patient and every patient is a control depending on what you're looking at and how you bring that patient into the system could be under one, for one reason, but then they become a very good control for another reason. And so, I would love to see an environment where, you know, we have access to controls and some of the ClinSeq work that Les has done in being able to follow up these patients who came in and see if they actually had phenotypes that you never knew about in the beginning would be useful. I also think in a database of cases having an interface to patients directly and this, you know, David brought up this great example of patients are incredible advocates in finding themselves. And one of the things that we have found extremely challenging in finding other cases is having to go through a physician. So, you know, we want to reach out to all these cases that we've tested and were negative and we want to recruit them into research studies, but to do that we have to go ask the physician to contact the patient and those physicians just don't have the time to do that. But the patients are sitting there, please find me, you know, I want to be in these studies, I want to figure out what's going on. So if you can have a direct liaison and figure out if those patients are willing to be part of studies and be able to have communication to them and that was one of the reasons that we were proposing to use patient crossroads as a system to build this interface because they are very good at interfacing directly with patients and allowing communication, advertisement of studies, you know, and trying that recruitment process and I think that would be very powerful. I believe there's also a patients like me website, so things like that that patients are clearly doing their linking and your point about every patient also being of control is a good one and because, you know, in the clinical setting we do have the clinical phenotyping where in many of the research databases we have very little information on the controls, maybe that's a place to start at least in terms of putting a database together. And I guess, I mean, the other huge advantage of combining the two databases would just be to make sure that there was consistency of data quality and, you know, variant calling those types of approaches across the whole dataset. I just, you know, in terms of sort of how to integrate the different databases I don't see clearly how that would be best done in the short term. But in the long term I think it makes sense for us to think about what we would really like to have and one thing that we talked about in the known variants group is what we'd really ultimately like to have is, you know, some kind of a wiki-like environment where every base in the human genome was annotated with phenotypes that have been seen associated with changes at that base. And I think that's actually not an unimaginable thing. And if we consider two things, one is that, you know, everything that you're going to see in a clinical genetics clinic, you know, that's driven by a single type of mutation is going to be seen again somewhere. So it's, you know, start with that observation which is almost certainly true. And with the second observation that a lot of the patients are extremely motivated which means that they will do a lot of the work of actually getting the information into such a database, you know, it starts to become imaginable that we would actually establish some kind of a mechanism where we'd have annotations for what happens with, you know, any change at any of the three billion sites in the genome. And, you know, use that for a lot of our analyses. So you would see this is something that sort of ties in with the known mutation databases. Actually, this is a database of everything. I mean, this is the human genome with every phenotype associated with every site. How one would have such a database and assuming, you know, a million rare variants in each of us or whatever number that you filter down, every clinical quirk that I have would be tied to my genome and would show up. And yet that wouldn't necessarily, obviously, be causal. So, or even implicated or even necessarily related other than by chance. So how do you filter those down? So that's a great point. And, you know, not being a clinician I can speak about this quite comfortably and glibly. So the way that we've tended to think about it is that for any patient that you're looking at, there often are phenotypic hooks that you might consider. And these are, you know, unusual enough features of the individual that you think it likely that it's connected to, you know, whatever the reason is that that person is in the clinic. And, you know, in the example that I gave, the obvious phenotypic hook for the child that we were looking at was the absence of tears. That's particularly obvious. There's obviously a lot of things that are in gray zones. But you certainly can imagine a concentration on things that are unusual clinically. And that's what gets annotated as associated with the change that's observed at nucleotide position Y on chromosome whatever. And I think we could build something like that. And, you know, I tell you in the work that we're doing, you know, daily, having access to that would be transformational. Well, a gene might be involved in a disease. The individual variants may not. And so, I mean, a lot of, for a lot of genes, you see just simply a lot of polymorph, a lot of variants in them. And so, the entity that the gene is involved, yes, but individual variants. It's very difficult still to say anything about them. So I, just a comment on that. So I agree, but I'd actually, you know, sort of make two points. One is that that's actually yet another motivation for such a centralized database because then you would have the information within it that, you know, would allow you to ask the question about whether a particular change at a nucleotide site really and truly was associated with a particular distinctive phenotype. And the other point that I would make is that it most emphatically does happen. So if, you know, if you take, you know, for example, you know, we've just had where, you know, we found mutations in this one particular gene that causes really unusual clinical presentation of alternating hemiplegia of childhood. You get the same exact mutation. It's a condition is one in a million. You get the same exact mutation at the same exact site in, you know, one third of all the cases. So we see this one particular nucleotide change 35 times, one particular nucleotide change 35 times, all presenting with this spectacularly unusual clinical presentation. So what it does show you is that if those data were dumped into a centralized place, patterns like that that are quite unequivocal from a genome-wide perspective would indeed emerge. I mean, I do agree with that. I would like to emphasize a point that is shown in this slide. And for those of us who are dealing with genetic, with Mendelian disorders for the last 30 years, we call it penetrance. And it is my impression that each variant, the causality of a variant is not a digital number. It's not either zero or one. That each variant is important in its environment, in its genomic environment and the allylic composition of its genomic environment. I give you the example for a very common disorder. The sickle cell disease has been the variant in the sixth codon of the beta-globin gene has been identified more than 30 years ago. Well, in sub-Saharan Africa, that variant gives you a very severe disorder. And in the Arab countries, Saudi Arabia, for example, gives you a mild phenotype because of variants nearby in the hemoglobin F gene modify the phenotype. So in other words, the databases that exist today don't have this penetrance information. And I think it will be very interesting to have some kind of a score between zero and one of each pathogenic variant, likely pathogenic variant of its pathogenicity in different environments. And as Mark said in the beginning, I think we need to do an auditing of all the mutation databases and eliminate from the existing pathogenic variants all the ones that are probably not pathogenic. Because that dramatically changes the priors per mutation. And the existing knowledge, I think, is we do not take into account the penetrance of the variants. And some people in the clinical arena, they see a haplose efficiency, for example. I give you the example of TAR syndrome, haplose efficiency. Some people are affected and the parents are not affected, although they have the deletion. And the easy answer to this is that, there are many, many, many, many, many, many, many, many patients in general that all that sticks to this is that the allylic composition of the normal, non-deleted allele changes the phenotype. I do not know how to deal with this in the data basis or in the prior probabilities where one needs to take into account the penetrance number. So my point was a little back to Daniel and to David's point, the database design, The one is where we have a listing of variants that are clearly linked in a way that we believe is causal, which is almost a very simple relational database that we're used to building. But another is the searchable idea where a patient's phenotype is linked to their whole VCF file so that we can put together these 30 individuals from the whole world that might put this data. And so I think that second one is not something we've built before. I think that's a very interesting idea and something that could really push us forward. I mean, I think we also may, for this particular application, you know, the clinical sequence of a patient with all the detail needed, may need to explore, you know, more creative ways of data sharing because it's extremely valuable to, I mean, it's obviously, it's invaluable that we as a community are able to search across phenotypes and rare variants and so forth. We may be able to, and I expect that we can, develop, you know, computational infrastructure that will allow us to do that without simply everybody depositing and exchanging the full genome sequence of individuals. And I think that's, you know, something we need to explore because in a lot of cases we can't simply, you know, take one of our patients and then, you know, send it out to 55 collaborators around the world. And I believe there have been research networks more on the clinical side that have done those kinds of things where they do have, you know, some kind of a structured format. The HMO research network is one that comes to mind. They have a data warehouse, as I understand it, that does allow fairly simple, straightforward kinds of queries. But it is possible to do that without actually having the data in your own place. Can I comment on that? I mean, one thing that's, this may be too limited, but one thing that has struck me and what we've been doing so far is that we often get down to, well, we think it's probably one of these five or six or seven mutations. But you know, we don't know which one. And what I found striking is that it has now in several occasions happened that, you know, patients where we got down to five, six or seven, you know, it later resolved to be one of those five, six or seven when new data appeared. And so what that tells you is that the sharing aspect of it, for at least that part of it, is really super, super easy. It's, okay, here's some phenotypes. And I think it's probably one of the five, six or seven. You know, it's not the rest of the genome. So there is some capacity to winnow it down to a set of, you know, possibilities. And it's worth emphasizing, I guess, that clinicians engage in exactly that activity in an informal setting, in the sense that they will come across a couple of genes that are potentially interesting in an exome and then reach out to their collaborators, scan those same genes across multiple patients, and pull out compelling evidence. It's just, this would be an opportunity to do that systematically without requiring those kind of informal network structures. I was just going to say that a small step towards, you know, making a database that would have a lot of information on every base in the genome would just be to require that when papers are published, you know, every time they refer to everything on the genome, there's a base, there's base pairs associated with those papers so that, you know, you can browse the genome and just click on a base pair and see every single paper that touches that base pair. And that's actually very easy to do. And, you know, it'd be, you know, I think it makes an impact if we do it. So these last two comments, however, and maybe there's going to be a session on this later in the day, raise this issue of sharing and provenance. So there's two situations that I think are different. One is motivated clinicians who are just trying to figure out, at least in that part of their life, what's going on with a family or a patient who are very motivated to share data and to find out what's going on because the end result is being able to tell a family or a patient this is what's going on. And so these informal networks that Daniel just referred to are based on this common cause. When we talk about then submitting everything associated with a paper in a research setting where maybe it was even prospectively designed. It wasn't just somebody coming into the clinic. It was, hey, I think I'm going to study this disease. As we all know, the dynamics of data sharing change dramatically there because it's about the much less importance of our entire academic careers compared to the life of a patient. And so I think we have to be really careful in designing these databases because that first motivation I think is very clean and clear. And I think the reason why Heidi or David would share data is transparent. But when we do a study on our favorite coagulopathy and then are asked to put the data in because that's the right thing to do, that's a different thing. And I'm not saying I'm against it. In fact, I'm probably for it, but I think we should acknowledge right away the different social setup in those two scenarios. And do we have, in fact, so here's a question. Do we have time for this kind of discussion or is that going to be for a future day? No, I think we do have time here. I think here is the appropriate time. And I guess we talked a little bit last night pursuant to that on what the incentives are for doing this sort of thing. And I heard from at least several people around the table that are doing this kind of work that this makes their lives much, much easier if you actually have access to this information. And so maybe the small labs or the kind of one-offs don't realize that it would be advantageous to them. But having the larger groups begin this would seem to make perfect sense. And it sounds like to some degree you're already doing that on an individual level. Is that right, David and Heidi and Ewan? So how would you go about making that a little more, I don't know, if not only available, but larger and more useful to you? So I mean, I think there are a couple of things that conceptually from the clinical lab side of the world that are given to us that may not be obvious to people that live in a research world. One of which is the clinical data we have on patients and the sequence data in our instance is considered protective private health information. So there are certain legal obligations we have with the way we deal with that data. So from a clinical lab point of view, we have obligations to protect the data. Our institution has rather controversially decided that the sequence data itself is health information, which once again, limits our ability to share sequences. But I think also from a practical point of view, I don't necessarily just want a variant file, I actually want the whole sequence data because I think there's actually a lot more value in that. And so I think the reality is that practically this data best sits hosted locally, but in such a way that it can be centrally interrogated. So there needs to be a structure around both the genomic data and a structure around the phenotypic data. And if we think there are problems with disagreements with genome phenotyping, it's just, it's very difficult. I think some of what Lazar's done at ClinSeq is actually kind of a pretty good example of ways to think about doing this. But I think the practical way I would see this working and the reality is that the clinical labs are actually probably now going to be producing more genomic data than research labs over the next five years. We need to find a way in which we can tap into that with consent from patients, with phenotypic information, and really ideally, I mean, at our institution, we're trying to pull with parental or patient consent the full phenotype from the medical record so the blood pressure's the whole deal. So if you're interested in what causes hypertension, we can actually look at what the blood pressure is of the patients we've sequenced as well and what medications they were on when they had that blood pressure. So that kind of data is really very useful, and I think if there are ways in which then people want to ask structured questions of the data, some simple questions are, have you seen this variant before? Being able to ask that question is important, and if you've seen it, what phenotype is associated with it? That's a relatively simple level question, but I think we want to move beyond that, though that would be huge in itself, to actually being able to ask bigger questions about real mass research questions about the patients and having their complete variants available. But I think the data for both the phenotype and the genotype has to be housed locally but with a central ability to interrogate the data and ways in terms of protecting patient confidentiality to meet minimum legal standards, but also to maintain the trust of the patients that would be willing to share. So just a comment on how the ISCA consortium, which has been collecting CNV data on patients through clinical laboratories, so they came up with a opt-out consent process, and so the clinical labs will post on their requisition forms and their websites and the final reports that they contribute data, de-identified data into a public database, and the process is basically the labs can send in, if they've put this opt-out and the patient hasn't opted out, then they can send that full cytogenomic array data set, which can be considered identifiable into DBGAP to be centralized there, and then the variants individually get pushed into DBVAR where they can then be seen to the public. But as individual variants, they're not considered identifiable and so they can flow there. And that's all done really without direct consent. And so that's been incredibly useful. They've collected over 30,000 cytogenomic array data sets through that process that are available. At the same time, there's some limitations in getting access to those data sets, and so the approach that we're now trying to take is to truly fully consent patients for sharing their data in a biorepository that is much more accessible. But it will be harder to get, go through that consent process than every patient, but it's, I think if we can make an effort to establish what is the consent process, what are the standards to do that, and incentivize patients to be part of that, I think it will drive that process and we'll get a lot of benefit out of it. And I think that, I mean, it's absolutely doable. I mean, I think these things are, this is a solvable problem. I mean, I haven't, most of the rules are created not by patients in consenting patients for the whole genome sequencing. In our facility, we had a long discussion, of course, as we all had with ethics boards about how to do this and what the wording should be and what we should tell them. And there's a great concern that lots of patients would not sign up. There hasn't been one patient who hasn't been interested because they're not so interested in privacy. They're interested in the issue that they have. And if by sharing de-identified data or even identified data will help their family, they absolutely sign up for it. So I think it's doable. Sign up for D.B. Gap. We haven't had any decline, that's not a word, is it, no? We haven't had any decline. We can make up new words. We haven't had any decline for that, either. Yeah, I was really just gonna say for us the same thing. So, I mean, we've been asked many times, can you just put the sequences up on the web so that other families can look to see if they see anything similar? So I don't think there's any barrier from that side. And if there's no barrier from the side of the families that are participating in this research, there should be no barrier. We should find a way to accommodate that. One thing I was wondering, though, is what fraction of the clinical sequencing, do we have any way of knowing what fraction of the clinical sequencing that's happening right now is sort of lost to the world? Because at Duke, if we do it under research and anybody writes to me, we can go and look at the genome sequences and say, okay, yes, we have a mutation in that genome. We don't. And if a clinician sends out, like I'd said before, to get the sequencing done at Baylor, that sequence might as well be lost to the world whenever anything. It might as well be lost to the world really right now in the sense of a clinician saying, hey, I see a patient similar to the one that you studied back in 2010. Do you see any mutations in gene X? There isn't a mechanism to do that. Now I'm not dumping on Baylor. I'm just saying it's lost in that particular sense. And what I'm wondering is how urgent this problem is. Do we have a way of knowing what fraction of the sequencing that's being done is relatively interrogatable going forward and what fraction is kind of gone in that way? Does anybody know? I can answer that one. So the policies on what's done with data are basically at the discretion of the lab director of the lab and often are probably more largely controlled by the lawyers of the institution than they are by the lab director, if I'm honest. Is that a fast statement? So I would say, I mean, so at our institution we have a policy of open availability. So if you have your genome sequenced, the hard driver data is yours to take and do what you will with if you want to. I know Ambrie have a policy where if a qualified investigator has an appropriate IRB protocol, the data is available with patient or parental consent. Right now, as far as my web Baylor doesn't have a similar mechanism, but I know they are looking towards trying to get something because it's been a lot of pressure. So in one sense, the data is not returnable back to the requesting institution. That's not so that that data may not be interrogatable on a one-off basis if one asked Baylor or whoever, but you would have to ask them and you would have to have all of the nice ways of using somebody's private data set. So the vast majority, because I think Baylor probably accounts for more than half of one of the clinical exomes right now. I don't know exactly, but that data's not lost. It still exists and Baylor is required by law to keep it for a minimum of two years. So if there is a way of changing the way these things are done so that there are incentives either clinically driven or patient driven, I think the reality is that most of the patients that don't have an answer are more into getting the data shared. I've had one family that actually didn't get the genome sequenced because they were worried about privacy. I've had multiple patients who have said that they would not be comfortable with their data being shared with the federal government. And I mean, this is a significant problem because it means that we can't share the data with Clint Varell or through other consortia. So even in the situation where, and particularly it is our underserved populations or relatively newly immigrant populations that are less keen on sharing data. And that's a problem because they're also the most underrepresented in the other databases. So just to add to that, when we were approaching clinical laboratories about data sharing with respect to variance, which is one step easier than sharing their genome sequences, most all of labs with rare exception were very willing to do this, but it takes resources to do it. It takes resources to go dig this data out of their systems to physically submit it. And so I think we will gain a lot more traction in getting access to these data if we can assist the labs in making it easy, developing the infrastructure, the standards to do this, and some labs now that we're getting into exome and genome sequencing, they're just unsure what they're allowed to do. What kind of consent are opt out or what's the regulation and what's the infrastructure I need to set up to interact with the patients and get that permission. And if we can set very good community standards for how to do that, assist them with that and just make it easier, I think it's much more likely to happen. So I think that should be something we all strive to do. Just one refinement on a point you've made, David, which I've heard from a lot of patient advocates, which is that if the patients or patient advocates says, it's okay, why the hell can't we do it? And that is a reasonable thing to argue, but in fact isn't how our human subjects, regulations and laws are actually written. And in fact, there isn't anything in the Belmont report that says if the patient says it's okay, whatever. The IRBs are actually structured to make decisions about the range of what is allowable before the patient has any say in that. And so I'm sympathetic with that and that we can say that we need to move toward that and we need to try and convince the regulatory structure and our IRBs that things in that direction should be more okay than they sometimes are, but we can't just say if the patient says it's okay, just do it. I just want to say, I mean, I think the idea of data sharing is a really good thing, but I also think it's the case that, oops, I also think it's the case that people don't fully appreciate, I think all the significant stuff, genomic data. And 10 years from now, people might have a very different view of, well, I've shared my genome, what does that mean? And I do think that there needs to be some, how should I say, sensitivity to that and maybe a little caution. Great, so, oh, sorry. How are we going? We're great, we have one minute and 31 seconds. Okay, cool. So I'm intrigued by this idea of rebuilding mutation databases from scratch using newly generated data and the extent to which that could actually be done. But I'm interested in whether that would actually work for well-established genes. I mean, is this, typically patients will be, who have a very strong phenotypic presentation associated with a specific gene are not gonna have their exome sequenced, won't be deposited in such databases. So there is some need to integrate that type of targeted or historical data in addition to the exome data. Is that fair to say? Or can we actually, would it actually be possible to totally reconstruct mutation databases from scratch over the space of the next five years? I would say looking at the data from CFTR and the fact that they basically had to rebuild their database over the last year, I think honestly, we're probably better off starting all over again. And it's not to say that you don't include the historical data, you just look at it differently and better classify it. So using natural language processing to pull variants out of literature in a new way and then applying our current criteria to the classification of those variants and starting the classifications from scratch, but getting that historical data is still useful. So I think our next talk is Greg Cooper and this is the estimating impact group, which used to be, weren't you the ones called the functional variant group? And you not only rejected your charge, but your name too. That's right.