 Yes. Okay, and Mike Payson, who I'm also working on this with, is here as well. So yes, we've talked with some of you, and we're, okay, so this is the concept clearance. Okay, so this is actually an extremely simple-minded justification. As Adam had discussed about the reason why we look at exomes or why we look at whole genome sequence, obviously there's a cost issue. But the other reason is we don't know how to interpret variation in non-coding regions, which is a serious issue, but that's really pretty much a simple-minded justification for this whole effort. We have a whole genome. Jeff Schloss's program has been very effective at generating ways of sequencing the whole genome, and as I'll say a little bit later, we know that there's lots of stuff that affects phenotyping disease in the non-coding parts of the genome. So exomes are so 2010. So the question is, we know that many genes and variants are associated with the disease. Which ones are actually causal? And as you know, function is complicated, causation is complicated. We'll talk a little bit about how we're going to sort of get away from the word causal. But the point is, as you well know, and we've seen these things all the time, that you get a region of the genome, and some of you even know how to interpret this sort of diagram, but you get a whole bunch of variants associated with each other, with the disease, and there's clearly something going on there. Something of a genetic variant really is mechanistically... Sorry, pathogenically causally mechanistically related to disease. There's something that's really there that is contributing to the mechanism how disease happens. But they've got a whole bunch of buddies along for the ride. And it's very... I mean, LD does exist. It's very non-trivial to figure out, okay, there is a region, there's something real in there. Of a whole bunch of variants in those regions, which is the variant or variants, it's really, really causing the phenotypic effect. So we know about the genetic code. Coding regions have the genetic code, so we understand much as it's... There can be a lot of oversimplifications here, but it's a good place to start that you know what's synonymous, non-synonymous, and stop codon variants. So in the coding regions, we have some good information that helps us interpret it. In the exon, it's about 1.5% of the genome. If you only focus on the exonic regions, it's like looking under the lamp posts for your keys. We know that non-toding DNA variants affect human diseases. There's a bunch of diseases and there's many more. We know they affect drug use, response to drugs. The GWAS catalog is full of these associations, 90% of so, which are not in exons. We know that both the GWAS and if you look at scans of the genome for natural selection, that a lot of these annotation signatures are outside of protein coding regions. So there's a lot of the genome that clearly has functional effects that's not in exonic regions. Lots of interesting things that these sequence does. So that we're getting to the concept interpreting variation in human non-coding genomic regions using computational approaches with experimental validation. So what we're trying to do... We're actually trying to address the really hardest questions here. As you know, as I said, function is complicated. There's sort of easier function and harder function. The easier function is things like looking for transcription factor binding sites. That's not trivial. That's an encode type project where you go through the genome and you look for these elements. The question is, though, which of these elements actually affect organismal function? Just as many variants probably have zero effect on function. Adam likes the metaphor of a perfectly functioning door that goes nowhere. It can work at the molecular level, but really not make a difference at the organismal level. And so figuring out which variants actually cause organismal effect is a hard problem. And so we thought it would be worthwhile to stimulate research in this area. Which is to say, we still need all those molecular studies. Those are hugely important. But we thought we would focus on the harder problem. And so the other thing, of course, is that computational approaches... There's a huge data set base that's needed here. It would be great to get a lot more of these data, but that's a separate discussion. There certainly are data sets that already exist that can be used. So we want computational approaches highly innovative to identify or narrow the set of potential variants. Causality is a very hard problem, so we're not... Especially we're trying to stimulate this area with the validation, the experimental validation. We're not trying to have groups absolutely prove that this variant causes that disease. But we want to narrow the set of variants to ones that are potentially contained, the causal variant. So I'm confused. Then what do you mean by experimental validation if it's not to show that the variant causes the phenotype of interest? Well, there's show... So first off, we're talking about computational approaches, so computational predictions. But then we want to have some ground truth with experimental validation. There's a whole range of validation from low throughput, gold-plated validation that really does show that that variant causes that disease, but that's very expensive. To the extent there are ways of doing experimental validation that maybe don't show-show completely, but give you an indication that narrow the set of variants, that seems to be okay. Or we're proposing that that would be okay. Just be careful with the language you're using. The experimental validation that shows a variant causes the disease. The association is just a probability that's associated with the disease. Absolutely. I can't believe that you'd have an experimental validation that would prove it was causative. You might prove that it changed the expression of that gene or something else, but the idea that you could make a direct correlation to disease is not that simple. Oh, absolutely. And so that's exactly gets to Jill's question that at the highest levels these are associations. Then when one is doing something else experimentally, there certainly are cases where the pathway has been worked on. So you start with associations. You narrow down a set of variants. You then do experimental work, clinical work, and you think, or you more or less prove, that that variant causes the phenotypic effect. It's not based on just a whole bunch of associations. So absolutely, we're trying to move beyond GWAS associations. We're trying to get to that middle ground where it's more than just associations, and yet it's not one variant, one huge research project. Because we want to eventually be able to use these methods to at least narrow down the set of variants that then can have experimental studies can be studied in much more detail. Does that help? I think the phrase experimental validation was also left intentionally open in hopes and part of stimulating good ideas in that area. And as written, it could go as far as what you described as the gold-plated experiment where we mean gold standard experiment or gold-plated? Well, if it's a mouse, it's gold-plated as well. No, I'm just saying that that's the problem with the true recreation where you generate, you know, this was recently published, for example, for the coding region variant in the EDAR gene, one of the areas that shows strong signature of selection in Asian populations. This specific human amino acid variant that shows one of the strong associations was recreated in the mouse, an expensive experiment because it was a knock-in that at the very same residue in the endogenous gene made the switch. And they verified a whole series of animal phenotypes that were generated by that single amino acid change. Now, that's a very expensive whole animal experiment. You're not going to be able to do very many of them if you try to do that for all of the interesting things that have come from GWAS. And there may be a variety of situations where you could have cell culture models of phenotypes that are in vitro surrogates of things that you would like to be able to score in a whole animal. So there's a range of possibilities that I think vary in how expensive they are per variant, how whole animal-like they are, and how surrogate-ly they are. And all of those could be put together or proposed by investigators in what will then be looked at to see what's the most compelling combination of prediction and some sort of experimental test of whether the predictions are finding things that have functional effects. Lisa? Sorry. Yeah, I was going to ask, this is a really important point because I think this could be the poster child for what Jim said and David reiterated about the loop of trying to connect back to the biology of the disease through back to domains one and two. But it depends on what percentage of the effort goes into that. Some of it may be very high throughput, but there might be good reason to do a significant number of those and significant is the question related to budget of those gold-plated gold standard where it's warranted. So have you, is there a in the concept clearance? I didn't see a kind of ratio of effort on the computational versus the validation? Yes. Well, we initially had suggested one, but the small group of council members we'd initially consulted about this said, don't put in a specific limit. It really depends on the expense of the method. And so that, so there's no limit there. I mean, what you're talking about in a sense is validating the validation method that if you have some gold standard methods that can validate that your bronze standard methods actually work well, then that becomes you have sort of a few very expensive assays that will validate a larger number of less expensive. They may not be so expensive, you know, as mentioned by Eric, Cas9, CRISPR technology may be able to do very rapidly create mice that have, you know, both alleles replaced in, you know, Rudy Anish had a beautiful paper just demonstrating that. Exactly. If that sort of proposal came in as an innovative way to be able to test function, I think that would do great in this sort of RFA. Yeah. Yeah, I mean, I think what you want to get across, right, is that you want the community to kind of hit the sweet spot, that you can't set the bar so high, but talking about experimental support or diagonal types of support or validation seems to me to be what will provide the most coherent message. The other thing I would just throw in there is I'm very much in support of looking at the non-coding regions, if for no other reason than the fact that so many important things seem to land in there, but I think it's very important to remember that we still are clueless as far as interpreting most variants in the coding regions too. Yeah. The quality issue is completely true for coding as well as non-coding. Yeah. You have the genetic code and that's a help for a minority of changes, but it by no means gets us out of the woods. So I don't want people thinking that NHGRI thinks that we've solved that problem and now we're on to the next one. Absolutely. And actually we say, somewhere in here, okay, focus on non-coding variants for the reasons we discussed, but I mean if some method gets you a region and there's some coding variants in there, that's fine. And many of the techniques will work regardless of whether they're in coding or not. And that's completely fine. You don't want people to forget that as they write these and think about this. That's right. So what we're not looking for are sort of improvements to ways of inferring because of a non-sononomous change. Expression and functional right. Right. Non-sononomous change, you know, affects protein structure this way and therefore it's more likely to be causal or something like that. So that's a real focus on coding variants, but as you said, you know, there are certainly methods that are agnostic to codiness or not and those are completely acceptable. Do you want to? Yeah. To follow up on what Jim's saying, one thing we would like is that people are going to follow up protein coding variants that they should do it in an agnostic way. Some of the more interesting examples of non-coding variants were found because they were initially coding variants and upon further study it turned out there were tag snips for a nearby non-coding variant. Okay. So which variants potentially affect organismal function? Sometimes this will show how the effect is brought about or the genetic architecture if you have things like gene-gene or gene-environment interaction. So we expect applications will include the computational approaches as well as the experimental validation of these approaches. We're not looking for large-scale production of functional data aside from the validation data and we're not looking for things simply like databases or just aggregation of information on variants. There's a lot of data sets that are available to use. So the initiative focus is on genome-wide interpretation rather than somebody saying I have a very interesting region and I really want to study the variants in that region. What we're looking for is approaches that can be applied to a lot of data sets so that you start with the entire genome, such as GWAS. GWAS starts with the entire genome based on association, comes down to particular regions, but it's not saying I just want a priori look at a particular region. It doesn't have to be GWAS, something like genome scans. Can also start with the whole genome and find regions. So I would suggest I'm wondering if you could add to this concept the idea of having a coordinating center whose job it will be to run a contest where you would provide variants to groups that say they've developed a method and then have them all analyze those variants and see how they do. Sort of similar to what Brenner does with KG. Yeah, I was just thinking of KG. That's interesting. Let me get to one more. But then don't you need a gold standard to judge them? You would ask the coordinating center to try to develop such a gold standard, but it would have to be something that's not in the public domain so that they couldn't cheat. Exactly. That's an interesting idea. It's quite related to this. I'll also point out the focus, even though we want them to start with the whole genome and go down, different classes of variants may have different properties so that CNVs, say, or transcription factor binding sites or CPG islands, the signals of which variants are actually contributing to the organismal phenotype may differ according to the class of variant. So we're not trying to say you have to. Again, this is very hard and it's kind of early days, so we're not saying here's a genome, give me all functional variants. So Lisa, would you say that kind of the driving idea behind this is to sort of flesh out the best computational methods that are out there, somebody who might be saying, well, I've got this theory that knowing something about the network structure would really help me predict which enhancers will be really important. And so I'm going to make some predictions and then I think I can test that using this cellular phenotype and I'll read it out and see. And so I viewed it as that way, right? Sort of saying, okay, and then you would want to fund sort of a portfolio of maybe a couple of network approaches, maybe somebody who says, what I really think is important is to take all the encode data and put it through some prediction algorithm that's actually totally agnostic. It uses machine learning or something to make predictions. And then I'll run that through and see how well that does. What's the training system? Well, I mean, no, but you could imagine that a series of different approaches will be put forth and then by having all these folks liais with one another, you'd get some best practices and maybe they'd be even sharing some of their gold data standard, for example. I don't know, but it seems to me that that was sort of the... Or did I get that wrong? Yeah, no, that's a very good description. And of course, the reviewers will like to see some evidence that a method being proposed can actually work and I'll get to that issue towards the end. We also figure that these groups will be meeting like once a year exactly to exchange ideas and possibly validation data sets and approaches. Okay, and we want the methods to generalize beyond the specific data sets and diseases studied. So basically, the idea is that you start with the whole genome and go through a series of approaches. And this is just kind of a very straightforward, simple-minded example where you have a whole genome, you do GWAS, you come down to regions, then you look at, say, transcription and cell types related to the disease and it gets you down to certain ones and then you use ENCODE and regulation and pathway and other data sets to get to a smaller set of variants. Other examples are things like, you know, instead of starting with GWAS, you can start with the genome scan of natural selection, chromatin... You know, there's an example where chromatin structure, you have an indel that affects what's the open chromatin structure there. And so we already know examples where that... Where those indels actually change the chromatin structure and therefore affect whether, like, persistence of fetal hemoglobin or for thalassemias. So there are examples like that, you know, a very simple-minded thing is promoter binding, knowing which variants actually affect the promoter, can help you interpret the variants, epigenomic variability, so the variability itself gives you a clue as to importance. So there's a set of types of things and we certainly hope the applicants will be quite imaginative and come up with good methods for the computational approach, for the validation methods. You know, there's a range of types of validation as we discussed. They can use model organism data. The concept clearance says that we encourage innovation in methods for validation. I mean, what Joakar was saying about CRISPR methods are those sorts of zinc finger, you know, very specific things, maybe very nice validation and maybe not too expensive. That would be terrific. Okay, there are some other initiatives. NIGMS had an RFA that was related to figuring out everything you can about function of variants, both experimental and computational, and they included things like databases. So they made about eight awards or so, only one or two of which are kind of related to this at all. So they haven't solved the problem. Other institutes, including arts, are doing somewhat, developing some data sets, experimental data sets for functional methods that are experimental. So those will be good data sets to use. So the timeline we're talking about, we're talking about two rounds here and we actually think this is quite important. It's sort of partly getting at Carlos's point. So receipt dates in January 14 and January 15. So if receipt dates, so anybody who's kind of ready to go can put in an application. But because this is difficult, because there's a lot of moving parts here, that they have to have the computational approaches, they have to have the experimental approaches. It'd be really nice if they had some preliminary data showing that their computational approaches actually work a bit. By having a second receipt date a year later, we give some confidence for groups to actually put the work to put together the experimental and the computational side. And so we think having two rounds kind of defined ahead of time will actually help stimulate the field. It'll actually put together collaborations. Those groups will have a chance to get some preliminary data and put in good applications. Because of the experimental side, especially, we think these are reasonably large grants that we really want to, especially, again, this is a very difficult topic. We really want to have the validation in there, so it's not just association-based. So we figure 500K direct cost per year make about five to six in each round. OK, so actually, we're hoping we'll be able to start interpreting the non-coding part of the genome. So any other comments? Mike, did you want to say anything more? Any other comments on this? Yeah, first of all, thank you for this very thorough and clear presentation. And I'm very excited about this concept. Clarence is clearly a high priority area. It's clearly one that really generates a lot of excitement. And you can, I mean, just the council couldn't let you finish your presentation there if you're jumping in. And Carlos even started designing the proposal right there. I mean, it's that good. It is so important. It's really exciting. And I'm glad that we're, well, I really encourage moving this forward to an RFA as the way that you've got it set up. Some of the, this issue about, so it's the experimental tests are vitally important, and they're going to be part of it. I hope you can come up with a shorter title that will still convey that. But using the term validation is tricky. And it does have specific meanings to do a lot of people. To me, it means that you're going to take a conclusion that you've inferred from one set of data or one technology, and you're going to test it with an orthogonal technology. That's really validation. And given, and this is back to Anthony's point, given that the initial idea is that something in a region is associated with and potentially causative of a disease, then validation has a very high mark there. But there are other ways to define this. I can get a lot of mileage out of just experimental tests. I mean, it's a broader term. Or support. And the idea is to. Yeah, OK. That's the key. That sounds good, Jim. It's not all computational. It's great that it's computational, because it has to be genome-wide. And they also emphasize something that you did point out. Well, further, we can see the epigenetic signals giving us strong information. But there are lots of them. There's actually many, many variables to bring in. And that's another thing I'm very excited about, because nobody knows how best to do this. We all kind of have a little intuition about how to go about it. And many people are already active in it. But we don't really know what the best way is. It has to be computational. It has to go genome-wide. But computation and the absence of experimental feedback is of limited use. So I think this could really work well. I also like the fact that you, this is set up. I think it's set up to not over-engineer, not over-design the RFA. And to do, I think, was it, Carlos, you said, let the best of the community, the community's best ideas really come to bear. So I think it's great. I also really like the two rounds. Two rounds, because not everybody's ready for prime time on this. But there's so much excitement. Give people a chance. Give many people a chance to try. We just wish there was more money to put into it. On that, I agree with that. And I'd be even content with plausibility. Biologic plausibility is something. No, even if you said you're far too long. Well, no, but I mean, that's how far off we are. So just some form of words that make it clear is that we don't need to, we want to be towards an understanding. Right. The actual RFA, of course, can have a discussion of this. So take your point that validation may be too strong. Plausibility may be too weak. Support might be in the middle. But there will be a discussion of it. So hopefully people will kind of understand what we're going for. Any other comments, discussion? So if there are no other comments, we take a vote on concept clearance matters. Can we just have this friendly amendment about support versus validation? Yeah, sure. Would one of you like to state the amendment so that we have clarity? Just to change title from experimental validation to experimental support. Yes. We can go with that. All right. So can I motion to accept? Thank you. And a second. All in favor? Any opposed? Thank you. OK, thank you. Good discussion.