 presentation by Adam Felsenfeld. I want to remind the council that before any Institute of NIH can publish a funding opportunity announcement that has a set-aside of funds attached to it, on our, for example, the underlying concept has to be approved in an open session. We always use the advisory council for concept clearance, so you guys know everything that's coming through the Institute. So Adam's going to give a presentation today. He's going to actually present five different activities that are all quite interrelated, and there's quite a bit of synergy among them. For that reason, we're proposing to take a single vote on the entire concept. And so this is for the human genomic genome reference program. OK, Adam? All right. And thank you, Rudy, and good afternoon. Before I begin, I hope you've all had a chance to look at the council materials, the written brief summary of the program. So yes, the human genome reference, as you know, has its origin in the human genome project. It serves as a coordinate system. And it is used for read mapping. And so essentially all human genome sequencing using short read technology relies on it. And it's a critical utility for genomics. The human genome reference, the human reference is currently maintained and improved by the genome reference consortium. The genome reference consortium also maintains mouse and zebrafish references. It is a collaboration of multiple groups who together maintain the genome reference. Support for the human reference comes from NHGRI to WashU and Nationwide Children's Hospital and others, and from the Wellcome Trust through Sanger and the European Bioinformatics Institute, and some from NCBI. NHGRI funding supports the maintenance and improvement of the human reference, including developing new reference versions, patches, responding to error reports, community outreach, and development of some tools for the use of the reference. NHGRI also supports two additional awards that produce high quality human genome assemblies to be added to the reference. And together all of these have been about $5 million total costs all combined at their peak. One of these is on an extension. Some of the others have been supplemented after May, but are going to be winding up at the end of this year. So for more than a year, we've been hearing about needs for the reference that are not being fully met or ideas for improving the reference. That includes the need to better serve growing and more ever more diverse set of scientific users spanning basic and clinical, various degrees of sophistication and expertise. There are challenges that different consortia have had in deciding which reference build to use, which update to use, that's a problem. We are collectively learning much more about human genomic variation. The reference needs to reflect this variation in order to avoid problems in downstream analyses that rely on the reference. Finally, including the additional variation that raises the challenge for the basic means of representing the reference. This challenge needs to be addressed in a way that encourages downstream use of the information. Based on this input, we held a conference call with about 65 members of the community in March 1. The meeting report is available in the council materials and there's a link that's included in the notes to this slide. There's several high-level conclusions from the meeting. That is, first being the reference does not adequately represent human variation, leading to a risk that downstream analyses may be biased due to failure of read mapping. More high-quality human genome assemblies from diverse populations should be added to the reference. The recommendation from that meeting was to start with 300 of them. Another recommendation was that a linear representation of the reference is not adequate, especially as we incorporate more and more diverse high-quality genome assemblies into the reference. We will need better ways of representing all that information. The alternative assembly or alternative path information that we do actually have in the current reference is underused for a variety of reasons. And overall, there's an opportunity to develop the reference into what participants call the pan genome, which in its simplest sense is just a collection of sequences that are together used as a reference. In the context of the discussion today, I want to stretch for an ideal, and that ideal would be that this would constitute enough genomes with enough diversity to be as representative as possible and ideally to maximize the chance that any new sequenced experimental or patient genome readily and fully mappable to the reference. This program will comprise five different components, and they are based on those recommendations almost directly. So the first, human genome reference center, second, high-quality reference genomes, third, research and development for reference representations, fourth, research and development for comprehensive genome sequencing, fifth, informatics tools for the pan genome. I'll go through each of these in turn. So the first is the human genome reference center. This first component will be a central group that works to maintain, update, release, and serve to the community the human genome reference. So they'll construct and maintain and release new reference versions. They'll receive and resolve error reports. They will promote and adopt state-of-the-art representations that include alternative haplotypes, alternative haplotype information, and provide basic tools for the community use of that information. They will undertake community outreach and training and act as an aggregator of informatics tools created by the community for use of the reference. They will also serve as a coordinating center for the entire consortium. They will work with other program members to identify a framework for pan genome implementation. Not on the slide, this group will also work with the program and the larger scientific community to prioritize sample choice and develop quality standards for new high-quality genome assemblies. They will identify and respond to diverse community needs across the clinical and basic genomics communities. They will liaise with other resources that represent human genomic sequence and variation or that provide and or that provide human, provide reference resources for human and other organisms. The Human Genome Reference Center is also expected to integrate fully with the Genome Reference Consortium and other international efforts that also have responsibility for providing references. We are proposing this activity as a cooperative agreement. I think that's appropriate given the complexity and the ongoing need for probably for change and making big decisions during the course of the program. We are proposing it at $2.5 million total cost per year for five years. We anticipate making one award. The second component will be high-quality genomes. This follows directly from the March conference call, recommendation that we need to add more genomes. The exact right number was not stated and I think that actually the exact right number and the constituent populations they come from is an interesting scientific question that needs to be addressed on an ongoing basis for the consortium but the meeting participants thought that at least 300 would be needed to begin with. They will begin with sequencing diverse samples starting with 1,000 genome samples which are both reasonably diverse and also importantly consented appropriately for this application. They will coordinate consent and collection of new samples if that's needed during the course of the program. They will provide capacity to help resolve error reports to component one if needed. It is important to note that a number of things we do expect to change over the course of the program including the technologies will change, the cost of high-quality genomes will change, the quality requirements probably need some refinement, the optimal number of genomes is not known, the needed diversity needs to be given quite a lot of thought, all that has to be assessed over the course of the program and any of those things could change the funding requirements. But initially we expect that this component will require three and a half million dollars a year and five years and we anticipate here also making a single award. This component will need to work closely with the others in this concept especially around the issues listed in this bullet point and it is proposed as a cooperative agreement. The third component is research and development for reference representations. During the March web meeting, participants strongly converged on the need for a reference that faithfully and usefully represents sequences from diverse apple types and populations. Currently the linear genome reference has been extended to include alternative paths that are added on top of the standard reference sequence to show some population-based variation but there's a need for more advanced representations such as graph representations. So that's what this group would do, what these groups would do. We expect that they will use efficient scalable methods that the results will be open source and a product that facilitates use and downstream tool development. We want them to help set benchmarks and standards for pendulum representations. Their products may well be adopted within the Human Genome Reference Consortium but also should be capable of standing on their own and they should be based on common standards. We're proposing this effort at one and a quarter million dollars total cost per year for three years, two to four awards. We're aware that the number of grantees proposed here is small, the amount of money, the amount, the funding amount is also small and that there may be those that are interested in this problem who might miss applying for the single receipt deadline. So we would also accept successful investigator initiated applications in this area and incorporate them into the discussions of the consortium. The fourth element is research and development for comprehensive genome sequencing. March meeting participants agree that the technologies available today are just adequate to the task, there's long read technologies and synthetic long read technologies but they also stress the need for additional technology development to address particularly difficult regions of the genome that can't be adequately resolved with methods that are available in the next year. So we proposed to call for applications that seek new or extended technology for high quality contiguous telomere to telomere phased human genome sequence. We anticipate that that will include integration of new technologies with existing approaches. We prefer generalizable approaches to comprehensively sequence genomes in these applications but we would accept ones that focus on specific difficult regions. And in the long term we're seeking applications that are striving for 10-fold improvement over the state of the art. This component clearly calls for long-term development. It's unlikely to be ready to implement in the next five years or at least in the next three years but even so these grantees will need to understand what the quality requirements are for references, what difficulties different user classes experience and so it's important to integrate them into the program. Unlike the other components of this concept this one will be funded or would be funded through the regular technology development program so that's investigator initiated R1s and R21s and SBIRs with our interest made known through a guide notice. The final component is informatics tools for the pan genome. This is aimed at making their genome reference easily usable for a range of clinical and basic researchers. Participants in the March web meeting noted that existing tools for use of the reference can be difficult to use, they can be difficult to find especially ones that use existing haplotype variation information. The pan genome is expected to support new types of analyses and enable new kinds of questions to be addressed that previously couldn't. So we're looking for proposals for analysis tools that are exemplars for use of emerging next generation representations. We want diverse applications considered so spanning clinical and basic across a range of specific uses and user communities. We want tools that are compatible across different reference representations and across different reference versions. We are looking for either new, completely new applications or ones that propose revisions or improvements of the existing high value tool kits. This is another situation where the number of people who could contribute to this is probably greater than we can fund with a single FOA at our proposed one and a quarter million dollars per year total costs, two to four awards. So applications in this area will also be appropriate as regular investigator initiated applications which could readily be integrated into the program as appropriate. And finally, the whole program will benefit from coordination with this component. For example, the tool developers need to know about the details of the implementation of reference representations. The whole consortium needs to maintain an awareness of the uses and the users of the reference. And before I end and get to discussion, I want to thank Heidi for her work on components three and five and Mike for his work on component four. Carolyn, for her comments, my DGC colleagues, especially Lisa and Taylor Lynn and Terry Minoglio for a detailed set of comments on an earlier version. So with that, we go to discussion. Now, usually for concept clearances, we don't assign discussants, but I did talk to Carol a bit a couple of weeks ago about this and hope that she can start off the discussions. Yeah, and one of the things we talked about was the challenge of having the R and D for reference representations and the informatics for the pan genome like simultaneous because you kind of need one before you can fully develop the other. But I, and Adam, maybe you wanted to say a few words on your response to that discussion. Yeah, so thanks for reminding me. So you may remember all the way, you may remember earlier, I should just say, we already have $5 million identified for this in 2019. We would have to find an additional $5 million to bring it up to the total of $10 million in 2020. And that lag suggests a way of staging these. You'll also notice that in the concept proposals, that some of the tool development is a three-year grant and that means we'll have a chance to renew, renew the grants at some point. And also being able to accept R01s as part of the program and bring them into the program will afford another opportunity for the kind of nucleus to stabilize so people know what they're planning against. Yeah, and one of my other, one of the other things we discussed was the idea about how to integrate the R01s that might come in separate from this call into this whole program. And I think your approach to doing that will be appropriate because there are gonna be people that are developing the informatics tools and maybe even some of the new ways of doing reference representations that may not be funded under the U mechanism. But as long as there's a way to bring those research projects into the fold, I think it will be better for the overall plan and the deliverables. Yeah, I agree. And I think that again, if there's some solid notion, some solid notion of what they're planning against, it will make that more productive. At the same time, I think we do need, we thought about a model where it was just all R1s, but I think we need a nucleus with some cooperative agreements because that you can ask for a commitment to collaborate very closely. Sharon? I think this is an extremely important topic and I agree with the language that there are these tools out there and people don't know how to use them. So I think when you, assuming this past, when you write the actual announcements, I think it would be important to think about how you're distinguishing the third one from the fifth one because I'm not really clear on that because a lot of the tools are representation tools. So I think maybe some clarification of, I could imagine people going, I'm not sure which of these to apply for, especially because they're multiple in both. So I think that could create some confusion. I do also think it'd be interesting whether there is going to be any language around, there's more and more exploration of whole genome sequencing and clinical settings and whether there will be some exploration around tools for clinical representation of clinical reporting of genomes in addition to kind of the research setting. Yeah, I mean we, this is one reason why it's really important to, another reason why it's really important to bring R01s into this because they're a couple of major topics like that that we want the tools to be able to cover and if we can only fund two to four, we're limited. So yeah, we need to make that clear. Heidi, do you have a response about the differences between the reference representations and the informatics tools? Yeah. Partly what we're doing is trying to make a strong emphasis on open science engineering techniques. And so we expect these tools to be engineered so that they're modular and that they're interoperable and connectable in more powerful ways. I think we'll be requiring a higher level of engineering and how they're developed. So representation, it's a lower level concept and it can be isolated, it can be the foundation for really, really diverse types of tools. And ideally that's how it would work. And then we envision applications being built on top of that that would take it, solve the last mile problem of going from that core representation to users to open up this area in ways where it hasn't really been before. Does that make sense? It makes sense. I'm just putting on my kind of investigator hat and you can imagine people really trying to figure, I just think you're gonna have to have a very clear description about how each group of grants are gonna be reviewed. Because frankly, that's what people are gonna focus in on. And so I would just be pretty clear in the RFA or what are the clear components of review for people to figure out which one they should apply to. Because a lot of the representation is for end users. And so I mean, I understand the differences. I just think the wording's gonna have to be pretty clear or you're gonna get lots of emails from people trying to sort it out. Thank you. Jay, did you have a comment? Yeah, so I love the concept and on the whole I love the way you've structured it. I think this is overall terrific. I just had a couple suggestions or comments. So one would be in the tech dev category, right? So I mean, like you're saying, it's easily can be latched on to the sequencing tech dev program, but it still feels like in the spirit of what you're trying to do in this RFA, there's still a big gap here, which is that we still don't have a long reads. You can get it quote unmappable sequences, but we still can't do the same thing for epigenetic signatures like histone marks and ventilation and so on and so forth. And so there's still pretty big gaps and you've been getting any cell line where we have even a basic set of histone marks across the full sequence. And obviously the technology is not there yet, but you could imagine lumping that in is another kind of activity in the fourth one. And then I don't know, you said it was aiming for 350, is that right? For the... Yeah, it's 300 to 350, it's hard to tell. And that... The thing, I mean, all the major primates are being done this way already, correct? Is that a... Yeah, so what's being done, what's being done this way, the two applications that I mentioned already that have been supplemented, they're together working on, ultimately be 15 or 16 new human assemblies. My suspicion is, I don't know for sure, my suspicion is some of those are gonna be higher quality than we, than this particular application. They're humans or primates? Human. And then there's about five or six primate genomes, non-human primate genomes in there. And yeah, so that's what's kind of... At least we're thinking about whether devoting some of those 350 to... It seems like getting a diverse... Obviously getting diverse humans more diverse than 1,000 genomes is something you've already thought about and are encouraging, but also knocking out a few primates while you're at it probably would help. And then if there's some rational way of thinking about that. And then the other one would be, I know there have been discussions in code about how to get high quality sequences for the major cell lines. And it seems like you've already built this capacity and the informatics and everything, you might as well knock out all the... Devote 5% of that capacity to clicking off all those major cell lines as well at that same quality, but... Yeah, I mean, we certainly could. How are the cell lines consented? Do you mean for inclusion and references or do you mean because it's just very high value for that project? I mean inclusion and references and because they're very high value. I assume that everything that the code is doing has been consented enough that we're doing it all through. Okay, I'll check with my colleagues. Yeah. Any other comments or questions? Are there any application restrictions? Could one PI apply to all five of these? Yeah, I think they could. I mean, I don't see any reason to say to limit institutions to just one of these. The awarded multiple, that does apply. I don't think there's anything that restricts that from happening. Would that be advantageous to the program overall or not? There are levels of consideration here without applying it to this particular circumstance yet. We have the option of saying wouldn't make two awards to the same institution. We could just say could, but would have to be demonstrably independent of each other and things like that. Do you, are you, it sounds like you may have a concern with the idea of the same institution getting two of these. Is there, is it, is any particular? Okay. The original question wasn't an institution. It was PI, right? So I think that's what you were asked at. I'm not that an institution couldn't submit, but could the same person submit? So in the high quality sequencing that was already done that ended in 2018, what were the numbers of genomes that were done and what was the cost? Right, so when these, when the very high quality genomes were started, the aim was not for maximum efficiency for utility in a reference. The aim was for perfection. As close to perfection as current methods would get. So the platinum, those platinum genomes initially were over $100,000 each as compared to on the order of at the same, at the time that that was initiated a couple thousand dollars or $3,000 for a short read, good quality assembly. Now that's of course much cheaper. The basis that I am using to estimate what the component two needs is based on my current understanding of a high quality reference that has a reasonably high quality reference that uses long reads. And that's about $20,000 to $50,000 total cost per genome. Depending on who you ask and depending on exactly what goes into it. So that's where I think things are. I do think for this they will need to be an initial discussion about specifications. There are discussions about specs going on in other programs. So the genome 10K vertebrate genome project has already gone through them. They developed a specification for their references. So something like that will have to begin this program. You guys are getting soft. Any other questions for Adam? Okay, can I get a motion to approve the concept in a second? All in favor? Any opposed? Any abstentions? Thank you very much and thank you Adam. Thank you. Jerry.