 Well, good afternoon, everyone. Let me start off by welcoming you to the Francis Collins Genomics Reunion. Actually, it's wonderful to, Francis's eyes lit up when he walked in the door, seeing all these colleagues that he's been working with for so many years. But more seriously, I wanted to give a little more specificity on what we're here to do at this trans-NIH workshop, which has been affectionately referred to as the data aggregation workshop, but has a more formal title that's shown here and on the materials that you were given in advance. And really, Francis crafted this with a series of issues and questions, but really the challenges are the ones that I'm going to at least initially focus on. I think the major challenge at hand relates to data access. And in so many ways, it's because we're victims of our own success. What have we been so successful at? Well, we've become remarkably successful at data production, in particular DNA sequencing, genome sequencing. But I think we've also been highly successful at basically achieving one of our core values of open data release and data access. But that's a tough combination. If you're generating tons of data and you want everybody to access it and you want everybody releasing it, that then creates a situation where if you don't have the systems infrastructure in place for everybody to be able to do something with all of that, then you have a mismatch. And the mismatch that I hear about all the time, Francis hears about all the time, is how we are in a situation of being data rich, but analysis poor. And so what we have heard about in various venues, whether we go out and travel and talk to investors, we go to meetings and hear this, or we hear this from advisory groups and so forth, is that there's great opportunities here, but the current systems just are cumbersome and we're achieving a little, we're just too successful at a point but not having a streamlined process and system in place that are going to allow us to achieve the full ability of having data available for lots of people to be able to analyze in the most effective and efficient way possible. So ideas that have come up and ideas that will be discussed in the next day and a half or so, of course, range from things as building a research commons where individuals with appropriate approvals or certification will access all the data sets all at once. That's one possibility. Other possibilities would be centralized servers or data servers that would provide results and summaries to individual investigators, but not necessarily the underlying data and these different models will be discussed and debated and have various advantages and disadvantages. But the major challenge I think we're here to tackle has to do with data access. I think there are other challenges we want to contemplate. Obviously there's issues that approaches to variant calling and clearly if we want to have pre-analyzed data available, we're going to have to harmonize our thinking around what are the approaches we're going to use for variant calling. Speaking of harmonization, there's going to be lots of discussion here we expect about harmonizing phenotypic and environmental data across different studies. This is a huge set of issues of great complexity. We're going to have to figure out what we're going to be able to do realistically and practically in the kinds of initiatives that one might launch coming out of a workshop like this. And of course we're going to have major computing discussions. Just computing on very large data sets. This is just one component of sort of the classic big data problem that's being discussed in various ways in various settings in and around and outside of NIH. So these sets of challenges and yet the exciting science that could be done that Francis framed have led us to this workshop. So let me pause for a minute and just say a few things about this workshop. The first thing I want to point out is that this is not an NHGRI workshop. This is a trans-NIH workshop. We were asked to organize it, perhaps for obvious reasons, but I don't want to think for a minute that we should be taking all the credit for it by any means we shouldn't. The reason Francis is here and the reason we were asked to do this is because there is a crying need for this across the NIH because, as Francis mentioned, virtually every institute and center generating data this type or else are supporting researchers who desperately want to utilize data of this type. And that is the reason why Francis provided funds for us to have this and also asked NHGRI along with an organizing committee to put it together. Specifically, Mike Banky and Wiley Burke are the co-chairs and we thank them for their willingness to co-chair this workshop. But David Alscher and Paul Fleissic had a very important input to this and we thank them as well. There are 10 institutes and centers that have sent staff and are involved in this. Lisa Brooks and Adam Felsenfeld have done the micro organizing in many ways necessary for this workshop. Terry Minolio dealt some early planning that led to this idea of having this workshop and then handed much of this over to Lisa and Adam for actually planning the actual workshop. But then here I list various other people from NIH institutes, including NHGRI and others that have been involved in this and were very appreciative. But you can quickly see this is not just an NHGRI problem, it is truly a trans-NIH problem. And also, let me just quickly thank, in terms of logistics, Nicholas Clem from the NHGRI staff and also Sandra Bloomberg from Capital Consulting who helped deal with all the logistics associated with it. This workshop involves all of you. It's an all-star cast. I mean, I'm very appreciative and impressed with the people we've got to come here, in some cases from far distances. And long trips, you and tells me, since 4 o'clock this morning, UK time to make it. There's about 47 non-NIH people, about 40 NIH people who'll be popping in. And we deliberately aim to get diverse expertise. So among the kind of expertise that we wanted to include ELSI and policy and obviously genomics and functional data, disease studies, GWAS and cohorts, database computation, data analysis, but also relevant to this workshop that, in meeting we were at last week in Boston, drug development, individuals with that expertise. And we've included scientific publication. In other words, journal editors are here to help us because they play a role in some of this as well. So thanks again to all the participants and all of these individuals. What are the goals? What do we expect to try to accomplish? We want to discuss the scientific questions that analyses across these data sets could address. Francis gave us a few examples, there'll be others. Discuss the challenges to obtaining and analyzing across many data sets. We can see the motivation to do that. Discuss the options for dealing with these challenges, including costs and trade-offs. We want practical ideas. We want some sense of scale of what we might be contemplating doing. This is not a thought experiment. It may harshly be a thought experiment. This is actually something very practical. We could envision putting together a trans-NIH program to make this happen, this to be defined. We're gonna have to have some sense of scale of what that's gonna look like because the practical realities is funding will have to be identified to make this a reality. And then of course we want you to recommend steps to address these challenges again, including some practical realities of what the scale might look like, what the phasing might look like and therefore we'll figure out how we might go about getting the money to make it a reality. What are the questions we are asking you to consider? How should we deal with data already collected? In other words, retrospective data. And then how should we maybe going forward deal with data that is to be collected, prospective? When one and I imagine we do something different if we know how it's gonna feed into some centralized system or some way of aggregating the data going forward. What options are the highest priorities? We can't do everything at once. We're gonna have to prioritize. What are the cost benefit trade-offs? Again, we're gonna have to think about this. Money will be limiting, but we also want you to think audaciously to some extent so that we can think big and then we'll have to scale it back as needed. What could be done now and what would require changes in policies or other work to be able to implement? So again, that's where the phasing and staging will come in. And we also wanna emphasize that we can't necessarily expect 100% perfect solution for all of these very difficult problems. Even 90% solutions would be progress and would really open up the field in ways that Francis described and I think others of you can imagine. So again, we're looking for realistic practical solutions that don't have to be the perfect permanent solution, but wow, a lot certainly could be done to make those tens of thousands of genome sequences more readily accessible and analyzable to a larger fraction of the research community. So with that, I just wanna end by offering you the ability to ask any questions either of Francis on behalf of NIH or me specifically either about what NHRI is doing in this or about this workshop before we start really getting to the data at hand which I'll ask Adam Falzaville to come up and discuss. Lincoln. So I'll answer this first. I think the initial focus and initial focus should be around genome sequencing data and associated data for genetic studies that would include potentially phenotypic data and so forth. I wouldn't necessarily, I could imagine that could be the first foundation for going beyond that. I think if you, my personal opinions, if you tried to bite all that off at once, we would break under our own weight, my own personal opinion. I think, well, epigenetics. To the extent that you want to broaden to phenotypes and environmental exposures and you happen to have that kind of data on the same individuals where you have whole genome and whole exome sequence and clinical information, it seems like it would be unfortunate not to figure out how to link it up but I think the primary driver here is DNA. And epigenetics would be included in that. Epigenetics, Marx? Okay. Sure. But pretty soon you've got a lot of data there if you're not just talking about methylation but all those histone marks and transcription factor binding sites and woof. Other questions? Debbie? If you layered the DNA down, which is what you're suggesting and then resources were provided over time at the NIH level with RFAs, et cetera to put more and more data down and in the database. I mean, starting with the DNA like you're saying is good but thinking at the end, perhaps of the group, how to get more data sets integrated into the data that would have full access would be the way to, I think, look at that question. No, I like that way of putting it because clearly we don't want to encourage establishment of a database that is all sort of self-enclosed and cannot be layered on top of with additional information as it comes forth because this ought to be a growing entity. It's a staging and phasing part that I was saying. Anyone else? Okay, yeah, Eravinda? It's not so much a question but I wanted both of you to sort of emphasize the following. You said at the end that 90% of going there would be great since I envision many of the problems that we'd have to solve technical things aside would be cultural about people really being comfortable about sharing data and other matters. I would think even getting half the way would be great and so if we can use that as an operating principle of what we can get done by the end of whatever year we count, say in one year's time, two years times, three years time, I think that could be tremendously helpful in people getting confidence that this thing is working towards this goal. What I would, I mean, I could have just as I said, less than perfect solutions would be wonderful. I mean, that was what the less than 90%, I mean, because you're right, it could be even a 70% solution for some of these issues would be wonderful. But 1% would not be too different. Let's think highly of it. Yes, Debbie? I just think that focusing on what would be of the greatest utility to the NIH community as a whole, for example, I get approached every day for controls for exome sequencing, right? I mean, if we could focus down on sets that we could, I don't wanna put a percentage on this, but if we could focus down on sets that would be eminently usable by every member of the community, including myself, it would be great. And I think that should be a goal. I completely agree with you. I think Adam will start to paint that picture to some extent, maybe not, but I think he will. Only because when you start to hear about data that is absolutely being generated, individuals like you, but lots of individuals say, well, if all that was in the same place, it could be a control data set for the things I'm doing and so forth. So, absolutely. Well, people approach me all the time and expect that I can provide these controls and that they're at the DB gap, but putting together many different studies where they've generated control data sets that could be used by other people as controls because they have phenotypes was the kind of thing I was thinking about. And what I can tell you is in the past year and a half, NHGRI has had multiple workshops on multiple topics where that same request comes up and it could be some of the most basic science topics or the most clinically applied. They all say the same thing. Where are these controls? Where are they centralized? Can we access them? Why is it not all in one place? Yes, David. So, Eric, one of the things that you put on your slide that I thought was really important was making a distinction between retrospective and going forward. Five years from now isn't that far away and that we're in a very different world than we were even five years before. So I would just make the plea that we don't confuse those two things because that we may be quite limited in some of the things that we have retrospectively but that not to in five years if we have those same limitations, that's our fault. No, I completely agree. And the two numbers that to be keeping in mind, I mean, there's a lot of data that has been generated and we'll spend some time figuring out how do we get our arms around it and pull it back in but that has different connotations than all the data that's going to be generated in the next year or two based on the surveys we've done of institutes and grants and so forth and maybe those are gonna have slightly different solutions in the near term. Okay, well thank you all and I will turn this over to Adam.