 So I'll blow this up. I tried to make life easier by drafting the report already or at least starting to. So here's the participants of our group. We also had a lively series of discussions around this. And I've hopefully, along with Scott's notes and Jeff's notes and a couple of iterations back and forth with Scott, captured what we were trying to do. So the goal, I think, Les kind of put it out there, and I like this. So is to change the practice of medicine to a point where whole genome sequence can be routinely ordered for a patient and to use it to improve their health care. So what would the community agree is a reasonable set of data? They want to accurately interpret from a genome and be able to act on it so that it's useful for patient care. So that's the fundamental broad question. But before WGS can be accepted as a legitimate clinical test, we first need to develop enough data to convince people that this needs to be in the clinic. I think this discussion we're just having kind of illustrates those issues. So the needs in order to accomplish this is a set of standards and best practices. I don't know what happened to the numbers here, but best analyses, best practices for analyses, these may change over time. Alignment issues, representative of different major ethnic groups, de novo sequencing, assembly, accuracy of calls for SNPs, genes, endel, CNVs, et cetera, to know what we find with different search strategies and what to do with that information. Well-defined clinical phenotypes, we're calling it the Tiffany standards, you know, the Miami standards. Those of you that were in the Tiffany room yesterday, appreciate it's a low bar that we can accomplish to have equivalent standards for phenotypes and genotype annotations. What is the baseline we need to have? So we think that's an important question. And then what are the annotation strategies to meet the Tiffany standards and beyond? We need to layer different categories of data, SNP data, standard phenotypes from eMERGE and eMERGE-like, and layer these into EHRs into the genomic sequence. Easy to say. How do we do that? Major issues to think about around that. Variants of unknown significance, what does that mean for a variant of unknown significance? And perhaps that's a wrong term. Maybe it should be undefined or not defined or something else to think about because variants of unknown significance has a very different meaning depending on which area you come from. And I think this last discussion illustrates that. I think the clinical geneticists would have a different view around implementation of variants of unknown significance. Okay. So mission critical. More and more labs are going to be rolling out whole genome sequencing and exome sequencing. So there is a critical need to define standards and meaningful clinical reports. Failure to do so is going to drive up the costs. And here's just a classic example. In the interim, there's going to be pressure to run expensive clinical tests to follow up genome-wide sequencing studies and exomes. So for example, somebody shows up in your office and you're a GI doc and they say they have a SNP for Crohn's disease and they're very worried about Crohn's disease. So how are you going to end up doing a scope? So how do we get around that as being an issue? Because that's going to be a major pressure point. So the group came up with at the, I think, kind of the initial leadership of David Ledbetter around this, which is a grand vision. So we said 100,000. Maybe that's the wrong number. But sequencing 100,000 patients with detailed electronic medical records to build a comprehensive data set of variants, phenotype annotations, and critical information about incidental findings. This would be analogous to the initial goal of sequencing the human genome. Therefore, there needs to be some key pilot projects established to set specific needs and milestones for this project. A white paper could be potentially developed this year to begin the dialogue. Should it be more grand? Should it be all children with birth defects or simply all children or the entire human planet? I mean, we could decide what we want to do with that. So considerations to think as we go forward. Part of the question of getting whole genome sequencing and the routine testing is figuring out how to fund it. Insurance, patients, research are a mix of fuming. I guess that would be funding sources we now use. Sequencing costs will drop if the insurance or patient pays for the sequence, it ends up becoming potentially free for research. It's an interesting dynamic to think about. So we could use research dollars now to build infrastructure to enable it. And pilot projects, such as we've just been discussing for disease specific questions, would also fit into positioning the best approach for the grand division. There's a list of pilot projects. I'll list them and then break them out into, I don't know what happened to all the numbering. I guess I cut it off. Well, anyway. The pilot projects are analytical best practices. A wet lab bake off is number two. Number three is improved reference set for clinical analysis. Number four is established minimum standards for genomic and clinical phenotyping data. Five would be work with NIST on developing standard genome types and create a central repository. So let's break these down into potential goals and projects. So the pilot project one, analytical best practices. The goal would be to develop a set of standards and software tools analysis pipelines for clinical analysis. The goal of this is to do this now. There's many groups in the process of doing this. And so we thought having 10 genome sequences would be a great start point. These would be already sequenced. We then provide a set of reference genomes to compare to with the idea being that it's not about how many genomes you have, but how good is your software with the genomes that are available. There's coverage, how much is covered. All those things would be standardized. There'd be a collection to define clinical phenotypes. There'd be six genomes with known variants that cause disease and four genomes with unknown causes of disease but have been analyzed but not found. So the exercise would be interested groups would get the 10 genomes and the reference genomes to conduct their own analysis. Debbie's term of bake off. The goal would then be, and this has been done in other groups, the statistical geneticists have done things like this over the years. And the strategy then would be to have a meeting to compare results and look at the lessons learned. What did these different software package come up with? What did we learn from this? How do we deal with some fundamental questions like if you see a variant in one color but not in the other, which one is real? These are fundamental questions we need to address. So the deliverables would then be a standard reference set for testing and benchmarking future new tools and really answer some fundamental questions around coverage needed. It could develop analytical guidelines around what would be the best practices for the validation of a clinical tool and it may actually weigh in and provide some potentially new novel insights. The other goal is to have a wet lab bake off. This is really to go in and take a look at some issues around sequencing strategies. There's different views on how you should do that. Do you do short reads combined with long reads with clone reads? How deep do you go? Fundamental questions that need to really be addressed. The idea then would be again to get 10 genomes that have been consented, it would be ideal, I would argue. If it's the same 10 genomes that are going to be done in the analytical side, but it doesn't have to be. And the goal then would be the interesting groups would use their platform or strategy to analyze these same genomes. Fundamentally, I think this could be maybe an annual meeting that would come up with where are you? Again, this comes back to what's been done in statistical geneticist groups where you say, hey, how good is your stuff? Deliverables would be coverage needs by platform. There's plenty of different technologies. We've talked about Illumina and we've talked about complete genomics, but there's a large number of other platforms that are out there. This then would provide really knowledge about how would we make that comparator. You think your technology is fantastic? Great. How does it compare? And then again, look at the best practices for sequence generation and that initial calling algorithms and so forth are again major issues we have to think about. The next one would be improved reference set for clinical analysis. So the goals here would be to create a better reference set of genomes and phenotypes. This is the pilot project or one of the pilot projects for the grand vision. The notion would be to have 500 genomes with detailed electronic health records. The ideal again, and these are numbers that were just thrown out there, but I think are in a level that we can discuss. 100 from each of the five major continents with the idea being that much of the reference data we need is dependent upon where the people are from. And if we're going to be practicing medicine better, we need to have those reference resources both from the genotype and from the phenotype. We're going to have to understand the subset of rare variants, what are carriers, what are common known variants in each of these different subgroups. Subsets, so the strategy then would be what other needs do we need from these 100 in each of the five major continents would be some rare variants and carriers, I didn't explain that well, so that you have something that we know what to find. So it gives the analytical group an idea of can you actually find a CFTR carrier or a sickle cell carrier or some of these others. There also we think should be some extremes of common clinical phenotypes. I picked blood glucose. It could be anything. I just threw that out there. If you had the extremes of the distribution, can you now find something around those? Potential sources of samples, eMERGE, eMERGE-like programs, George Church, Personal Genome Project and other existing cohorts, DBGAP. I don't want to specify, but I think there's the goal again would be to do this relatively quickly and not spend the next five years trying to ascertain patients. I mean, I think we really have to be realistic about this is coming quick and the slower we are, the further behind we are. Does this need a grant mechanism? I don't know. It's probably big enough that we need to think about that. Again, there'd be an annual meeting to compare results, advance best practices. Deliverables are similar to what was said before, but an additional leveraging of the difference reference populations, standard reference sets for testing and benchmarking future new tools. Again, coverage needed, analytical guidelines for the different human populations, best practices for validation of a clinical tool, potentially novel insights and knowledge and information about how to think about a grand vision. So when we say we need to sequence 100,000, is it really 100,000? I mean, what is that number? I think we need to have some baseline around that. Project four, now they come back. The numbers come back. Establish minimum standards for genomic and clinical phenotyping data. So here the goal would be what's the minimum standards for data annotation for genomic and phenotype data? Sounds pretty easy, but I think if you start to come up with what's the basic level of standards? This is something that was done for microarrays. It really helped improve how we exchange knowledge. It really helps the programmers think about what information do they have to make sure that's being maintained. These are critical issues. So the needs would be a work group for genome annotation minimum standards, a work group for phenotype annotation minimum standards. I think those are two different cultures. I think there needs to be some haggling going on there first. And then maybe a joint group to align the genotype phenotype minimum standards is one potential approach. How would we do this? What could be done through meetings and or conference calls? I think the goal here, the deliverables would be some position papers or a paper around this, as well as the beginnings of this potential variant catalog for clinical references. There's ClinVare, there's a variety of other sources, but it would be great if we could kind of standardize those around these basic minimum standards so we know what we're sharing and how we're comparing. Number five is to work with NIST. This is the National Institute of Standards. Is that the correct? Standards and technology. These are the people that tell you how long is a yard, a meter, an inch. And they're working on standard genome types. I think we would like to make sure that whatever NIST is establishing is a standard would kind of fit into what we think are standards around which we want to make comparators. So I think that's just something that needs to be worked at. We don't know enough about this, but the goal then would be to make sure that if we're going to do this pilot or grand vision that at the end of the day the NIST standards are part of that. I think that would be a mistake not to do that. Create a central repository of whole genome sequencing for the clinical labs to compare against. So I think a goal would be a short-term solution. This group or a subset, as Eric Topol said, he's willing to share his genomes. I think this is something we can do quickly if we want to. In order to make this happen, I think there would need to be some data sharing agreements MOUs that would establish what that is. Data security and data sites. How would we do this? And rules for data use. I think this is something that could be done very, very quickly with a concerted group of dedicated individuals and wouldn't require resources and a lot of thinking about this. We need to share data somehow. Long-term, obviously, is the PAN NIH project that Eric Green mentioned earlier today. I think that's potentially, I don't know the details, but the likelihood of that would be a longer-term issue. I think we could jump-start some of these. So hopefully I've captured those discussions and distilled them into what you all agree with. So can I start from the people from the group? Did I capture that accurately? Okay. Questions? So, Howard, I just wondered if you were aware or in the group about the Huck and Huck and Arson's program at CHOP where they have 100,000 kids and 20,000 parents, they brought GWAS all of them, and now they have 10 high-seeks with BGI, and they're going to sequence all of them, and they're getting pretty, they're cooling up pretty fast. What about that resource to potentially work with to accelerate this whole thing? Sure. Absolutely. That's why I didn't put a statement on what the populations need to be or where they come from. I think that's a discussion point for what would be the best sources. And the timing. You know, 10 high-seeks is a lot, but 100,000 people is a lot. So that actually leads into my point, which is one of the lessons we learned in Emerge is that all subjects and all electronic health records are not created equal, and so that the oldest subjects have actually been informative for virtually all the phenotypes across the sites, whereas the younger people haven't had time to develop the adult onset diseases. And so different cohorts had different utilities. And so if you're going to focus on children, you're just not going to get those phenotypes. So that's something to think about. And then the other thing is that the electronic health records from places like Kaiser and HMOs where they're capturing all the medications the patient have took, all the diagnoses are very valuable in a way that electronic medical records from, like we didn't even propose the University of Washington. We used group health because we see people for a couple weeks and then we don't see them for five years. We don't capture most of their health care. So finding those electronic health records that actually capture most of the health care, and in particular the prescription drugs, is also much more informative than not. So that's a consideration when you're marrying those things that we learned in Emerge. Yeah, Howard, so during the discussion yesterday, we talked, I think, more about ensuring that the different groups that were doing this came up with the same answer. We talked less about whether that answer was correct. I'm hoping that during the validation process people would make sure that that was also a high priority. So that's why we have six knowns. So we actually can then benchmark against what is known and then the four unknowns would be then for discovery potential, yes. OK, and the second point I wanted to make is bake-offs are good and needed. I just want to make sure that a project like this would take into account that if you do bake-offs with early technologies, particularly, you could squash any new technology coming into the field. So we have to be very, very careful about that. One of the challenges, of course, is all these technologies are changing so fast. And by the time you implement something, the company has a new kit available. But moreover, some of these are relatively mature, but we have new technologies emerging. We don't want to eliminate the possibility for those to come on. Can Debbie answer that? I think the goal was to take a look at what are the drivers of what we can see and what we can't see. And I think that will also lead to the development of new technologies and understanding how to drive the field of clinical genomics forward. I mean, this is a big application. We also thought it was really important. We talked about what is it that would make an ideal, let's say, electronic medical record for what would be the kinds of things you would look at for the genotype that you could link. So I mean, all of these things would be more integrated than they are currently. And I think, Jeff, to your point, a bake-off denotes a winner and a loser. And I think that's not what the point is. The point is it's a competition with the goal of knowledge, not for declaring a winner and a loser. So we can modify that language. And it's really about creating an environment where there can be comparators so that new technologies, new strategies can come to bear. But there's a way to benchmark them. So I understand your point so we can fix that from being a winner, loser structure. Yeah. And as we move forward, obviously, we would have a really good knowledge base to compare new technologies against. But part of the point in this design was that we wanted to create a system where anybody in the room or beyond could have a set of standards and a sort of best practices to refer to. Because if somebody got their genome done on the Illumina platform in a health care system in California and then moved to Boston, would it be a different kind of sequence that they would get in Boston? Hopefully not. And we wouldn't have to repeat it. So if we kind of standardized things, and it's going to be a little bit of a moving target, but that the whole process, I like Howard's analogy with the Miami system, that we would have a set system that we all agreed on, what kind of medical records data would go in when you did a sequence, how much coverage would you have, minimum, what kind of quality would you have, what kind of, you know, at certain positions maybe you would have to be able to know that there was an internal reference that you called accurately. So these are things that we can do as a group to sort of facilitate a uniform approach down the road. So this is another follow up on the Miami comment. So I think it's very hopeful that you are involving NIST into this process. I'm hoping it won't be as an afterthought, but it will be right at the beginning. Because since the Miami days, there is a new layer of technology, and there is no convergence between technologies and databases. And I'll just add another reference to that line, which is the data.gov initiative. So all sorts of databases are now being integrated together. There are new toys to play with. The other thing that we talked about, and I'm not sure it's gotten enough attention, is the defining what a perfect phenotype looks like. So we touched on the idea that we at Vanderbilt, Josh Denney specifically at Vanderbilt, have been doing FIWAS based on ICD-9 codes. And we think we probably need a better definition of what a reference phenome should look like. And I don't think we got much further than that. But what elements would you want to see in a standard electronic medical record? What kinds of things would you like to know people have and didn't have? And the didn't have is sometimes better than the have. What phenotypes tend to travel together, the sort of the hack map for the phenotype, for the phenome, those kinds of issues probably need to be addressed as well if we're going to do phenotype correlations. I wasn't involved in the session, but we had kind of a dinner sequencing conversation looking at training for people to help with the interpretation of all this data, and it's an issue that all the programs are running into, not having enough bioinformatics. And one suggestion that came up was that take genetic counselors and train them in this, and then we said, well, they're not enough. Genetic counselors are shorter there. But all the genetic counseling programs do a good job of providing the genetics training. And their rate limiting factor is that they don't have enough clinical sites. So if you created a companion track, we're rather doing clinical sites, someone could do an informatics intensive, so they use the same coursework. And we thought that would be something that we should look at to see if that's a viable possibility. Also take seriously relating to the phenotyping issue, I think that a priori we're not going to be able to develop a perfect phenotype for a variety different reasons. One I think will be hard to define, probably the more important reason, is that the way phenotypes get entered is highly idiosyncratic, and it's very difficult to sort of extract that. So I think we really need to take Les's points from his talk earlier very seriously, which is the idea that we shouldn't assume that when the phenotype goes in that that's going to be what's going to remain there, that if we develop interesting associations, questions, variants, whatever, that we develop a mechanism by which our subjects are consented such that we can recontact them to enrich phenotypes in an iterative fashion. I think that's the only way we're going to solve it, because we'll always want information that we don't have, no matter how much information we have on the front end. OK, I think this has been a great and helpful discussion, but we're running a little late and probably need to move on.