 So this may be my only opportunity to use the gavel. I might take one of those home with me, see if it has any effect on my daughters. So we're back to the home stretch of the open session, and we're going to begin with a report on the genome sequencing program. This is actually something that was requested by the council during the council-initiated discussion in February. And it's going to be a team report. I think Adam, is Adam in the room? There's Adam. Okay. So Adam's going to lead off with his report on the GSP, and then Chris Wellington will step in as well. So. Yeah, thank you, Rudy. So good afternoon, everyone, and yes, I'm going to tag team this with Chris. Just a quick update on the genome sequencing program. Some orientation. I know Eric talked a little bit about this in his director's report. The genome sequencing program has four major elements together situated between biology of genomes and biology of disease on our current sort of strategic scheme. They consist of the following four components. The centers for Mendelian genomics is aimed at discovering and causal Mendelian variants as possible, and Chris will pick up on this in a few minutes. The centers for common disease genomics are CCDGs, whose aims are to develop a paradigm for any common disease for the comprehensive discovery of genes and other elements that affect risk of alleles, both risk-raising and protective, coding and non-coding. And also, they have an aim to improve methods, analytic methods, and technology for all of them. There are also GSP analysis centers, and I'm not going to talk, most of the talk today is going to be about the first two components, but I have to include something about the GSP analysis centers who aim to develop and apply analysis methods to improve our ability to identify variant associations, help develop a set of common controls based on the data produced by the centers. But one of the things that I don't think, when they started, I anticipated as much, anticipated that they would do, that they are doing is that they sort of force the issue on several important areas, for example, provision harmonization and analysis of data across the consortia, and to show what can be done early and reflect that back to the entire consortium. I should say the analysis centers are pretty diverse, what they do, their analyses including, for example, methods development to use population admixture information in analysis, and also the use of functional data to help those power to recognize non-coding variants. There is also the coordinating center, which tracks progress, helps with data storage for the consortium, helps lead and rationalize consortium policy development, so data access and publications, for example. They spearhead the common controls effort. They also help with logistics, and under that, et cetera, is a laundry list, as long as my arm, of things that they help with. The structure is fairly typical for this kind of consortium. It is a little bit baroque, but it's flexible and responsive to where the consortium wants to go. It's fairly typical, again. There are, I use this, yeah, I use this. There are disease working groups. Some of the groups I'm going to point out, because I'm going to talk about them later today, these are the disease working groups for the CCDGs. There's data pipeline analysis and standardization, data flow working group, methods working group, and here over in CMG, a number of working groups that maybe Chris will talk about. There is, we are fortunate to have a number of sources of co-funding. Much of that is from NHLBI. There's a lot of collaboration with the Transomics for Precision Medicine effort or TopMed, some from NIA, some from NINH, and some from NEI. There's also a lot of what I call indirect co-funding from several sources. That's when somebody, another entity is basically also interested in funding sequencing from the same phenotypes and sometimes from the same exact same cohorts, and it's great to be able to have, to be able to mix the data when we can, to do cross analysis when we can. And just before I continue, I want to say that both the CMGs and the CCDGs work would be impossible with that extensively characterized samples for the CMG at the patient and family level and at the CCDG at the large cohort level, funded and worked on by many completely outside of the program. So program timeline, the current iteration of the program was started in 2016. It ends at the end of 2019, roughly. You can see where we are here in May. Most of what we're going to talk about today for the Centers for the Common Disease Genomics, GSP Analysis Centers, and the GSP Coordinating Center, really these were, these either changed their mission or were completely new two years ago, so most of what I'm going to talk about later is about progress in the last two years. The Centers for Mendelian Genomics, this is their second round with roughly the same mission, so that'll cover six years. So I'll leave it here for Chris. Thank you, Adam. So I'll give a brief update on progress from the CMGs. So as Adam just said, we are currently in the second iteration of the program. The first phase was funded in late 2011 with three centers, one at Baylor Hopkins, one at University of Washington, and one at Yale. All three of those successfully re-competed for phase two, and we also were able to add a fourth at the Brode. And also as Adam mentioned, the Coordinating Center has been instrumental in the second phase since they came online. So back in 2011, the sort of framework for launching the CMGs, the overall purpose was really to see if it was feasible to do Mendelian gene discovery at large scale. And if so, to see if we could use that as a lens for insight into gene function and disease architecture. The specific goals of the program were first to discover genes underlying Mendelian phenotypes, as many as possible, to develop strategies and tools for effective discovery, and then to disseminate and collaborate the findings of the project. So I'm just going to briefly walk through progress on each one of those. So again, when we launched this project, we weren't sure how this would work. So one of the important things is to define success. And we actually defined two different tiers of discovery. So the higher confidence tier, one discoveries are where there are multiple lines of evidence, either multiple families, or one family plus model organism data or functional data. And then tier two is where there was only discovery was only in a single family. So starting in the first phase of the project, things were slow for the first year, but by the end, there were over a thousand discoveries, which made us confident in renewing the program. Since the renewal, the discovery has continued to pace. And yes, we see that discontinuity. That's actually an interesting topic on its own. Those were a couple fairly large collections of samples with similar phenotypes that ended up resolving to a large number of underlying Mendelian causes, so interesting in its own right. You'll also notice that there are about 1,000 tier two discoveries at present. And obviously, we'd like to be able to move these to tier one. These are rare diseases, more samples are hard to come by. So model organisms are a good approach for that, in some cases. The number of discoveries. A discovery could be a association between a gene and a phenotype. If there are multiple genes underlying similar phenotypes, that could count as multiple ones. So this is not a count of unique genes or a count of unique phenotypes combination of the two. So model organisms are one approach. We don't directly fund the CMGs for that. So actually about a year and a half ago, they started a collaboration with COMP, knockout mouse project. Which aims to make a comprehensive public resource of mice with a null mutation in every gene. So the CMGs share candidate variants with COMP. COMP is able to prioritize some of the mouse orthologs. And it ends up being a nice example of collaboration between two of our resources. At the same time, it's not that each of these discoveries is a simple gene that can be knocked out and recreated in the mouse. Super straight forward manner. So just a brief example of something that's a little more complicated. This was an example of bililic inheritance of SMED-6, and craniosynastosis from the Yale Center. And here they found a loss of function in SMED-6 strongly associated with phenotype, but the penetrance was only about 60%. Which would exclude it in a number of analysis pipelines from a delian discovery. But instead in looking at some of the GWAS data that was collected previously, they saw a risk allele near BMP2. And when looking at the combination of the two, they got full penetrance in subjects who had the null mutations in SMED-6 and their risk phenotype, risk allele near BMP2. So again, just an example of some of the slightly more nuanced discoveries that also come out of the CMG. So the second major goal is developing strategies and tools. So here, just a couple of very high level things. One sort of strategic point is the value of high quality whole exome sequence for Mendelian discovery. Obviously, this is still less expensive than whole genome, at some point it won't be. But for now, the CMGs have found this very useful. Another is that they have a number of strategies for doing analysis depending on sample availability. And really recognizing that you can still do discovery, even with a single case. Obviously, you'd rather have trios, maybe, or a larger number of cases. The CMGs have approaches for all of those. A number of tools have come out of the CMG. A couple high level ones are listed there. And a particular note, three of the matchmaker exchange nodes, Matchbox, MyGene2, and GeneMatcher, are all directly associated with CMGs. So finally, dissemination, collaboration. Obviously, everything I just talked about is freely available. The CMGs also share, pre-publication, basis, the phenotypes that they're going to be working on. So others with samples can see that. The candidate variants that are identified, again, help others in the field. They offer some courses on analysis and better training opportunities. They've done some valuable patient-facing collaborations, engaging either support groups or social media. And they're members of the International Rare Disease Research Consortium. So now, just very briefly, you've seen where we are. What are we looking forward to in the rest of the program? Constant question that always comes up is how much remains to be done. So I want to use two different lines here to suggest that we are nowhere near complete. The first is COMP, I mentioned earlier. And with the careful phenotyping, Pipeline's COMP has, they actually see usually more than one phenotype in about 80% of their knockouts. That includes embryonic and early lethal as phenotypes. And at the present, about 20% of human genes have been implicated in phenotypes, Mendelian phenotypes. So I can argue around the edges of those numbers, but clearly there's a large gap there. And earlier, we mentioned the rate of discoveries is continuing consistently. So we also would say on that basis, it seems that there's still much to be done. So it's not surprising probably to hear the future plans and what we heard in the external scientific panel at the in-person meeting of the GM sequencing program, we're really first and foremost to continue what the CMGs are doing well, this effective, efficient discovery of variants underlying Mendelian disease. We're also looking at a couple other things such as understanding and improving the solve rates. So at present, about half of the phenotypes that come in are solved. It's a little tricky to define that. And so we're looking at adding whole genome sequence data, RNA-seq, try to understand the value proposition there. There have been some success cases with both of those, but we don't have enough data to know if it's actually an efficient use of resources. Also looking at what we can do by aggregating similar phenotypes across the CMGs. And finally, we want to keep an eye toward the bigger picture when we launched the program of using this as a lens to better understand human variation. For instance, you can imagine the impact of the NOVO. There are many trios sequenced under the CMGs and some interesting potential there. And I mentioned an interaction, a rare and a common variant earlier. There are also some cases where we have Mendelian phenotypes that look a lot like common diseases and looking at the contribution of Mendelian to that. So on that note, turn it back over to Adam. So back to the Centers for Common Disease Genomics. So you saw that for Mendelian disease caused by rare variants of very strong effects, there are now very many examples of finding the responsible variants. The success rate at the CMGs is approaching 50 percent. And much of the time, I would say we know what we're doing. But for common disease, I'm much less sure. I think a lot of the time we still don't know what we're doing. And I'll return to this point at the end of the presentation. First, I'm going to go through some basic progress. Just some numbers. We are at about 58,000 genomes and 43,000 exomes headed towards, by the end of the program, about 100,000. Sorry about that. Headed at the end of the program to about 100,000 whole genomes and about 125,000 or more whole exomes. This is a little bit higher than what Eric showed in his slides, and that's because these numbers are actually capacity that is spoken for and plus capacity that's not yet spoken for. Other consortium progress. We already had the first data freeze of 22,000 samples that was last year. Frees two starts in June with all the whole genomes. Something I should say about in case you're not familiar with the jargon of data freeze is just a convenient way to have a common data set to analyze across the consortium. These were all processed through the coordinated or through the synchronized, the harmonized data processing pipeline that was developed in the first year. Just a brief look at ancestry. These are for samples received. This is the best. This is what I could get right now. This is samples received. It's a little bit of a proxy for sample sequence, but it's where we're headed. And the main point here is that even including the green wedge samples that we don't know the ancestry for, that just over half the samples are of a non-European. This is just breaking out some of the numbers in detail, and I'm not asking you to look at everything at all the numbers, but I just want to make a couple of points. First of all, the sequencing is spread out over eight. You'll actually see 10 different columns, but really there are eight different disease phenotypes here that are really part of the consortium. And that means that it takes a while, it's taking a while to build up sufficient numbers to be able to do analyses, and these analyses are just starting. The other point that I want to make is that, and I'll show a simplified version of this in a second, is that there are three umbrella disease working groups, the immune-mediated, cardiovascular, and neuropsychiatric working group. With immune-mediated thinking about and coordinating work on type 1 diabetes, asthma, inflammatory bowel disease, the cardiovascular working group on early onset coronary artery disease, early onset AFib, and hemorrhagic stroke, and the neuropsychiatric working group on epilepsy and autism. I just want to show a couple of vignettes, one from each of these working groups that were presented at the recent meeting. So the immune-mediated working group presented work on type 1 diabetes. This was from about 3,000 samples from the Diabetes Genetics Consortium, which includes cases and controls of African, Hispanic, and Asian ancestry subjects. There are 58 previously known GWAS loci in European ancestry samples. This study replicated some of those loci and found four new alleles at known loci. They also found that at several known loci, associated variants in Caucasian ancestry populations are not observed in African ancestry populations with novel associations a little bit of a distance away from those. This is evidence for population differences. And the cardiovascular working group on some results on early onset MI, looking at cases from the CCDG effort and controls from the top med effort. They did whole genome sequencing to ascertain polygenic risk scores for early MI, and about 17% of cases had high polygenic risk, but not other distinguishing factors. So for example, they didn't have evident bad lipids or worse lipids than controls. They had equivalent risk to individuals with a rare strong LDL receptor variant for familial hypercholesterolemia. I like this for a couple of reasons. First, it shows an application of the CCDG data that I certainly didn't anticipate at the outset of what I thought was a discovery effort. This is a little bit closer to the clinic, and I think that's cool. Second, this highlights a point about the importance of genomic architecture. There's a strong monogenic component here and also a polygenic component of what we call the same disease. The neuropsychiatric working group, looking at about 12,000 quads from the Simons Foundation samples, together with some existing data, found likely damaging exonic alleles in 124 previously known plus nine novel genes. These implicate synaptic activity and form of modification pathways, and they showed evidence for a role of non-coding variants and estimates of their contribution. So there are other kinds of analyses going on in the CCDGs, and together with the analysis centers. We've seen a few that are about the sort of the expected kinds of variant discovery studies for each disease, but they also include analyses that can be done across datasets. For example, on basic genome biology, like structural variant discovery and ancestry LD studies, work on study design improvements, methods development, methods comparisons, and development of secondary resources. There are a lot of other activities going on, including, as I told you, the standardized data processing pipelines across the CCDGs have been developed already, and that paper has just been submitted. I understand that's from the analysis and data flow working groups. There is an effort to try to aggregate the data between NHLBI and TopMed. If we can do that, that will afford analysis of about 150,000 whole genomes. There's joint variant calling plans, so both within, so across all the CCDGs samples, and also if we can get the aggregation to go forward together with TopMed, and that's, again, work from analysis and data flow working groups. There's some thought of putting together an imputation server based on the data, and also there's been coordination on data annotation and markup within both the Centers for Common Disease Genomics and together with TopMed, and there have been two joint CCDG TopMed analysis meetings already and another one planned for this winter. So I want to return just to something I said at the beginning here, and that is, from Mendelian, we know what we're doing. There's not a lot more to do, and of course, as Chris showed, some examples of very interesting biology that the Mendelian Centers are getting into. And we have a lot of examples, but for common disease we don't. I still think we're quite near the beginning. As you might know, the CCDGs have been working on what they call a goals strategy and plan document. The goals were really stated in the introduction and the other slide, another slide. The strategy part of that document has gotten quite exhaustive. I think people are just trying to cover every aspect of it, but I think it can be summarized, I think it can be easily summarized by looking at it as a statement of the state of the art here. And some of those are reflected on the next slide, which is taken directly from the discussion about strategy at the meeting. And I just want to make a few points. So rare variant, if you look at the block, you just look at this block, you can see that rare variant studies looking in coding regions is working. It's beginning to work, and it looks like it has room to grow. Although, if you think about it, you look at the large number of cases that are needed to find just a few rare variants in coding regions. And maybe that's not unexpected. But in contrast, there's still very few. In fact, some people have asserted that there aren't any examples of well-validated non-coding variants being identified in rare variant association studies. So how can we address those challenges? There's probably a number of ways. In addition to clever study design and better analysis, it would be great to have... Sorry. So clever study design and better analysis methods may help, especially for some diseases or maybe some components of some diseases. That's important to consider. But otherwise, we probably need much more functional information, much better functional annotation of the genome to help with that. So based on all the discussions, the CCDG plans going forward are to maintain work on the current range of diseases to continue exploration of different approaches. So exomes define coding variants in case control studies, whole genomes and families where there's a strong maybe even de novo component of the architecture and still pursue genomes in case control studies to begin to look at non-coding and sort of to stimulate the field in terms of improved design, better analysis methods, better technologies. The external scientific panel recommendations were that the CCDG should not add a new project at this time, even though the one under consideration would add a new design, just for the sake of concentrating power on the existing studies. Where there's environmental data, there is some to add that analysis to actually do some more systematic evaluation of methods, for example, SV calling or developing polygenic risk scores, to have the cross-consortium collaborations add collaborations on polygenic risk scores this is already going on and phenotype harmonization. And finally, they recommended that we should increase outside collaborations beyond the current scope to include other organizations and other consortia. There are way too many people to acknowledge. I can't put them all even on two slides, but there are the Centers for Mendelian Genomics, the Centers for Common Disease Genomics, the three analysis centers. I want to especially thank our coordinating center and team sequence, and also members of our external scientific panel. And open it up for questions for either me or for Chris. Or maybe I'll ask Jonathan to start off since he's one of our ESP members. Right. I think Adam and Chris covered it pretty well. I think one of the things that the external panel is really looking forward to is we're about halfway through, so the data is just getting there, particularly for the CCDGs. And so a lot of the analysis and the results of that aren't quite there because the data is just getting there because the freeze too is just coming, right? Yeah, freeze too is next month. Yeah, it's next month. And there'll be a lot more that they can do with that, particularly with the analysis centers. They're gearing up with ideas and developing methods, but they'll be able to apply that. So I think in general, you've seen the recommendations and I think that we certainly feel reasonably good about how things are going. When you count the significant SNPs, that's based, I assume, on P-values, one for only corrected P-values. What are the effect sizes? Yeah, so that was on the slide and I could go back to it. And again, it's not my slide, so I can only comment so much on the details, but I'll go back. So inflammatory bowel disease, these are moderate effect sizes, three-fold, same for cardiovascular disease, maybe larger effect for schizophrenia and then huge for autism. Yeah, so I wouldn't read too much into the details of the slide. The reason I'm asking is because if you have, in some of these, I assume that you really have a huge sample size and you've only had a few go over the significant threshold, which technically could be lowered, by adding more and more samples, you might get something significant, but now the effect size must be very, very small and the question is, does it even matter at this point? Yeah, so that's the whole question that was anticipated, maybe not fully anticipated at the beginning of the program. And that was one of the reasons to, one of the several reasons to try with a select number of diseases to push as far as we could push, we were told by one council member at the time, you'll probably have to go farther into the realm of diminishing returns than you wanted to in order to understand really where you were but where you are. And we did have maybe the naive idea at the outset that we could bring enough sequencing power to this to actually be comprehensive about a few diseases, but things have gotten, the picture of genomic architecture disease I think has gotten quite a bit more complicated since then, but this is a very important question, is what is the point of diminishing returns? When do you know you've hit it and what does that really mean? Jonathan? Maybe I could address that a little bit. There's a difference between the effect size of the individual variant, which could be extremely rare, but that variant may have an extremely large effect. The question is population attributable risk. How much does it mean to the population as a whole? And that's where you start getting into how far down do you want to go in terms of diminishing returns at the population level. Certainly for the individual or the family that has that rare variant, it's very, very, very important. And it might open up new biology, you don't know. That's a balance that I think everybody is struggling with a little bit. Yeah, Jay? Well, I will say, my bias would be that we may already be at that point, but I think it also depends on dividing it up into this dichotomy of whether we're doing this to explain disease risk through specific findings or whether we're doing it to get to leads for biology. If we're doing it for risk, I think that one of the most and you highlighted it a bit here, but I thought the study from Saccathoracian I think is still on bioarchive. I think that's one of the most exciting developments of the last couple of years. And I think we should be paying a lot more attention to it. Maybe you guys already are, but I feel like it should be kind of permeating through all of the genomic medicine aspects of this institute as quickly as possible what the implications of that are, because all of the genomic medicine folks tend to really be focused on the rare variant specific genes, and if this really is that we can predict an equivalent risk for a much larger fraction. That has pretty big implications. Yeah, so essentially the same thing was said by our ESP during the meeting in April. I guess another comment is that following up on Jay's comment if you're interested is in leads then Bonferroni corrected at whatever it is, 0.05 or 0.01 level is incredibly conservative and you could definitely make that instead to a FDR of 0.05 and then your sample sizes come way, way, way down if what you want are leads. Just comment. Two millions and millions of dollars. These are discussions that happen almost weekly, I'm sure within, certainly when we're having the joint meetings, those discussions come up as to where should we go with it and if you loosen things up and you do more and the polygenic risk score kinds of things where can you go. So these are ongoing discussions and debates, so it's a good point. This is coming out as a bit of an affectionato and not knowing the details of this area super well, but just looking at it from the outside it seems like focusing resources on getting polygenic risk scores for non-white populations would be a super use of money in the near term if that's at all a possibility with the cohorts that are available. So Jay, we've just started to think about this again since April since we saw that presented and some of that has to do with how right how much of a shift it requires in analysis and also in data production and we just haven't thought through that yet. I don't know if you've done this calculation, but could you figure out in terms of identifying new leads or new targets cost per target that you identify in terms of either the CMG or with the common variants and if you did use the CMGs for getting at some of the common diseases are those phenotypes represented in the CMGs? So the easy one first I guess. For the CMGs it's about 30 exomes that we sequence for each thing that's counted as a discovery. We said there's a little nuance to what a discovery is. So there it's relatively low and whether it's representative on the phenotypes that's something that we tend to find out at the end. And I my inclination is to resist doing that too soon for the complex disease because right now the cost per is going to be astronomical. I'll write. Is autism a Mendelian disorder that's very heterogeneous or a complex disorder? Can I just answer yes? Well the question goes to what do we learn in going forward and selecting other disorders to study? Because there could have been guesses made about I'm sure there were about autism. Yeah especially about that component of autism and yes in hindsight it does seem sometimes that there could have been guesses but I think at the outset maybe not so much although I agree with you that it's clear that there's sort of there's overlap in the Mendelian mission and the CCDG mission maybe with that kind of study. Was autism purely the CC genome sequencing? It's purely CCDG and there's part of it is a whole genomes and families and quads design and part of it is a cohort design done at two different centers. Chris, Adam, thank you very much. Let's move along.