 So, I'm going to talk about annotation of whole genomes primarily and how we would use this and how ENCODE is playing a role, obviously, in those annotation exercises, and we use this for human disease. You know, whoops. You know, we did so well yesterday. I was proud of us. Proud of you. Oh, thank you. You know, we all come to the podium with biases. So the first couple of slides are really to let you know my biases about, you know, what's happening to, you know, sort of contemporary human genetics. One is that, you know, the habit we all have of putting disease into two bends, it's a good administrative tool and an organizational tool, but really, it's not, does not reflect the real science, that there's a continuum between rare Mendelian disease and common complex diseases. You know, we all know that there are examples of common diseases that have Mendelian forms such as LDL receptor defects, familial combine, and many, many more examples. There are also examples where the rare Mendelian diseases, you know, clearly do not have an underlying single gene mutation. And I'll give you, since they're less well known, I'll give you a good example of that. What's shown at the bottom here, really, are burden counts within genes that have been related to Charcot-Marie tooth syndrome. And what we've done is taken individuals, a very large cohort of CMT patients, and we can identify many of those, about a third, that have known mutations. And the other two thirds, we can't. So let's sweep out the third and think about those that have no obvious underlying mutation and do a traditional burden analysis like we would do for a polygenic disease and compare that to a race and ethnically similarly matched control set. And what we can see is that basically the CMT patients that are shown in red basically have an increased burden of disease, of mutations like you would expect for a polygenic condition. And so it's interesting to think that when we go to the table with traits that we think we've ascertained through Mendelian conditions, indeed, there are clearly examples in there that are not Mendelian. And then when we take some of those mutations and move it over to an experimental system, and this was work done with Nico-Catsanus, is basically now taking each of these mutations basically in pairwise combinations, we can see here's the control fish up here. And one can see that basically in looking at combinations, you probably can't read the font, but that's two of these two mutations. We can see that we're seeing defects in neural development in the tail of this model system. And so this gives rise then to a hypothesis that we need to consider, I think more broadly, the role of oligogenic disease. And for those of you who are more computationally minded, even moving from one to two creates enormous problems in terms of power and thinking about how two genes combine. I think the other bias that I'm now I think more appreciating looking forward is this and would like to integrate more into my own work is this relationship really between the clinical enterprise and the research enterprise. I probably this is based on my presence within the very large Texas Medical Center, but really previously I've run my life in a schizophrenic manner, bipolar manner, where I've had sort of a research work and clinical work. And there's been a firewall between them really. And I think moving forward because of the nature of the science and also the economy of where we're acting is we need to think about how the data that we're generating can drive discovery. That's a part of this virtuous cycle that we're very comfortable with. How that discovery can feed into translation, that's something we talk a lot about. But the step really that I don't think we've spent nearly enough times, how we can use this translation to generate additional data to drive additional discovery. And I think for those of us who work in large scale projects, this is a cycle that I think will be very beneficial moving forward to help overcome and drive discovery and overcome basically the cost involved in doing very large population sciences. But there's really two separate cultures that we need to bridge. Again, many people in the audience are used to working in this culture. It's very physician driven for those of you don't know. And you're also maybe used to working in this where it's very research protocol driven. And really what we've got to do is bring these two in line and make sure that the research protocol and the physician guided principles. And also protecting the privacy of the patients. And for those of you with children, you'll realize this is a shoots and ladder kind of game. It's two steps forward and one step back constantly. But I think again, if we can drive this in human genomics, it will be very beneficial. But by anybody's standards, these initiatives have been very successful. I didn't know how to present this really. I call it a scorecard is really we've been, I would say, the last five years, both in the Mendelian arena and in the complex disease arena. We've been enormously successful in identifying genes and often identifying variants within those genes that are contributing to disease. There's really no question about that part of it. Can you do this again? From the current slide. And this paper I think is very interesting. I would encourage you to look at it really as a case series. There are not very many examples of case series in this area. And so it's an example of really taking a case series of 2,000 individuals that were referred for presumed Mendelian cases. And really after exome sequencing the results of the findings in that study. And by and large, the take home message I think for today is that the solve rate really is about 25%. Then in about 25% of the presumed Mendelian disease is we can do exome sequencing, sometimes whole genome sequencing. And we can quote solve the condition that's before us. And if this were a typical sort of dog and pony talk for 45 minutes, what I would do is then go through several vignettes and stories of what these mutations look like. But being an encode workshop, what I'm going to do very unusually is talk about the white space out here. Let's talk about the 75% that we don't solve. And think about how we can drive and increase the solve rate for that 75%. One thing of interest coming after Aravind's talk, it's kind of interesting to look at the diagnostic rate then for neurologic diseases versus non neurologic diseases. In general, our diagnostic rate for neurologic diseases tends to be a little bit higher than non neurologic diseases. Whether that's an artifact of the databases that we're working off of or an artifact of the genetic architecture of disease, I really don't know. So what do we do for these unsolved cases? Well, the first is the need really to move them really from the research side of the enterprise, I mean from the clinical side of the enterprise over to the research side of the enterprise. Make sure we have informed consent in place. Make sure we can move the data from one side of that firewall to the other. But really for this group then one of the tools that we're doing is moving them into from whole exome sequencing then to whole genome sequencing. And we've done this now a number of times and I'll tell you a couple of stories. But before I do that is it's nice to think about what's allowing us really to drive this forward to think about whole genome sequencing at scale. And it's not a secret as the technology keeps evolving. And so what's very nice in the field today is the ability to think about whole genome sequencing not just on tens of individuals, not even on hundreds of individuals, but thousands and tens of thousands of individuals. So we in the genome center at Baylor have brought up and in production for the new, you know, fleet of X10. And I'll show you a little bit of data. We're basically pushing this through with, you know, going through the various chemistries along with the Lumina, thinking about comparing library, you know, PCR free versus nano libraries. And then also the final goal is really for the first time having a CAPCLEA, have an X10 in a CAPCLEA environment. That's our ultimate goal. The black line here is really a threshold that we use and is our goal. It's basically a 90 gigs per lane and this just shows you. These are lumped, by the way, by a time window. It shows you sort of the variation and after a certain amount of burn in. And so we're quite pleased today with the performance of these instruments. The sensitivity and false discovery rate, I can show this in a variety of ways with array based SNPs. This happens to be what we use internally as a truth set. It's HS1011. For most of you, that's an ID. For some of us, that happens to be a friend of ours that has Charcot-Marie tooth himself. So here's the coverage for those. You can see we have about 97% sensitivity for detecting variants within HS011 and about a 2.5 to 3% false discovery rate, depending on the details. Having this in place, though, is a tremendous amount of technology infrastructure. So it's not a matter of just buying the boxes and sitting them on the shelf. It's a need to totally upgrade the infrastructure to handle this amount of data. And I'm not going to go through these two panels in any great detail. It happens to be basically a flow. It looks like the sort of Paris underground, but it's really a flow chart of moving data through this system. What I am going to spend time, though, with this yellow box here is really the annotation of those whole genomes, which in my opinion is not only how we're using ENCODE, but is really key for moving these whole genome analyses forward. Depending on where you're at, I'm just letting you know we're trying, and we're working to build all of these tools in a cloud-based environment. Not because we may absolutely have to have a cloud-based environment today, but we're trying to build basically a sandbox where investigators throughout the community can use these tools regardless of where they are at. The only case that we're absolutely needing to use a cloud-based environment is for our very large epidemiologic studies. There are tens of thousands of individuals doing that type of calling in a local compute environment turns out to be a horrendous task. I put this slide in after yesterday's discussion. It's kind of interesting to think that as we move from exomes to whole genomes, we might think that we're basically plugging all the holes in the genome, H-O-L-E-S in that case. And so what I've done here is let's look at the gene test genes and look at every site in those gene test genes that have at least 30X coverage. And in our hands, basically this purple bar, let's just look at this panel towards the right, this purple bar here would be considered a gold standard. That's a whole exome sequence that we have really tailored and placed a lot of effort in to try to get the very best coverage that we can. And so that's shown here, if you can't read it, it's about 1,600 of the 1821 genes. So even there, we're not getting complete 100% coverage, which is what we'd like to do. So as a comparison, let's look at PCR free with the X10, that would be this bar here. And you can see that even though it's, quote, whole genome, there's still a portion of the genome that is under cover, I'll use that word. And I think one of the challenges that we have as a group, as a group of genomicists, is to think about how we can basically raise the quality and make a comprehensive analysis of human disease. The first thing we have to do is have a comprehensive genome in front of us to think about how we could use that comprehensive genome then to understand the disease. So as we move from taking whole genome sequence to analyzing human phenotypes, the challenge we have, I think, is a signal to noise ratio problem. So this is kind of a silly cartoon, but it allows us to make a simple point that really is, I think, critically important. So what's graphed is the number of individuals on the x-axis, and we'll look at the y-axis on the left is really the number of variants. And it's interesting to think about array-based genotyping, whatever it is, your favorite panel, let's say it has a million snips. That stays constant as you add individuals. But as you add individuals in the sequencing realm, the number of variants just continuously goes up. In fact, it's quite amazing to me now having seen a lot of sequencing actually really continues to go up. At some point, my guess is every site in the genome is going to be variable if you sequence enough individuals. But now if we switch over to the y-axis on the right and think about the signal to noise ratio, array-based genotyping, the signal goes up and the noise does not. So we crossed some threshold of discovery quite early, and this is basically what's driving the success of the GWAS era. But sequencing this panel, this slope of this line is much lower. Because as you sequence more and more individuals, it's true, the signal does go up, but the noise is also going up with it. And so we need to think about then ways of which we can minimize the noise and bring out the sequencing. And the thesis that I'd like to promote is basically study design, annotation, and filtering are the ways then to lower or raise that signal to noise ratio by really lowering the noise. So the annotation engine that we're developing for whole genomes off the X-10 is really the result of Simon White and Xiaming Liu whose pictures are shown here. I'm not going to go through this in great detail, a lot of this on top, we know quite well sort of the protein encoding annotation. Clearly, we've moved then into think about how we can do functional prediction in the non-coding region. Some of these tools are developed by people in this room such as CADD and FunSeq. And what we're doing really is bringing these tools in and thinking about very rapid ways that they can be applied by taking all nucleotides in a canonical genome, mutating those nucleotides to all possible positions, recalculating a CAD score just as an example, and then loading that so it's pre-computed and can be used in an analysis. And here are other motifs really that are annotated, many of which come from ENCODE. Then again, going back to the cloud-based environment, what we're heading toward is really to build this tool that would be available to others, one then could upload your own genome or your own genome from your own studies, upload genomes from your own studies. You could annotate them with this tool or other tools and then bring that annotated genome back to your environment. And then our responsibility then is to keep up these annotation libraries. Here is a couple of stories, I guess I think I just put one. This is a picture of Matthew Bainbridge, this is Matthew's work. We identified ten families of which after exome sequencing and a lot of analyses, we could not find a mutation that was responsible for the disease. We then moved them from an exome environment, did whole genome sequencing in a trio. And I'm not gonna go through all the solved cases, but I will tell you one story because it's kind of interesting in this environment here is basically a patient that had a neurodegenerative disease and had iron accumulation in the brain. And basically because of moving from exomes to whole genomes. In this particular case, the mutation was near a gene. But the mutation had characteristics of such it could not be detected by multiple tools using whole exome data. So one of the things we don't appreciate about whole genomes compared to exomes is it allows us to detect structural variants that just weren't possible with exome sequencing. Mutations that are small, we tend to be able to pick up. Mutations that are large enough to drive a bus through, we tend to pick up. But mutations that are really moderate in size, this particular one is 168 bases in length really are very, very difficult to detect in exomes and very difficult to understand when you do detect them. But in the case of whole genomes, basically this mutation was detected in this family and we're in the process of validating it. And it is likely responsible for disease in this particular child. Moving then from a situation of thinking about Mendelian patients now to more common complex diseases, this is work from the Charge Consortium. One of the things that I'm working on very hard is to think, here's a list of large cohort studies, most of which are funded by the National Heart, Lung and Blood Institute. And in collaboration with NHLBI and NHGRI, really what we would like to create is a community resource where literally tens of thousands of individuals have whole genome sequencing, have gene expression, have RNA seek, have methylation data, have metabolomic data and MINR seek and also a little bit of flow cytometry. So if we can create this very deeply phenotyped resource and these phenotypes are closer to the gene level, the hypothesis is that we would be able to, number one, drive gene discovery for these phenotypes that are closer to the gene level. And then number two, because each and every one of these individuals are phenotype for tens of thousands of phenotypes. If you can think of the phenotype, we try to measure it. And if all that data were available by something like dbGaP, it really could be a major driver of discovery. So what I'll show you is we have data now on whole genomes from 12,971 individuals. We've called approximately, I can't remember the number exactly. I think the calling is complete on 5,700. This is work of Fuli. And I put this slide in here to indicate that we've moved beyond it. And particularly in the 1,000 genomes stay, I can remember the group being involved in a number of bake-offs. The old days, we would think about callers and we would compare callers. I think the window that we're in right now is we run not all the callers, but we run multiple callers on each of these sites. And then we think about a consensus call site using algorithms that are developed by using several gold standards. So we've developed this tool, or Fuli's developed this tool GSNAP, which really takes whole genomes off the X10, basically runs them through multiple callers. We have a couple of gold standard genomes then that we developed in a consensus call set algorithm. And then we push those genomes through that algorithm to have an ultimate consensus call set. I've thought about ways that we could use multiple call sets in a genotype-phenotype relationship. I think it's intellectually stimulating. My guess is it probably won't improve discovery, so I've stayed away from that. It's a question I often get. So we've also developed a tool. One of the problems you're going to have in thinking about whole genomes and how to approach whole genomes for a common disease is just the horrendous computational task, which is actually quite tractable. But really, it's how to digest this information for your biologic and clinical colleagues. And this is a visualization tool we call Lekesis, which again is publicly available. What's shown here is position. And this is a very small window of the genome. It starts around Boston and probably ends up around Miami if the whole thing were here. And what we have is it's just a silly cartoon, really. It's basically p-values for common variants. It's the burden test results for genes or protein-coding motifs. And then what we do is we annotate multiple sources, including end code, and they're shown in these sort of purplish boxes, depending on what you chose. And then on the bottom, you can either have coverage or you can have numbers of SNPs, depending on how you want to toggle it. And then finally, we basically run a sliding window burden test across the genome. And so it turns out it's a valuable tool, because again, one thing we're going to all struggle with is the tractability of having tens of thousands of genomes, how we make sense of it, and not just have it as a fun computational exercise, but how we really use it to drive biology, discovery, and clinical practice. So I'm going to end there and thank everyone for their attention. I've tried to mention names, the virtual cycle work. I spent a lot of time talking with Richard Gibbs, the clan genomics work of frequency and effects is with Richard and Jim Lubski. The clinical work at Baylor is at the WGL or Baylor-Morocco genome lab now, with Christine Eng primarily. The Center for Mendelian Genetics is a collaboration with Dave Valley and Jim Lubski. The charge consortium I've already mentioned. The whole genome work in charge is really led by Alana Morrison primarily. And I have a lot to thank in annotation from Simon White and Matthew Bainbridge and Jalming Liu and Cassandra. So thank you, everyone, for your attention. Questions? Do we have sound? Can somebody try? Ewan, we'll try. Now, I'll repeat it. Go ahead. In sequencing, surely that's what the GATK samples joint calling tries to leverage in the opposite direction of fact, that systematic noise can be detected and modeled and then one cleans up the noise. I mean, I don't sort of buy. I mean, one has to be sophisticated about the calling and maybe a combination scale calling in these cases. But I don't think it really shouldn't be the case that the noise is going to dominate. You're focusing on, quote, noise or error in the calling itself. I think the primary driver of noise in that graph or that schematic is chance noise of the association between rare variants and the phenotype. Because you have a fixed sample, there's going to be a lot of just stochastic variation of the relationship between genotype and phenotype. And I think that's the primary noise driving that graph. It's not calling error or noise. It's definitely involved. Yes, yes. I don't think that, I mean, I think this is in the part of the statistical model for SCAT and these learning tests is that you have to account for how those things happen. But I don't really, I'm not somebody who believes that by restricting the data that we have, we clean up our statistics. One should be able to produce better statistics with more data, even if at the extreme that is to restrict what happens at the most extreme. I agree with that sentiment, except what I'm trying to point out is in the GWAS era, we got ourself, maybe we intoxicated, I would use the word, that by increasing sample size, we would always drive an increased signal at a higher rate. But what I realize now is by just simply increasing sample size in the exomes or genomes, that rate of increase of information is not as steep because you're increasing the number of variants as you sequence more individuals that are not related to disease. And therefore, that's the slope of the line. That's the key. It's not like the line went down. It's the slope of the line didn't go up as fast. That's the point. How you think about, oh, I'm sorry about that. Sorry, I don't want to repeat that. Well, I was asking about the window on the genome and how you set the window size. First, it's a great question that there's two issues we've grappled with. First, what is the metric of the window? Is it physical distance or is it number of SNPs? Those data that you saw are a number of variants. We had a window size of 50 and a skip of 25. And really, we have played with both of these quite a bit. And the 50-25 is one that we're comfortable with because, again, it's totally data-driven. It's just looking at the line itself seems to be quote-unquote better behaved with those two numbers. But I would be the first to admit it's a bit arbitrary at this point. Okay, good. All right. So our next speaker is Nancy Cox from, I have to get used to saying this, from Vanderbilt University. And she's gonna be speaking of genetically predicted endophenotypes getting to the next level in understanding how genome variation drives disease.