 So, our next speaker is Bill, oh, thanks Bill. Bill Lowe, and Bill is going to talk about identification of regulatory variation important for maternal metabolism during pregnancy. Okay. Thank you, Eric. So, I'm going to be coming from a different perspective as a person with clinical training who's interested in a specific phenotype, in this case maternal metabolism during pregnancy. And the, how we've used ENCO data to try and, whoops, this is pretty sensitive, how we've used ENCO data to try and move that forward. I'd also like to say, although I'm the person giving the talk right now, that this represents a collaboration between myself and Tim Reddy, who's a Duke and a member of the ENCO consortium and just a plug for NHGRI, Tim's and my collaboration began about three and a half years ago when NHGRI arranged what I could best be called a speed dating event between Geneva and ENCODE, and it was sort of a propitious time for both of us, and so it's worked out quite well. So, I think as we've heard, you know, just looking at type 2 diabetes and a variety of related glycemic traits, as someone who's interested in the genetics of the disease, you know, what we see is that there's a whole, as we've developed larger and larger cohorts, there's a large number of genes that have been shown to be associated with the disease or the phenotypes, but I think it's fair to say in most cases we don't understand the functional variation that's responsible for this, and in many cases don't yet understand how the specific gene products contributes to the phenotype. So the phenotype that we've been studying is maternal metabolism during pregnancy. There are a number of changes in metabolism that occur during pregnancy to accommodate the growing fetus. Some of these are shown here. First, there's a decrease in fasting glucose, which occurs despite an increase in insulin resistance, and that increase in insulin resistance, not surprisingly, is accompanied by increased fasting insulin levels, increased insulin secretion, as well as increased hepatic glucose production. So we did a GWAS using 4,500 mothers from four different ancestry groups, a population-based study looking at a variety of metabolic traits, and found a number of genes that demonstrated a genome-wide significant association with either fasting glucose, fasting C-peptide, or two-hour glucose levels. Several of these were genes that were known to be associated with glucose-related traits in non-gravid populations, but two of them were ones that had not previously been shown in non-gravid populations. And I'm going to talk a little bit about some of our studies with HKDC1 initially. So we were able to, so, you know, here's HKDC1, which is located on chromosome 10. I should mention this is called Hexokinase domain containing 1, sits right next to Hexokinase 1. And if you overlay it over the ENCO data, we are happy to see that, in fact, it looks like an area that's relatively transcriptionally active. There's a lot of open chromatin in this region, as well as some enhancer marks. And then here is, this was our lead SNP, but a number of SNPs in this region demonstrated association. And, in fact, you could demonstrate that it's also, this gene's also expressed, which was in a number of different cell types, and that was consistent with our findings looking at normal human tissues that the gene was widely expressed. So the next question was, which of these gene, which of these variants account for the, account for the association that we're seeing and what's the mechanism of the association? So a graduate student in Tim's lab, Carl Guo, took it upon himself to look at, as you'll see in the next slide, divide this gene up into about 11 different regions based on the DNA-1 hypersensitivity sites that have been identified or characterized by ENCO. And then synthesize DNA fragments that contained all the different haplotypes represented in each of those regions based upon 1000 genome's data. You could then put those into Luciferase reporter vectors, transfect those into cells to see which impacted gene expression. So you can see here the 11 regions that were identified over the, 11 sort of areas that he tested over the HKDC-1 locus. And in fact, what Carl showed was that there are, if you see here, four, in this case, four different SNPs from four of these different regions that demonstrated a change in which impacted Luciferase expression, presumably through an impact on gene expression. Each of these were also, each of these variants were in a region where there was a SNP that demonstrated genome-wide significant association, and probably more importantly, these SNPs also aligned with EQ, previously demonstrated EQTLs for HKDC-1 in liver. Importantly also, and you can see down here in bold, each of the bold SNPs, in fact, each of the bold SNPs, which was associated with decreased gene expression, was also associated with higher two-hour glucose levels in the mother. So this led to the hypothesis that lower HKDC-1 expression may in fact be associated with two-hour glucose levels, and in fact, we have some data in mouse to suggest that might be true. But I think you can appreciate that this is a relatively tedious and laborious approach to sort of marching your way through these regions. And also, as suggested by Eric, as you start to sequence these regions in greater numbers of people, you're going to find an increasing number of variants. So the question was, was there a faster way to move from causal variants to function in a high-throughput manner? And could you develop new functional assays of gene expression that could be, and specifically could you use donor DNA for functional assays of gene expression, this would have the advantage of both sort of directly testing personal rare variants that are going to be identified through sequencing, as well as for haplotypes to be tested directly. So to address that, Tim developed an approach based on the StarSeq approach, which I'm sure all of you are familiar with. So the idea is rather than, as I'll show in the next slide, rather than testing a variety of fragments of DNA for enhancer activity per se, one could take the DNA capture libraries that were used for sequencing, take those fragments, and put them into the StarSeq vector, and then downstream of, in this case, GFP, transfect those into cells, and then by doing RNA-Seq to look at differences between the input and output libraries to see if there are variants that impact expression. So we've recently gotten the sequence data from our locus on chromosome 10, but we've not been able to test that, but it has been tested on a related trait, fetal adiposity, which is another trait that's related to maternal glucose metabolism, and there's locus on chromosome three that we've been studying. So again, now using the ENCODE data, we, Tim, had sequenced across this region choosing a variety of targets based upon areas of open chromatin to look for variants. So then he could take these fragments of which are about 450 to 500 base pairs in length, clone them into the StarSeq vector, and transfect them into cells. So, you can transantly transfect these into a relevant cell type, and then you can do high throughput sequencing of expressed reporter gene to measure the allele-specific regulatory activity of each amplicon, and what you can see here is there was a fairly good correlation between the allele frequency, the targeted sequencing allele frequency versus the frequency in the input library, and you'd anticipate that there would be some outliers that were fragments that contained variants that affect gene expression. So across this region, these data were done initially based upon using fragments from 95 individuals. There were 321 SNPs identified among those individuals, 283 of which were successfully assayed, and if you did replicate experiments, the concordance between experiments was quite high. Importantly, about 32% of the SNPs of these SNPs were rare, defined as aminal allele frequency less than 1%, and 29% of the assayed SNPs were also rare, suggesting that you weren't losing rare SNPs along the way. And then 27 common and 9 rare SNPs demonstrated significant fold changes in regulatory activity varying, I guess, you look at it, some either decreasing gene expression or others increasing gene expression, so you can see some reasonable changes. So obviously then, you have to validate these changes using a more standard assay. So these fragments were then cloned 5-prime to a leciferase gene, so previously they were 3-prime, now they're 5-prime, and then looking to see their impact on expression, and in each case, the allele with a higher level of expression in the initial assay also demonstrated higher level of expression, or in this case increased leciferase expression in this more standard assay. And then if you look at one of these alleles, RS4266144, in fact that overlay, that is part of a TED4 binding site, and you can see that in fact there's a G allele and the C allele. The allele was the allele with a higher level of expression. You can see the ancestral allele appears to be the G allele, it's reflected by other non-human primates. In fact the G allele is more common in humans as well, albeit it's about 50-50, it's a relatively common variant. I don't know why this turned into a black box, the Neanderthal variant was also a G allele as well. So suggests that this allows us now to sort of hone in on this particular site as one that may be important for the phenotype that we're interested in. So just to conclude and get us a little bit of time back. So in terms of what gap, the gap that's being filled, so I think one of the ideas is that this could be in a sort of a disease agnostic way developing new technologies that allow for more rapid screening of functional variants that can be tested is something that would be appropriate for NHGRI. Obviously those trying to identify, try to determine the impact of those variants may be then, using those technologies to identify functional variation may be done more appropriately through as it relates to specific phenotypes. But the technologies I think could be developed in a more obviously disease agnostic sort of way. So thinking about what other resources might be of value, I think one of the challenges face is someone who's interested in the phenotypes as many times getting relevant cell types for functional studies is a challenge and developing resources of whether they're IPS derived or primary cells or other types of cells so that we're not forever looking at cancer cell lines specifically in the diabetes world, beta cell lines that really recapitulate function are hard to come by, the cell lines are not necessarily that great. What other efforts, I think we've heard a lot about EQTLs and I think we've heard a lot at this meeting as well about the idea that many of the EQTLs have been identified in basal states, understanding them in sort of a perturbed state. For example, in my case perhaps in the presence of insulin resistance would be of interest and I'll conclude with that. Thanks. Oh, sorry, I just want to acknowledge Tim in his lab at Duke who's been a real driver in all of this. Thanks. This thinking about it is fascinating. I mean this appears to be then very scalable, right? Meaning you essentially go through ENCODE to figure out which regions of the genome to clown but I presume that you don't even need that meaning you could use it as corroborating evidence. Right, so in fact I think the idea would be that you could do the test for the variation and then as you identify the variation perhaps go back to ENCODE then to see those that lie and reach, you know, variants that are fragments that contain, you know, potential variants of interest. You could go back to ENCODE and maybe prioritize those that you're going to follow up based upon how they overlay ENCODE data. But I'm wondering about the detection ability. Meaning you could screen for, I guess you're looking at many elements that you could screen for thousands, tens of thousands of millions of such elements. Agreed, yes, absolutely. And you could, and this was done on a small scale with, you know, 95 individuals actually were doing on an 800 at the current time. It's sort of related. I mean this is, you can look at so many of them now and they're multiple ways of doing it. They're grinding. A number of people have just saturation mutinized areas too, even if they're not naturally occurring yet, to see what can't, you know, what may or may not be. And as crazy as that sounds, I'm not sure that we couldn't do it because you literally can put tens or hundreds of thousands into these types of assays. Right. I'm not sure that we couldn't do that for every base payer or maybe every conserved region or something like that and so we know. Sean? Yeah, I think, you know, you mentioned the issue of the cellular context, you know what, I think which is a theme in many things. But also, there are some caveats with these type of assays, right? And the caveat is that collectively the results from those experiments about 30%, 20 to 30% are false positives, which means they're positive in the Hanser assay and in the exact same cell type where they're testing something, it's not a DNA hyper sensitive site, right? So which is, which is a sine qua non for anything that has ever been found to be an answer. So you've got a 30% false positive rate and you also have about a 60% false negative rate in cases where you can look at known enhancers and actually map them back to those assays. And that was part of the driving reason why those kind of assays were abandoned in the late 1980s in favor of other cell based contexts. So I think that this is, you know, I think that one, you can learn a lot of things and when they're positive and everything lines up, it's great. But I think that, you know, before contemplating a massive assault on the whole picture, I think one has to really understand very carefully what are the properties of the assays in the cells and everything and kind of map that into the whole, the whole picture. So I just want to echo, I think that the star seek and similar protocols are incredibly powerful. But recognizing their limitations is also really important. And one of the things that I think, you know, to go to John's point is these are sites taken completely out of their context being used on a transiently transfected vector. So you're basically taking chromatin out of the picture. And I just don't know to what extent it's safe to do so. Because they're being put within a transcribed region where the polymerase is cruising through, nucleosomes are not assembling. And so, you know, for everything that encode is invested in understanding how DNA's hypersensitivity and how chromatin modifications can affect enhancers and regulatory regions, I think we don't want to invest too much in an assay that's effectively chromatin free and out of context. Just to say that I fully accept what John and Karen just said. But I think one thing that CRISPR opens and thinking about what will presented late yesterday, there is a way of putting things in the endogenous context. So these assays need to meet each other. There are issues with going after non-coating stuff. But that's in terms of the technology area to invest in. Going after the endogenous context with a lot of synthesized or natural sequence, I think is a very important direction. There isn't a solution right there available. Just pull the trigger. I'm sure there are studies that John knows that will support that 30% false positive rate. That's not the finding from all the studies. And there are studies where our negative controls we rarely see activity for them in these assays. Now, that being said, there's clearly limitations to this assay. There are limitations to how you can interpret any assay. And you have to keep those in mind. But you really can get high throughput. And I think that what you're learning from these transient transfections, in fact, does extrapolate well as you look into more and more assays. This is a very exciting prospect. And you really could ramp this up to genome-wide efficient coverage. I'm not sure we'll be the best population to do it in. But you could do this for P300. I mean, you could really hit this well. I just want to slightly disagree with Karen. Yes, we can and we should get where we can do it intangibly. But you do chromatinize the small plasmids. We know that you can do it. John, we've done thousands and thousands of these and I've never seen a false positive. You certainly get false negatives, but we've never seen false positives. And so I think you don't go in saying that this is recapitulating all biology, but it's so fast, so cheap. And even if you just get a hint, just being careful not to overinterpret it, it's extremely valuable. I mean, this has been going on since transient transfections were invented. You learn an enormous amount from them. You just have to be cautious not to think that it's the entire story. And I think when you get something, it says that base pair can have an effect on transcription and whether it does in real life, you know, you may have to do more. So so I view it as a screen, right? No, I didn't want to suggest that this was the end and the answer is really a first step, a very first step.