 All right, so I was asked to address where are the gaps in our understanding or knowledge of the functional landscape of humans and mouse. The planners at this workshop actually gave me a list of questions that I was supposed to try to answer in this short time. And I'll try to touch on a fair number of these versus current status and what new data production efforts should be the highest priority. What should we do in terms of validation and characterization? What future studies should be envisaged? If not limited by technology, I like that one. What technological breakthroughs would be transformative? How would you prioritize? And what do you need to make the data interoperable? And I promise to stop, well, my timer, I can see it right there because the others have lots to say. And happily, a lot of what I want to emphasize, you've heard parts of, but I do want to give you my perspective on these. So first, the current state of mapping. This is not to scale, but I think the message is accurate. You've got to think about this top layer is a two-dimensional matrix cell types and features. And this has been emphasized before, but let me say it again. There's some of these assays that really penetrate and there's some cell types where you only know maybe two things about them, but you know two very valuable things about them. You get that DNA sensitivity. You get the RNA. You actually know a lot more about the tracheolepithelium than you did before. And then this has been emphasized a few times. There are a few cell types for which there's a lot known. There's a lot of white in here. There's a lot that will have to stay white because the factors are not in those cells. But still, there's an awful lot that hasn't been done. But that's really just kind of a tip of the iceberg and context dependency has been emphasized over and over. So if you imagine that, it's that same matrix over and over again, some number of times in response to environmental stimuli, time of differentiation or whatever. And for every cell within each one of these matrices, there is actually a relevant time course. And each one of these cells is whole genome data. This is huge amounts of data. And of course, no one really knows how many there are of any of these. But let's just put in some numbers that are not crazy. And I cited a few of these and some of these I just guessed at. If you just multiply all these together, it is multiplicative. Just 2,000 cell types, 2,000 features, a bunch of conditions, a bunch of time points is 800 million. You talk about brute force completion of the matrices is 800 million. That's why I think we're not going to get there. I think we can get something extremely valuable, but I don't think we'll get to completion. Oh, so how do you get to something that's extremely valuable? Well, you can focus. And this is just looking at one particular system that I'm very, very interested in and just looking at the myeloid component of hematopoiesis, bunch of different cell types that thanks to advances to allow us to do RNA-seq and attack-seq and small numbers of cells, you can actually fill in the relevant matrix a lot. Everything that's colored is actually known. As known six months ago, the black is from in code, the gray is from others. So, you know, focus will help you learn a lot. Of course, it won't cover every cell type, obviously. So that's current status. What else is needed? Well, my number one thing that I think is needed and this is consonant with other things you've heard is a 3D chromatin interaction map. This is just one example of this and showing that alpha-globin enhancers actually interact all through a region that's clearly that the regulatory region is meant by K27 acetylation. Now, this guy, Jim Hughes, used a capture approach to get high-resolution data. We saw some high-C stuff, much lower resolution. And we have to embrace a whole range of scales and try to understand that range and try to hit many different cell types. And I think we've got to jump into the dynamics of these interaction maps because you're actually seeing regulation and process in progress. I wish I knew more about what I thought 4D nucleon was going to give us lots of new data sets, right? And so I hope we get some clarity on that. But at least there's some money going into this and it's very, very important. We need to interface with it. And I do think that at least for several cell types, this could benefit from a top-down managed approach. There are other things we need. And so I actually do think we should move towards completion, even though I don't think we'll ever get to completion, but we will get to a useful point that I'll mention later. So you need to map more factors and more cell types. And these really are serious limitations. I mean, maybe antibodies will get us from 200, let's say it was 300 factors up to the 1,500 that are needed, but honestly I doubt it. And a lot of these cells are pretty rare. And so you really have to get down to smaller numbers of cells. Higher resolution would be very, very useful. I think you'll hear more about this from other speakers later. And maybe, I mean, you're hearing some calls for continuing with this top-down managed approach, like we have in the current phases have had an encode so far. I think there's a lot of value from opening it up to community-driven projects as well. And throughout all of this, I mean, the dynamics. Dynamics is how you really, I think you can really start to approach a clearer understanding of what you want to know from this. This is just a snapshot from a very well-studied system, erythroid maturation, beta-globin gene complex. And just looking across the time course, not only it was happening in the cells, but the regulatory regions which have this factor got a one bound early on. And you can see later on the response, a million fold increase in expression, a hundred fold increase in binding there. And I think this is a sort of data that we need throughout. And I'm gonna have to say likely chosen parts of that many dimensional matrix. Hence, I'm saying it should be driven by investigators who really understand these systems and can make the best case that this is where you're gonna get the biggest bang for your bucks. Oh, then another question I was asking, what validation and characterization do we need of the functional elements? Well, we need to do this seriously and happily. You're hearing many speakers embracing this. And I'm kind of breaking into there. There's some efforts that would benefit from being highly managed and closely coordinated. In fact, it's ongoing now in phase three of ENCODE. Obviously, you wanna use these high throughput genetic screens that have been developed recently. And we want to ensure that a certain fraction of the functional predictions are tested, and then you get an answer. How good were those predictions? You get this empirical p-value that Dana was asking about. Now, I think everybody knows, we're sure you're gonna start with the positive predictions and you're gonna test those. But I also wanna put in a pitch. Do it broadly, put in some negative predictions because you really also need to have some orthogonal ascertainment of what are you missing. You like this, all right, we're getting some enthusiasm here. Now, again, I don't see this happening throughout the entire matrix. But we have to have some well-chosen systems to investigate very, very deeply, and actually, this kind of information will extrapolate broadly. Now, I also, I do like the less tightly managed approaches and that there are all kinds of perturbations. I mean, you talk to anybody who's invested their career in a system or locus or whatever, that they're doing every kind of perturbation you can imagine and learning all kinds of things. So I think it's not just high throughput enhancer assays, large scale genetic engineering for loss of functions, and other things. And I think Will's gonna talk some about that later. And this is pressing the envelope of what the NHGRI envelope portfolio is. But I think we have to embrace these. And Lori's gonna talk a lot more about this. But why look at lots of different perturbations? Well, because we are working with a fairly limited vocabulary of regulatory elements. And you can tell that there's much more heterogeneity from all kinds of things. This is just two different assays for fairly sophisticated predictions of enhancers. And in both of these assays, some things work well, some things don't. And there's a large dynamic range about what's happening. On these different axes. So that tells you heterogeneity. If you dig into these enhancers, you see diverse combinations of transcription factors. If you take an unsupervised learning approach to find a chromatin states, you don't come up with two, you don't come up with four. In this particular paper, we had 25 because we said that's the number that we'll work with, right? It's bigger than that. So expanding the vocabulary and I hope we hear from that more. And we need, we know some popular assays right now. There's no reason to think that functional assays are, that we've come anywhere close to exhausting the possibilities. And I would love to see periodic calls for proposals. Now NHGRI can come in, this is that catalyst, that facilitator, take the good assays and get them genome wide. And you get them genome wide and then we can start to really get large amounts of data together in a way that we'll start to learn systems better and also find out things we didn't expect. Evolution has come up a few times and let me just emphasize. You always get a lot of power from interpretations of comparisons. Many of us kind of grew up in genomics lining up DNA sequences. And if you see motifs that are deeply conserved, it actually does mean something, it says this is powerful. And you can interpret these alignments in evolutionary terms. You can do the same thing for epigenomics. And comparative epigenomics really does tell you a lot. You can take, it's just the same transcription factor and same locus and mouse and human. And then they actually do line up quite well. They actually line up better than the underlying sequences do. And one of the things that came out from the mouse encode comparison was that there are these categories of functional evolution that's revealed by this comparative epigenomics. When I was younger, I really was in hopes that all functional regions would look like this where you'd see, if you see a factor bound in, say, erythroblastin and mouse, that the orthologous sequence in human would have a similar signal. And that does happen. It's a small minority of the time. And this marks some profoundly important regulatory elements. This is some work from Mike Snyder's lab and our lab, showed that these guys, their function is not restricted to red cells, that they're pliotropic, they function in many other tissues. Linear specificity, this is not, nature telling you it's not important, it's just nature telling you it's lineage specific. And we also see a lot of evidence. There's a beautiful paper from John's lab about this, and we covered it in this paper, that you can, even if you don't see that feature in mouse, and you don't see it in that tissue in humans, you can see that piece of DNA being used in another tissue. There's a lot of turnover, there's turnover in regulatory regions, but it's like the same pieces of DNA being used over and over again. And maybe, as we do expand filling these matrices better and better, maybe the dimensions will reduce. Maybe this reuse thing could actually turn into something that would simplify what we're doing. But anyway, so, but evolutionary comparison of both the genomes and the epigenomes is important. What future studies could be envisaged if not limited by technology? And this is really fun to think about, and I just stuck, I'm going to stick with one because it is so beautiful. And if it's, if this is really happening, we've got to understand it better. I know it's happening, but we need to understand the mechanisms better. Directed movement of genes in the nucleus. You really do see during activation, genes moving from inactive chromatin territories to active ones. Some people will call those places with abundant RNA poll to transcription factories. Other people don't like that term, but there is this movement. You can also see co-localization of actively transcribed genes. This is some work from Peter Frazier's group and all these pictures here, the alpha-globin complexes in red and there's some other erythroid locus in green. And every time you see yellow, that's co-localization. And there are other studies that show the dynamics of that movement. And here's just another picture from Peter's work just to kind of give you this image that it's not so much, it may not be so much gene activation by recruitment of all of these factors, but rather these factors bringing your locus to where it could really be highly transcriptionally, highly transcribed. So what directs that? Are there molecular locomotives? Is that part of the vocabulary that we need to be thinking about? And is that what one type of enhancer does? Are there tracks that the gene follows to get from one place to another? We've got a lot of unexplained factor binding, maybe. What determines how long you stay in that active zone? And that's a really good model. This is an old one, but a good model for what some enhancers might be doing is just holding the polymerase there. Oh, and how much of this directive movement is actually generating so much of this non-coding RNA that we don't know? How to explain? Something to try to get at if technology isn't a limitation. So I don't have to tell you, I don't have to come up with a way to do this. I just have to say it would be nice to know. And there are lots of other functional elements. And there are some things you don't know. You don't know. I know I don't know these. Replication machinery. Well, happily timing domains. That's coming out. And there's beautiful data people have been showing it. But I don't think we know where the origins are. I don't think we know what sequences determine those. We haven't mentioned much about how cells remember what they were as they go through mitosis. Now, you can certainly do factor binding through mitosis. You see some things that are stable through mitosis. And in this case, accessibility changes during mitosis. What's going on with those? What determines them? And is this a special type of bound site? I think it probably is. And I think it's saying, hey, we've got to turn these sites on real early. Recombination hotspots. I don't think we know. I don't know nearly as much as I'd like to know about those or matrix attachment regions or others. I think there are lots of other features to try to get at. That was OK. Now, transformative technologies. This is a new question now that I was asked. And here I'm just going to join in the chorus and say, yeah, mapping binding profiles for very large numbers of transcription factors. Now, so if there's a better way to do it. I mean, if antibodies can do it, oh, that's fine. I just don't think it's going to get us there. And there are lots of ways that you can think about more efficient, well, ways that are going to extrapolate to many different loci. Going down to smaller numbers of cells, I mean, to be able to look at multi-lendish progenitors is fantastic. I do want to emphasize one point. Because it's obvious as you go to smaller numbers of cells, you get down to one, right? But here's what I think, I mean, if you know this, but I have to emphasize that you go down to single cells. You're not just got a new technology. You've got a revolution in your thinking. Because now you can really see that the heterogeneity, this is just a picture from work from taric inverse. And this is not genome-wide or whatever. We can get this is the picture. If you see just colors here and there against the gray, that means it was expressing just one of many cells. But these were all, when I do the experiments, these are all the same thing, right? I'm always looking at this as a population. Heterogeneity, stochastic events, and these binary decision trees we show for differentiation, they could be all wrong. As an ensemble, they're correct. But in terms of following the path of any one cell, they may not be so useful. So I want to emphasize that. This is a revolution in the way you think about things. Now, transformative technologies. Oh, all right, so visualization. OK, everybody said we need better visualization. But nobody's showing us that. And just let me just picture this, right? So instead of going through your browser page after page and trying to remember what this locus looked like, another one just this room, that's your genome, right? Every table's a chromosome. And you guys are all features. And I can zoom around and really try and start to get a more full understanding and actually be able to look whole genome and let the brain, your brain is really good at pattern finding, way better than some machine learning approaches. And well-invested money that would bring in some new visualizations. I want to go into a virtual reality viewing environment and just really dig into it. All right, prioritizing needs. It's NIH, you've got to go with disease relevance. But man, if you can, it's new biological insights that drive things. And I like projects that dig into newly fertile ground on the enduring questions. That's what I would prioritize. Oh, OK, I've got to say I've got 38 seconds here. Let me guess what I'm still going to say. Oh, yeah, I've got two things to say here. First of all, all right, wasn't needed to make the new data interoperable. It's all data coordination. We have to have all of these aspects. And we've heard them before, rapid data release, expert curation, uniform data processing, easy access to everybody. This is the heart of anything. Whatever you call the next phase, this is it. You've got to have a really strong DCC. Oh, but you don't have to do everything in a top-down, managed way. You can do it with a community-driven project. But it all does feed into the DCC. And this is the last point I wanted to make. So this is kind of the current structure for ENCODE. You have some data production centers, a small number of them. And you've got the economy scale and all this. And they really pump out the data. And the DCC keeps it. This blue background means that there's a lot of crosstalk within here, a lot of coordination, and quarterly reports. This is it. You could have a consortium. It's less coordinated. The systems will be chosen by investigators proposing and reviewers evaluating. And there's going to be a much larger number of production centers. They still have to adhere to the standards. They can help develop them, evolve. It still goes through the DCC. And you have to have good interfaces for the users. This lighter background means less managed. Annual reports, perhaps. OK. How do you know when you're complete? We can do predictive modeling. And I'm out of time. So I might be out of town after I say this. But you need a metric for when you're complete. And whether you want to go with accuracy of predictive modeling or whatever else, there needs to be a measure. I just don't think the brute force approach is going to work. Oh, yeah. And this is the last point. Once you know enough that you can identify the gaps, sometimes you can find that gap is actually filled by something very valuable. This is the front range that the Poconos and the Delaware River cut through there a long, long time ago was absolutely gorgeous. So gaps can be good. OK. I'm going to stop now. Now, Carol, if you want me to take some questions now or wait. OK. Can you take questions about more structure? OK, good. All right. So I actually like your two structures. And I'm thinking more on the more less controlled structure. But do you think of some intermediate? So how would you make sure that the data can be, for example, managed by the DCC, and at least if the data quality reduced, even if the systems aren't coordinated, at the end, at least the assays are comparable? Right. So let's say there was a call for proposals. And it really fit along the structure I put over there on the right. Well, you still would have to, all the investigators would have to buy into that there are going to be data standards. Now, the data standard, number one, is there will be quality metrics put on all of these data. And exactly what the threshold is for bringing it in might, what that could probably evolve. But let me also say, there have been some studies. The community-generated data may have been predominantly crap early, but they're not anymore. There's a lot of really good stuff out there. And we need the numbers. And the numbers can be, the metrics are established. Yeah. OK. Truly? Quick, simple question. How many of the ENCODE cell lines have a good quality whole genome sequenced? How many ENCODE cell lines have a high quality genome sequence? Thank you. I knew somebody knew the answer. About four or five of them. They're messing. I mean, the cancer lines, these are MCF7K516. Did you have a follow-up tool? You must have had something in mind. I was going to say, now that we have X10 and relatively cheap sequencing, would it be worthwhile to get the genomes of at least the cells that have most assays? OK, yes. And so that would be another type of data. We agreed to that five years ago. We just haven't finished it. Yeah, this is Mark. I just want to second that. I mean, people probably know I've always repeated that thing. I think just given the investment we have in the functional assays, it would be so valuable to have proper whole genome sequences for some of those core cell lines. So there's some history. The history doesn't matter. We started doing it with the big centers and had to stop. But I think ENCODE should do this this year. It could be done very quickly. I volunteer. Great. So if you have questions for Ross, hold them. We'll do them in the end. And so, Laurie, we're going to talk next.