 So, our next speaker is Joe Ecker from the Salc Institute, and he's going to give an overview of the ENCODE PI's vision for functional genomics. You've received a copy of this document by email and also in your folder. Thanks, Joe. Yeah. So, I, as Elise said, this is a document that's been in the works for a few months and not actually seen by all PI's yet, so it's somewhat of a work in progress. I wanted to acknowledge Brad, Bernstein, David, Gifford, Mike Snyder, John Stambe, Barbara Wold for their input, so this has already been discussed. It's part of our document as well, describing what the accomplishments were. I don't think this was pulled out by Mike. This actually, I think, has a huge impact. Studying students and fellows, attending the phone calls every other week of the analysis group or whatever group you're part of, I think is real value. They get to hear the discussion, the criticisms, the data presentations. It's a really different new generation of students, a new generation of wave-doing science. Actually, I think it's a really important aspect of the project. As was mentioned, so Mike went through the various encode phases. The matrix is enormous, right? If you think about all cells, all assays, the matrix can be very large. There are some obvious things. I think the limitation in what we'll be able to do will be sort of access and reagent-based, obviously, access to certain cell types, reagent-based, but the community will have an input in terms of development of assays that reduce the numbers of cells that I think could change the matrix. We can't fixate on necessarily what's possible to do now, but maybe what might be possible with additional technology development. Really we have, I think, so far a catalog, not necessarily an encyclopedia. The definition of here of a catalog is sort of collections of items, pictures, et cetera. We have maps, right? We have different tissues. We have maps. We also have some integration of that and the attempt to generate knowledge. But the encyclopedia, I think, and this is also represented in the document, is really a deeper level of information about a particular element. I think to get to that level, you need function. You really need to assess the function of some of these elements, and that's part of where I think the PIs feel the project should go. These are the challenges, despite all the progress that you've heard about, the task of identifying all the functional elements is really not fulfilled yet. The more experiments to do, we actually haven't plateaued at finding different biochemical regions of interest alone assessing their function. There's an unexpected greater, I would say relatively unexpected, greater degree of diversity of the numbers and the molecular signatures that have been identified, and so that poses an enormous challenge for the next phase to try and understand how to attack and prioritize which of these biochemical activities to address with function. And already you've heard that it's been useful, it will continue to be useful when associated with genetic information. So the next phase of this project must leverage and integrate emerging technologies. You've heard about some reducing the amount of samples or imaging kinds of things, potentially, that you could take advantage. Any year I can't do everything, but to attack, for example, cell differentiation in a biological context, the time dimension so far has really been missing from the project, and that's certainly recognized by the PIs as an area that will be of interest, so perturbation to some extent, cell differentiation will give us potentially new ways of assigning correlating element requirement and biochemical activity. So high throughput approaches for mapping genomic features is going to be complimented by needing new tools that will allow us to assess function, so genome engineering and systematic functional perturbation are on the radar of the PIs for the possible next phase. So essentially four layers of information for this encode 2020 vision from elements to function are articulated. Layer one is completing the catalog of the elements. Layer two is connecting, there's some fine grain information in here that you could lump some of these things together differently, but I think PIs felt that connecting elements with their cognate genes is a challenge in and of itself that it warranted not being part of the catalog, but actually another unique layer of information. And then transforming this catalog of elements into the encyclopedia where you actually have knowledge about a particular element in a particular cell type and its function in terms of regulating the expression of the gene, the elongation of the polymerase, et cetera, whatever activity that has splicing, assisting other kinds of biochemical activities really will make this catalog into an encyclopedia. And then really to take this to yet another higher level to begin to address and I think Rick and Mark will discuss some of this because it was more related to genetic variation associating the variants with these elements and the impact on individual phenotypes and disease is kind of another layer that we think is going to be essential. So completing the catalog of elements is going to require new cell and tissue types. I think if you look at the textbooks and the recent literature, you know, there's about 400 cell types recognized, but if you look at complex tissues like brain and you start to dissect those apart, which I'll briefly mention, there's likely to be many more. So the number is probably much higher than that depending on how you define a cell subtype. New types of elements, the PIs believe and I think that there's reasonable expectation that what we're pointing to in terms of what kind of elements exist is based on sort of looking back at the literature and saying there's enhancers, silencers, et cetera, but that doesn't leave any room for new element discovery. So there's lots of potential discovery of new element types, especially when you layer on functional experiments on top of this. And then we really haven't in this project gotten into so much condition specific experiments where either developmental programs, I think Aviv had some really nice papers, I think transcriptional anticipation, for example, a factor binds but is not needed until some later stage will give us some more information about elements that you can recognize biochemically but don't seem to have a function at a particular stage because you haven't seen the requirement for that element during development or during a stress response or other kind of perturbation. So some features of what you heard in the links but more focused on the assays that ENCODE group is doing. So what's going to be required? A new generation of mapping and discovery tools. So this is likely to help with the problem of cell diversity in terms of the numbers so new technology should be part of the program. A feature of ENCODE is that the kinds of experiments that are done are typically beyond what individual laboratory can do and that's part of the interesting consortium activities. That is there's many high throughput experiments that are done with high quality and low cost so that provides the resource aspect to the project. So maintaining, actually increasing the throughput is also a goal. This addresses what Paul said about cost, try to drive the cost to down. The numbers of cells are clearly on the radar. It's challenging for some assays but there have been improvements in the assays through some of the technology development groups as well as some of the production groups and pushing down the numbers of cells that will allow you to get meaningful information from these different compartments. And then one important aspect is that if assays are developed with lower numbers of cells we don't want to really erode the quality. It was mentioned, I think Dana mentioned that the quality of the data is very high and we don't want the drive to get one cell without having high quality for that information as well. So although you can say single cell, some of the single cell assays that have been developed don't really cover the genome or only subsample a fraction of the information more or less randomly in each cell, so we have to be aware of that. The other thing is that there are, this was mentioned also, so I'm sort of saying things that I've already mentioned, this idea of having a community focused, not necessarily encode focused data coordination center where any individual from the community can go, I think we're headed there with the DCC we have, it really is a tremendous resource. People I know at the Salk use it all the time and you can go and find things, search things, but then equally having a similar resource that integrates the other data from the other groups in a way that allows a seamless sort of interaction among sort of high level users. Okay, I don't want to have to go to all the different databases, I want to be able to pull down from this tissue, this time, et cetera, all the raw data or the process data. So this is another resource that we think would be really useful. And these two resources will cover high level users down to individual investigators and really the resources that are here are very high quality, but if you can't utilize them, the individual laboratory can't utilize them or even the high power user can't utilize them in an efficient way that detracts from some of the value. So this next layer that has been discussed is really trying to associate these elements with their genes and as an important piece of information that we think is a separate activity related to some of the other things that are mentioned by Mike, for example, but these are high throughput. So these kinds of interactions are genome wide, long range, short range, and testing some of these long range interactions once they're identified, testing their functions. So some of the approaches for doing this or what's been going on is there are better, maybe more computationally efficient ways to do this, but activity correlation, physical interaction and then functional assays where you knock out the genome go through each one of these. So activity correlation is associating these biochemical activities with other features like promoter activity, so enhancer and promoter activity, that is chromatin marks or other kinds of biochemical activities, the enamelation on those, and associating them together to identify cell type, tissue type, selective events where you have this particular enhancer has a biochemical activity in this cell and that and an associated promoter, but not in that cell. So that's the kind of thing that ENCODE has been doing and these sorts of maps are activity correlation maps with different assays. The physical interactions are also beginning to emerge, they're certainly used in individual laboratories and are being beginning to be included in some high throughput approaches. We're directly linking promoter and distal elements. Some of the approaches are here. That doesn't mean these are limiting to these assays, new assays that may be equally efficient or higher resolution might be possible and understanding which of these assays in which conditions are going to be useful is also potential for technology development area. And this just shows an example of the kind of assay that can be carried out. It really hasn't been yet necessarily implemented in high throughput of a high C experiment where you're looking at domains of interaction across the chromosome and this allows you to begin to associate long distance regulatory elements and that's been useful already in some disease focused applications. So also there's knockout, so obviously reverse genetics is focused on elements, the functionality of elements is going to be very powerful, particularly Cas9 or CRISPR or other ways of mutating elements and this is being done again in individual laboratories but really hasn't yet been implemented by ENCODE or as far as I know other groups, large consortia to test for example the requirements of motifs either by altering a motif with a specific event, a mutational event or deleting completely an enhancer. This will allow us to begin to associate some of the biochemical activities with function. Where that function exists in the genome is going to be challenging because that element might affect the activity of polymerase whether it's stalled or not or what the initiation rate etc. So having assays that are downstream of these functional events, deletion events is going to be important. So this is an experiment obviously that you can do in mouse, you can create crosses and identify deletion, you can create deletion events that are on one chromosome so you have an enhancer associated with a particular SNP by mutation or deletion in whatever the case. You can have haplotypes that are derived that are associated with, that is from individuals that have SNPs that are associated with promoters and this can be done either by looking at natural populations in mouse or people or creating them and I think the point of where ENCODE wants to go is to begin to test these in cell models. So what kinds of models. So the challenges that exist for trying to do these kinds of things are that where to look. So if you delete an element what will be the effect of that change is it going to be on transcription initiation elongation splicing etc. So assays that are associated with, as I mentioned already, these kinds of functional events need to be created. The likelihood that these events will, you know, the deletion of a particular enhancer in one cell type will have a phenotype, it may be low because we are now beginning to realize there is a very high cellular and genomic sort of context for each of these elements. So deleting the element and looking in the appropriate biological setting for the requirement of that element is going to be key. So that choice of which regulatory element and which biological context is going to be critical especially considering we would like to do this in high throughput. And then there are other kinds of events, epigenetic events that might not be so obvious where the presence of the element as I already mentioned could be a priming event required later or a memory event and I will give an example of this. So one of the, so cellular context as an example, one of the, I will give an example of a Margroup, is we have been purifying cell populations from mouse frontal cortex using an affinity or a affinity tag, the nuclear membrane tag developed by Jeremy Nathan's group using the method Steve Hennikoff developed called intact, developed in a rapidopsis to purify different cell populations. So we've purified and looked at epigenetic marks in two different inhibitory cell populations and in excitatory cell populations. And it really clarifies what's going on relative to, for example, the whole cortex. So we're just looking at some, an example of the data here where you have whole cortex and you see some RNA expression of this particular GAT1 gene. If you then look at the three purified cell populations in replicate here, excitatory, PV inhibitory and VIP inhibitory neurons, you can see they're much higher expressed. When you look at the methylation patterns, you can see that the inhibitory cell populations have a real signature here in the DNA methylation pattern in CG and also in non-CG that don't, they really exist in the cortex or even purified all neurons, new N plus purified neurons. And you can see the same thing over here for highly specific VIP inhibitory expression and the correlation with the anti-methylation patterns of CG and non-CG context. So when you really identify, in fact, you can look at a much at looking at differentially methylated regions in these, if you just align them, you can see if you look at whole cortex and then you start to look at the sub-fractionated cell types, you can see patterns that are completely invisible in the, obviously, in the whole cortex or even in purified neuron populations. So cell context, specific epigenetic marks and really, for us, have put a shine of light on exactly what was happening in some of these that we could never get from any of the other less purified populations. And in fact, even you can see in these populations, there are of thousands of cells that there are sub-types in there when you actually go down and look at the reads. So also there are elements that indicate that there is a memory. So you can look, for example, I'll give another example from our lab that I don't have an example of prime, but there's a memory state where you can look at these regions that Bing Ren and our group defined as called DNA methylation valleys or DMVs. They're very large, 15 kilobase regions that are absent of DNA methylation and they don't always overlap with the CG islands. And so you'll have a large region that, in the adult, then collapses in terms of the length of the region that's not methylated. And you see, so there's hyper-methylation. And so that's something you say, well, that's quite interesting. What is that? And if you actually look then in the cell populations, I'll give an example over here for this particular transcription factor, this particular transcription factor shows this hyper-methylation pattern in the PV inhibitory cell and in both the CG and non-CG context. And if you look at the literature for that transcription factor, what you find is that actually the precursors of the PV cells that start in this medial ganglionic eminence and migrate out, they have to express that transcription factor for a brief period of time, otherwise they end up in the striatum. And so we believe this hyper-methylation is a signature of when the gene had to be expressed and then methyl transphase had access to that after it was silenced. And you can go through, in fact, all of these differential methylated valleys that have collapsed and find there's a very high enrichment for transcription factors that are involved in early development. And you can look at this list and go to the literature and find where there have been experiments that have sort of looked at the lineage tracing and you can see that there's a memory of the expression of many of these factors early in development. So we can see these events and sometimes they're predictive of what might happen to a sort of transcriptional anticipation or they're a memory of things that happened in the past. So connecting these elements with their cognate genes will require the development of novel assays of large scale. I mentioned some that already exist, but additional assays are going to be needed to get even higher resolution for elements that are closer to promoters. Systematic experimental perturbations are going to be needed to test the function of some of these elements, whether it's anticipation or you can go back and test some of these events earlier in development. These kinds of approaches are going to need more integrated, sort of integrating the function with these assays is going to require additional computational approaches which haven't been really developed for these high throughput deletion kinds of assays at this scale. And the challenges will obviously be how do you get this into or is it possible to develop assays to look at function at this scale. And we believe this type of effort of the development and implementation of these functional assays might be one area that encode could make an impact on, just as it has in the past on other areas. Some of the additional areas of interest are the really transforming now, the catalog of elements into an encyclopedia. And some of the ways to do that are going to require classifying these elements as enhancers, silencers or other, whatever that might be based on their function. So we want to know not only where the element is, but what it's doing and how it's doing it, to some extent, and identify all of these kinds of functional categories beyond what we have our prejudice about existing. So there may be elements, for example, that interact with other parts of the genome that is not just insist but also entran. They've been very difficult to necessarily identify and some of these kinds of approaches may allow you to do that. So that brings in the idea of really trying to interrogate the function of elements or the enhancers, silencers, elements that partition the genome, et cetera, by other kinds of high throughput assays. And some of these have already been developed and there's reference here. Others could be developed that could test some of these functions. And ultimately what we want to do is to understand this element functions in this context, what is the grammar of that? And this somewhat overlaps with the computational program that has been initiated by NHGRI, but these would be experimental tests that would allow further computational groups to access that data. That is to experimentally testing the grammar in different assay systems and then allowing the computational groups to also work on that data. And then finally, this is somewhat, probably will be more addressed by Rick and Mark, is beginning to apply this information to the biology of disease. And both by taking the variants that are known and beginning to relate them to both the biochemical activities and these new functional assays to come up with some level of confidence that a particular SNP is going to be, it's going to be useful to explore that further. So providing some high confidence estimates of the causality of a particular SNP based on these additional assay cell types as well as the functional assays that will be developed. So we think that ENCODA's position to make an enabling contribution in this area of functional genomics, that the high throughput approaches that for example of biochemical and otherwise that are being carried out will be complemented by new high throughput genome engineering approaches and perturbation systems. And that the importance is that the coordinated action of the consortium and the generation of this kind of data on a large scale will have an equal impact to the impact that, for example, Mike described for the last phases of ENCODE. So I'll stop there and take any questions that you might have. Yes, Eric? Joe, I enjoyed that. This is a council question, I think. You gave these layer one, two, three, four in a very logical order and a scientific progression. But now if you were forced to prioritize them of building the encyclopedia with the existing catalog, would that be more important than finding holes in the existing catalog and filling it with new data slash cell types? Which of those two do you think is the highest priority for NHGRI? Well, I don't know if I wanna, I put myself in conflict maybe if I answer that. So I think we haven't fully flushed out the catalog, that's clear. But we do need to begin to interpret what these elements are doing, right? And so I think that was probably one of the biggest criticisms that ENCODE got. It's not that it, I don't think it was ready for prime time basically to begin to do that. We need to find the elements and assess their function. So I think it has, those things have to go hand in hand. The challenge will be accessing the tissue some of it. So I think it can go on in parallel. In fact, I think different groups have different expertise. And I think that it doesn't have to be either or. I think that the technology groups working in hand in hand with some of the construction groups maybe in a closer way than has been done in the past might allow that to, to, to merge before the end of the project. Ewan. In, in the association of genes with their elements. I was sort of surprised, I was surprised not to see imaging. And was that very deliberate? Is that because it's happening somewhere else? Or do you feel that technology is not ready really for this, for this, this scale? I mean, was it just an emission? No, it wasn't an emission. It, there is, so the 4D nuclear will include that kind of imaging. I think it might be a little early days for the high throughput nature of the kinds of activities that ENCODE is going to carry out. It's possible that it could come along in the next year. But I haven't, I think it's, it's more of a development phase. So it could be a technology development. Several RFAs have gone out in this. I think, I think there was one that was part of road map that was for imaging of epigenetic variation during sort of a real time. I think it's one of the most exciting areas, frankly. But I'm not sure it's necessarily implementable in immediate, within the next year or so, the time, time frame that we're thinking about. Jay, you're. So, I was trying to understand the limitation about single cells, right? So your argument for stopping it, I don't remember what the lower bound was there, but was that with current single cell data, you get a sparse but rather random representation of what's going on within a single cell. For some of the assays, yes, others no. So it doesn't, I don't mean to say that it shouldn't be part of the project. I guess, I guess that if you want to have the kind of comprehensive set. So my point is that it's not a limitation, right? It's actually a strength because you can just aggregate cells that are of the same type and get the same data that you would have from each of the subtypes. Yeah. And, and, and nonetheless, also get the breakdown, right? Yeah, right. Well, depending on what you're, if you're going to recover 5% and then you have to do, you know, 20 cells. So sure, but that's fine, right? Because it's the same amount of sequencing. Yeah. It's, you sparsely cover each cell. And, and, well, if you want to attack the problem of, you know. I should let, if you'd like to speak here. Yeah, go ahead, sorry. I would just say that at the first level of approximation, that's exactly what happens in a bulk assay, except that you just aggregate the simulator. So at the very least, you would, in an ideal world, every read would be tagged for the cell it came from. You'll, some of the assays are right, 500 or 1,000, or they haven't reached 5,000. So you'd have to do 5,000 barcodes. If the sample prep, if the sample prep, if the sample prep cost is taken away, then sequencing-wise, it's exactly the same cost. Yeah, I agree. I'm not, I'm arguing against, against single cell. It's just that it isn't at the current cost of what these assays are, at least, current, currently. Yeah, I think it's, I think the critical thing is to understand it's not the sparsity that is the issue. The sparsity, it's the sample prep. Right, you, you have to combine, for some of these assays, 5,000 cells is needed, right, at this point, so. Right. Terrific. Can I, can I ask a technical question on the last? I mean, if sparsity, isn't sparsity an issue when the elements you're looking at are only a kb or even smaller, or could be smaller? I mean, I would think that sparsity is a very big issue. Let me, let me try and clarify. If you take an assay, even an incredibly noisy one at the single cell level, and you aggregate the signal, you get back the signal that you get from the bulk. Well, that means that the sparsity was there all along. It was there in the bulk assay as well. It's just hidden by the aggregation. So the question is, would you, do you aggregate post hoc or pre? That's all. But there are some biases, actually. If you look at the data, for example, for a single cell methylation data, it always, it always, it's biased towards specific regions over and over again in that cell. There's some bias in the bulk. No, no, you don't. When you have the, when you have the bulk, you get a much better representation. There's some stochastic events that happen in that early stage that you get the same thing over and over again. So it's not, it's not like you're just sub-sampling. It's your bias sampling in those, in those. Yeah, because you're, because the material, the bulk in preparing the material in, of a, it's the bias, yeah. Well, anyway, if you look at some of that data, you'll see what I mean. Yeah. Yes, I think the, the, the proposal is great. And, but I think it's, we all agree that it has to move from the correlation to functional validation. And you listed in the very beginning, a lot of technology development is needed. I think you didn't specify, but I feel like we need to develop heavily on epigenome editing, in addition to genome editing. Because you're going to correlate how the epigenome manipulation correlates into the relief function. And the second comment is, I think obviously we couldn't test all those elements. It's just impossible. So there is two questions. One is, how does those in vitro cell generated data going to really play the role in the physiological condition in vivo? Second is, how does that, how do we put the perspective of evolution to really study them, right? So, so my comment is to, to, to hit both questions is, have to use model organism, which is you can use evolution to prioritize. What is, I mean, it's very challenging bioinformatically, but if you can identify the top hits that's really well conserved, then that might be the problem. I didn't specify if this was all human or mouse or anything. I don't think I had. I love mouse, so that's what I was talking about. And human around stem cells. I think that's a too vast model. Our part of the project of encode is mouse. So I agree with you, but I, I did want to make this little organism agnostic about the epigenetic part. Yes, I agree. I work on epigenetics. Epigenetic editing will be key. And there have been some RFAs and maybe, and though there is some progress in that area as well, I didn't mean to eliminate that. I just thought for the purposes of most of the elements that have been interrogated so far, that they're linking them to the variants is, was a, is a priority. Yeah. So from my point of view, one of the biggest contributions of encode to date is actually the technology from the actual tech dev to more refined ideas of best practices and protocols and data analysis methods. And what you proposed here is in large, okay, let's look at all the technologies we already have and, and, and crank them. But I think the next. No, I, I'm sorry if I gave you that impression that I said new technologies, not necessarily. You elaborate on some ideas of what you think some of the new technologies next wave should do. No big, no big, why? That's for the community to decide. I, if I, if I knew that, I'd put it in a proposal. We are, there are many labs working on technologies that we haven't discussed here, but so I, I think I just put it under the category of new technology, send the, your best ideas. Yeah. Can, can I ask them in a flip side, which is also often very useful for engineers and for computational scientists, what are the biggest challenges that you would want to see solved? You know, if, if a fairy appeared and said, whichever, you can choose one or two technologies that would be the most impactful for you, which one would it be? Imaging activities in individual cells over the life of an organism. I buy into that. Okay. If there are no other questions, I guess next up.