 Yes, so we tried to synthesize the many threads of discussions that occurred in the past two days and set them up on a set of slides. And actually when we went through this, I think one of the questions we stumbled on early on was the question of what's our ultimate goal? So where would we want to be in five years from now? And from all that we've heard, we heard two types of point of view that I think were laid out in a talk yesterday from the ENCODE PIs that Joe gave, which was the notion of are we continuing on a courageous path of cataloging, of measuring a lot of components and interactions and entities across ever-increasing numbers. It could be of cell lines and conditions and, sorry, cell types and conditions and stimulations and time courses and so on, or is there the wish to reach a level of understanding about the functionality of some of the elements and function being, I think, meaning different things for different people all the way from a very molecular, mechanistic, notional function to a phenotypic function. I just wanted to jump in for a second and say, I don't know that we're saying it's an either or. No, it's not necessarily an either or. But these are two different dimensions. Yeah, because it was just layer one with sort of catalog two. I agree. They're layer one and layer two, but they are in a world that's finite. There's more in one or less in the other. These are two different dimensions. And secondarily to this, that was the specific question of how to relate to disease. So even at this level of goal, I don't think we have a strong definition of what that goal is. Various goals were put forth, but they somehow fit on these on maybe a two dimensional one axis on how much have we cataloged and the other is for any entity that we cataloged, how much do we understand it and its interaction with the other entities. The other major notion that came up mostly in the kind of evolve during the last couple of days was this notion of generalizability that as Ross has put it out, the space of options is huge. And again, you can one can use analogies they think are very convenient. If you think of genetic variation in humans, it's ever changing and shifting and expanding because no not only is it a question of the frequencies, but people are simply constantly people constantly die and people constantly are born and new mutations constantly arise. And so it's an ever shifting entity, yet we can bound it into something that we feel captures a great deal of what is going on and is quite manageable for many practical purposes. In the case of biological systems, because of the way in which they're organized and because of the way in which they evolve and the functions that they perform, systems are used again and again. And cells are not all distinct from each other in radically capricious ways. They're related to each other through different types of shared functionalities and shared lineage and shared entities. So this naive matrix, while huge, could likely be dimensionality reduced in a very aggressive way. The only problem is that we actually don't know how to do this. So the question is how would we sample the cell types and the cell states and stimulations and perturbations and the right type of endophenotype related to sequence in a way that would give us the strongest power to generalize in pute and so on. So we wouldn't have to measure the whole matrix, but we would know most of it even without measuring it. So within, I think, these two premises, there were several types of proposals. In terms of the types of proposed research, they fell into three categories. And again, not every time that there are three or four things, there are zero sum game and there are one against the other, but they are distinctively different from each other. One is that we could have many cell types with some characterization on them with the idea that we still really don't know our cell types, and every time we add a cell type, we add a huge amount of information that is incredibly valuable. The second was what seemed to be a strong ambition and aspiration if we could only one day get a piece of sequence and understand the function that it encodes, at least for gene regulation, at least for something like RNA levels or transcription or one of those endophenotypes. I think this relates to the technical problem of relating regulatory elements to the targets that they control, understanding quantitatively how they exert these targets and taking a piece of sequence and inferring the expression that is controlled out of it in the particular context of interest. All of these would fall under this umbrella. And the third aspect, again, related to the previous two, are the notion that we should be vetting the predicted functions of entities with appropriate perturbations, for example, with genetics under relevant simulations, which could be different environments or different differentiation conditions, and not forgetting natural variation that could be a great way of getting a lot of pain for the buck. So that's sort of a third bucket of category of the type of research that could be conducted towards the ultimate goal. There is the specific question of how to relate to disease. There are two types of models for that. The fact is to really work with disease samples, either within themselves or compared to healthy samples. And the second is to use the genetics of disease that has already advanced substantially and is constantly advancing in order to prioritize those variants that one would want to assay, for example, for function, and do it in the relevant cell type if that cell type is actually known and available. That immediately, I think all of these together open the question of what makes a good cell system. So we tried to, this did not come up in the discussion, but it should come up, I think, in this discussion of what makes a good system. And while, you know, one way is to just advocate to a system you already know to be great, but another is to try to abstract away and say what has made particular systems useful. So obviously the issue of accessibility, one would want something that you can actually do something with. There are great cells that are hidden in early developmental transitions and we'll never put our hands on or not in the near future. And so it's not clear that we can work with them. The ability to manipulate them, if we need to do manipulations, we need to be able to actually define them because again, we need to actually hold them in our hands. If we can stimulate them or differentiate them in ways that we already know, then we can build on this knowledge so that our biology is relevant. We ideally would want some that are related to diseases and as I put their ad criteria here. I don't think we discussed it in the last two days, but it's an important item to discuss. There is a question of what are the most informative assays or data. We heard pitches for different types of readouts. For example, on the side of components, we've heard multiple times about RNAs and there are different types of RNAs. mRNA is just very one particular type. Should it be the one we focus on? A new, renewed focus on proteins, the value of chromatin marks, and so on. I should have added dot, dot, dot, many other there. There is the issue of molecular interactions. Transcription factor binding to their targets raised huge enthusiasm when for a moment we believe we have 800 validated antibodies and then they were gone. But they will come. Issues in the 3D organization of the genome. I would say one and two components of molecular interactions are things that are really in the wheelhouse of genomics. Imaging is not, seems to be a great emerging opportunity, including imaging coupled to sequencing based readouts. That is something that we start seeing appearing in the community. That's an area to keep an eye on. And then there's the issue of what functional essays get coupled to what types of readouts. So we had, I think, a vigorous discussion around recorder essays. We've heard how yeast folks use readout in effective ways, but there are ways in which yeast and mammals are more similar to each other than we sometimes think and there are ways in which there are less. So I think readout is actually a difficult problem. There is a question of what's the right perturbation to do and how to conduct these perturbations. And we've heard everything from crispering to doing things in reporters to putting in endogenous locus or not, all of the H4 genetics, but also things that are not genetic at all. And then I think there's a varying degree of tolerance to error, false negatives and false positives around functional essays. And there are, in this area of what is acceptable in a primary screen versus a secondary screen versus a vetted result reported in a paper. It's actually something that maybe not the folks in this room particularly spend a lot of time with, but there are communities that spend a lot of time on this particular question. And then specifically, the question of perturbation. People raise the issue of physiological conditions, using more stimuli, more differentiation, specifically more time courses. People repeatedly alluded to natural genetic variation, increasing our power to detect important mechanisms. And to engineered variation mostly through genetics, although at least one person referred to degrons. So it's not always going into the genome to do its thing. There was, especially in the last session, I think the question of organization came really to the front. The current encode, from my perspective actually that I put in that term, I would say is a flagship project. There is a predefined notion of its contours and what is there and what's the question that is being asked with it and which cell types and which assays and so on. It's a relatively limited community that is at the core of the activity. There is a good sense of concentration around this goal that's a flagship. And nice papers come out, great resources, wonderful. I think what really has come up extensively is how one takes this model and maybe leverages it or extends it or combines it with something that is more community open. And one possibility was for the community to provide samples into the pipeline. The more one would go to specialty cell types, the more difficult it would be for a small community to actually maintain all the expertise needed in those systems, so that's a way. It's also a great way of training and educating and building, I think, humongous goodwill in communities that don't really get yet what encode gives to them. And there's the question of community provided data that has to be vetted. There's a role for the DCC in this context. But people are collecting data and they will only collect data more because to a large extent of the effect and success of encode. And then the question of where in this is the role of encode in being either the developer of new technologies, the scaler of technologies that are clearly ready for scaling but won't be scaled by the people who developed them. And that includes everything from the actual technique to data standards and analytics and also being, again, in the community-minded sense, not just being open to the biologist community but also being open to the technologies community and shift into new technologies as they arise within the time span of a project. I don't think that was discussed as much, but it's something that we think should. And within this context, oh, something happened to the font, it's very small. Within this context of the community outreach interaction, I think there are very subtle things that we would have to consider, how to make encode more accessible, how to leverage on the training possibilities, how to make incoming new technologies compatible, how to make yourself compatible with more sample-sparing situations so you can't always specify conditions that would be very comfortable in a cell line or a big piece of tissue if you want to work with a large community of collaborators or contributors. What is the timeline for improvements? Is it a more or less managed activity? I think there are varying opinions about this. What are the key assays that one would have to keep in order to maintain continuity? Do we continue doing every assay that we've ever started with because it becomes a legacy? We can't let go of it or we switch to new assays and how do we leverage not just small data sets but actually large scale data sets that come from other resources. So I think, yeah, this was it and floor is open for questions and I'll bring back the slides if I just need to go away. Yeah, so we heard over and over again an imperative for developing new technologies and that's both methods and analytical technologies and disseminating those to the community and I'm not sure if we want to talk about if it's NHGRI's role or shared role for making the data accessible to many different communities. Now that may be UN's, you know, you got Wranglers and Brokers that helped do that but I want to make sure that those are recorded as things that we talked about and that need to get done. Good question. We agree. Yes. So, by the way, that's an outstanding summary. I think it clearly appears to be more than the sum of the part. So I'm just going to come back to this last aspect. I mean, in introducing the kinds of, I forget whether it was Mike and or Dan, introducing other sort of related projects going up there, perhaps it's subsumed under it beyond how this analytical issue that you and Jeff brought up, you know, there's a question of modeling here that I think should be intrinsic to end code and we all know we're talking of interactions but, you know, it's undefined and so one aspect of this has been, it should be, you know, to try and model and predict. Otherwise, all of these accurate measurements, trying to reduce false positives, trying to make sure that the false negative rate is low, doesn't quite make any difference. So I would think that perhaps it's implicit, it's not played a huge role in end code so far and I can see why so far. I think it should be a very important part going forward. So do folks feel that the goal under encyclopedia, per Joe, or understanding should be a predictive model? Has the time come? I think some of that is covered under the genomics of gene regulation, right? There's a, that has a modeling component, a data, somewhat data generation related to the modeling. So I don't think it's been ignored. I think it's, to some extent, it's been, it's, it's been, it's its own thing. So it depends on what you mean, whether it's addressed. I mean, the scientific goal of genomics of gene regulation is to develop models to make, to develop approaches to make predictive gene regulation models from genomic data. However, the point of genomics of gene regulation is not to say go through all of the encode data and analyze it. Rather, it's to develop these techniques and do it in the particular systems, the applicants to it. So they could develop methods that could be used on encode data, perhaps buying code people, perhaps outsiders, but they would not, we don't anticipate as part of their projects, they would plow through encode data. I just wanted to add that when we asked Mike this in the break, we specifically asked him, is this goal of taking a piece of sequence and predicting something, for example, about expression within the scope of the current GGR projects and the, and the planning said, that's the aspirational goal, but their expectation when they set up the program and the projects that I imagine are ongoing right now, do not take that on as defined like this. And that was definitely something that I heard time and again in this audience, including, you know, figures from reviews that, that are in that domain and talking about how the enhancer relates to the sequence and the regulatory code and the regulatory logic and so on. So, so it's a question also to us, do we think that the time has come for that or is it indeed premature? So, right, so I've been, I've been upstairs from Eric Davidson for 27 years and I think you could probably, might be able to redo what he did, you know, encode could do that a lot faster now, but it requires, it's a very different, it's focusing on one cell, essentially, right, and it's lineage. So, I think that's not the way to go, I don't think there's enough data. So, I think I would rather see encode, the factory, the production qualities be used for data where you want to compare it across either different samples, right, so every cell type, if there's 400 or 1000 cell types, then that's where we want to comparison, that you're not going to get that out of individual labs. If you want to beat the crap out of an individual regulatory circuit and model it, that's not what encode is going to be good at, sorry. You know, but I don't think that's the modeling challenge. I mean, Eric Davidson has a very particular flavor of looking at regulatory networks and how the CH and embryos, you know, develops. I don't think that's the issue. The issue here is a model and a predictive model in really the statistical, in the statistical sense, that we have to be able to do it. If it's done well, then all of Eric, you know, Davidson thing can be tested. But just as I think I heard a plea, very well made, you know, sort of argument from Karen, that other kinds of assays will become relevant for encode to become really mature, to understand the basic problem. I guess what I'm trying to argue is perhaps out of my ignorance of genomics, of, you know, gene regulation, that the time has come, if not, you know, already that that kind of modeling should be an intrinsic part, however you engineer that it enters encode. Because we brought it up with enthusiasm, I would say. The conclusion was maybe it's that time, there's a time issue. What is it? What is it the right time? To really have have a conservative effort on model wings and now, or is that a ball rigged or a camera rigged kick down the road? And we'd really like to know, you know, how people feel about it. But that's what you're talking about model-wide, you know, these are, you know, we don't have dynamic information. Exactly, that's a problem. There is models correlating because some models, for example, predictions of gene expression. So there is some model, we're trying to see if you had a transcript. So there's some amount of that going on. There's a lot of model in white hair. David's doing which is the temporal, you know, the temporal and system test. Hidden markup models or linear regressions or not? Yeah, yeah. Light or? Model go deeper and deeper and more accurate. I mean, we constantly discover new things, but I think that modeling needs to be an integral part of the end code because you can't say, I've collected enough data and now I can model. You need to see where your modeling is, how accurate it is to figure out what end code needs to collect in order to make a model goal. Obviously, modeling goal is a long-term goal and you can't just collect enough data to see what you think is needed. You actually need to see what should I model now, how accurate my models are now, where am I wrong and what am I missing to figure out what the best assages you are in the first place. John, you said model. I think just as a matter of principle, you should measure what you can measure easily and model the things that are going to be too numerous. So, for example, we're never going to be able to test all of the variants. So clearly, we need a model. That's something for which we need some kind of a model. We're never going to test every single one of the millions of regulatory elements because we don't have access to cellular contacts, but we need a model to learn the rules for how these things are paired up with their target genes. And we should not spend any time, at least unfortunately, allowing people to get into it. Try to model those things that we can measure easily. Okay? What's the point of modeling where, let's say, all the DNA-safe or sensitive sites are getting 50% of the answer when you can just do a quick experiment and get the full answer, you know? And so you can learn some things out of it, but as far as the systematic effort, I think, again, this is just the emphasis to partition the measurement from the modeling to the things that really need to scale. So I was just going to say that, you know, I do think that there's, you know, been some degree of statistical modeling already in ENCODE. I mean, I think one of the big achievements of, say, the 2012 paper was really seeing the degree to which the Histel-Marx and Transcription Factors could predict gene expression. And it was really quite impressive, I think, to see that. And I mean, it wasn't at all obvious until you actually saw all that data come together and to see those predictions that you would get that accuracy. And I think, you know, a lot of the things with collecting very large data sets is it's not, you sometimes get very powerful, you know, statistics from them, and sometimes you don't know what you're going to exactly get. And so I do think that there's been some of that. And I don't know if that should be an aim of ENCODE, but I do think that it's useful when you have these very big data sets to see how they interrelate and fit together. And it gives you much more confidence that you actually understand the data set when you can kind of sort of see how it all puts together. I mean, I also want to point out that, you know, one person's model is another person's imputation. I mean, what we're hearing about in relation to, for instance, to these, the transcriptome imputation from analysis. I mean, that's another type of modeling. I mean, all these are different types of things. That takes a question. What are the actually modeling for the first time? Yeah, so I mean, just going back to one of the first slides, I mean, I very hope that this is a, this is an encyclopedia, right? That's the goal and not a catalog, right? And to some extent, you know, understandable functioning. One of the, just to answer your comment, John, I mean, I see what you're saying, but at the same time, and one of the reasons you model is to show that you can understand something, right? You can dissect it down to a model, which in turn means that you understand the underlying phenomena at some greater level than just describing it, right? No, I appreciate the power. I appreciate the power and the value of models. But the question is, this is something, is this something that is applied at a consortium level or is this something that is applied at a community level, when quite honestly it's more of a crowdsourced type of answer all the time? Because I think that's where, you know, I think our goal should be to see the models in some way. There are some places where models have to be coupled to high group of biology, where, you know, word makes sense at a consortium level. I'm not just coming to the goal, I'm just saying whether that's something that is conducted as well. They're also like kind of coming back to Donna's point, and there are places where the choice of experiments is directed by what will most inform the model, right? And I think that should be an important goal here. I think with modeling the contours are, there is an ambition we need to understand the realm of possibility. The models are not given one-off, they need to be treated in a certain way, so you can completely be coupled then, and we need to actually understand what we are modeling, what it is that we're trying to do. We have a large round of cover, so we're going to use out-of-models for other aspects, so we wanted to get a little bit more feedback on this community memory sheet. The one that we can't read. Because many of you in your presentations brought up kind of alternative models that we're going forward with new initiatives. And we wanted to make sure that this slide really captured it's a little bit more like a program. So does this capture the flavor of those of you who do address this in your presentations? Is this the end of discussion? Does this capture? We heard about bringing in data from the community, making data available to the community in ways that reach really broadly, and the notion of bringing in samples from the community. There are a lot of little details associated with it that make big business stories. How can we go about making this transparent to the investigators who are doing some of these things in their own lab? I think in code, obviously everyone has heard of or knows in code, but to some extent, there's, I think, a moat in some regards in terms of how to cross, you know, where's the bridge? Who's going to build the bridge? That has to come from in code, I would think initially to build the infrastructure for these in-reach types of projects, and how can we begin to do that? I mean, just to say I keep coming back to this, I mean, in code behaves exemplary or very well in the way that it's coordinated, it's metadata standards. And this means that other people who use those standards and you're encouraged to use those standards when you submit through EREX Press, sometimes when you submit through GEO are compatible, the metadata is compatible with the same standard. So there isn't some kind of them and us kind of split of the standards happening worldwide. It's the other way around, everything's being coordinated. What I think is harder to know is having someone that tries to pull data together and consistently analyze it from both small and big labs. And I think that what I think is a very good idea that's been suggested is more this business of the sample kind of, you know, asking for cell lines and samples coming in. Whereas my experience over time is that when you try and coordinate data sets that have an upfront been thought through as being coordinated, it's much harder to derive good aggregate information out of that. And that's been the experience sort of many times. It's not that it's impossible to do that but it's just much, much harder. And the thing that out of the last day that felt right to me was the idea of having as well as standards but really having a community input for the samples. What are you doing now at one point? That it's not just cells, it could be cells of an anxiety we had with the specialized assays or phenotypes for that community which could be very difficult for the end-told investigators to bring up. So that may be a grammar that needs to be discussed at the time projects are accepted or not. I think the reality is there's going to be a lot that's going to be a lot that's working. The reality is a lot of these experiments are going to go on the outside and I think it would be a shame not to collect them. And we do bring that kind of process. It's a little hard work. But I still think there's value to do that. But I do like the idea that it's already about having some structure laid aside to be able to take samples because it will come out better. It will come out more uniform. We all remember the micro-educated, right? Where things clustered by lab, not by sample. We have an action for getting an end-told to it and the action for chip days. It's exactly that. And we want to avoid that if we can. So that is the value. So I would endorse that as sort of the first. But I do think it's probably going to go. Down at the micro. Yeah, so Mike said a lot of what I wanted to say. But I just wanted to express one thing. It's a big effort right now. It's not worth it for the data that's out there. But the bang for your fleet for our bug that we get if we figure out how to integrate everything that's collected out there. And maybe not everything, but everything that matches the data standard. So the end-to-told said, OK, if you collect your data through the specs this way with this protocol, we can integrate it. And both the effort to figure out what those specs are. And then do some type of pipeline for vetting. I think the community will collect crowd-sourcing far more than the end-to-told could ever collect on its own. And there would be value given side of the space. I completely agree. I also want to point out that the integrate is different, right? There's methods that actually are quite robust to differences in signal-to-noise ratios and data patterns and can handle integrating all these things worse for other methods like if you're trying to figure out differential binding, say, you really do need the data to be as carefully controlled as possible. So it's going to be non-trivial how we put this data in and how we develop these pipelines and at which point is it being shown to the end user and how. But I think that we just can't ignore all this data that's been out there. I don't know if it's worth investing millions of dollars to curate what's already out there, but making sure that new data goes in would be important. So with respect to these two different models of either taking in community samples and processing them or taking in community data, we need to think of it, I would say, as the trade-offs associated with both. Obviously the data that's already generated, you know, that's there. So maybe there's a cost savings and that data already exists. On the other hand, with respect to bringing in data, there are data collection biases, and especially we hear this with RNA where it seems it's very different one lab to the next and there's a lot of difficulties integrating them. And we're hearing this with GTACs and within code and some other projects. So that would suggest that if one had choice, we might be better off taking sample rather than data. But when it's done already, it's done. One of this, one of this aspects could we return? Yeah, yeah, yeah. You're right. So I think on organization and on the overall goal, we got a pretty decent view. We heard a little bit about modeling. I would love to hear more from people about their notion of what they would want to see functionally happen. That's one of the strongest things we heard, I think, from the ENCODE APIs and also from the general community that people would want to know what the elements are doing. Yet what that exactly means is still at least in my brain, a little fuzzy in terms of trying to reflect what people are thinking about this. Obviously I have my own ideas. I'd love to hear more about that. I think that's a big one because it defines a lot of what the project actually looks like. Eventually. Mike, then Dana. Then whomever. Everything up there. Maybe it's a third but not explicit. Because I would like to see some reference lines that really do the third item where you do do a dynamic. It integrates a lot of the discussion I've seen already over the last few days where you could do some sort of developmental time course or stimulus what have you. I don't think that's defined by this because probably shouldn't be defined by this. But it should be the most important. But you would actually take that and you would do a deep dive. Right? In the data sense it would be a reference. It can be matched up in its variation. It can be matched up in all sorts of things. You would have all these other perturbations listed at the top. Which is genetics. So I guess that's what I'd like to see. Maybe it happens to us in the reference system. So So is that a like a GTR on steroids? Yeah basically it would be complimentary. Okay. I actually wanted to put that match up more with the links project. Yeah I It kind of crossed it could it should match up. It's kind of the cross between them. Maybe. It should match for example with G-tension side how to bring that sort of information that you could choose. You might choose I'll just say cardiomyocentric. Differentiate yourself. You want to make sure the G-txt will be in a whole bunch of people or maybe it's a marriage with G-txt. And then match the variation out in a whole bunch of different people where you can do cardiomyocentric differentiation for example. And the same would be true for links and things like this. And this is how you would leverage I think the process of a project at a much higher level than we're doing now. So you're really getting some information. I only modified the word lines into cell types. So the people will. Yeah. Dana and then Joe. So first Dana then Joe. So I'm going to keep it very big but I want to add a thinking point. So I think a lot of the reason that ENCODE is valuable because we are doing it in my input and large scale is why. And I think that many of the and one of the problems with perturbations that we're encountering is that our perturbations are currently very limited one by one or not quite one by one but not the best scale. So I think that not being as analogous to myself one of the questions should be given that there's so many interesting things you could possibly do. Which ones can we scale up large scale rather effectively large scale genome-wide or high throughput functional aspects is more maybe more valuable than the one that I would rather have but I get one at a time. I'm with analogous. So Joe and some of the others had their. Doesn't work. I just want to echo what Mike said that having cell systems solves many of the problems that we have. One is tissue accessibility. So you could have a system or otherwise you could develop assays that match the context. So the challenge is being able to have rather than just cancer cell lines which have been valuable to have assays that you say come from predictions about whether or not particular cancers are expressed in at a certain time during cardiomyocyte development you can then enter all the genetic information and assay it in the right context with the high throughput assay of whatever flavor you would like to have. So I think that there's a practical reality as well is a lot of the tissues that you'd like to do assays and aren't going to be single cell long transform lines. So you're going to need to do some kind of differentiation protocol. You're also capturing the genetic variation. So if you choose those cell types correctly think about the genetic variation. Populations that are being studied for example where you have consented lines you could then link other projects by definition of the fact that it's the same consented individuals providing the fiber blast that you then put into a differentiation protocol. So I think the idea of having these cell systems proposed by either in laboratories that are expert in that or collaborators etc that are directly part of the project would solve a lot of problems and bring a lot of the projects together if the coordination could happen before the project started which has been a challenge in the past for a lot of the NHGRI funded project but now we have an opportunity through picking cell systems that integrate these kinds of different projects. So I'm not going to add anything much more sensitive than that but that was exactly the point of my talk last night where if you start with a model where you can take it from a single cell to a differentiation event to a disease model I think for example if we can pick a few of these kinds of cell systems we're going to be able to extrapolate that to very generalizable principles and especially those that you can actually test things in model organisms I think will also be key and so I like that we're talking about cardiomyocytes I think I'll write a grant on that no but I think that this is what we have to think about what's the right cell system that will allow us to actually move through all of these questions in a single system from normal variation normal development to a disease model to a model organism So while I'm a real advocate of doing time courses and looking at dynamics one thing we might want to keep in mind as a word of caution is what are you going to get out of this you're going to get a time course of histone marks, RNA maybe other factors but maybe the reality of it is everything will be demarcated by activation and repression as you can define by for example RNA so if you have genes that are activated you're going to see change in histone marks so you're going to see the same thing all over again but in the same sort of but now in the same series and I'm not sure you're going to gain anything that you already don't know about activation and you couldn't see from RNA all the genes are not regulated in the same way and so you're going to learn about differences in how genes are regulated for example induceable genes may be regulated very differently than another set of genes and you can look at RNA levels when you're going to see very different chromatin associations but that also may lead you to more of the biochemical questions what are the causes that are resulting in these changes in chromatin conclamation that could not be read out at the RNA level so I do think these reverse significant amount by looking at the data set and I think we looked at the salt response in yeast which is pretty simple we looked at the salt response in yeast and thought we had the major factors when we knocked them out and they would go and the whole thing would be explainable and almost none of it was explainable, it was a disaster we knew so much less than we thought we knew I think that will happen in these systems and that's why you want to collect the data and do it so you can really understand them at the level we want to, you know humans should be more complex so I think the challenge will be bigger I think we were hearing from Aravinda yesterday about cell type specific enhancers and this is something that we would get beyond just simple gene regulation knowing where the variation is in those regulatory elements what's causal, I think that would be an extra thing that we would get out of that So let me pause for a moment and ask if you looked at Aviv's very nice slides are there high level topics that are not represented that should be did the group that met and through grapes at each other up here during lunch did we miss something that should have been in this list of topics that emerged from this meeting and don't use this to bring up new topics now but are there generalities that came through the meeting that we missed can you flip through the high level or you just flip through the slides one thing to add to it that's already been mentioned is we are all moving into an era where samples that are used and your cells from sample individuals that are used they need to be appropriately consented and we don't go back to this issue of we can't now do it going back it's a very important part this may be old news to the genome community but if we are to encourage others to add data this is a very very important thing to get across I add a decline this actually relates to this slide even though it's not a good point but I actually think that in in relation to disease the ENCODE should very much be understanding the healthy tissue and how variations in the elements cause disease so I actually think that ENCODE should mostly stick to assaying and understanding healthy tissue and then what goes wrong in disease as a baseline to understand disease this is a point that I assays which I think there was a lot of discussion it's not an exhaustive slide but already the font was very small and I think for our meeting this doesn't need to be I just wanted to point that out and then there's the issue of perturbation the different organizational models that I think we discuss in further details one thing that we heard a couple of times I think that leads to this or talks about this organizational strategy is whether or not everything needs to be done by these large consortia so this seems to focus a lot on how the community interfaces with ENCODE but I guess what I'm wondering is how does ENCODE interface with ENCODE and so whether or not you have one big consortia you have more loose ties among people where you agree on a cell line and then people can kind of come what was the analogy instead of quarterly reports maybe annual reports and maybe a less top down managed structure whether or not there's some room for that especially in the avenues that we're going to be more like technique driven or technique development and to give space for things that we today when we're sitting here to still funnel into the pipeline and so I guess one thing that I don't see here is this idea of how will the consortia themselves be organized and could there be a move from everything being kind of very top down to more openness and flexibility in the structure of the projects I'll make just one comment, I think this is something an IGRI might be way better equipped than I am to comment I actually don't even exactly know how it's run today so but the one tidbit that is actually on the slide after this one, the one with a very small font is that there is this notion of maybe porousness to new technologies allowing more flexible switching as they come up and that is an organizational principle that might be different than the way things have been done before so if you know will wakes up in the morning and comes up with some crazy new additional assay so if you're really excited you're not locked in by a contract that's the only tidbit I would add I think at least would have much better I just want to comment on the current structure which is fairly top down but there are investigator initiated groups, the technology development are basically R01 projects we also are an open consortium so we have other groups that apply for membership to the consortium as long as they agree to buy the daily release policy and to really be actively involved so there is the opportunity to bring new groups in we do have a little bit of flexibility within the production groups to bring in new technologies a couple of people are doing a taxi for example and of course we do need to modify milestones to allow for that but I think we're not completely what we started in year one has to be the same as year four but we don't have a lot of flexibility I can imagine there being a mixed model where we do have some flexibility and on-ramps but some of the other things that I'm hearing about like bringing in samples from the community or samples or data I think are going to require a lot more management than we actually currently have I was more thinking of easier transitions between let's say the R01s that are for technique development and then let's say you sponsor a really productive R01 they come up with a great technique could it be a very easy pipeline to go from that into one of the consortia and immediately make that something that is one of the sort of encode vetted assays? I think it's hard to get new money in the middle years I think that's probably one of the limited factors I don't know if any of the data producers want to comment on sort of the ability to bring in new assays Well yeah we feel locked in somewhat but there is some room I think you know my own expectation it's interesting when encode 2 came we were right at that switching between chips and sequencing and so we did keep an eye and that one was a dramatic change that happened like that so that was adopted quickly I would say other things maybe not so fast I assumed we'd all be switching to EXO quite frankly and that has to happen because it's been tougher so I think it is getting vetted and it does happen it might be a little bit like an oil tanker where it happens kind of slowly but it does happen Can John say something about this too? Yeah I think that the history of the project has been one where there has been continuous development of new assays and implementation in parallel with the regular production assays and when it made sense to shift over I mean Mike indicated that you know the dramatic shift from sort of microarrays to sequencing that happened very quickly and then even then the format of the assays I mean we have changed the format of the DNA1 assay several times it's undergoing another change now and I think there is continuous implementation of that I mean from a speaking from the data producers point of view I have ever felt and I'm sure Mike would say the same way constrained at all by NHGRI as long as we are moving forward with our production milestones to try to innovate as much as possible and I think that's part of the picture because ultimately those have always resulted in reduced costs and higher throughput But I think that's within the specific goals and the specific general data types it's not really talking about new technologies or new groups One example of the chip pad which we are trying to integrate now so we do have to justify this piece what's that going to do with the other production which is reasonable and we also have to show it's really really working too right well just that So there is a trade off here that hasn't been addressed and that is switching to the most current technology can get better data and that's usually a good thing but what happens if somebody like Adam wants to look through the matrix and see what's there and part way through the matrix this technology has changed that experiment is no longer being done suppose somebody is interested in some disease it says these are two important cell types and for one assays A, B and C were done for the other assays X, Y and Z are done but they need to directly compare them and so we need to consider the continuity which is important in one way and generating the best data can be done today as is an issue also I was in the bad scene I still I know I keep coming back to this I want to I want to make sure that like every transcription factor deserves its day in the sun and I would the only fear I have about all of these other aspects is that it means that there's less emphasis on going out across all those transcription factors and maybe I think it will be I think we will regret not having a comprehensive view of every transcription factor in the human genome in one cell type somewhere part of three this vet function with perturbations I I I I I I I I I I just feel that there's no way around it there's some really hard miles to go through whether that means you know really pushing antibodies whether that's making new technologies there's something hard about that task and I think we will regret not doing it if we don't go through every transcription factor and also in molecular interaction if you have my light what do you mean by going through every time you're obviously thinking of something more than simply finding out just doing gypsy I would be extremely happy chip seek on every transcription factor and we're 200 in out of 1600 yeah and I would add to that I would add to that the fact that if you have done it this particular case the value would be less in the data than in the research reagent so if you end up with antibodies that are vetted it means you can apply it in many other contexts oh if it's all tagging then it's a whole different story you standardize that there's a lot of discussion on the standardization of antibodies who chooses the writing how do you make that choice still I think at the end of the day you're going to really try to fundamentally understand or compare transcription factors run to another I think we have to start talking about antibody independent types of technologies that are binding I mean I think that would be great but I would be up for just even if one had antibodies with all of its you know complexions about epitopes and blah blah blah blah and stuff like that I mean they work now for 200 one gets good information out of that yeah I think I just want to add I think to solve that problem for sure tagging is a strategy and it's impossible to tag every single cell type but you can make it in for example mice and then you basically got a infinite number of production of any primary cells and also for any higher priority genetic manipulation on the test if you make any model you have an available model also you can cross them it's different from the cells you can really study the combinator effect so we talk a lot about you know choose different system but I don't think there's any conclusion of how much efforts were put in real model what model and what's going on with this mouse encode so I just wanted to comment on the what kind of information you're thinking of getting out of the transcription factor binding site list so if you do it in a particular cell type you'll get a particular series of binding sites and then if you perturb the cell you'll get a slightly different set of binding sites and then the same transcription factor in a different cell type will bind to something else so the question is what is the goal of that initial cataloging are you trying to find what the motifs that the transcription factor binds to are what exactly are you trying to get out of it I mean I can respond to that so I think finding those motifs and that initial thing there are obviously all the flaws that you mentioned but just as we found it incredibly useful to know all the binding sites of the current 200 transcription factors I mean it's kind of one of those things kind of obviously if somebody we would not refuse getting all 1600 in equal detail and I'm sure there will be bits of biology that drop out but it wouldn't be the full answer I mean let me just stress I mean it's not like having that somehow magically allows you to understand every cell type or what have you it's a long way off it's just a very important component and interesting enough it's a countable component that we know whether we've got to the end of that list or not for some definition of the list but that's that's fine yeah I just want to comment it seems like this particular goal is something where there are a couple of other groups in the community that are cranking on this like Tim Hughes and UC type I mean they're just tagging factors chipping them getting the motifs and they are doing them in the hundreds right for the zinc fingers okay but what I'm getting at is that there's a limited number of players here it's totally synergistic with ENCODE we already know who they are it seems like this is the kind of thing to bring everybody together and just do it yeah just to re-emphasize the need for the TFs and them providing more than just motif information because we have yes TF binding profiles but we also have chromatin loops we also have expression we can make inferences as to what are the functions of some of these TFs and so for instance for chromatin loops now we know that ZNF-143 is a chromatin looping factor because we've been involved to use the chip seek data from that factor that was chipped for probably just because an antibody was good against it and now we know that it's a chromatin looping factor we wouldn't have known otherwise and I'm sure that from the list of remaining TFs we'll find additional function that are yet to be discovered so you seem no now we have to do some like I guess what about model organisms in general like where do modern code and mouse and code kind of fit into people's thinking about the future of this was that on this slide if it was I missed it again I would like I think NGRI carried a brand of the answer on this on where do modern code and the other model organism on code sit in this but I would say that I think there is a component of model organism that is really that model organisms encode and there's a component of model organism which is use the model organism as a piece of this encode and I think that's quite different so when Lori goes into a fish and looks at a hard phenotype if that's what you do yes you do or when you engineer the right mouse model to get the right cell type out of it for John to look at some T-cell phenotype or for Angela to look at exhaustive T-cell that's a different type of application that's driven by the human side not by the model organism side not to say that the modern code is not important but I don't think it's exactly the same thing so I guess it depends on what we mean by modern code and mouse and code right so I would say I think we're still committed to the mouse and learning as much as we can about the genome whether it's it kind of depends on what the next phase is going to look like in terms of continued data generation but I do think that it's unlikely for us to bring in new model organisms and do a whole new complete cataloging of those but I think it does make sense to bring them in when the biology dictates that the future of data generation mouse is still part of this currently it is we want to hear from the community about that including this discussion yes absolutely so folks mouse right down what did you say Jeff we have a certified mouse person I think this is very exciting I think they're probably not in the immediate future but there's so many such well characterized knock out mice for example all on a uniform genetic background and it was alluded to just a few minutes ago so you can take these at any developmental stage you can cross them to anything you want if you happen to want to but I could think of zillion experiments to do using this data if using data that was generated from these animals so you can look at animals you know because the characterization with the phenotyping consortium that's going on you're going to say even though it's not a major phenotype you know the kidneys effect you know the heart is affected and so on so it provides guidance on what cell type data that you would use from the human to design your experiments in the mouse and in addition to this and people I'm not a computational person but collaborators certainly use the data that's available in encode as it is to direct interpretation in mouse experiments for example we have a H Huntington's project and looking at very early changes in gene expression in mouse models for with repeat expansion in Huntington's and you get changes well before there are any changes in disease but when you look and compare with the differentially expressed genes in humans there's considerable overlap and with the transcription regulatory networks it's the same sets or an overlapping set rather transcription factors so I think having the ability to manipulate the mouse of resource of extremely well characterized targeted mutations makes this having the data specific to mouse is a framework as people have called it framework to layer on to the interpretation is going to be events just because you can manipulate these animals how much genomics is done on comp inside comp if at all there it's just simply knockouts so people don't do say any genomic sampling no and it's the same question is what tissue for example you would love to have differential expression but what tissue, what cells it would magnify into the millions and millions of so can I just oh the phenotyping is histological developmental when lethality would be you have behavioral phenotyping you have metabolomics not really metabolomics but blood chemistry so it's a basic profile then of course behavior and morphology and it's pretty crude but it's in-depth and it's for as complicated as a mouse is and as complicated as animal husbandry is it's very well standardized yes there's like the expression can you just during embryogenesis so for example can I just add that right now with the CRISPR it's much much more cheaper you can easily make attack mouse with a few thousand dollars maybe two thousand dollars and then you can also literally one lab makes hundreds of lines of non-cognitive lesion mice in half a year this is definitely something and that's that that's going to far outpace the characterization which is much harder the phenotyping is all in place and now Compa are also switching to CRISPR yes they are that suggests for example that if hypothetically there's a transcription factor focused activity then one could take all the compromise that are transcription factor and knockouts and based on their phenotypes define a set of tissues of interest or just go standardly say into spleen or into something simple like that the equivalent of PBMCs for the mouse that would be extra but I'm saying the comp piece that just knocks them out exists without any extra work no new lines have been phenotypes doing the extra tagged ones a layer on top so just to remind you I mean I think so let me just step back out and I'm going to step back in I really think there's a huge number of opportunities still around understanding how how chromatin interacts with gene expression in a more idealized way from all sorts of models and I you know I put fly and worm right in there as well as as well as mice for that that kind of conceptual classification just to remind you of the work both of mouse and code that was published and of Duncan Odom and Paul Fleachek is that when you chip the same transcription factor in multiple mammalian species you see a very very high amount of movement of the chip seek peaks between mammals and that's quite separate to protein coding genes so whatever one would be built when you're building that a kind of catalog in mouse you've really got to be cognizant of the fact that it's not going to be a simple one-to-one mapping of those sites so I think on the other hand the ability to tag an organism have every tissue possible come from that and it's a mammal and although that diversity is there it's very clear that a lot of the cellular programs are very very homologous comparable between human and mouse does give you and you can use that divergence to your advantage I all I'm saying here is that I think there's quite a lot of pros and cons on both sides for this and and I think there will be it's going to be one of those annoying things where neither extreme is correct I would add that at the level of the individual site the instance there's not going to be great conservation but in the level of the physiological function of a factor in a response that seems to be much more conserved they switch their they partly switch their targets they definitely switch the places in the genome where they control their target so there's stuff you can get out of it that would be very valuable and stuff that's a lot less and one would have to tweak in the right way but it's a good thought experiment to go through in terms of what where would the interface lie one simple point although I think it's important if one does set up on the systems that you do the sign if one does go to the IPS group it would be advantageous to use lines where there's actually a lot of phenotyping behind the people from which lines are generated and the data is all open so the point is there's going to be a lot of lines from very well phenotype people and that would be a great way to start they're openly consented too so are there any other comments for this general discussion I think we're going to move into a higher level get another higher level of this discussion and before you hear from NHGRI we've asked I guess four council members here if they want to make any comments about their thoughts in terms of priorities within what we've heard today and within NHGRI and I'm going to call on Eric or Winkle first who I think has to leave soon so I have several comments first I think it's important to realize maybe it for everyone what does council do won't surprise you particularly in genomics there's more great ideas than there's dollars and so really what we do is help the institute grapple with that fundamental reality and as I see it there's really only four options you either cut the budget of the items to fit within a fixed budget then you argue how far can you cut the science really is damaged so much there's no need to do it in the first place the other is to find partners from other institutes and I think that's particularly important with the discussion of the community engagement that may be an avenue for NHGRI to find partners to help fund ENCODE which is my understanding that hasn't happened before probably the most painful choice is to kick the can down the road I guess NIH institutes have learned a lesson from congress is just keep kicking the problems down the road and funded in the next fiscal year or the next fiscal cycle ultimately and painfully the other choice is not to do it I think the other then is how does council prioritize things this is my opinion is basically how does it fit in the medium term vision is council seems to really love the generation of resources community resources which I think is a great advantage for ENCODE again that if there's partners and there's cost sharing that's another obvious advantage I think the most important thing I see for the institute to grapple with along I hope with the investigators and others in the community is to fill out the first slide what is the clear vision mission goal of the next cycle ENCODE in 10 seconds what is it and if it takes longer than 10 or 15 seconds you probably haven't owned it yet you've got to give a half an hour speech to convince the audience it's not going to work I think an added challenge that I realize in helping pull together this meeting is to clearly answer the question what is unique what's the unique opportunity for ENCODE going forward that's not already being covered in GTECs and links and GGR etc and once again that's be very clear how what is new and what's different for ENCODE and then I think but on the other hand you don't want to make it so different that there are no synergies but I think the group again with community engagement we need to identify a path by which ENCODE clearly has its own clear mission but yet there are synergies with either other institutes or synergies with similar projects I can't emphasize enough in my opinion this is totally my opinion having as this clear sort of resource generation role and the idea that the next phase of ENCODE will have a lot more community engagement I think will resonate quite positively so that's my sort of two cents about as a council member what I would be looking for and trying to predict the discussion around the table I have to say predicting council's discussions are probably a lot like predicting enhancer activity but yeah I'll echo what Eric said and also say that one of the things that's especially emerged I think from this discussion is how the vision that we've talked about the projects that we talked about how that resonates relative to NHGRI's traditional role and mission in genome biology and a lot of the things that were talked about here are these concepts of endophenotypes with protein levels and proteomics and shotgun proteomics that is really sort of traditionally outside of the kinds of work that NHGRI has done in the past and I think that this is probably going to be one of the discussions that we will have we always have at council the two times I've been there about how these projects sort of fit in how NHGRI is uniquely positioned to contribute to the community and I think that's in one way why resource projects fare very well in NHGRI because that very for the cost the impact is quite high and has been quite high but once you start rolling out from these data generation into trying to use them to understand biology now you've got these gray areas where new technologies that aren't necessarily genomics technologies and questions that may be more disease focused than NHGRI traditionally has focused on sort of come to the foreground and I think that's going to be a balance that we're going to have to discuss going forward and Joe, I don't know kind of a huge amount to add to this discussion I think the points that were raised by Eric Kojen right on the mark I think the challenge is the basic the basic discovery aspect right so the basic biology NHGRI council is committed to having genome biology be up front and this is one of the projects that I think meets that criteria with providing a lot of basic information about the genome and if other aspects can be linked to the sort of downstream medical applications by the nature of the in the next phase of having you know for example what Mike mentioned or other samples that can bridge that gap just by their very nature the genetic variation that exists I think you can even merge the council on this issue so I think that would be a way of combining both the basic and the more medical aspects of the project so I think that's an area that is I think that area would be terrific for the next phase so I'm not even sure where to start here so I echo I think all the other council members maybe could really on target comment so so I mean it's clear this has been an extremely successful project right and some of the virtues that I think have come up over and over are things like the high quality of the data the ability to access it unfetteredly and the fact that it fits so neatly into NHGRI's mission is in part because it's not disease focused and the fact that it doesn't overlap with many of these other things that are going on that are related to this notion of kind of staying disease agnostic and kind of thinking you know this is going to be this is now the fourth phase right and I think it just strikes me that it really it needs to evolve in some significant way beyond what it is now and I'm you know I think there are a lot of great ideas came up for directions that that might take my personal bias is towards rather than continue rather than the focus being on expanding the matrix beyond where it is I'm not saying that shouldn't be part of the activity it should be a deeper focus on understanding the matrix that we have right and and it's you know you can argue about whether that properly falls within the encode consortium or it should take some other form but we're clearly at the point where we have these massive amounts of data on a lot of cell types we can now in the coming year we'll be able to do genetic manipulations that were hard to imagine even a year and a half ago and it seems like it's a great opportunity to take advantage of this moment to try and really work towards mechanistic understanding of the data that we do have that's at least my view and I don't know what is the best form for that but I think that should be squarely under NSURI's mission Any comments on that before we I'm going to turn it over to Jeff I think one question if we do move into the disease realm you know doing one is that useful doing six is that more useful doing 20 what I mean obviously not going to do 20 but I mean what's the value of doing a little bit versus are we going to learn anything from that you raised Jeff Mike's not here anymore not specifically focused on a disease but on the variation that exists in the population that will have disease right so looking at lots I think we can go back to the cardiomyocytes that these individuals both normal and some with with particular disease or variation could provide a lot of information about what normal cardiomyocyte if gene expression looks like as well as provide candidates for disease so I think if you you know if you pick the system right it will that is based on a lot of the biology that's out there for studying disease models it fits hand in hand with the cell systems that are the best enabled or developed because those folks have spent a lot of time developing the cell system to study particular biology usually which is coupled to some disease phenotype so it's not studying the disease per se but studying the development of the cell type in the context of variation some of which is disease yeah I think with regard to disease one of the remarkable things about the last phase of ENCODE is that although the focus was all on normal generally on normal material outside of the minority of cell lines when you cross that with the genetic data you suddenly got information on disease and so it doesn't mean that we're not working on disease and so I would say regardless of what's done in the area of disease samples because there this is a huge you know can of worms ENCODE certainly could do something to much more greatly enable the study of specific disease systems for example you tick the heart it would be fantastic to have you know the different types of atrial cells and things like that you know in other words we know what a lot of the different cells and tissues are that become diseased and in many cases we don't penetrate those and if we just sat there at hawk and had them you know be generated for example by the community it may not happen in the coordinated fashion so I think that actually enabling certain disease areas could be an easy goal I would like to add a couple of things I think one question of course is how deep is your pocket how many diseases can you take but to follow up a bit on what John is saying essentially when we start looking at healthy donors and we look at 200 the variation is enormous and much more than than we had expected so if you then start from one example or two examples that you have and extrapolate what is overlapping with geo-snips and whatnot then basically you are entering a very murky business and I think you have to realize if you go into it your normal cohort might be several hundred preferably more and your disease cohort would probably be several thousand in that order and how deep you can go into that is one of the major questions and how easy your material you can get but I think you have to realize that you really need to do very large groups of people and analyze and do all your enhanced mapping and so on before you can really make strong statements about which geo-snip is a causative snip because the variation is simply very very high at least in the blood case the variation is enormous and you have all the exposures from the environment so there the problem might be bigger than in other diseases but I think the variation is enormous in normal already so I just want to add to that point I think if you choose systems where the target tissue is very clear that can also make things less complicated I mean obviously cardiovascular disease is a really complicated question but then when you start looking at diseases that particularly affect for example ventricular cardiomyocytes sorry I keep going back to the heart things because that's what I know but there are diseases that really target a specific cell type or tissue and those probably will be better platforms if it kind of all falls into place to think about let me clarify we're looking at very highly homogenous cell types we're not looking at blood or the whole thing really very well defined cell types that you take out of blood and analyze and then the variation is enormous so I don't know how that looks like I did get only half of it you know if you derive cells from the blood on the other hand and then re-stimulate them ex vivo things clean up substantially I think things complicate extensively I don't know people have done effective genetic associations in this way for example but I think we're also talking about if you're looking at variation sure there's millions of variants across the genome but if you also take a focused approach for example I think there have been so many studies from many people in this room that really clearly shows that many variants associated with complex diseases or traits are actually enriched in some of these functional elements so if we already reduce our search space you know obviously we're not going to find everything but it's a way to actually start to make some traction that when we learn from those studies that could be generalizable or extrapolated to you know larger genome I'm just wondering based on this discussion if you're going to study a bunch of different cardiac cells given this discussion do they all have to be from one person in order to be able to be able to compare but and many different kinds of cells from many different patients that's one of the values of having some characterization of cell types because cell types give you where things are expressed and if you have genes genetically associated it does give you a clue into cell of origin or cells of origin often it won't be just one and I think this would be a great great opportunity for collaboration between ENCODE and GTEX so GTEX has for example left ventricle samples from I think now individual and you could do a lot of things with that hey thank you I think we're going to turn it over to Jeff and a few of us will help wrap things up