 I think Brenda really focused on the question of genetics and especially genetic interactions and how we could bring that type of modality that has been done very successfully in East into humans. Karen and Anjana on two very different realms raised the issue of what types of things should we be measuring that people have started with because of interesting basic biological questions, one around nascent RNA and the other around not just methylated cytosines but there are many, many, many derivatives as the markets erased and some needs to be pushed technologically and some might give subtler and more sensitive signal than we've seen before. And John then focused on, you know, what's in there to the mom and pop shop I guess? What's in it for it? But raised a whole set of issues from the essays that would be necessary for the regular, the type, how many cell types are there and do we really need to worry about this? How to incorporate data from non-consortia efforts, which is what we've been discussing right now, and the question of how do we bring in, I think, individual labs in a sense overall into the fold? So the door is open. I think Donna had a hand up. Well I sort of said it but I think it gets reaffirmed after this session where pretty much every speaker came up and said we need more technology and we're already using some of the technologies that ENCODE generated and all the should we incorporate data from the community into the ENCODE data or not and I really think that the biggest contribution to date of the ENCODE is not the data as much as developing the technologies and the pipelines and the protocols and the know-how on how to do this right and disseminating that the protocols and the computational pipeline easily accessible to the community and there was a comparison that the ENCODE generated as much chip, for example, as the entire biological community combined, but that in part is because of the head start. I think that in a couple years now that these methods are out there it is going to grow much much bigger and most of the chip is going to come from the community so I think that one we should make efforts to try and incorporate that data towards a completion of course with quality controls and standards and and requiring that that they the data was created based on the standards of the ENCODE but also that really future ENCODE projects should follow this there's too many technologies they've been raised here too many conditions too many cell types so what ENCODE should really be focused on is doing what it did for chip this is the this is how you do the technology this is the best practices this is the protocol this is the way to do it real high quality here are the artifacts this is how you analyze the data this is how every single biological lab can now do it in their system. A question though is can that be done without doing the actual data generation? So I think the data generation is critical but I think that you know there was a big discussion yesterday about finishing or not finishing I think that for some for some of the technologies that the ENCODE's now you know been the leader in making them ubiquitous I don't think ENCODE should be the one that should be finishing every single cell type whatever technology and just finishing that matrix I think ENCODE should be you know taking a data matrix of the current ENCODE side. One model that worked really well for the immunologists that John actually showed their app in his slides is the which was used in the Imgen project was that there was a central source of funding for the final steps of data generation that people wanted to do in the most standardized way and a good aggregation of immunology labs aggregated around that and each lab was responsible to bringing in the samples that actually required a lot of specialty knowledge in the system and these are also labs that were considered to be really leaders in their field so people really trust that the cells were sorted in the right way from the right animal in the right time and so on and the labs that participated barely got anything from zero funding to barely none all they got out of it was that the data was actually collected on the samples they cared about and I think it would be fair to say the immunology community is very very happy with this resource. Absolutely and that's that's why I said if you want information on reindeer you should go to the reindeer herder and the reindeer herder will probably appreciate having all that information. I guess we hit on a point so Carol then Mike then you and then John. Another one of the types of standards that I gleaned from what John just talked about not just the the standards for experimental generation experimental design it's semantic standards for how we describe things so all these large scale projects are developing really useful standards for the community but they also need to adopt the standards that have been developed by the community for things like gene names so these are these are simple things that aid the ability of the community to leverage the data sets in effective ways there are standards for what to call genes and proteins both those standards have been in place for decades they need to be adopted by these these projects that are developing these data sets that are supposed to be integrated with all the other knowledge we have about those genes and NHGRI in particular has been a particular leader in semantic standards for biomedical ontologies and developing ontologies for describing function functional knowledge and those are standards that are also going to be critical going forward to making sure that the impact of projects like in code really have the impact that they should have and so that's the kind of standard we haven't talked much about over the past day and a half but I think it's one that's really essential. So yeah I certainly like to bring in data from the community in some central resource as well and the mechanism one could discuss what the best way is again we have one thing we're doing for that I think there was an interesting idea which hasn't come up yet it's been discussed as part of end code whether a certain amount of the production capacity could be used from outside sources to take on projects from other individuals and I personally like that idea because then first of all some people may not have the expertise but they as you point out they have the cells and they have really great stuff and they're better suited for that sort of thing than we are and so to be able to make those marriages is very useful and that of course that model is being used from DNA sequencing community and their production center so I do think there's value to that and that would be one thing that would be nice to have laid aside for these sorts of projects so. I was going to kind of echo that but I don't think this is a model A versus model B thing but in fact the best place is a mixture of a scaffold just to take your species thing you know there may be a global thing to go around and say look let's just get everything in the same class you know let's get one let's get a description of every species on the planet and then we will go to the reindeer herder and and and go for that so I don't think you have to describe this as an either or kind of process and just what you just said Mike about cell in particular cell acquisition is something where there's a huge amount of expertise distributed whereas things like I'm I you know that I'm quite keen on completing catalogs on some of these other axes like every transcription factor and that's almost a different process around antibodies and everything else and I think that's very valid and and and and should should be kind of done and a final point about the standards and the coordination just to reassure people I really think Mike Sherry at the moment in NCO 3 is doing an incredibly good job of coordinating the data and in coordinating the encode data with the worldwide data so he spends time at the EBI he uses the experimental factor ontology and coordinates the ontologies that the samples are used and described by that's also coordinated therefore with the blueprint data and everything else so there's a kind of a good thing happening that shouldn't at all inhibit the idea of having more specialized portals which is what I think you were heading towards and better coordination I mean I think there's way more we can be done but I don't think we're starting from some kind of disorganization state we're starting from a partially organized best-willed state that needs encouragement for further alignment and further exploitation and and I would really support new portals being set up to to form communities but not couple those portals to data production because I think that gets everything incorrect can I just add a few words here to what you and said there is clearly a lot more than only encode in in this whole business there are other consortia that are producing a lot of good data and maybe even as much data as encode and we have in the organization of IHEC we have made the standards we have ecosystem we have all those working groups where encode is participating so it's not like you're the only one in the world and then let's see whether there are a few others there is an activity centralized going on there is a portal where all the data is in all the data from all the consortia including the encode data so there is much more than in this big world but still small world of encode so look outside where what is there and I think there is much more than then I hear here in the comments there is metastate metadata standards there are standardization which is synchronized or done for encode and roadmap and blueprint and all the others it's there and start looking into it and start using it Mike at first I'd echo what Hank just said I mean some of the data that's not in encode is collected in other places and this is very important and useful for everybody know and it's being done in a way where at least the large projects that are talking about this are trying very hard to coordinate their ontologies their standards and metadata like Hank just said is a two-way street that a multi-way street that all of the groups that are involved are trying to think of what's best for everybody rather than at least what I see I don't see any one group saying this is how everybody else has to do it because we were laughing at one of the reason I had calls it's like everybody's fine with a common standard as long as the idea is here's my standard would you like to adopt it but we're all existing projects right so we're trying to coordinate this and it's not that simple but I think we are making good progress and making it work and coming back to the idea of taking in samples from community and then processing them with some kind of core one thing to be considered with that kind of model is the consents would come from the person who generates the sample so how important in this model would it be for those samples to be consented to be unrestricted or open access as opposed to controlled access is it worth how important is it to spend the extra money on the one of the on open access and possibly lose some samples and my again my experience is extremely limited almost on the margins of a community in this case of immunologies everything was open there no one has had any issues with it yeah I don't think we should go anywhere in this I think most people just really want to see stuff that they couldn't do themselves being done for a domain expert it is often unlike a genomics analyst this is with respect to human consent human consent people aren't consented to broadly share I think it should be an open access and properly consented I don't think we should go anywhere in this I think we should be converting all these closed good groups of trying to convert G tax at least I am in the moving more open because the data is just a hundred times more usable when it's open absolutely so I actually want to go on regarding the portal issue it's first of all I'd like to echo the fact that that Mike has been doing an excellent job with the portal he it's something that he recently took on so it's a lot of work to actually take a lot of disorder into order so I think there's a lot more that's gonna come in the in the near future and I think though that making more portals is not just an issue of a web designer it is actually a research question how do we take these complex concepts and and and and disseminate them into a language into a model into a graphic that's more accessible I actually think that how to do that is a big and important research question its own right I think with respect to integrating community data I mean if you look at encode data in general the data we generate and others as well there is there's basically I think two kinds there are two kinds of data there are high quality data and there is everything else and you can analyze the everything else you can sequence it deeply you can do whatever you want you will never extract the same biological information that one extremely high quality high signal to noise data set will give you and and I and there are simple measures that can be applied to any data set without requiring a replicate that will give you a score and a number and sometimes people are surprised when they see what their number is because they've been staring in the browser and thinking their peaks are great and then that number puts them and measures them and sometimes it is great and sometimes it's not but that number exists it's been systematically applied in the roadmap it's applied in certain encode centers and it can be applied to any data set from the community to immediately tell somebody you are generating high quality data or you are not in many assays and so I think that those so there are ways of scoring those things and there are certain standards about read depth that can once they're publicized everybody knows of course they're in the standards documents but they can be you know easily made so there's could be a 1 2 3 that people in the community could easily do and and then the challenge becomes sort of the submission thing of the metadata and whatever but those are not insurmountable challenges regarding but putting community samples I mean I would say right now first of all in our center and you know and we've done more cell types then then I think the other centers have done I would say I mean all of our specialized cells have come from experts that are generated by experts and then provided to us usually for free they'll generate it we generate the data we put it public and then they appreciate it and and and use it and and that has that's something we've been doing for years it is now entered into a phase where the consents are are being you know rigorously chased up and and sometimes going back and reconsenting patients and changing consents but it can totally be done it can totally be practically done and and you know quite honestly there's no reason why in the forward you know a certain percentage of the forward production capacity then code can't be formally allocated in my opinion to to community sourced cell samples and I think that would ideally bring together the expertise in the community with the platforms that have been built and and really you know I think everybody wins in that in that kind of a model because it's expensive to go fetch community data I mean like the reason in code data have been so useful is because their high quality they're consistent and they're all in one place and you can go in there and just pull it up and see it all but I mean if you see if you just started to add up the dollars of what it would take to suck in a thousand data sets in the community you're talking about millions of dollars a year in informatics personnel I mean real to realistically to get it in pull down the reads remap them get the metadata all straight the whole but I mean we've been through this I mean I can tell you it's a super expensive enterprise you know to do that at post hoc I think we got good clarity on this particular item but there were others that were raised during the talks this one definitely hit a nerve I want to turn back to the question of the the the questions that were raised in in some of the other talks the issue of genetics and genetic interactions that we started discussing in Brenda's talk should we be measuring more RNA than we have in of different types I still Olga all right maybe I'll actually my genetic interaction was sort of a question to Brenda but really it was more of a comment and I know this is stepping back from my all you know we should be unbiased but I wonder if one way that we could incorporate genetic interactions especially specifically into the M code question is if we could look specifically you know focus on proteins that we think might affect the chromatin state as well as transcriptional regulation right and then basically you could have epigenetic readouts and you could directly that could be our perturbations in that big matrix that I think Ross and others have shown I think it was as it was on the bottom across cell types and this would give us an area that would allow us to focus so we're not doing thousands and thousands by many cell types by many conditions and it's also maximizing the likelihood that we'll see some meaningful readout when we're looking at chromatin in other markets yes so in terms of the interactions which I love but in terms of think about the the DNA sites we might want to consider looking at interactions of trans factors and cis sites so a simple experiment the genetic test to show that a factor worked through a site would be to perturb both and show that there's no additivity so you have less than additivity right which is Brenda pointed out as a genetic interaction so that would be a way that would keep it at with the goal of this encode project and still and and make it more rigorous and we have the factors and you have sites and you could do a functional test that is genetic test and learn something I'm not sure how to do that at throughput but I'm thinking about it I think it was it was this point but I think but I think Olga made the point that doing this on at least a number of transcription factors would sort of be and other regulators will be the first thing to do I think the the 800 million element cube or hyper cube or whatever it was on Ross's slide the question would be how to slice and dice and and if the goal is to look at if the goal is to look at transcriptionally proximal events then readouts that are relevant to transcription and perturbations that are relevant to transcription seem to have this natural focus at least from a mechanism perspective with this some of you may have thought much more deeply that we keep assuming that this is an infinite sized you know we keep multiplying these things and the multiplication could simply be how biology does it so if there are limited number of states right and if the same kinds of programs are used again and again this might be much more tractable than anyone believes so the question is can that be tested I certainly think it is much more tractable if you look at the model organism like yeast where people have done a larger and more exhaustive set of experiments you are not even remotely scratching the surface of the hypothetical your orders and orders of magnitude seem more simplified also if we look at the systems that we do know the data that we see that is why there is hope for imputation per Nancy that is why there is hope for generalization from sequence to expression per various speakers here and and people in the community and this is why the 800 million you could go much bigger than that because you could go more and more and more and more but of course this dimensionality reduces radically the problem is that we don't know that mapping well and the question is how do we sample that space in the way that would let us know our bounds Adam then Brenda then Dana yeah I've been thinking about this this issue of how to get started and it seems to me that the 400 cell types is a starting point I realize it's a lower bound but but if you if you start with some number of cell types that is much more exhaustive than what we have now and we and you create a skeletal data set for all of those cell types maybe a dozen or two dozen assays it gives you a it gives you a point that you can jump off from for imputation and it gives you an initial sense for the type of dimension reduction that's possible so I thought that would be an interesting possible what one thing to do in addition to deep diving driven by community experts and so on that might be a role for a more standardized consortium driven aspect to the project I would also point out that I think in a fortuitous center set of events computation has really gone in ways that work well with these kinds of data streaming is really has really evolved and you know matrix calculations and this is a domain in which completely irrespective of biology computational techniques have really evolved so there's I think the data will have its matching analytics Brenda J. I was just going to follow up on on all this comment about potentially using transcription factors or regulators as a starting point for making the query gene list and that's a great idea I think you could also use additional information to hope you know narrow it down pretty quickly how many transcription factors in him 1500 yeah yeah so you want to narrow that down any single mutant perturbation that doesn't have a fitness defect don't screen so you pick the ones that you starting with a fitness defect etc so there's a lot of ways I think you could start to narrow it down pretty quickly and start a pilot project I don't want to ruin the flow of conversation but I think what is the readout the readout mammalian cells fitness might not well no but you can do a pooled a pooled genome scale CRISPR based screen in that query cell line and using that is a fitness based but you can use any other you can use any other readout you can do an analysis of just the transcription factors in your data do you learn anything because what you showed us was all cell biology which is great it would be interesting to see what you learn about the transcription network right so one thing at least in yeast a lot of the transcription factors alone are not required to understand or growth conditions so you don't see much of a phenotype when you delete them under those conditions so you get more information by overexpressing them then you do by deleting them so but I you know Jay yeah so I guess I want to make two points so one is one is I mean just coming back to the point that our vendor made I mean this this point about dimensionality being finite right it's it's kind of the only way you solve the problem right the matrix is infinite unless it's unless you have some sort of way of bounding it so I think the question of how to actually quantify completeness using that as a definition rather than a big matrix is really important although I'm not sure how one does that you know and then and then just a related point is just thinking about the marginal value of each additional encode cell type right and presumably that marginal value decreases at some rate right and you know if at some level there's a choice between just doing additional cell types versus doing something like perturbations right where you knock out a TF or you're you're doing some time series or whatever it is and it might be my bias would be very much towards the perturbations and the and that sort of thing because that's a largely unexplored area where the the marginal value of anything you could do is likely to be much greater in terms of biological insight than simply adding you know cell type upon cell type so I've collected a few comments so first comment I also think that transcription factors both knocking them down and and then measuring what the effects meaning more and more RNA based assays not just RNA seek vanilla but all these other assays that really understand what these elements are doing like grow seek and other so I think that should be a really big part expanding the RNA world regarding transcription factors I think that we should do beyond just simple knockout if we want to get a predictive map if we want to be able to predict sequence to function we should not just do knock down but dial down so we can get quantitative effective transfer transcription factors use enhancers for combinations and regarding the 400 cell types I think that 400 cell types is such an underestimate that it's it's ridiculous since RNA is cheap one thing we can do to try and figure out what cell types we should query with more assays is we could do RNA seek for all the new cell types that we're discovering in 400 is an understatement use our current sequence to expression model map to see how well we do using the closest cell type or the cell type that we believe is most similar and see how far off we are and then could tell us which cell types would actually benefit from deeper epigenetic more mapping I was gonna follow up on these two comments and saying you know in the sense what we want to do is we want to we want to complete this matrix but we can't do that we want to sort of complete as best we can and I I do think it's good to think about this in terms of the amount of information we have like you there's a formal theory of information you can actually ask you know how much more information do you get for each new experiment you know I'm saying then of course you can you know it's also last last night you can even think about how many dollars you're paying per bit you know I'm saying you can try to maximum and I really do think that's the right way to think about but you're never gonna get the completion but you want to get the maximum value you know in terms of bits of new information relative to how much you spend yeah with respect to the cell type question we actually have some data on this and the data show I think quite clearly that there are not diminishing returns with adding more cell types they're actually expanding non-additive returns and this is most clearly seen in the case of the GWAS data where the finer you sub-segment the cell types and tissues the stronger the signals you see and if you don't have the right cell type sub-cell type sometimes you will see no signal and when you see when you have it suddenly you have everything and your ability to be confident in it is also based on all of these other cell types in which you don't see the signal so there are expanding you know effects plus you get greater power to do things like you know coordinate activation of enhancers and their promoters and things like that so I think that that and and you know continuing on like that is fairly cheap regarding perturbations biology I mean nature has handed us a huge number of perturbations in well characterized differentiation schemata right hematopoietic differentiation and things like that things you can do in the culture there are experts in the community do the stuff all the time and being able I think that there's tremendous value in doing time courses in a systematic way to expose condition-specific elements right now we have no handle this is a huge blind spot in NPO there are very limited number of condition-specific experiments and we really have no idea how big that condition-specific compartment is currently but I mean there there's lots of great biological systems that could be brought to bear for that I just want to expand on that I mean the RNA one I was gonna say there there has been a lot of RNA stuff in comparison there's also so from an element perspective how many elements you discover as a function of cell lines and it's amazing how many experiments have been done with CTCF and actually still don't saturate it when you get out to like 40 cell lines and it's barely starting to curve so you do add more value and you could do some cost effective relationship I think on the condition stuff there I think John's right there hasn't been a lot done but there has been some done because we did explore how many new elements you would get for example stat 1 under two different conditions or at a time course and you do get more sites not surprising you've done those kinds of experiments you yes you know I think you get something like double the sites if you go from one time to the next time point that sort of thing and so it does add value but I think you get some other kinds of information from a time course experiment you don't get from just an element perspective if you know you get an understanding that's quite different and you get to see well you know how it works you get to see how the processes changes a function of time which is very very interesting and that's an added value that goes beyond just element counting and in our particular case comparing to natural genetic variation and things like eqtl eqtl versus reqtl so differentially expressed are like two different worlds so I guess I'll just leave that there is there is added value to these time courses yeah definitely I mean I want to signal there I mean it's definitely better to watch the movie than the still you know I would I would also I want to say one more thing that actually chatted it was in response to something Mark and I chatted on the bus today on the way over I think the cell types or a cell type catalog is almost a set of basis vectors so if you have a good set of it in your hands for example you can now take a complex tissue and you convolve the signal and we know from the genomic studies that people have done in complex issues most notably in peripheral blood on the nuclear cells for which they're simply the most data to date that changes in cell type proportions dominate most of the signal and that that signal hides a lot of other things that we would want there it's actually very interesting signal but it hides if you think about gene regulation a lot of what you would want to see so there there is value in being methodical about it in in in the right context it's also the right systems you later on functionally validating and so on but just it's it's it will be likely a powerful computational resource Olga Adam and somebody else was raising their hand before and I kind of forgot by now maybe so I completely agree with this that cell types important and I also agree actually that the perturbations and especially some of the important natural perturbations are critical I do think that also just to formalize what I think has been popping up in many of our comments I think we probably want to figure out what are a general set of cell types people are interested in and then do some sort of genome-wide experiment on them and essentially figure out clusters of cell types you know just to try to narrow down and you know what we've seen is you certainly can use a subset of chromatin factors to be able to predict others right like in chromatin marks to predict others right and you can sometimes predict use a subset of that subset of chromatin marks to even look at a different cell type and predict the behavior right so it doesn't mean you can jump from a liver cell to a lung cancer cell but perhaps if you look at two different epithelial cells and you have enough say expression information then you can backtrack so essentially I think that that we actually need to figure out what is our general universe of things you know obviously there's 400 plus cell types who knows how you divide it we can do that what is the of those 400 which ones are we most interested in from just the perspective of the community and then of those let's try to do some genome scale experiment that will allow us to basically cluster them and then we can do perturbations on those key you know basis vectors in each cluster so same thing for perturbations potential so what I hear from this and for multiple comments before is it's being called by different names also for technical reasons I think but if there is an overarching theme is how can we set up a project that would lead to generalizable insights so that the cell types we study from them we could generalize to other cell types so that the technique we use we can be generalized to other systems so that the model of say sequence to expression or regulatory code or whatever people want to call it could be developed on one system but would then give us a model that we can actually apply to a new piece of sequence from a new individual in a new disease and make a prediction there and so on so that might be a good generalized that might be good a good generalizing principle for where you bridge the gap between basic which is what this session was about in disease which is what the previous session was our Avinda Adam and then we have to cut because we have to all so so I you know it's it's coming out in part of the discussion probably just the way we are discussing it that that you know there's all these basic science being done and eventually be applied to disease but that's a two-way conversation we're assuming also I think most of us assume that the disease is some kind of a you know extreme of the normal biology that we already studied so two kinds of epithelial cells made look at the resting stages very similar but the one in the long and the one in the gut maybe we know that the environments are quite different so partly I think the disease studies can and should inform at least some bit of the directions I'm not saying that they should be the arbiter it should be unbiased but it may be one bit of arbiter of you know gaining information on where we might put you know more emphasis rather than meaning the kind for example the kinds of perturbations or the kinds of cell in some circumstance the number of cell types may matter much less than in others so disease and availability of samples in the disease means that you have a human in vivo experiment as you know unpleasant it is to say that I think that's a very valuable fact Adam you have the last word I yeah I just want to make one general point about this tension between maximizing information of the next experiment and systematically filling things out in the matrix I think that this is a natural tension that that encode has to deal with and has dealt with already to some degree but I think I'd like to caution against being too strongly focused on maximizing the marginal information of the next experiment because I think that that it prohibits you from filling in areas of the matrix that you don't a priori think are going to be informative that may turn out to be informative and it also restricts your generalizability because there are then big holes in the matrix where there's no stake in the ground that you can get your bearing from so I think there has to be some balance between these two things and I think an element of the project should continue to focus on systematic regular description of many cell types thank you very much everyone we have a lunch break now which ends at at 1 30 be back here at 1 30 organizing committee grab your lunch come back here no break for you