 So, our last, but definitely not the least speaker is Katie Pollard from UC San Francisco. So Katie has done a lot of great work in comparative genomics, and her lab has contributed a lot of useful tools for the comparative genomics and the inter-grit analysis for multiple large-scale data analysis. Actually, I'm a big fan of Katie's work, not only because the science is exciting. Her lab, the software from her lab is always quickly released and deposited to GitHub and many bio-conductor packages. I think that's also the spirit of ENCODE, that's the shared work with the whole community, and that's what the good science is about, right? So today, Katie will talk about the topic of, the title for our talk is Many Transferring Factors, Recognized DNA Shape. So Katie. Thanks a lot. It's great to be here today to talk about some very new work in my lab at the interface between structural biology and bioinformatics. And before digging into that main topic for the talk, though, I wanted to give a brief overview of some of the other regulatory genomics projects in my lab that motivate us to move towards this new project. So the motivation for a collection of projects we've been working on that use machine learning and statistics to try to model gene regulation at the level of individual regulatory elements and of chromatin interactions is to understand how non-coding variants would potentially have phenotypes, as several of the speakers have mentioned. So the main hypothesis we started with was that non-coding variants would alter transcription factors, sequence motifs in regulatory elements, and then alter expression of their target genes. And so to understand that, you have to first know where the regulatory elements are. Those are the green boxes. You have to know whether a mutation would alter the function of that enhancer. I'm denoting that with the sequence logo there. And then a big problem is mapping those to the right gene because we know the closest gene is only right about 10% of the time. So we've been approaching this computationally, utilizing data from ENCODE and other public repositories that we're very grateful to have available to us, to try to shed with computational models some light on the experimental techniques like Hi-C and some of the other assays that you've been hearing about today and throughout the meeting. So the first project was to get a better handle on where the truly functional enhancers are in the human genome. And we were motivated to do this because many regions of the genome carry one or two or even a handful of the marks of active enhancers. They get called an active or a weak enhancer in a genome segmentation. But we saw that when we looked at them and tested them either in vivo or with massively parallel reporter assays that many of them either don't drive gene expression very well or don't consistently, when we look at Hi-C data, loop consistently to a particular gene. So that suggests they're not really biologically active. So are some of these sort of potential enhancers better than others? To attack this, we used a biological definition of enhancer that it can drive a reporter gene in a developing embryo. We focused on developmental enhancers because the assay works well there, but we're also doing them in stem cell derived, various stem cell derived in vitro systems. We built models that would predict or distinguish the sequences that can drive enhancer activity from the ones that don't. And we were able to do much better than what you get if you just overlap or intersect individual chip seek data sets. This is a performance curve with a low false positive rate. We had a much higher true positive rate than the typical thing of just intersecting or overlapping data sets. So the machine learning and the integration of massive amounts of data, these are thousands of data sets, some of which seemed irrelevant to the task at hand because they were from a cancer data set or something, we really could do better at predicting which candidate enhancers are biologically active. And then when we go and test those either in vivo or in stem cell based systems here, these are beating cardiomyocytes that we derived with colleagues from pluripotent stem cells. We see that these predictions validated a very high rate, higher than if you just took a random subset of the things that might be an enhancer. So we think this biological definition is very important. The next question is whether a mutation in an enhancer has a fact or not on its function. And a number of people have been thinking about this. Our contribution was fairly mathematical, which was to try to figure out what happens to the ability of transcription factors to bind when you induce a mutation in an enhancer and knowing that there can be turnover, especially over evolutionary time periods, but even in the human genome there can be multiple compensatory mutations. We looked at the total effect on the enhancer activity, the cumulative regulatory potential. This turned out to be sort of an unsolved math problem because these are both Bernoulli trials for the hits for binding and they're correlated because the sequences are related to each other, so we worked out a distribution to get a p-value for the net change in binding potential when you have multiple mutations in a region of the genome. And these, this data predicted very nicely changes of function where you can see a restriction in the domain of expression for this enhancer in the mutant compared to the wild type. So the really hard problem, or at least I thought it was going to be really hard, would be to map these enhancers and their variants to genes. And I knew this would be hard because of mounting evidence that the closest gene, which is what many of us have been using for a long time, is not usually the target gene, that the structure of this toy locus could look something like this, where this genetic association pointed to this region, this enhancer prediction pointed to maybe this variant, but we certainly wouldn't want to follow up gene C in the pathways that it's in because it's actually looping over here to gene A as we've been hearing about in other talks. So there is increasingly experimental data to look at this and being told us about a massive amount that should be out soon with promoter capture, for example. But we thought it would be interesting to see if there's information and other data, in particular in the one genomic, one dimensional genomic signatures along the genome, that would basically allow us to do an in silico high C and to accurately predict high C interactions. So if we look at express genes in a locus, and here we're inside of a tad, so not trying to recapitulate the tad structure, but individual promoter enhancer interactions within, say, a one or two megabase part of the genome, and we have lots of active enhancers, and we have a number of active promoters. Can we pair them up with each other accurately or not? And this is important for two reasons. One is that unless you do a lot of sequencing, the experimental procedures still can have a low enough resolution some frequently, at least so far, to not be able to deconvolute individual enhancers. There could be a 5, 10, 15, 20 kb fragment could have multiple enhancers or multiple promoters on it. That will hopefully go away as the experimental techniques get better and become cheaper, but so the other motivation is just to understand what protein binding events, RNA binding events, when you have those sequence features, et cetera, predict these interactions. Can we learn a signature for looping chromatin? And so even when we have very fine high-resolution chromatin interactions, understanding the mechanism is still an interesting question. So we again use machine learning, and we showed that here in dotted lines are what happens with the closest gene or all genes in a window with an increasingly large window, and at low false positive rates you almost never get the right gene. In the dotted lines across these different cell lines, but our algorithm called Target Finder has very high accuracy over 90% accuracy or power to detect the true promoter enhancer interactions within TADS. But what I thought was most interesting was why we can do so well. It wasn't so much because of what was going on at the enhancer or at the promoter, because you can imagine inside of a TAD there are many active enhancers and many active promoters, and they all look active. So what was really interesting was that what we saw was the signature for when an interaction happens with saying that this enhancer loops to gene A and say not to gene B. The information is on the intervening chromatin, so the piece that would be on the loop if there were a chromatin loop. And we found very different proteins bound to the loops than to non-looping pieces of chromatin. In particular, a lot of signatures of heterochromatin on the loop, cohesion and CTCF within about five, six KB of the enhancer and promoter, but an absence of it on the loop. So if this loop were happening there would be some here and some here, but not here in the middle. And that CTCF by itself wasn't particularly predictive, but if you combined it with various other proteins, including the cell-type specific transcription factors, you can literally read the high-C data out just by looking at a handful of chip-seq data sets on the order of eight to 10 chip-seq data sets. So a few histone marks, and they're not the K27 acetylation or the K4 methylation marks that you would use to predict the enhancers and promoters, different ones. A few of those. Some of these cohesion complex proteins, perhaps CTCF, some of the cofactors that it has. And then you can literally read out a high-resolution high-C from that one D genomic signature with very high accuracy. So just to summarize this background or motivation section of the talk, we found that machine learning on biologically validated enhancers and interactions led to very interesting cell-type specific predictions about gene regulation and highlighted the importance of really looking at specific enhancers that are the ones that are consistently looping to the same genes and that will validate when you do an assay, like a reporter assay, in an animal. And that not all enhancer-like regions are really doing this. But to get to the second part of the talk, despite all these interesting observations, we found a lot of things that couldn't be explained by what I've described so far and what most of us have been doing. So one of them is functional variants that are outside of enhancers. So once we've done a really nice job of finding all the enhancers, why are we seeing some variants that aren't in those? And by enhancers I'm using that term loosely. It could be... I'm also including repressive elements potentially, so variants that are just not in what seems to be any kind of what we would call a regulatory element that would interact with a promoter and modulate expression of a gene. So ones that would be, say, on the loop of the chromatin. Well, the Target Finder project made a prediction about that. So the fact that we saw this interesting and very predictive signature on the looping chromatin that was different when an interaction was happening than when it wasn't suggests that variants on the loop could be functional. And so instead of focusing on this enhancer variant that may or may not be functional in this example, maybe I should be looking at this variant over here that, say, creates a CTCF binding site and then now makes an interaction happen that wouldn't have happened before. Because now there's a signature on this looping chromatin that prevents that and maybe creates a loop here instead. And so we have for a variety of different encode cell types, predictions of regions on looping chromatin that aren't enhancers but we think are modulating enhancer promoter interactions and would be interesting to test. And the approach, low throughput version of that to really understand if this hypothesis holds would be to go into a few loci and really test these with genome editing. If we see a signature then to try to think about something that we could do genome wide. So we're very excited about this. It's not the subject of the talk today, but a very big direction for us going forward. And it makes sense. We do see, for example, when you look in cancer cell lines, indels, structural variants that are affecting these kinds of binding sites for structural proteins. So what I do want to talk about is another unexplained phenomenon for the rest of the time. And that is variants that are in enhancers. So you have a sequence. You're pretty sure it functions as a regulatory element. It has a variant in it. But as far as we can tell it doesn't disrupt a sequence motif. And that could happen because it's for a protein that we don't have a good motif model for. But even when you do de novo sequence motif finding you don't see that this is something that's enriched. So there are other explanations besides the obvious one that while we just haven't learned all the sequence motifs yet, which I think is true, but is there something else going on? And the reason we got really interested in this is we analyzed the chip seek peaks for all about 250 encode data transcription factors. We looked at the top 2,000 peaks, so the ones that were most confidently called and really strongly bound by the transcription factor, and 23% of them don't contain a sequence motif. And that's not just that they don't contain the consensus sequence or they don't match the PDWM, but that when we do de novo motif calling to take all those peaks and ask what's enriched is there maybe some motif we just didn't know about? We learned some new motifs. They don't 23% don't even contain those new motifs. There's no enriched sequence. And that's using a very loose cutoff for calling a sequence enriched. So what's going on? Well, we became really interested in work by Rima Rose, Richard Mann, Harmon Bussemenker, some of their colleagues on the structure of DNA and on how protein DNA binding is really a biophysical phenomenon. And the idea that maybe a mutation like this is altering the shape of DNA. And that may or may not affect a sequence motif, but that proteins are well known, a number of DNA binding proteins are known to really recognize the shape, things like the major and minor groove with the helical twist, the propeller, the roll. And so maybe the mutations are affecting these biophysical features. So the idea would be motivated by work that these other labs have done on sequence motifs. So all the work in this field so far has focused on parts of the genome where there is a match to the sequence motif and shown that DNA shape can provide additional specificity or can distinguish between, say, two transcription factors that have pretty similar sequence motifs. But we wanted to ask, what about this 23% of peaks that have no sequence motif? Is DNA shape maybe playing a role there and maybe playing a very prominent role? So the idea is to develop an algorithm that does motif searching, just like you would for DNA sequence motifs, but it's searching for shape motifs. And then to apply that to all the ENCODE data, predict shape motifs and see where they occur, what do they look like, are they the same as the sequence motifs are different. So this project was really enabled by a computer program from Rima-Rose's lab called DNA Shape where you can put in DNA sequences and they get translated into a vector of shape features. So anyone who's familiar with the biophysics of DNA will know what I'm talking about here. If you're not, trust me that these are describing different important physical aspects of DNA and they're encoded basically in five mer sequences. So you take five base pairs and I get, I put it into the program and I get back a number for each of these shape features that tells me what the values are. And those have been derived by his lab using molecular dynamic simulations. So what our work was was to translate the whole genome of DNA shape and then to develop an algorithm to do motif hunting in the DNA shape realm. So this is, we implemented this with Gibbs sampling which is the approach commonly used in DeNovo sequence motif finding. And the criterion we're trying to minimize here is the distance between these feature vectors. We used Euclidean distance in the first pass. So these pairwise distances between instances, what your candidate instances of the motif are. And so once you learn, say, from a collection of believable binding sites, what shapes you think a particular transcription factor prefers, you can then score other binding sites and ask whether they contain the motif or not or call hits. And this is an example of what it would look like. This is a particular sequence feature such as roll. And you can see in most of the DNA in gray here, you can see a huge variance in the values. And then when you get inside of a motif, the variance shown in gray is very low. Instead of just a wiggly background, there's a very distinct signature of values. So this would be what we would call a motif. And the null distribution is like this flanking region that you get a big variance. And then we look at that and we say, well, this is much smaller than variance than you would expect. So we can call hits in a set of peaks that we didn't use to learn the motif model. And then, of course, you find a bunch of them. And so the question is, could they happen just by chance? And so we look in flanking DNA non-chip-seek peak regions of the genome nearby with similar sequence content, also call peaks, and then we do an enrichment test. And what we found was across two encode cell lines here and a number of different shape features that many transcription factors have a shape motif. They're very common. Most transcription factors have more than one. So they recognize both, say, the roll and the minor groove width. And they may preferentially use one in one cell type and the other in the other cell type, or they may use both. And what was really cool was comparing this to what you get when you do shape motif, I mean sequence motif discovery. So those peaks that don't have sequence motifs have at least one shape motif. It's often at the peak center, just like you would expect a sequence motif to be. And in about 25% of these encode-chip-seek peaks, there are a sequence and a shape motif. So this begs the question, are they the same thing or are they different in working together? So they can be similar. Here's an example for NRSF. This is its roll motif. It has this wiggling roll. And when you take the sequences that are hits for this shape motif and then you just look in the genome what sequences did I get, you can use those to build a logo. So discovering this was completely using the numerical values for roll. But I can ask if there is a single sequence or a preferred sequence that gives this roll or are there just a bunch of sequences that give the same roll? And there could be a bunch that give the same roll because there's not like a one-to-one mapping between sequences and structural features. In fact, very different sequences can give the same roll. But in this case, there are some consistent sequences and they look very similar to the sequence motif that's in factor book, but perhaps giving some flanking sequence that may provide some specificity. Here's an example where they're similar, but they're not the same. It's a refinement. So this is for CFOS. It's a propeller twist motif. And here this ATTGG core motif from the sequence motif appears in the roll, in the sequences that have this roll motif. But then there's this flanking GC rich region that's necessary to give this pattern that's not really picked up as part of the sequence motif. And this can go the other way, too, where the roll is sort of, where the shape is, say, a core part of a bigger sequence motif. Or they can be totally different. So, and this may, and this will obviously suggest that they're not occurring at the same positions in the genome, which I'll show you in a minute. So here's a helical twist motif for math. This is the sequence that gives that, so it's pretty specific and it really doesn't match the sequence motif. Now, the shape motifs that are different from sequence motifs can be nearby. So here's an example that has no sequence specificity, but it has a pretty specific helical twist. And here's, it doesn't, it can't match that because it has no sequence specificity, but we find it very consistently three base pairs away upstream specifically of the sequence motif. Here's another example. This is for roll. And this is the sequence that drives it. It's totally different from this. And it's not right next to it. It's 30 base pairs away, very consistent peaks of this shape motif, 30 base pairs away, suggesting maybe a cofactor or a complex. And the shape motifs can be different between transcription factors that have similar sequence motifs. So FOS, L1, and ATF3 are both BZIP transcription factors. They have very similar sequence motifs shown here, except they differ a little bit at this AG in the middle. And they prefer very different sequence motifs. This one likes helical twists with a certain pattern, and this one likes roll. They share a propeller twist motif that relates to some of the positions that are highly conserved in the sequence. Okay, so to wrap up, this is just really new and we're barely getting started. So I have a lot of open questions you probably do too. One is to sort of combine in a more rigorous framework this integration of sequence and shape motifs. Another is to start looking now in the different contexts. So we focused on the top 2,000 chip seek peaks because we wanted to be in situations where we were really sure that the transcription factor was binding as we were benchmarking and getting this started. But we want to start looking at overlapping peaks and at weaker peaks. Another project involves a collaboration with Ben Waubruneau's lab where we recently published a work on deleting transcription factors and showing that you get ectopic binding of their cofactors at other places in the genome that drive aberrant gene expression. We want to, we couldn't find a really compelling sequence explanation for those extra binding sites that you gain in the knockout and we want to look at the role of shape there. And then finally looking at whether the DNA shape may provide some insight on cases where enhancers have conserved function over long evolutionary time periods without having conserved sequence. Maybe they have conserved shape. And obviously going back to the question I started with about non-coding mutations, we want to develop a scoring system for SNPs just like we did for sequence motifs to ask whether SNPs are predicted to change DNA shape or not. I want to specifically acknowledge Sean Whalen who worked on Target Finder and Hassan Semi who's leading this work on DNA shape. And I'm happy to take your questions. Thanks. Yeah, if I understood correctly you were looking for structure in each of the DNA shape features independent of the other features. And I'm wondering why that's the natural way to treat the information instead of trying to look at everything simultaneously. Yeah, that's a good question. So that would be a nice extension to try to look at them together. We thought that, well we've seen that often factors will have a very specific preference say for role but not care at all about helical twist. And we were afraid if we put them all together in a simple way that the kind of noise would override the signal. If you did it in a good way where you could kind of sort of have something sparse that would not look at the noisy part and focus on the part that was specific, that would work. But for our first pass we did them individually and then combined them. But I agree some kind of integration. And then eventually integration also with the sequence motifs is where we're headed. Yeah, Zipeng? That's really great, great results. Can you validate this using like MSAR kind of experiment? Say cut out that piece that has no motif but has the shape motif to see if the TF actually binds. Yes. To look into PBM or other in vitro binding data. Yeah, absolutely. So the folks who've been developing some of the things that we base this work on were looking in like CELIX data. But they were always doing it in cases where there was also a sequence motif. So we would now need to extend that to this case where we're kind of considering them separately. But yes, we absolutely have to do some validation. We just started this project a couple weeks ago. So we haven't done that yet. Yeah. So Katie, I have a question. So now you have the sequence motif. You have the shape motif. Let's just scan the human genome. Yes. Kind of just predict the chip sequence results much better just based on the sequence. Right, yes. So I mean, we know you can kind of predict it from the sequence motifs, but there's a lot that you can't predict. The hypothesis is that would be explained by the shape motifs and we're working on it, but we haven't done it yet. Cool, cool. Yeah. All right, let's say thanks Katie again. So now we will have the most important session for this 3D 3 days workshop. So I think we have talked enough and showed you what we think is the best for you. Of course, there's always your wrong estimate. So it's time for you to tell us what we have done right, what we have done wrong, and how we can help you to make the encode data and the software more accessible to you. So I also want to take this opportunity to ask all the encode members current or past, please stand up. All the encode members, please stand up. Bing, Eugene, stand up and wave to the people. I'll encode people. So just in case we haven't answered your question enough, just look around, see who stand up and you can approach them, ask more questions and how we can help you. At the same time, Mike Payson and Dan Gilriss is here. So they are the program officer that supervises encode project. So if we haven't done enough, talk with them, I can make the process faster, I think. So now I'll pass my mic to Mike. All right, everybody, I'm Dan Gilriss, one of the program officers from NHGRI who works within code. And I just want to say thank you all very much for coming here, for sticking around to the bitter end. And now we would really like to hear from you what we can do to make the encode resources most useful for your research. And one of the key things we have to learn about what you want is a survey. Please take some time to fill that out before you take off. If you fill it out while I'm talking, I will not be offended in the least. A very big hand to our speakers who came down from far and wide and sometimes just across the hall there and gave some fantastic talks. And I want to take a second to thank our hosts, the encode DCC, everybody here at Stanford. I think they've done a fantastic job of keeping things running smoothly, just having things really well organized, bringing in piano players and beer and all kinds of fabulous stuff. So thanks very much to the DCC and especially to Gene Davidson who has done a lot of the heavy lifting for making this such an excellent event. And then I just also want to say thank you to all of the tutorial presenters. I know it's tough sometimes to get up here and present something that may be in some stage of experimental development and not know how well it's gonna go. So they were all very brave to get up here, share their work with us and hopefully this has been a valuable introduction to many of these tools. And also thanks to the encode outreach working group who spent a lot of time planning this event and especially Mike Cherry Fong and John Stam who put in many hours working on this. So again, I'm going to plug this survey. I'm aware that this is not a survey monkey, this is a survey ape, but hopefully it gets your attention. Please fill this out before you leave. We really need to know what you think about encode resources. And so there are a couple goals of this meeting. One is to hopefully show you some of what encode has to offer and there's a lot that encode has to offer. We don't want folks wandering around in the desert out there, but we want to guide you to the resources that can help your research move you in the right direction efficiently, hopefully more efficiently than the DC Metro system, which is what I have put up there. But we really want to direct you to what you need to make your research work. But the second key purpose of this meeting is that we really need to hear from you what we can do to make encode resources better available so they're easier to access. And this includes both the scientific resources, the data, the tools that the consortium is putting together, but also outreach activities like this. We'd love to hear what you think about that. And I'll take two seconds and just say why this feedback is so important. So the encode consortium is going to take what we hear from you here today. And we use this for planning purposes to think about how we should focus our scientific efforts, how we should potentially refocus data production efforts, and how we can improve the presentation of the data and the resources to y'all. And I just want to point out that there's a session. We have our annual consortium meeting next week. One of the kickoff sessions is going to be a discussion of what we learned from this meeting, what we learned from last year's encode users meeting. We're going to take this feedback and try to really think about how we can put things out there that y'all can use. And I also want to point out that NHGRI also uses feedback from these meetings. We do this to focus, again, the encode production efforts. We use it to plan outreach efforts for encode and other programs that NHGRI supports. And we use this to think about how we should shape resource projects like encode but also other resource projects that NHGRI is involved in. So please tell us what you really think. Fill out the survey. Come talk to us, myself and I think Mike and many of the encode consortium members will be around after this is over. Feel free to email us and express your ideas. And then we'd really like to open this up to you. Grab a microphone or someone will hand you a microphone and tell us what you think. So one thing we're interested in is if we do another event like this, are there any sessions that you found to be just so fantastic that if we do another meeting like this, we have to do a session like that? This was great, I loved most of the sessions. One thing I was wondering if it was possible to do in one of the tutorial-type sessions, maybe have us show us how to download a data set from encode and even do a simple analysis from that, maybe isolate the big sequences or some such exercise might be very helpful. Yeah, I mean, look actually, look download a particular track, maybe the data set corresponding to your track on our computers and do a simple analysis of some kind. So I think it has been a much greater meeting than the previous time in terms of the practical session. I think the program has been more adapted to less thick than the previous time, so I'm very happy. But it would be possible to include, I know there's been like three or four talks about how to link enhancers with target genes like the previous one. So do a practical on that. So a session sort of focusing on enhancer target gene linkages. Yep. Hi, this is a fantastic meeting, thank you. I was wondering if it's, but what I got from here is mostly overview of the types of data that encode has and some uses and fantastic science that people do with it. While I would really enjoy if there would be more practice and hands on type of meeting for specific uses, which would be more focused on specific types of analysis or specific types of data and how really to bring it back to the lab rather than and to other people in the lab rather than basically like a more tutorial workshop type of thing than an overview type. So sort of spending more time getting down, getting your hands dirty and working with the? Yes, and also I feel that there is a lot of things which can be done within code, some of which are relevant to what they do and some of which are much less relevant to what actually goes on in the lab. And I think it would be useful if there would be maybe satellite workshops or maybe several dedicated workshops throughout the year that you could or specific tutorials or YouTubes online how to do this or how to integrate that or what's possible for mouse, what's possible for human and so on. I think that those are really great points. A couple of things that I'll bring up since you already brought them up. We do have satellite sessions at some meetings. We have one, we actually have two, one put on by the encode DCC and one put on by Fong and Ting Long, that's going to be at ASHG, those are both going to be at ASHG this fall in Vancouver. We had one at the previous ASHG meeting and we're always interested in putting out more of those. So if there are meetings that you think of that would potentially be good venues for these, we'd be happy to hear about that. A second thing I'll mention is that we try to make tutorial materials online available both through the encode portal and through the NHGRI website. All of the videos from last year's users meeting are posted on the encode portal along with slides and they will be from this meeting too but I think I'm also hearing from you that more focused sort of tutorial materials would be helpful. Yeah, more hands-on. I have a follow-up question. You're saying more practical hands-on and for you would this be more I have this problem, how would I solve it within code data or I want to do this particular process, what are the steps in it? So for me specifically there are two types of things that I would really love is one is you start with specific data set, for example you start with a list of genes or with a list of regions or in our specific case we have a resource of genetic variation in mice and we know a lot about phenotyping in those mice and we mapped a lot of the genes of those mice and so on and what layers can we add and how step by step? The other would be of course if how we can integrate our own data, for example we have hundreds of RNA-seq samples into this and in terms of how to access the whole pipelines pragmatically and not necessarily through the portal and so on. I understand that that would not be an ideal focus for most of the people here necessarily in terms of pragmatic access and processing hundreds of samples but that would be something that I would be interested to come as a specific tutorial type of thing. Another thing that for example we had a meeting at UCLA, a workshop that was given by GATK, those people go out and do that and that was very useful and popular because basically because it's there because you can, it's not, you know. So probably from UCLA here is not a big deal like some people who came much further but in general it's, I think it would be useful if that would be kind of more available. So you're suggesting that's a meeting as in LA next year? Sort of just a follow-up point because I'm also at UCLA and there is, I've had discussions with people that there's interest in like for example, maybe leading some type of tutorial at UCLA on Enco and I think this could be something more general that there's enough people who have Encode expertise at a lot of different universities where they could lead some type of session and what could facilitate this if there was sort of a streamlined set of material that somebody could take which covers lots of different topics about Encode and just presented and I think we have some of that accumulating just from like workshops today and I mean it's the past few days. This is the first time I'm attending this meeting. My name is Raj, but I come from UCLA. First of all, I feel the composition of this meeting was perfect. You can either have workshops or you have a meeting where you have paper presentations and you have conceptual advances and things like that discussed. In this meeting, I thought it was extraordinarily balanced. We had hands-on sessions and then we had terrific sessions. I think the session yesterday morning was one of, in my opinion, one of the best sessions of this meeting and so I think it will not, I don't know, I shouldn't say that definitively. Having a dry tutorial session may not be as, it will be significantly useful for practical purposes, but in order to incorporate and reach out to the community and present it as a resource for everybody, not for 100 people or 200 people here, it is important to, for example, talk on the Nancy's talk on the phenome was extraordinary, that brings in questions and I would like to see thematically that this meeting focuses on some of the problems in genome biology and there are millions of problems and once you focus thematically on two problems or three problems in one particular meeting, I think then you would attract people who are interested in addressing those questions to the meeting and present this as a resource that may be used to investigate and initiate investigations in that particular area. For me, this meeting was extraordinarily useful because it introduced to me the potential that exists in using encode data. I have run a huge number of gel retardation essays in my life, I still do them, but I think we need to graduate from that and realize, recognize the potential that encode provides. So I would keep tutorials and talks, conceptual advances together in a meeting to make it more relevant and more interesting. And I want to thank you all for making it so comfortable, so easy using computers and worrying about us who is doing what, people from the encode here and from Stanford, it was extraordinary help. They were looking over your shoulder and that was the best part of it was that these people were there looking over your shoulder and wanted to make sure that you are going to what you are going and sets whole presentation. This was a very balanced meeting and I hope you would repeat this again. So one thing I think it might be interesting to include is like a session where people in the community can talk about what sorts of data from encode they use most and what they think most might be most valuable moving forward, particularly if like the transition from encode three to encode four, it might be nice to have input on that, like whether or not the certain data sets that are being produced are actually going to be useful to the community. And I think that there's definitely a balance there between what's obviously available for encode PIs to do their own work and also what's available for the community to use as a general resource, but having some information and input on that would be very valuable. Let me just clarify. So you were asking what data sets are used the most by the community? Okay, I think my chair might have the answer. How many, for example, how many times a data set has been downloaded and something like this? I really enjoyed the conference. I learned a ton. I have very little feedback, but I have two things that I think might be interesting. The first is that the lightning talks are so short that I think the ones that were most effective were people who said, look here, I made this awesome figure. I used encode and here's how I did it. So those were the ones that I really got the point where people tried to present all their research. It got kind of lost sometimes. So I think if the lightning sessions were almost like mini workshops in the sense that, this is how I used encode for one or two things that might be effective. And the other is that people who are very new to bioinformatics tend to like to use Galaxy. And I don't know how hard it would be, but if the encode pipelines could make their way into Galaxy, I think they would be pretty widely used then. My comment is really with the second point there, right? What new resources could be created? And yesterday I talked a little bit about that. And for example, for me, I mean, I work in a very specific field, right? Where I consider cancer as a hierarchy of cells, right? And I do much more like functional studies and then try to use encode or TCGA to see if that can predict the relevancy in cancer that is at least my field. But the problem of the sources that are available, right? Or they are cell lines that sometimes don't mimic what really happened or they are pool of, or a biopsy or a pool of bulk cells where those bulk cells can be a mixture of endothelial, lymphocytes and real cancer cells, right? So I was like wondering if the encode, probably not encode four or five, but maybe encode 10, will approach a single cell RNA sequencing? I know a single cell chip is impossible to do. But if you guys are going to the direction of taking in consideration the hierarchy of tissues to study stem cells and more differentiated cells are not only in the embryonic field but more somatic stem cells. I'm a little bit new to the encode. I think it maybe would be helpful like add something like a bigger picture of the encode project, like where the center or how the tissues are sample selected and that kind of thing, maybe very helpful to the new user or, yeah, new users. And the second thing is like, because this is a histone modification mark or any kind of element is very tissue specific, I don't know whether you have any plan like to add like the non-coding variant annotation by like a tissue type or cell type, that kind of thing. Now the non-coding variant annotation, so we have like several tools there, but right now I don't think it's tissue specific, maybe it's aggregate annotation, right? Like if someone into the variant or particular tissue, whether they're kind of a tissue specific, like if you select this tissue, I want to annotate this kind of particular variant, that kind of thing. So I know how to write the, you will give us SMP, they can tell you whether this is enhanced or not in what cell types. And then they have their most recent version that can give you a goodness of what tissue might be function in that specific tissue. I just have a small suggestion. If you could provide a list of tool that are using encode data, I think that would be perfect. If you have a list of publications that are associated to encode data, right? So if you could have a list of tool. The OMDB, HEPA, REG, I think. Yeah, yeah, yeah, yeah. Like Target Fender, those. Right. I might be wrong, but my impression is, I think on the encode project we might already have a list of tools where I think it should be linked to the download website. I'll double check it. I think it's there, yeah. Hi, Declare, if I, oh, okay. If you just go to the actual portal, then under materials and methods, you will see under software tools and then you can go there. But we can definitely add some of the more, these are tools that we use often in the pipelines of Rikisi metrics, but we can actually add some more tools that use encode data there as well. Also on the Encyclopedia about the Encyclopedia, a lot of these tools are contributing to what they would call the Encyclopedia. So that page actually has links to many of these tools. And that can be increased and or we could link out from there, but that page was what Zeeping was trying to show. So Fung's element browser is linked off of there and Factorbook is linked off of there. I think he's looking at that page. I think to make my suggestion specific, maybe I'm looking at the data type and the click publications. You can see a list of publications. Maybe you can create a filter called tools. If this publication is related to tools or just a research article. Yeah, so under community publications, we do list people that are developing software tools, but they're not necessarily tools that use encode data or rather they're publications where someone developed a software tool and they used encode data possibly to test it out. I think that's a fantastic suggestion. I think if you have a query, you type in this tool, type in certain data and then boom, there's like 10 publications using this tool and those data sets, it'll be interesting. Yeah, so my question is like involvement. So how the community scientists or researchers can get involved in this fantastic project like from long-term or I? You are already doing that. Yeah, I mean, there's a few different ways that people can get involved within code. Okay, I mean, we just went through a round of soliciting grant applications and that has closed, but encode in the past has been an open consortium and likely it will continue to be so. That means that people can join the encode consortium without having encode funding on our website. We have the process for doing that or feel free to contact any of the NHGRI program staff. Other people get involved in it informally. There are people they know in the project and they ask questions and make suggestions and I made an offer at the beginning of right after lunch yesterday saying that in the next round we are likely to be soliciting ideas for samples, especially if people can actually procure the samples that consortium might do and also we're interested possibly in taking in community data. So if some of you have large data sets of transcriptomic epigenomic data and if we move ahead from that, we'd like to hear from you of what you have and consider bringing that in. So those are some ways that people can be involved. As Fong said, we have different outreach events. We're trying to encourage people to speak up and say what's useful about outreach events, what's useful about the project so we know better how to do things. All right, so I hate doing surveys. I'm gonna guess a lot of you hate doing surveys. I hate looking at surveys, but they are incredibly useful and one of the most useful things about surveys, if you're not used to getting people sending you surveys, is seeing one of the comments that pop up again and again. So please, even if you're thinking, oh, everybody's probably gonna say this, why should I say it? We are aware that we cater to people that have lots of different backgrounds and do lots of different things and it's enormously useful to get a sense of a hundred people asked for this versus one person asked for that. And obviously, if 10 people fill out the survey instead of 100, then we don't see these patterns. So if you can, please do that. And I've got one last question for the group and that is if, so you've been to this meeting now, if some colleague or friend of yours says, hey, this event's coming up, should I go to it? You having been to it, would you say, yeah, go, it was a good experience? Yeah, or don't waste your time? And I ask, this is a resource allocation question. You know, we put people's effort and money into this and if it helps people out, it's a good choice and if it doesn't, then it's a bad choice. I mean, so if a friend asked you, would you say, yes, you should go? Would you say, yeah, maybe. And would you say, no, I didn't get that much out of it? Okay, thanks. Okay, so I'm the last thing here. I just wanted to give you a little bit of an infrage. So for me, speaking for the DCC, this has been a really great meeting for us because our job is to distribute data and to understand the use cases and things like that. And I think we've learned an awful lot about that. So it's great that it's important for you, but it's also very important for us. I'd like to say, so there's, I think it was 230 attendees, people that actually checked in, not just the ones that paid and didn't check in. I think we had 150 people that finished the RNA-seq chip-seq pipelines yesterday. So that's great, great, you've done it. Hopefully all of you've done it. Let's see, we had 80 to 100 people that stuck around for the lightning talks. I mean, that was really tough after a 12-hour day. So everybody that gave a lightning talk, really great use, you stimulated people to stay. And I guess, unfortunately, we had a lot of wine left over last night and that's too late to share that, but you know, next time there's free wine, you know, heck around, okay? So, let's see, okay. So the weather's nice today, I don't need to go into that. Okay, if you want to stay in touch with us on the encodeproject.org, go under help and contacts and there's a place you can subscribe to the encode announce list so that you'll get regular updates about things. It's not a real high traffic sort of list, but fortunately that's where you'll keep, you'll learn about new updates and things primarily. There's also an email address there for the help desk. You know, any comments that you want to say to any of us go to the help desk, we try to answer things within a few hours, if not a day, the whole group sort of sees that as well. So, there's certainly easy to stay in contact with us, I think. I did want to say one little thing from my perspective of the DCC. So, you know, there's an increased concern about transparency, reproducibility, openness of data and research and things. To me, what encode is all about is addressing that, right? So, I don't know how many of the PIs from labs are still here, but to me it's really impressive that they're committing their normal research lab to do stuff the way the consortium says to do it as opposed to the way their postdoc wants to do it, which may be great research, but there's got to be standards for things. As well as the computational folks that are doing it in sort of defined ways, sharing all of that software, sharing the data and things. So, I think really addressing this transparency, openness, standards is really what this is all about. I think NHGRI has created this to sort of be a core nucleator of research in this area. And so, they should be commended too, of course. They're funding it, but this is for outreach here. They're putting in a significant amount of money for the outreach. You'll notice there's no sponsors here. There's no company saying, you know, buy our stuff to do your better assay. It's because NIH wanted it to be, you know, this sort of neutral ground, okay? So, I think that's really great. You know, like I say, you guys have made this an important meeting because of the questions and your interactions on things. And, okay. So, it was really an outreach committee that did so much and like Dan mentioned most of this. I just do want to say one more time and everybody sort of talked about the DCC, but I'm really proud and really thankful for the DCC group that actually did so much. All the wranglers, all the developers, everybody was actually here and in particular, some that haven't been recognized specifically. So, Edon and Jason were the mic runners. Okay. All these wonderful pictures you've been seeing were done by Forrest. Edon and Jason's regular job is wranglers and Forrest is a UI guy. And, of course, Cricket and Seth who did a fantastic job with the tutorials, particularly Seth for standing up there and being so clear. So, okay. So, and I want to wish you safe travels home. You know, we hope to do this again somewhere. Like I said, last year was East Coast, this is West Coast, so we'll hope to see you again. Thank you.