 So, I'd like to go ahead and get started with our first speaker, who's Mike Snyder. Mike Snyder is a mod-encode researcher. He is the Stanford Asherman Professor and Chair of Genetics at Stanford, and one of his specialties is personalized medicine. Thanks, Mike. Great. Well, thanks very much for having me here and giving a chance to tell us what we're up to on trying to look at regulatory information across different organisms. And I think the mod-encode consortia along with ENCODE has really provided some nice opportunities here. Let's see. So why would you want to compare regulatory information across organisms? Well, first, if you do see conserved elements, you know they're functional. And if they're functional, that can be quite useful to you. It implies something you might want to study or understand if it gets mutated. I think the other reason you want to look at comparison of regulatory information across organisms is that you would like to try and understand basic principles of design, what kinds of circuits lead to what kinds of outcomes, and then, of course, ultimately how you modulate those circuits and change outcomes is of high interest. And lastly, I would argue that, and probably many more than us, but the third possibility is that we'd like to get a general understanding of how we're similar and different from one another, both within a species and between species. I think many of you may be aware of this general question, is just how much are we identical to one another? How much of that is due to differences in, say, amino acids or the genes we have? Or how much is it due to the fact that we have the same genes but how the regulator makes us different from one another? And along the lines of this last point, there is some controversy in the field about how much is binding site transcription factor binding site conserved or differing amongst different individuals. And so we had a paper out along with Tonka and Odom in about 2007 suggesting, in fact, there's extensive divergence of binding sites amongst different organisms. And the pilot phase of the ENCODE consortia had similar conclusions, but there have been other papers out there suggesting that, in fact, binding site information is incredibly conserved between different organisms or within the same species. And so there is this unresolved issue that's there as well. So what I'm going to talk today is about general characteristics. So I should back up and say, so my ENCODE, along with ENCODE, does provide a really nice opportunity because there's a huge amount of data to really get at this question, although I have to admit there's still many more questions to get answered as you'll see at the end. Okay, anyway, here's what I'm going to talk about in today's talk. So first I'm going to tell you a little bit about general transcription factor binding data. And this talk will really just focus on transcription factor binding information. Jason will talk more about the chromatin information, which is obviously another aspect of regulatory information. And then I'm going to give you a few vignettes, if you will, of some of the comparisons we've done. This is a work in progress. And some of those questions are just how much are the features of binding site of different transcription factors in their binding sites similar or different to one another? How much are the partners of transcription factors differing between different organisms? And even a fundamental question of do transcription factors like to bind the same types of genes between different organisms? And most of what I'm going to talk about is worm, fly, and human comparisons. But for reference, I'm going to sprinkle in some mouse ENCODE data as well. Okay, let's start with the data sets. As I say, there's really a lot of data sets out there now with transcription factor binding. And since humans have been studied the most with the most groups and the most amount of time, they have the most, there's 700 different data sets. For worm, there's 236 flies, there's 102, but there's a lot of chromatin data sets here. Again, we'll talk about that. And for mouse, there's quite a few as well. These represent a number of different transcription factors, generally 50 to 168. This just shows that in general, there'll be a lot of transcription factor binding on one or two cell lines. Although for a number of factors, we can get them across different developmental stages or different conditions. So for example, here's human, you can see that there's a lot of effort put on a few limited lines. But you will find a decent number of data sets, tens of data sets across multiple lines, and the same is true for worms. Okay, and a little bit less for flies and mouse. Okay, so all of the data was processed using a very common pipeline for mapping for calling peaks. And then for running through various quality control measures to make sure the data sets are of high quality. And at the end, we do come up with these data sets. There are some additional data sets that are put aside that don't make this analysis, I'll tell you about. If we look within these data sets, how many so-called orthologous data sets are there? That is to say where you have orthologous transcription factors. How many are in common? And there's zero between all four organisms, so there's not huge amounts. There's the most between human and mouse, not surprisingly, because they're the most closely related. There's a decent number between, say, human and worm, and human and fly, and so on and so forth. And again, that's what you would expect because these things, it's reflecting how close the organisms are relative to one another. And if you look at the transcription factors involved, in fact, they are quite similar based on expectations. So red means you're 80% or more identical in amino acid sequence. And so you can see most of the human mouse orthologs are quite similar. And if you start picking the worm, human, or the fly human, these are really quite divergent with only one of them really being super close in terms of its amino acid sequence. So within this, first thing we did was analyze all of these. And sure enough, there are these hot regions, which Mark told you about. I won't say a whole lot about them other than there are thousands of these things in the various organisms. And you can identify them by statistical overrepresentation of regions. And they do change across development. So that is one of the conclusions that does come up. And so, for example, there's only 212 of the hot regions and worms that are similar between the various stages that were looked at. Embryo, first larval, second, third, fourth. So there's only a small number that are conserved, most are stage specific. And they do change quickly. So for example, the embryonic ones are very specific to embryos. They will shift down. Many of them are lost as soon as you move to L1. And likewise, when you move from L3 to L4, you get a lot of stage specific hot regions, okay? First question is, how conserved are the binding sites? Now, worms, flies, and humans are quite far apart. So you can't really get synthetic relationships there. But you can ask some general principles about this. Are they binding the same motifs? Are they binding in the same locations? So step one is to look, are they binding at the same motifs? And this is the work of Playa Kay from Manolis's lab, who's done this very systematically across all of the different motifs. And just looking at his data set, it's over 80% of them seem to be identical or very similar between the different organisms, that is between flies, worms, and humans. This is just one example here of blimp one. You can see it looks pretty similar between flies and humans. But there are some interesting differences there. And here's one case here that Alan Boyle from Alan Boyle's slide in our lab. So this is C. elegans ag one, and it's homologous ag one. And actually quite different binding sites, okay? And it's not simply a matter why we only happen to look at one stage. If you look at different stages, you'll see that you get the same binding site for worm across different stages. And so it's not really changing. But the bottom line then is that this factor does have different targets in the two different organisms, okay? All right, so binding sites are usually conserved, but not always. What about binding locations? Well, we thought we'd start with the simplest case. And if you compare orthologous factors between human and mouse, you can see are they binding your promoters or enhancers or what have you. So if you look at transcription factors, they fall into three classes, I would say. There are those that love to be around promoters. There are those that love to be around enhancers, which tend to be fairly distant in humans and mouse. And then there are those that like to be at both. Now the way we have the relationship set up here, this is the TSS and that's a 20KB away. You can plot in an accumulative fashion whether something likes to be closer further away. If it's red in this region, it means it likes to be around the promoter. And if it's green, it means it likes to be distal, typically at an enhancer. So the bottom line is if you start comparing human and mouse, you'll discover that in fact, in general, they're clustered obviously by, these are more proximal, and in general, there's some trend towards us, but there was surprisingly a lot of variation in this. When one digs into this, you realize it's actually because we're comparing different cell types, and this is interesting in its own right. So if you look within orthologous cell types, so for example, K562 and MEL cells, which are semi-orthologous, if you will, they're cell culture cells, you can see there's actually quite a bit of conservation in terms of location of these orthologous factors. So these factors, again, are near the promoter, and they're near the promoter and mouse, both mouse and human. Okay, so what that implies then is that these factors do vary a bit amongst different cell types. And in fact, we do know that, this is some work from Carlos Araya in their lab, and he's actually looked at the binding relationship of the different factors across different development and different, well, just different developmental stages in this case. And the bottom line is sometimes they're together, and sometimes very often, in fact, they're not together. We had seen that from other studies from Wei Zong as well. I should point out all this has done in collaboration with the team that I'll mention at the end. So the bottom line is for certain kinds of factors, you can see that they'll cluster based on, like these blue factors. These are different stages that like to be clustered together, but there'll be times when they're further apart, meaning they'll actually be binding at different locations and with different partners. Something I'll get to, in fact, in a minute. So just for completeness, we did in fact compare humans and worms. And again, the promoter loving factors for humans, sometimes they're promoter loving for worms, sometimes not so. And we varied this whether it's 20kb or 2kb, you get pretty much the same result. Okay, so the conclusion then is that these orthologous factors, they're often binding the same motifs, but they're not necessarily in the same general locations across different organisms. What about partners? Well, initially, so we want to see whether the different factors work together in different organisms, excuse me. And so first we'd set up a general clustering scheme. And in fact, we didn't see any relationship. That is, the co-associations in one organism did not match up with another. So then we said, well, let's make this, let's define the context of this. So we decided to just look around promoter regions. So you look around a 2kb of promoter region, you ask our factors working together in these regions and are they conserved across different organisms. So again, we first set up a co-association relationship in a 2kb region around promoters. And it's somewhat of a standard clustering, if you will, not quite standard. And we looked for within species comparison defined relationships. And then from that, we then looked for those that are conserved across species. What I'm gonna show you is this slide here, which is the most complicated slide, I think, of the presentation. So just focus on the right for the moment. So we're gonna compare co-association of these different factors, if you will, with these different factors within an organism. And the only comparisons I'm making are the cases where we have orthologs between, in this case, worms and humans. Okay, so we're gonna compare relationships within an organism and then we're gonna ask which ones are the same between. Is that clear? Okay, so let's look here. If you take this factor, EGLE38, and you say who does it like to be associated with? Around promoters, you'll see it has, it turns out, three partners. Faw4, there's a Faw4 data set here and here, and it likes to be at both of those. Also this one, EFL1 and MDL1, okay? So those are its partners, if you were. It's co-association partners. You can then say which of these are conserved in humans, and here's the orthologue of EGLE38. Unfortunately, it doesn't have the same name, so we should rename all these things, mind you. And PAX5 is the same as EGLE38. And what you notice is that when you look at its orthologs here, you don't see co-association, all right? So we do it again and get the same result. Zag1, which I mentioned before, co-associates with blimp1. It's homolog Zag1, does not associate with that fellow there. And we do it again and get the same result. So we still didn't get any co-associations. So that didn't sound so good. So then DanChi and Alan Boyle said, well, wait a minute. Maybe we're asking too hard a question. We're asking these things to co-associate across all of the genome initially, then across all of the promoter regions. What if we start just asking in specific segments of the genome? And this makes a lot of sense, because it turns out different transcription factors work together with one another at different gene targets or different specific locations. And so when you do this across a genome or across a broad range of segments, you'll have a hard time finding these relationships. So they came up with an idea to use the self-organizing maps, and this kind of approach was first started by Allium when he was in Barbara Wald's lab analysis at UC Irvine. And so the idea is the following. You'll take one kb stretches, and within these one kb stretches, you look for enrichments of partners of factors, and they need to be multiple partners across essentially the whole genome. You're looking for enrichments. And you then basically generate these neurons, which are, show that big one there, which are statistically significant enrichments of these transcription factor co-associations. That's the general principle. Clear? OK. So first you do this with humans. And you say, let's look at all the one kb segments of humans and see which ones like to be co-associated. And here's the answer. Crystal clear, I know. But you look into these at the neurons, and you see things that make sense. So for example, in this neuron, all the Paul III, nearly all Paul III, there's a few other things that belong there. They're all together, binding near one another, as you might expect for Paul III components. Same is true for some of these other things. Here's an enhancer type area, which has all of these factors that like to be together. So you can start finding relationships that come up. This is the regulatory code within an organism. Again, makes a lot of sense. Now can we find similar relationships between organisms? And first we were thinking, well, this is going to be complicated. You've got to find all of these and compare it to all of another organisms. But then actually Alan and Dan came up with a clever idea. Let's just mix them all together. We'll take all of the worm 1KB segments, all of the human 1KB segments, all of the factor binding information, put it all in the same pot, and let itself organize. And if they're the same relationships, they should show up on the same neurons. That's the general scheme here. You take all the binding information here, all the binding information here, again, mouse, human, funny looking human, throw them together, let themselves organize, and see who belongs together. And then see, now if they're human specific ones, you only get human specific relationships in one of the neurons. If they're mouse ones, they'll be mouse neurons. And if they're mixed neurons, you'll get a mixture of both. And the ratio of mouse to human tells you how much is mouse and how much is human. We mix them together, and this is what we get. We get a lot of species-specific information, which there can be technical reasons for some of this. But the idea, this is a neuron with a lot of co-associations that are primarily mouse. The red ones are primarily human, but you do get plenty of these yellow and orange ones, which mean they're conserved across. These relationships are conserved across organisms. These are the ones that are conserved. You can see things here. And I don't know what a lot of these factors are, but I know this one makes a lot of sense. CTCF, one of its good partners, is RAD21 and SMC3. In this relationship, you can see it's shared in both humans and mouse because it's orange, yellowish orange. So in fact, that makes sense. And you can find others as well. And then there are these species-specific ones, so the green ones and the red ones, presumably are species-specific, although there can be technical reasons for this. And so we have to look at that harder. So then we say, all right, we can find relationships between mouse and human. That's good. We should be able to do that. What about humans and worms? And lo and behold, it actually does work. So on humans and worms, you can find, again, you'll find these species-specific green is worm in this case, so species-specific relationships for humans. There are red ones, but you find the yellow and orange ones again, and these would be shared relationships that you can see between humans and worms. So these kinds of relationships are shared across organisms. So this is one way you can tease this out. And what makes it special is it's not across the genome. It's at very specific locations. And now we need to do the go analysis to see exactly where those locations are. We have done general go analysis to see where the different orthologous factors bind at various gene locations. And what we've discovered is the same general message I'm giving you, which is sometimes information is conserved, and sometimes it's not. Let's see. So this is a case where these are the different worm transcription factors. These are the different go categories. And the bottom line is you can see that some factors and clusters of factors, if you will, have certain functional relationships. You can do the same for fly and get the same result only the color has changed. And then you can compare the two between worms and flies. So here's a factor in worm. You compare it with what factors it has in common based on its go categories of its targets. So are they binding the same kinds of genes? And the answer is what I told you, sometimes yes and sometimes no. So this factor here, onc 52, has a very similar go relationship with these four factors here, two of which are in fact orthologs and two of which are not. So sometimes it shares with its ortholog and sometimes with other things. And the same is true for this relationship. And this is a complicated slide, but the general point is the one I just said. So in conclusion, we've compiled a massive set of ChIP-seq data sets, which are out there for everyone to use, and we're glad lots of people are using them. We can use this to look at regulatory variation within a species, and Mark talked about some of that yesterday, and there's actually quite a bit of analysis of this, and I think it's quite interesting, because regulatory information does change across developmental stages, which is something you can tease out from having worm and fly data. And then lastly, I can tell you across species that binding locations, co-associations in particular, and even go analysis can be different. The types of genes they're binding can be very different across different organisms between the various homologous orthologs. So regulatory information can be quite divergent. So for future directions, we need to understand what these rules are. We are seeing some conservation and we're seeing some that's not, and we want to understand what those basic principles are. And there's a lot of ways to attack this particular problem, even with the data sets we have. You can look at different kinds of relationships. One area that we're particularly interested is this idea of looking at things in terms of networks. How much can these principles be deciphered based on regulatory networks? And Mark mentioned this yesterday, how you can organize regulatory information into hierarchical networks. And then you can start looking at various principles of what regions are most interacting, what regions have certain properties. And one very interesting property related to conservation, either within an organism or between organisms, is that when you have these various layers, so this is the top, the middle manager, if you will, and the lower level peons, if you will, in a regulatory network, basically it turns out that there's much more conservations at the top. That is, these are, I should say, these are under negative selection at the top of the network, which is something that came out of the Human End Code Project. And of course, this is what we'd now like to look at in terms of worms and flies and see if that same general principle is true. And then you can map the conservation information I just mentioned previously about partners and other things based in the context of where you are in these networks. So there's really lots of interesting ways to attack this problem. Okay, so I gave you the conclusions already. Here, in fact, the people who did all the work, none of it was done by me. These are the people in my lab who did the work. Let's see. Well, I'll zip through a bunch of these. Yong Chen, Dan Chi, Alan Boyle, Manoj, Pirahan and Carlos Aroya, and Philip Tating. These are the analysis folks. These are the people who did a lot of the chip seek and Valerie, where are you? There you are. She's over here. We've had merged people. Wei is a double, double dipping here. Okay, we work a lot with Mark Erstein's lab. There's folks who did the analysis partner over here, at least some of them. Anshul Kenjali, of course, is the hero in all of this. He processes lots of data sets and is up here. I mentioned Poya on the motif stuff and, of course, they work with Manolis Kellis right now. And then, for fly work, we have various people including Li Ja Ma. We work a lot with Bob's lab, with Misha Saroff and Tony Hyman, Lincoln Stein's group, of course, is essential, and we work a lot with Stuart Kim's lab. So I think I have time for some questions. Thanks. So, Mike, when you look at the hot spots that are constitutive, that are not developmentally different, do you find a DNA configuration that's inimicable to a nucleosome formation? Is there a binding characteristic? Sorry, do I find what? A DNA sequence that doesn't bend very well would be disinclined to form nucleosomes. Well, we haven't looked at the structural aspects. That's a great question. We should look at that. I don't know if Mark or anyone else has looked at that. We certainly have not. Yeah, to see whether it's got bent DNA or poly-A runs and that sort of thing. Yeah. It would be a great thing to do. Great suggestion. Yep. Thanks. Very interesting analyses. This isn't my real area of expertise, so this may be a naive question, but in the self-organizing map that you showed initially for humans, that huge cluster of neurons that was very dense with multiple factors and things, I mean, what was all that? I mean, the two you called out were, looked to have quite fewer members. I was just wondering, what that huge, dense? Yeah, so you can find, you mean that one region up above that was, well, we're on the Jason's talk. I'm thinking of the very first picture you showed, it was black and white, and you caught it out with two. Yeah, then there were these regions with lots of edges, basically. That whole cluster of many, many neurons with. Yeah, this can look at all possible relationships, so it can look at pairwise relationships. I can't remember if we put a filter of three or more for this one. We've done this several ways. So it can look at all three more possible relationships, if you will, as well as former, as well as five more, all the way up to 168 for the case of humans. Does that make sense? So some of them will have lots of partners that could be working, and the goal is to look for them being enriched in a one KB region over the genome as a whole, and I'm sure we didn't do any of the fancy genome structure correction stuff that Ben Brown does, but hopefully we're still very healthy here. I don't know, did that answer your question, though? So you're looking at all possible combination, combinatorial relationships. That was great, Mike. I haven't been on the Joint AWG Coles in a little while, so it's great to see. That's right, all this is done in the last week anyway, you won't see. Oh, okay, good. I'm a little concerned about one part, which is when you were showing the clusters of co-binding, in one of the groups that you called the enhancer group, there was RAD 21, which is a cohesin subunit, and that's something that is associated with insulator elements, but it should be very broadly distributed across all sites that have. So I'm wondering about whether the list should be culled for things like that, or more importantly, why is that showing up? Why is something that should be more in Jason's area? But just more broadly distributed across all regions that are probably gonna be active in the genome. What's nice about this is that this is unbiased, this pulls these things out, and it is what it is, and it may not be what you wanna see. But having said that, we've looked a lot at actually RAD 21 and CTCF, and there's clearly the guys that are near insulators, and others can comment on this as well. We've looked at this a lot in humans, and there are clearly about a third of them are sitting right on enhancers and another third are around promoters and such. So they're not just insulators, they're really at different locations. The CTCF and RAD 21 and CTCF, they're pretty together nearly all the time. So in fact, they're different classes of CTCF slash RAD 21 sites, and I think that's what that data set's telling you. Can I comment on that as well? So when we're looking at the chromatin state annotations and how transcription factors associate with that, we actually find RAD 21 clustering with CTCF in a specific insulator state, and very rarely do we see sort of enriched in binding in promoters and enhancers. It's found there, but it's actually not enriched there. So most of the enrichment is actually found in insulator regions, as you would expect. Yeah, that might differ a little with some of our results. We might wanna talk about that later. But I think the point is that you can't use that data to make the conclusion that RAD 21 is specifically associated with those sites. Oh, I hope I didn't say it was specific. All I'm saying is this is a cluster of things that are enriched together over the genome as a whole. So yeah, there's plenty of other RAD 21 buddies out there too, okay. So my question, I don't know much about self-organizing maps, but my question is, so a 2KB region is very different on a worm than human, right, because of the size of the energetic sequence. So have you varied that across the different types of organisms and had them re-utilized? Yeah, so we've done a lot of that for the promoter proximal part I showed you, and we've done a limited amount of that for the self-organizing map. So we can do it more extensively, and we will do it more extensively for the worm human self-organizing map. We've done 500 bases and 1KB for a number of the relationships. We haven't gone from 100, which we will do to much higher numbers, and see what relationships do hold and such. And our experience for some of this, when we varied the windows, it didn't make a whole lot of difference, but it's something good to do. Yeah, Ross. Hi, my question is, so you made certain predictions, for example, from human to fly, which I care about. And there have been many examples where we've put, where the communities put human transcription factors into flies and shown that they drive, at some level, transcription in reasonable tissues and so on. Have you compared that data to your predictions? So in some cases you have predictions that it shouldn't bind the same transcription factors, or that they use different cofactors and so on. Have you actually compared them to the data that sits there? And I guess, of course, in the future, you could actually make direct tests of those by throwing those back in and seeing if they can drive transcription in a reasonable place. Right, now great comment and question, or comment, I guess, question. So we can, I think we would do it, first, it suggests some obvious follow ups. We should see those that have different motifs, the prediction is they probably wouldn't rescue, or maybe we're not looking at the right cell types or conditions, that's a possibility. Unfortunately, those may not be published. What's that? The negatives may not have been published, unfortunately. Yeah, well, we should do that. You put it in a case where you get lots of positive results and you can publish it. And you can do, what was the other comment? Oh, so you can do obvious follow ups like that. Yeah, you would say from the co-association none of this stuff should work. I think it comes back to the point that was raised yesterday. We don't know which of these sites are functional or not. We just know these are binding relationships. So it's conceivable that some of the more functional ones are primarily driven by one or a limited number of conserved things. Jason, I'm flipping through your talk here. Anyway, so it's conceivable, two stories, all right. Anyway, it's, I'm not touching anything. I'm moving away from this. So it's conceivable, it would be nice to figure out which ones are really truly functional as far as driving genes, because that might help explain that relationship. And the other thing I think I could say is that small amounts of activity can rescue phenotypes, and you point out they're almost always partial rescue. I'm very familiar with the yeast experiment. And yes, they rescue, but they're always, they're various grades of rescue. And I think that might be what you're seeing, because the partner relationships aren't perfectly established. So that would be my prediction. Yes, I agree. But you make hard predictions, or fairly hard predictions, about ones that will work better or worse. Yeah, no, I think this, like any good experiment, it suggests more things to do. Okay.