 So I'm going to talk about some work that we have done in the context of the main encode project but also this parallel mouse encode project and I think many of you are interested in mouse and have worked with the mouse data and there's really a very tremendous resource out there that has been created and is being added to. I'm going to start out just with just a general background because I've been hit up with a few questions about the relationship of regulatory DNA, DNA hypersensitive sites, et cetera. Just give a brief background on the situation currently with the human genome and then go into comparison between the human and mouse genome. So the focus of the talk is going to be these regions of the genome which I'm generically going to refer to as regulatory DNA. They're generally short regions, a couple of hundred base pairs to which transcription factors are complexed and they encode all kinds of interesting activities which we sort of encompass under the broader rubric of functional elements. Part of the difficulty here that some of you may have sensed in this meeting and actually was referred to directly by a couple of the prior speakers is the fact that we tend to lump these regions into a few categories. I mean, here we have our little kindergarten vocabulary that we inherited from the 1980s to describe them, promoters, enhancers, silencers, insulators, and of course that, et cetera, is not very big. It's only maybe one or two other things. And this is an enormous challenge right now is sort of understanding exactly what all these elements do. So in a broader effort to map human regulatory DNA you've seen many different modalities for mapping and defining where the regions are and then annotating various properties of them. A very core property of these regions no matter what they actually do is related to the binding of transcription factors and of course these factors cause a change in the chromatin structure. They can recruit other modalities that can put modifications and then they bring histones, et cetera. But really the cardinal property of these regions no matter what they do is that they have this altered chromatin structure that's accessible to nucleases and the classic one here is DNase 1. So if you put DNase 1 in the nucleus of a cell essentially very selectively cleaves the genome in these regions. In other words, hundreds fold more frequently than even the flanking regions. What this does is release these little tiny fragments of DNA and you can capture these just by a size fractionation sequence them, map them to the genome and if you go to any region here and start plotting what these sequences look like after you sequence a 30 or 40 million reads genome-wide this is what you get. And these peaks, this is what you see in the browser. These little peaks are the DNase hypersensitive sites in the case of these assays. Generically all of these peaks here I'm going to refer to as the regulatory DNA compartment and without going too much into this until later inside these peaks so we're just looking at the density of cleavage. If you sequence more and more and more you can actually start to see the structure of where these proteins are and this is illustrated here just by depth of sequencing. This is where the proteins recognize and you can sort of see now the appearance of the transcription factor footprints where they're sitting on the genome and these will come around later. And more broadly then applying this across various cell types. You can map the promoters, the enhancers and sort of a good number to keep in your head is that roughly, there's roughly a range of about 100,000 to 250,000 DNase hypersensitive sites that you'll detect in any given cell type. So it's on average maybe about 150,000 elements and that translates to about 1% of the genome that's in that state. There's a huge amount of these data, you've seen various pieces of them in prior presentations. Collectively now if you combine the encode data, the data from the roadmap and actually there were some questions around this I overheard in one of the breaks that eventually all of the data from the roadmap is also being consolidated with encode as well to facilitate access because many of you just want to go just say I want this tissue, this cell type for a data type and so it'll all eventually be together. But in the case of the DNase I data there's over 400 cell and tissue types and also developmental states and virtually all of it from primary cells and tissues. So just to put some numbers on this kind of where things stand, currently it looks like the human genome encodes at least 4 million DNase hypersensitive sites. Virtually all of these show some degree of tissue or lineage selectivity, some of them fairly extreme. Virtually the vast majority of these elements are in the distal non-promoter compartment that still leaves a lot in the promoter compartment. So Tom was talking about the sort of 50,000 annotated genes, well those genes have a lot of alternative promoters out there as well that are tissue specific. One of the things that it's important to keep in mind when you're looking at the DNase I data is that you are looking here at a generic snapshot of where regulatory features are encoded in a particular cell type. But you are also looking at a compartment that has a capacity for memory in the sense that any given site that you see either is active in that cell type at that particular time or it is potentially active, it's primed for some type of activity or it is a remnant of prior activity that exists as a memory. And this is quite surprising but it was actually a feature that was uncovered very early in the history of chromatin regulatory DNA analysis by Harold Weintraub and Mark Rudine. And the consequence of this sort of memory compartment is that you can take DNase profiles and you can see stuff kind of turning on, off, on, on, off, etc. And if you take these things and you just kind of cluster them, you actually can organize cell and tissue types in a way that recapitulates the structures that we know exist and the fate decisions that we know were made very early in development. And so in this case for partitioning of the blood, the endothelia, etc., all of the information and structure in these lineages here was just actually created by clustering fully differentiated cells. And the reason for this is that when you look forward in differentiation, so let's say if I'm going from a stem cell to a hematopoietic progenitor and then on to some other hematopoietic lineages, there are two things going on. First of all, the size of the DNase 1 landscape, the number of hypersensitive sites contracts as you're moving forward, and so that's kind of the size of these circles. And the second thing is that there is this carryover or persistence of information. And so if we kind of map this out a little bit different way in a few lineages, imagine you're starting out here with a compartment in embryonic stem cells. This is sort of the size of the hypersensitive site landscape. By the time you're at a hematopoietic progenitor, you have shrunk this down. So there's about 250,000 sites here. There's about 140,000 or so sites here. And at this point, about 40% of them are persistent or shared with ESLs. And then there's a bunch of things that are new. And then this happens all over again as you differentiate to each lineage. So here's about 100,000 sites in each of these differentiated cells. And about a third of them are persistent or shared with ESLs. Another bunch is persistent or shared with the progenitors. And then there's this compartment of about 50%, but it's a different 50% in each of these cell types, is the one that's unique to that terminal branch. And so what that means is, and we've done this sort of analysis now in several other systems, including dynamically differentiating systems like ESLs. And so the numbers to keep in mind is that about a third of the landscape that you see in fully differentiated cells is persistent from ESLs and is not associated with active gene expression. But if you count in the progenitors, we're estimating that another chunk, so roughly half of the compartment that's there, is persistent. It's some type of memory function. It's got a select group of transcription factors in general. That's a fairly big group, but that are occupying it. And this leads, so it means that the TNase 1 landscape, when you look at it, it's not all just enhancers. It's a complex mix of cell states and a much more rich compartment than just what's actively propagating expression at that time in that tissue. So there are also ways that we can now connect about a million of these four million elements with likely target genes by watching their co-activation with a target genes promoter over hundreds of cell types. And finally, it's worth emphasizing the point that individual cell types that we have looked at, and the more you can parse things, the more this is the case. Every individual cell type has hundreds to thousands of these elements that appear to be completely unique for that cell type. And as we complete the map, that number may shrink a little bit, but it's not going to get below hundreds, certainly, or potentially even low thousands for individual cell types. Okay, so going within those hypersensitive sites, our calculations now are that the genome encodes at least 20 million regulatory factor recognition sites that in each cell type, there's roughly on the order of two to five million transcription factor footprints that we can detect by genomic footprinting if we sequence to completion. And that the average cell type uses a recognition lexicon of somewhere between two and 300 words, and you get this by mining all of those footprints and then collapsing it and find that there's this lexicon of somewhere between two and 300 motifs. And then finally, in terms of the global lexicon for recognition sites for TFs, we, meaning the community, I think are now fairly rapidly closing in on a complete recognition lexicon from combining a variety of technologies to get there. I was just at a meeting last week where there was an update on this and there's quite a lot of activity. And that's gonna, I think, greatly improve. It can be immediately imported into tools from ENCO to improve annotations of the genome. But coming back to sort of the general landscape that's out there, four million sites, roughly 150,000 combinatorially in any given cell type. But the big problem is that we have very little idea what most of these sites do. It's definitely not just enhancers and I think we're sort of doing ourselves a disservice by using that word and thinking about everything as enhancers. Because really, I think it's a case that most regulatory regions are likely to encode novel and complex activities that are gonna take some time to sort out. So, for example, we stumbled into a set of elements, there's, I don't know, 30, 40,000 of these in the genome. And what they do is they sort of park themselves near or partly on exons and loop them to the promoter region where they seem to be modulating rates of alternative splicing. So that's just an example of a complex activity, but there are likely many to be sorted out as we go forward. And finally, with respect to the transcription factor landscape, the fundamental challenge I think is becoming more clear that every regulatory region is built differently and every transcription factor has to do its job in its local context, with local partners, in its local chromatid environment. And the chromatid environment is incredibly important. And I think that there's, as we understand that better, there's a widening discrepancy between what's actually going on in the genome and what you can sort of recreate in artificial assays. So now I wanna turn to the question of where did this all come from and where did, how did the regulatory genome arise? And I'll divide the rest of the talk into two parts. The first part, I'll talk about sort of mouse and human regulatory regions defined by DNA hypersensitive sites and a bit about the evolutionary dynamics shaping that landscape. And in the second part, I'll talk about transcription factors and networks, and particularly looking at the relationship between conservation of trans versus cis regulatory activity. So as I mentioned, there has, in the mouse and code project, there's been a systematic effort on going now to define mouse regulatory DNA using various modalities and with the hypersensitive sites at the time of some publications last year. Actually, the time I made this slide, there were 44 cells and tissues. I think that's up into the 50s and 60s. And it's gonna climb higher and get even more comprehensive and more finely detailed. But what I'll talk today are about 1.3 million distinct sites that you can map across 44 cells and tissues. Again, we're looking at an average of about 150,000 DHSs per cell type. And I'll compare this with a set of around 3 million hypersensitive sites from 230 human cell and tissue types that we've integrated. So, but if you do this and you take the human sequences, the mouse sequences, and you align everything. So you've aligned the mouse sequence to the human and basically you get sort of a picture where you can, for a given orthologous region, you can identify both species-specific hypersensitive sites and ones that are shared between species. And there are various ways to do these alignments to make absolutely sure by sort of doing mutual cross alignments, etc. That you're dealing with the same pieces of DNA. But very globally, the picture you get is this, is that of this pool of about 1.3 million DHSs, around 40% of them do not align to the human genome. So this is something that is specific to mouse. And these are not throwaway elements. They cluster all around super important mouse genes. They turn on and off with them. They have all the properties of mouse regulatory elements. And in fact, many of them are elements that people have studied in mouse. They just don't align in human genome. And you have two other compartments here. So one of them here in the yellow are sites that align to the human genome and are also hypersensitive sites in some human tissue out there. And then there's this group of about 24% that align to the human genome but are not hypersensitive in any tissue that we have seen so far. So they're given the fact that that's a larger space that potentially have been evolutionarily extinguished. So on the one hand, we can look directly at the mouse. But of course we can take these sequences and align them across all other animals. And what you find here is sort of a general picture where we have, this is the percentage of the mouse hypersensitive sites that are kind of aligning. And then here's sort of the evolutionary distance. And sort of mapping this on here, what you basically get is that 75% of the DHS landscape is restricted to placental animals. And then if we map in now sort of this stuff here that's kind of shared with human, you find that there is this 50% of the non aligning stuff is restricted to mirids. Here's the stuff that aligns up to human. And basically what you find is that the vast majority of mouse and human regulatory DNA is specific to a placental animals, although even in there there has been a fairly tremendous rearranging of the furniture. And so this appears, this kind of rearranging appears to under two, there's sort of two big picture principles that emerge from this. One of them is that there appears to be extensive functional repurposing of DNA. And by the way, what I mean by that is that you have a situation where if you look at these 1.3 million sites in the mouse and then you look at the one, so this is the fraction here that aligned this chunk of about 33% that maps the human and is hypersensitive in some tissue. About 21, so this chunk right here in red, 21% overall, but this chunk here is hypersensitive in the same tissue orthologously that it is in human and mouse. And this other chunk here has switched tissues. So for example, it's the case where these are kind of orthologous elements. So let's say here it's in the mouse and in the human, it's both in muscle, which is an orange. And this is a mouse brain elements also on a brain in human and maybe it has acquired expression in muscle. But this green chunk corresponds to things which are on in one tissue type in the mouse and then they're completely different one in the human. And so this appears to be a fairly frequent and prevalent phenomenon that nature can kind of rearrange that the furniture there pretty easily. And if you look inside these guys, what you find is that the mechanism of this is fairly straightforward because in all of these cases, there has been a turnover of the of the binding sites. So you have some sites that are that are conserved and then basically you have other sites where there are novel binding sites that turn up. And so again, if you look in this global compartment of things that are hypersensitive in the human, only this fraction right here have a conserved transcription factor binding site. And there's actually a whole bunch of them that remarkably are in the same location in the genome. And we know that because we can align the DNA around there. But there has been such extensive switching around of the binding sites that there's basically no conservation of the recognition sequence that you could have turned over the entire thing. And if somebody asks how that could happen, I can answer that a little bit later because it's sort of a biophysical playground there as well. And certainly the conserved binding sites are enriched in the sites that have conserved activity. So basically, again, just to kind of throw some numbers here, looking at the entirety of the mouse landscape, you've got 21% of that landscape that's shared with a corresponding orthologous human tissue. And 11% has a conserved binding site in it somewhere. Now, that's a fair amount of divergence. And given all of this divergence in the regulatory landscapes, you can basically ask what is maintaining functional conservation in mouse and human? I mean, mice and human have the same basic body plan, physiological functions, lots of other things. So what is sort of maintaining this? And here is one view from the global landscape picture. And I'll give you another view from the sort of the fine landscape picture. We have been treating conservation in a very, very sequence-centric manner and looking at it at individual spots in the genome. And what that does is it basically ignores the bigger picture of what might be going on. And what we find is it's actually conservation of global regulatory content. So that if I go and look at, take mouse and humans. I take a given cell type, I take a given transcription factor and I look and say what fraction of the hypersensitive sites in that cell type, what fraction of the real estate is devoted to recognition sites for that transcription factor. And I calculate that in the human and the mouse. For every factor in two completely orthologous cell types. This is regulatory T cells in mouse and human. Each dot here is a transcription factor. And basically it's the case that the amount of the real estate that's devoted each transcription factor is the same in mouse and human. You can do this in different tissue types, etc. And so despite the poor conservation of individual binding sites, the overall proportion of this real estate is nearly constant between different cell types, sorry, for any given transcription factor between two organisms. So just to recap this quickly, regulatory DNA landscape has undergone wholesale rewiring during the mouse-human interval. Humans and mice share a core regular that encodes cell identity and lineage programs, and I didn't actually go into this, but that core set is really enriched with the lineage regulatory factors. The regulatory landscape evolution involves two basic things. One, this extensive repurposing of elements from one tissue context to another. And the second one is that there's continuous re-evolution on the same ancestral DNA template. And finally, strict conservation of the proportion of regulatory DNA encoding binding sites for a transcription factor. So very quickly I just wanna walk through the view from the ground up in transcription factors and networks. And so here what we can do is push the data by a deep sequencing down to the level where we can read individual transcription factor footprints. This enabled us in 25 cell types to map 8.6 million footprints. And so you can go in the genome and dial them up and see them for various cell types. A large fraction of these are very cell type selective. But overall, you have a situation where about 20%, a little bit over 20% of the footprints are conserved positionally between human and mouse. And on top of that, what you have is conservation of the recognition repertoire. So if we go and we mine these footprints, we derive a transcription factor lexicon just like we did for the human. And you do this and you get a set of 600 unique motif models. You can compare this, you can see which ones match databases and you get about 240 that are not in databases. And you can compare this with the identical exercise in the human. And remarkably, these motifs line up. So the human and the mouse transcription factors, their effective recognition repertoire on the genome is practically identical. The human, the mouse has some that the human doesn't and the human has a few that the mouse doesn't. And the mouse ones are kind of selective for ES cells. But in general, very strong conservation. And then if we look at the circuitry, meaning how transcription factors. So transcription factors are obviously wired in a big network. They control downstream genes. But one of the most vital genes that transcription factors control are other transcription factors and themselves. And you can actually map those circuits by going into data, looking at each transcription factor gene. And this is just sort of cartoon for a promoter just to get the idea. Looking at the transcription factors that are there in footprints. So these are the ones that are controlling this gene. And then this gene, of course, this transcription factor can control these other transcription factor genes. So now you've got the basis of your nodes, transcription factor, an edge. It's a connection between two transcription factor genes. And if you iterate this over and over, you get a network. You get these big hairballs, right? But unlike the sort of normal hairballs, these are actually, you can kind of comb through them really nicely because they contain very, very precise representations of known transcription factor relationships. And if you do this, again, in the mouse and systematically map this, you find that a large number of these transcription factor to transcription factor connections are cell selective. So for example, in brain, this transcription factor is controlling these other genes. In, let's say, in retina, the same factor is controlling part of them and other genes. But the really critical thing is, now what we can do is we can ask, what happened to these connections during evolution? So if I have these transcription factors here in the mouse gene and these here in the human gene, I can identify instances where, let's say in this case, the same factor is present at the same position in human and mouse. In this case, the same factor is present, but it has moved. In this case, there is a human-specific one, and here is a mouse-specific one. And we could look at all of these different proportions. And when you do this, you find that here is this fraction that is positionally conserved. And then there is this excess of these sites where the connection has been maintained through the innovation of a brand new binding site on the template. So around this number from 20%, roughly goes up to 44, 45%. And at the level of global network architecture, looking how the network is built and how network motifs are distributed. This is a figure from a paper of a couple years ago showing that the human cell types, each line in here is a different human cell type, had virtually the identical architecture and how these are utilized. You can make this computation over these mouse cell types and you get a very similar looking picture. In fact, it's not just similar. If you superimpose them, they're practically identical. So the global architecture of the human and mouse networks are extremely similar. And this also extends to the fine network architecture, where I can go into any one of these sort of network motifs and ask what kinds of transcription factors do I find in each location in human and mouse and you basically again find the same thing that the same regulators like to sit in the same spots in the same kinds of networks between human and mouse. So just kind of wrapping this up and stepping back from the genome, we can see where evolution here has really been acting. We talk about this figure about how much individual DNA bases are conserved, 5%, 10%, whatever you want. But when we look up at the level of footprints, there's about a 20% conservation. We go on average across all cell types. We look at TF, the transcription factor connections. There is around 44% of those are conserved. And when we go to, finally, to networks, we see that the human and mouse networks really look the same in many different ways. So with that, I'll wrap up and just highlight some great students and fellows and computational staff and all of our great collaborators that help bring the mouse, DNAs, one data forward. And obviously NHGRI is funding of the mouse and code project. Thank you. So substantial amount of the mouse and human genomes are comprised of retro elements that certainly move around and can introduce regulatory elements. So how much of the lineage specific regulatory sequences you're describing are embedded in retroviruses? Yeah, so actually the retro elements are really a remarkable story. In the human genome, it turns out that there's a very substantial compartment of hyper-sensitive sites on retro elements and they tend to be extremely cell-type selective. So actually Guillaume Bork has a very nice paper on this, analyzing it very systematically. And the, but when you look at the mouse and you look at what has happened between human and mouse, you have two things. Number one, you see a large number of the innovated sequences that are different between human and mouse. They do comprise these retro elements. So there's two real classes of the new stuff, the stuff that's evolving on kind of the unique DNA and then the retro elements. And the second thing is that we see pervasive evidence of the phenomenon of kind of transcription factor hopping in the sense that the binding sites of many types of transcription factors have been disproportionately distributed around the genome through particular classes of retro elements. So that's actually an extremely important compartment in the source of a lot of regulatory diversity. Hey John, regarding the enhanced repurposing, so do you see any specific pairs of transcription factors and certain tissues like the linkage? Yeah, so I think that the issue here is it appears to be extremely easy for nature to evolve new binding sites and to flip the specificity of an element from one tissue type to another. Even through the innovation of just a couple of recognition sites for a lineage regulator or for maybe a different combination of sites. So in other words, that that appears to be a very, very plastic thing. In some cases, we can even identify elements that have a single new binding site that's shown up that's flipped as tissue. More often though, the case is that there has been larger scale rearrangement and so the evolutionary time scale on which that appears to be able to take place appears to be really short. And I think part of the issue is that we have been globally thinking for many years about regulatory elements as being super engineered, tightly put together things that are really precious like digging up diamonds in the mine in South Africa. When in fact these things are being mass produced offshore somewhere. I mean it just continuously being produced by evolution. Yeah, somehow I was also thinking of Chris Glass talk from yesterday. Is it possible like one of the TF is conserved and then just new partner and then just new- So what you're getting out there is the explanation for the phenomenon which may seem paradoxical that I have a single piece of DNA. I know it's the same piece of DNA in human and mouse because I have enough basis to super confidently say. But in fact, I look and see between human and mouse and there's like none of the same transcription factors there. And the reason for that is because of cooperativity and nucleosome enforced cooperativity. So you've got in order to have a piece of regulatory DNA, you have to have your factors, it's got to get rid of a nucleosome. And what that means is that making a change there is very, very easy compared to forming a new element where there's lots of other sites to come together. And so what happens is you can lose one and then you can lose another, a different one, a different one. You lose four or five in a row, it's still the site. The site's there every time but now you've suddenly turned over the entire furniture in the house while the house is still there. Thanks. You can't do it, it implies that it occurs sequentially, not a wholesale turnover. Yeah, I'm sorry, the question was, do you have to leave something there? So that each time. So the answer is absolutely because something has to be there to maintain the cooperative. It's exactly what Chris sort of showed is the sense that you have these cooperative interactions but there's still everybody has to be there in order for that thing to happen. One last? Okay, well yeah, you can find me at the afterwards.