 Okay, so it's just me between the last one of the day here. Normally, if this were like a normal conference, you guys, it'll be all liquored up. So this is hopefully going to try and be a little entertaining here at the end of the day. So I want to thank Elise and all the organizers for the opportunity to talk to you. And when I heard I could talk about, you know, the new directions for encode, I thought, wow, okay, I get to be a visionary. But all I came up with often was things that were pretty related to what we were doing in the lab because I guess that's what I know. So I'm going to, actually there's going to be some pretty concrete stuff, some preliminary data. And if I sound like I'm trying to sell you something, I really apologize for that. But I'd love to have feedback on some of these directions. So thank you. So, okay, so I need an obligatory analogy of what encode is. So right now I think quite often a lot of the data sets aren't so much an encyclopedia as an atlas, right? So I can look at the map of the world and see political demarcations, or I can see topography, or I can see land, you know, this is surface temperature. And then of course things like connections between different airports, right? So that's sort of more like the high sea stuff. And I can look at this stuff in a browser. So I guess the first part that I want to ask are there other sort of maps that we're missing and what might be interesting in that direction? And then of course the other side of this is actually building the encyclopedia. So here, for example, Brussels. This is a regulatory element in my analogy here. And there's lots of interesting information here. This of course is from Wikipedia, which is crowdsourced. I think it's above my pay grade to determine whether or not that's a good model system for doing encode, but might be something we're thinking about. And so the other parts that I'll be talking about will be sort of talking about how to actually get it functional relevance for these regulatory elements and understand what they're actually doing. So again, projects are either gonna provide new types of genomic maps or a hyperbolic hypothesis less functional validation and a quantitative understanding of how those elements are operating. And here's a quick outline. First I'm gonna talk about maybe a new method to try and get a chromatin secondary structure. Looking at single cell regulatory information. I'm gonna pitch a quantitative biochemical investigation of DNA sequence and how changes in sequence affect structure and function of either encoded or binding macromolecules. And then also obligatorily maybe, I have to pitch something to do with crispers. So we'll talk about a high throughput and combinatorial crisper screens that might be able to get out, at least in a cell line context, some understanding of functional validation. Okay, so the first, so that's four things in 10 minutes, so we're gonna go fast. So here's chromatin structure at all length scales. One base, 10, 100, et cetera. And I made this figure so it's obvious that there's a big gap here right at the level of maybe kilobases. So how is chromatin folded at the level of hundreds to thousands of bases? I think this is really a missing portion of our understanding of the topology of DNA. And this is the length scale of enhancers and transcription start sites and things like that. There's lots and lots of methods for looking at sort of this linear, primary chromatin sequence. And there's all these tertiary things. And of course, I would argue that these are especially exciting, perhaps capture-based, high C, and Chiapet may be pulling down Paul too and things to really get at functional interactions. But as I've drawn, there's this big gap. We've been like stole an idea from Reitberg at all from 1998 and I, as an aside, I think it's always fun to look at old papers and say, okay, now that I have sequencing, what should I do with this technique that exists? So the way this works is a high energy photon comes into chromatin, it interacts with water and generates a cluster of hydroxyl radicals that can cleave the backbone of DNA. And if it cleaves, if a single event cleaves two backbones of DNA, you can get a single stranded fragment out. And the ends of that single stranded fragment were within two nanometers one another in the folded chromatin structure. So if we've got a bunch of different possible structures, these little bombs go off and generate single stranded fragments. And different fragment distributions come out of different folded structures. Okay, and so that's the theory and this is what it actually looks like. So this is what the gels look like. Red here is the actual chromatin. You can see interesting structure here. And when we sequence that, we see we recapitulate the structure of the gel and here's some theoretical distribution of what these fragments should look like. We can map it back to, for example, CTF, CTCF start sites. So this is a V plot showing that the nucleosomes and these are fragments generated from half-nucleosome all row and wrap around a nucleosome. And we can look at these fragments here and then make a contact map with respect to a single nucleosome here. And this is the theory we would expect from a crystal structure. And then we can start to say, okay, let's look at all fragments that have a start here and an end somewhere else. Where do those ends map? And actually in the red, you see more and more map there. And we're actually starting to get sort of near-crystallographic resolution of, and then we're trying to start to fold chromatin at this length scale. And you can also see here, this is our insert distribution here and if we've been along heterochromatic or active transcription start sites, there's depletions at specific locations suggesting, for example, this two-nucleosome peak is enriched in heterochromatic regions and depleted in active transcription start sites. So we're getting some compaction information. So maybe one way of starting to fill in what I think is a gap in our understanding of chromatin. Okay, so number two, understanding functional elements in single cells. There's been a lot of talk about this. There's methods like single cell methylation. There's single cell Hi-C recent paper, really interesting, trying to understand how chromatin is folded in different ways in different individuals' cells. There's lots of interesting techniques for doing droplets and doing RNA or chromatin analysis within droplets. And we've been working on methods of trying to do single cell attack seek, which is a method for looking at open chromatin regions on small numbers of cells. So why single cells? Well, I mean, it's maybe it's obvious, but here's my other mapping analogy. If you want to look at all the boats that go from San Francisco to New York, they can either go through the Panama Canal or they can go all the way around the tip of Tierra del Fuego. The average boat, though, goes through the middle of Brazil, which of course doesn't ever happen, right? So that's a big problem. That's all of our data. It all goes through the middle of Brazil, effectively. So here's what single cell attack seek looks like. Here's the Duke DNAs in GM12878. Here's bulk attack seek actually here. And here's aggregated attack seek from about 250 cells. And you can see high correlation. But of course, now, we actually get to see the path of every boat, I guess. Every column here is a different individual cell. And what you can see here are the reads that we're observing in those individual cells. And what we can now then do is look at correlations. So this is extremely sparse data. Every cell can either have zero or one or two reads at any specific locus, generally. So it's much different from RNA seek data, for example. So the way we've been looking at this is looking at open chromatin regions that are correlated with transcription factor binding. So asking if peaks that are associated with specific transcription factors, if they vary more or less, then we would have expected by chance. And then we can look at that deviation here. This clustering is of K562s, H1ESCs, and GM12878s. All of the cell types basically cluster out. But we can see variability associated with specific transcription factors. Here's Gata motifs, H1ESCs, GMs, June factors, and NFKappa B. So all of these cells cluster out. But you can see within, even within these cell lines, we see significant variability in Gata motifs, for example, for K562s. And in NFKappa B, we see variability in the GMs. So now we're starting to try and understand. So we can do this for lots of different cell types. GMs, K562s, ML, TF, MLCSCs, HL60s, and BJs. And of course, all of this sort of enabled by the amazing data sets that's been generated by ENCODE so we can look at variability in NFKappa B. Gata factors, June FOS, and NANOG, for example. And the other thing we can ask is, are there correlations along the linear genome? So for example, if I go across the genome, is there enrichments or depletions in individual cells as a function of say 25 peaks at a time? And we see significant structure here. And then we can ask, are there regions along the genome that seem to vary together? And if we look at that, we actually see correlations that are very similar to this chromosome confirmation capture. So we're recapitulating sort of high level megabase scale chromatin structure with the variability observations in single cells. So the single cell analysis avoids this ensemble averaging. It allows the statistical correlations in regulatory state to provide evidence of functional linkage between elements. I think if we had more cells, we would be able to say a specific open chromatin regions are correlating together, which is sort of a statistical argument for them being functionally linked. And I think in the future it would be wonderful to try and go after simultaneous single cell regulatory information in RNA-Seq from the same cell, which would couple sort of the what and the why of the cellular variability. Okay, fine, so I don't have a lot of time yet left, but the other two things I wanna talk about, this is sort of a slide aimed maybe at NIH people, maybe understanding a mechanism or establishing causality that does not necessarily imply hypothesis driven. Hypothesis maybe at NHGRI is a dirty word, but I think we can almost do hypothesis list mechanistic studies to try and understand causation. Here's my little pitch for two ways of potentially going about that. So everyone's seen this, this is the tritist slide in genomics. But there is this functional genomics bottleneck. We have really hard time, I would argue, linking sequence variation to structure and function changes in biomolecules. And one of the things we're interested in doing is trying to do this in a very quantitative way, extremely high throughput, actually on the sequencing instrument itself. So it turns out if you wanna do lots of different measurements across lots of different sequence-based, DNA is a combinatorial polymer, so you might wanna make combinatorial changes to DNA sequence and see how that affects structure and function of binding to the DNA or the molecules encoded. Well, this is a wonderful sort of machine that allows you to do billions of measurements at a time on something that I can hold in my hand. So this is actually an idea that was pioneered by Chris Burge's lab, where you can use the sequencer as a post hoc DNA array, you can then, you know, the sequence of every potential cluster here, and then you can add in DNA binding proteins, for example, transcription factors, or even, oops, or even initiate an RNA polymerase and ask about structure or binding to RNA, or even the ability for RNA polymerase to initiate a specific site, which gets more into understanding the requirements of transcription and how elements might be doing that. So we label these binding elements, we can label elements that may bind to RNA or DNA. And here's what the experiment looks like, maybe I don't have time for this, well, maybe I do, of course. Okay, so the experiment just looks like this. So we flush out all the protein, we allow it to come in at higher and higher concentrations, we build up a binding curve across all of these structures, we let it flow out. Every cluster has a different sequence, of course. So now we're getting huge amounts of thermodynamic and kinetic information across sequence space. Here's what that looks like, binding curves from individual clusters. We do lots of clusters so we get good measurements. So we can reconstruct a comprehensive functional landscape for this molecule and actually understand how mutations or combinations of mutations in this case, these are all double mutants, allow another binding protein, RNA binding protein to come in, okay. And it turns out pirating sequencing instruments is relatively cheap and easy. The GA2 six years ago was maybe $600,000 instrument and now it's free. It's like a big paperweight. People give them away, it's really fun. You can get them and if you open this up it looks like it was built by a grad student. There are actual structural cable ties in that thing that you can cut away and then have fun with. And then what we can do is do our sequencing on MySeq, put it on the guts of this GA2 and build our own custom imaging station which allows us to do these high throughput quantitative biochemical methods and we can clone this thing for real cheap. So multicolor capabilities allow for the possibility of binding multiple factors and understanding cooperativity for example at a very quantitative manner at specific loci. Quantitative biophysical models will complex molecular interactions across sequence space within enhancers of whole genomes or even trying to understand the machine how transcription itself starts which I think links to Frank's questions. Also this allows a predictive understanding of sequence perturbations on both DNA binding proteins potentially but also encoded macromolecules like RNA, a direct application to interpretation of genomic variants. Okay, finally no talk in this DNA age is complete without something to do with CRISPRs. So here is a quick pitch for maybe one way to do CRISPR-based high throughput functional screens in an analogy to high throughput SHRNA screens and again basically this idea and the figures are courtesy of Mike Bassick. So back in the day you could make lots and lots of SHRNAs, you could infect cells with them, some of them could be selected or unselected and then you could determine which knockdowns of specific genes were important in the selected phenotype. For example, let's say I added ricin here which genes do I knockdown which allow me to survive ricin or to die faster in ricin? So suddenly I have a sense of which genes are important in this regulation and then what I could do is pool, I can ligate these SHRNAs together and do double knockdowns and then understand whether or not these two genes were buffering or synergistic together and then build back networks of understanding. So this is a way of understanding the interactions of genes potentially of which genes are important and also their interaction. But now we have CRISPRs. So CRISPRs can be targeted anywhere in the genome effectively with an SGRNA guide and also the people have fused to this Cas9 molecule, some stuff like crab domain to repress or VP60 to activate and effectively anything could be fused to this Cas9. You can target any guy in anything that's gonna recruit anything else to this Cas9 and then you can make this library of different SGRNAs and do the same thing and you can actually also do this combinatorial kind of thing where you have two SGRNAs for example and then you can either knock out or change two specific locations or knock out a region entirely by cutting on either side of it and effectively this works. So what would this look like? Okay, okay, right? So here's an experimental outline for understanding functional relevance let's say enhancers of gene I of N genes. What we really need is some mapping between say growth rate and gene expression level. So there's lots of different ways to potentially do this for hundreds of genes. So I can dial in gene expression under some stress conditions and make a mapping between gene expression and growth rate and of course I'm probing growth rate by sequencing all the SGRNAs before and after my screen. So this is the basal level. Let's say I use a CRISPR knockout, I knock this guy out, I measure the growth rate, I infer gene expression, now I have a quantitative metric for understanding how this element affects gene expression and then I can add my CRISPR-I, this is a repressing protein and that might change some other way. I add a CRISPR-A, I increase gene expression level and then I kind of got sick of animating but the idea is you can also do combinatorial things. So now you can add multiple different elements and this is something Aviv was talking about. There's a combinatorial logic to these elements that are regulating expression. So mapping from one gene to hundreds of growth defects is needed but then we could do this CRISPR-IA and knockout targeted every putative potential regulatory element and sub elements within the elements. Paralyzed combinations would really get back some understanding of what's going on in a predictive way. High throughput quantitative picture, comprehensive quantitative measure of elements, gene expression and insight into their additive regulatory logic. Okay, I went over, I apologize but that is it and thanks to the people in my lab that generated some of this data and Howard, who's a collaborator. And you guys, thank you so much. Okay, I'm done, thanks. Thanks. Can we have everybody come back up, Laurie and Ross?