 Thank you, Laura. That was very nice. And thanks to Eric for both staging this and running the NISC for so long and being so successful. And congratulations to you and all of you. In fact, NISC has been involved, I've been involved with NISC, I guess I should say, in various collaborations, only one of which I'm going to mention today, which is the ENCODE project, but also full CDNA sequencing and really sequencing the human genome and various other sequencing projects. I do want to apologize to start off as that I, especially speaking for Eric's symposium here, that I don't have any animation in my slides, I don't have any music or any of the things. It'll look poorer compared to what he normally does when he's invited to birthday parties. And then also I just want to comment that it really is true what he said in the introduction that I don't know maybe when we started out to sequence the human genome almost 20 years ago, that we thought that that's what we would do, but it really created a field. It's a discipline that came out of this. And we really are in a sequence-based world. I think we're really just barely starting to, and the talks so far, I absolutely agree with these. So I'm going to tell you about one area that I'm interested in and have spent some time in. In fact, this was really what I worked on when I was in graduate school, was transcription control. And it's really striking to see what we did back then and actually still do now, where we look at one gene. There were probably, I don't know, 100 labs working on this one protein when I was a graduate student that binds to DNA and regulates transcription and DNA replication. And now we are, and that was one gene and one protein. And now we are trying to look at all the genes and all the proteins at once, thanks to genomics. So what we're really going to talk about, what I'm going to talk about really is mostly just the transacting part of this, the proteins that bind to DNA, and really only one type of protein, the type that binds to sequence, very specific sequences. They're chromatin-binding proteins that do this more globally. And then also we care about the cis-acting sequences or the regulatory sequences to which they bind, and these are all over the genome as well. So these transacting components, the reason this is so striking is that we have so many. It's a little bit unfortunate. We dedicate about a tenth of our genome to regulating our genome, or at least in terms of DNA-binding proteins. One's that regulate transcription, DNA replication, other things as well. And luckily, unlike the slide I showed you earlier, which was an electron micrograph, are related biochemical assays, which are one gene, one protein at a time, very, very tediously, with naked DNA. We now have assays that allow us to look at the interaction of proteins that bind to DNA inside living cells, and this is CHIP or chromatin-IP. And here's a diagram I think probably most of you know it. You take cells, you cross-link very gently with a chemical agent, you cross-link the DNA to the proteins to DNA while the cells still alive. You freeze them, you break it open, sonicate the DNA to break it into 500 base pair fragments, and then you use an antibody to immunoprecipitate the transcription factor, or the protein that you're interested in. And then you take what you've immunoprecipitated and analyze it in some way. You reverse the cross-links, remove the proteins, all the junk supposedly is washed away. And what most people have done is something called CHIP or CHIP array, where you then take this DNA and hybridize it to an array. There are other ways to do this. You can use quantitative PCR just to check for fragments that you're interested in. And what I'm going to show you is that it should have been, it was obvious I think from the beginning you could use DNA sequencing, but the new sequencing technologies really helped to make that. And there's an excellent description of this in this textbook here that you can buy multiple copies of if you're so inclined. The short read sequencing, but you've already heard about various versions of them, and I echo what Richard Gibbs said and others have said. These platforms are changing rapidly, and I think we're going to, they're already working reasonably well now, and I think we're going to see great improvements in them. And they all have one thing in common that differs from traditional sequencing. It's not so classical, but of Sanger sequencing. Instead of cloning the fragments and looking and addressing them one at a time, albeit with robots in the big genome centers, instead of addressing them one at a time, you just spread them on a surface, beads or flat surface, and sequence them in situ. They're sort of single molecules sequencing that they're not really right now. They amplify in situ. Some of the new techniques are actually true single molecule. They're short reads. They tend to be 25, 35 base pairs or so. You can get them larger than that with some of the technologies. And the key here is lots and lots of reads, easy and cheap. And you want them to be accurate enough so that you can actually read them. All right. So I'm going to tell you about one transcription factor, or a couple transcription factors, but mostly about one called NRSF or REST. It's an amazing transcription factor. It's called, NRSF stands for Neuron Restrictive Silencing Factor. It's also called something else by another laboratory. David Anderson and Gail Mandel's labs discovered this protein about 15 years ago. It's a zinc finger protein. It has eight zinc fingers. It's an unusual DNA binding protein. And it recognizes a 21 base pair fragment. That's really large for the cis-acting sequences that transcription factors bind to. Most of them are about 6 to 8 to 10 the most. REST is interesting because it's a repressor. It may act as an activator in some cases, but it's mostly a repressor that turns off neuron-specific genes and non-neuronal cells. But it also turns off those kind of genes, at least some of them, in neurons as well. So it's involved in the whole process of neuronal maturation and differentiation and actually maintenance as well. And it works with cofactors, and we actually understand a little bit about its mechanism, but almost all of the work, until recently, was taking this transcription factor and looking at it binding to a few genes. So what we decided we wanted to do, and lots of others as well, for this transcription factor and all of the others really, is to identify all the binding sites in the human genome. And actually, why not why stop there the mouse and other genomes as well. Determine which ones are occupied in different cell types in different cell states. And we learn a lot about biology by being able to do that. And then we actually may not have time to tell you this, but we then compare gene expression from those genes as well as the methylation status at the CPG-rich sequences and the chromatin state in those regions. And what we want, we want everything. And I think one of the lessons we've learned, especially from our big genome center colleagues, is that we should be really greedy about what we want. And because if you set your sites that high, you can usually, you can often figure out how to do it. And really, you want everything. You want it to be comprehensive. You'd like it to be completely unbiased. And this is one of the big differences between using a microarray to look at the chip-related, chip immunoprecipitated materials or using real-time PCR. If you sequence it, in theory, you should be able to find, you should see everything that's there. You want it to be fast, accurate, cheap, et cetera. So we started, we tried this, and we, by the way, is my laboratory in Barbara Wald's laboratory. Barbara is at Caltech, and she's been studying transcription for longer than I have. And we just decided to try to test the immunoprecipitated materials with one of these ultra-fast or ultra-high throughput sequencing platforms. And so we worked with Selexa. The idea is you do it exactly the same way. You, we add an amplification step during the, after the cross-linking, size selected and sequence it by Selexa. And the data look remarkable. Now this, I'm, we're cheating a bit because this transcription factor is such an easy one to look at. It has such a large binding site. It's a very fantastic monoclonal antibody that works for it. Doesn't work as, quite as well as this for everything, but, but in general we're seeing the same types of, of results, which is this is a browser shot of just a little tiny portion of the genome. What these are, each one of these little red things is a block as a sequence read that is being placed. So you sequence it by these, one of these fast techniques, the stuff that's immunoprecipitated. And then you take the reads, the ones that you can place uniquely on the genome, you place them on computationally and here's where they landed on this particular one and you get a peak here. And that's a lot of data saying that something is binding there. It turns out you don't need only, you only need about 10 or so tags to be above, way, way above background and a threshold that we set, probably a very conservative threshold we set. Okay. We found about a 2,000 sites bound, occupied by this protein in one particular cell type. We've now looked at this in multiple cell types. And this was only with about a million and a half reads. These platforms potentially from a read can give you, from a run can give you about 40 million reads in theory, probably higher on some of them and I suspect by next year it'll be 100 million and we hope so anyway. Okay. The thing that struck us about this is that background is incredibly low. This is the unimmune or the mock immunoprecipitation here and this is what the genome looks like. You get reads every once in a while. We have a few places in the genome where we do pile up reads that we think it's not a real binding site that has to do, that we have almost surely attributed to them being misplaced reads because of sort of low copy repeats. Okay. So here's an example of the kind of discoveries you can make with this. This protein, this gene was known to be regulated by NRSF from sort of a classical genetic kinds of, molecular genetics kinds of experiments, NeuroD, was known to be occupied but the people had tried very hard to find the binding site for it. They couldn't find it in the promoter. They couldn't find it and they looked and looked and looked. 100 kb upstream and 100 kb downstream and, or not 100 kb downstream but looked downstream as well as they could with the knowledge that they had of the binding site five or six years ago. Okay. And people tried standing through the gene biochemically to look for binding sites and never found one so we did this experiment and right there and not really in the middle of the gene but in an exon of the gene you have a binding site that clearly regulates the transcription of the gene. So we found lots of examples of these where well many new sites that had never been observed but even in some genes that people had studied where they didn't know where the binding would occur. Here's a type of thing that you can learn from this and again this is where the agnostic and global natures of this kind of approach is important. You're not having a preconceived hypothesis about what should be bound or what shouldn't be bound. You go in and you just see. And what happens in the, just even looking at a couple of cell types we learn that there are a variety of transcription factors that are repressed. The genes themselves are repressed by NRSF and these genes are involved in beta cell, pancreatic beta cell development and maintenance. And sure enough just from making the guesses from this type of finding we go in and we find that they are indeed involved when you look at them on an individual basis. That's actually a new surprise role for this protein and maybe not terribly shocking but one that was not known before. Another thing that we learn from this and we actually are getting a hint of this and other transcription factors as well is that not all the sites that are occupied by a protein actually look like the consensus binding site. That's a common tool to say oh here's the sequence. You can show it biochemically maybe that it binds to this. But in fact there's some slop in the binding and then occasionally as I'll show you you see binding and there's no question that it's occupied, a site is occupied but it has nothing to do with the NRSF occupant consensus site. But one thing we did learn is that there are a fair number of sites that look like half sites and this is not shocking I suppose with eight fingers maybe four of them are so binding to half site with the rest of the site occupied but not in the same way as the whole site. But what was interesting is that they're actually about 15% of the sites don't look like they're bound, look like either a half site or a whole site. And this is showing up in other transcription factors as well and we suspect it's possible that it recognizes the two very different types of DNA sequences but we suspect that it's actually binding on top of another protein that's bound to the DNA in that site. And if we actually truly can get that kind of sandwiching from these experiments I think we're going to be able to learn a lot about transcription. You can do the same kind of thing with the chip data that you do with gene expression data where you cluster it and you try to get an idea of the types of genes in a particular cell type or cell state which types of genes are in this case now being occupied by the protein. This is really useful in the case where you have no idea what a protein does, what a transcription factor does. We know a lot about this and so much of this is confirmatory but we now know a lot more genes and genes that are involved in particular pathways, in particular cell types and it does differ in different cell types and you particularly see this interesting changes in these patterns when you look at neurons. You still are repressing some of the genes in neurons. Every neuron doesn't express every neuronal gene and we are learning a lot about the different types. The biggest problem we have here is you don't easily get cultured neurons in any mammalian system so it's hard to do the real biology you like to do here. I think one of our challenges would be figure out how to do this in tissues and unfortunately the crosslinking makes it very difficult. You have to crosslink sort of evenly. If you overdo it or underdo it, it doesn't work and so you really can't do it in a massive tissue. You have to do it with tissue with cells that are dispersed. This is I think a little bit of an obnoxious slide. It's showing the performance of the technique and it really has held up with most transcription factors similarly where you plot the fraction of true positives versus the fraction of false positives and most curves for the other techniques for chip look like this and you're trying to struggle to find where do I set my threshold so I'm getting not too many false positives while I don't want to throw away the good results that I have and at least with the way that we've been doing this and we actually feel like we understand the parameters that make this important you can really set your threshold on this where you're getting very few false positives and almost getting almost all your true positives. One of the other features and this is held up although not quite as strongly as it has for NRSF in that you actually if you do chip chip or chip array you basically are narrowing down a binding site to about 500 base pairs maybe even a little bigger than that. You're not really sure. When you're looking at a sequence specific binding protein or really any DNA binding protein you'd like to know exactly where it's sitting on the DNA that actually helps you think about the network of other proteins that are bound. I like to think about the contacts having studied these things biochemically originally about how the protein actually lands on it and when you do this, when you do chip seek, when you sequence the chip products and you get enough reads and especially the way that we made the products to put onto the sequencing machine which is with sharing and serious size selection you end up narrowing it down to a really really tiny area and in fact the deeper you sequence the more narrow it becomes and so some of you might know the technique selects. It's a method for taking a DNA binding protein and selecting out from a mixture of very very large mixture of biological nucleotides, what it can bind to. We've done this fair amount in my laboratory. It's a tedious, difficult technique and it does work. I think this is the new select. You don't need to do it biochemically. You can do it I can't believe I'm saying that but you can do it with inside living cells rather than and the deeper you go on your reads the more you can narrow that down. RNCDA and postdoc in his lab Anton Valuov took the data that we got and we're helping develop the algorithms for doing the placement of the reads against the genome and making the calls and the peak calls and came up with an obvious idea I guess that we had not thought of which is that if you look at the sequence reads these are 20, 35 base reads actually that we're doing. If you actually look at their directionality on the genome and this is held up this works on essentially every binding site. These blue ones are coming from this end and the yellow ones are coming from this end going that way if you color code them they actually home in on to the very top of the binding site. So that actually serves two purposes it helps to narrow it down but also gives you more confidence about the placements of the reads when it works this way. This is two binding sites for NRSF in another experiment. So we've done this on a bunch of transcription factors and here's some of them there's one called GA binding protein I'll speak of in a minute. Serum response factor is an important one for stress response as well. Some standard transcription factors are in a polymerase 2 other chromatin binding proteins and we'll be applying this to several hundred such proteins for the ENCODE project as we're starting up the scale up of that funded by the NHGRI. One I will tell you a little bit about that's one of my favorite ones that we're just barely getting a hint of now is FoxP2. It's a 4CAD transcription factor this isn't humans it's a highly conserved transcription factor in all species very very little sequence variation in it and mutation at one residue in a family Tony Monaco and others discovered this and reported this years ago causes the loss of language in humans so it's one of the very few true behavioral genes it doesn't look like it's a mechanical problem that it's a processing problem largely. We don't really understand but it's a transcription factor and this is one of those even though 4CAD or Fox factors in general have a family they have sort of similar binding sites. The actual binding site for this is not known although we're starting to learn it. Only four base pairs differ between humans and chimp in this transcription factor. I don't think we will find the reason that chimps can't speak and we can by understanding this but I'll bet you we will understand something about language from being able to study it. So Simone Martique in my lab has a graduate student in my lab has been studying this it's been a hard one didn't have good antibodies she had to make her own multiple tries was able to now do this and what we've learned we've now found about nine, eight to nine hundred binding sites in one particular cell type that are occupied by this factor looks like it's primarily an activation factor although in some cases we think it might be repressing. Interestingly it binds to several places in its own promoter another student in my lab as Diane Schroeder has looked at this whole fairly large region and found multiple promoters for the gene and so it clearly is auto regulatory not terribly surprising that happens with a lot of transcription factors not all. It also binds to one of its family members the Fox P1 promoter. So we're just now starting to explore this and try to understand how the network of binding works. I'm going to tell you one more story I think and this relates a little bit to the cis-acting sequences. So one thing we did several years ago, Nathan Trinkline and Shelley Force Alderdon in my lab discovered that and this was when the human genome sequence was first really getting finished and we discovered that about eleven percent of our genes are arranged in a bi-directional way and that means pointing outwards bi-directionally in terms of divergent transcription where this distance is less than a kb and in fact if you look at the distribution you may not be able to see this most of them are about 100 or 150 bases apart and that really surprised me because that's not a whole lot of room that's really about the size of a transcriptional promoter. I'm not going to show you a lot of the other data but what we've done is a whole bunch of mutagenesis experiments to look at these they clearly share transcription elements within that 100 base pairs or so. One thing we've learned is that they're co-express, they're either both up or both down in a cell type for almost all of them, for 95 percent of them they're not anti-regulated, this range of it is conserved in other mammals as well. What a new thing that we did is we collaborated with Zhiping Wing at Boston University and her student Jane Lin where we looked for over-represented motifs within that 100 or 150 base pairs or so and there are about six or seven of them but one in particular that really showed up as being highly over-represented is for a protein called GA binding protein. This is a transcription factor that's involved in a whole lot of metabolic processes and many other processes. One single pinpoint thing that you can say about it is an etch transcription factor, etch family transcription factor forms a tetramer and binds to a sequence that looks like this, it's a little bit more extended than that. What Patrick Collins did, a student in my lab did was he did partly computation but he really did a bunch of experiments to show that almost all the bi-directional promoters are bound by this protein. So we're not going to say that it's clearly not the only protein that regulates bi-directional promoters but it really looks like it's one and this is true in different cell types and they're somewhat, the more cell types we look at the more the truer this becomes or the closer it becomes to 100%. And interestingly some unidirectional promoters, ones where there's a promoter and a gene going in this direction with no obvious gene going in the other direction, some of them are bound by this and almost all of those that are bound by GA-BP actually have a transcript going in the other direction when we look in transfection experiments. The other thing that Patrick did is that he added the GA-B binding protein site, this consensus site into some of those unidirectional promoters and now it makes those look work in the opposite direction. So it probably this is a very strong activator protein, it's like VP-16 for some of those transcription folks might know, very, very strong activator. We suspect that it just helps to grab RNA polymerase and its components to transcribe maybe a little bit more promiscuously than another promoter would. So sort of to get close to wrapping up here, if you think about what I just talked about with regard to using short read ultra high throughput sequencing methods, the key there is actually ultra high throughput, ultra cheap. You can sequence with regular methods and do this, you just spend a fortune doing it. And so for a few hundred dollars or maybe a thousand dollars you can do it a whole interactome for many of the transcription factors and that's actually substantially cheaper than doing the chip arrays or chip chips which you have to do often in triplicate or even more to get reliable or even partially reliable data. If you think about that, that's a census, a sequence census method you're counting, it's a digital readout like some of the others referred to and you could apply this to a lot of things and we've actually applied it not only to sequence specific binding proteins, obviously to chromatin proteins Eric Lander's group is doing this, many others are doing this now as well. It's telling us a lot about what the global structure of the genome looks like in a living cell. But why not use it to count RNAs? And so many groups, Rick Wilson's group does this, many others as well, where you use it instead of microarrays to figure out what RNAs are there. Barbara Wald's lab has been really trying to develop this as a way to look at alternative splices as well as working really well. I don't have time to tell you but we've applied this to a methylation method where we're looking globally at all the CPG islands at once to see which ones are methylated and which ones are not methylated in particular cells. And so again it's like a snapshot and I'll bet you there are many others that we haven't thought about or at least we haven't thought about that are counting methods that could come in handy. So I'll end by thanking a lot of this work was started by Nathan and Shelly when they were graduate students in the lab. She's developed with Barbara Wald's lab and her student Ali Mortizavi and my senior scientist, my post doc Dave Johnson and then Ji Ping and these others as well. I'd like to thank them. So I'll end there. Questions from the audience. Rick, it seems to me that Chipsiq still requires good antibodies. And where do we stand on that if we're really going to do any sort of comprehensive profiling? So and that's the problem with all, with CHIP in general is you have to have a one that is specific for the protein that you're studying and actually proving that a specific is non-trivial. NIH should have and I hope they will set up a trans-NIH effort to make monoclonal antibodies for every single protein in the human genome. We've got to do it. It's crazy not to. NHGRI is taking a bold step by having us do I think a hundred and something of the factors ourselves. I'm not sure if the other groups will be doing those as well. But for 1500 for DNA binding proteins we have a long way to go on that. It's surprising monoclonals work better. In general the three or four best ones we've had have been monoclonals and the advantage of that is that you can probably get your specificity and they do immunoprecipitate well as long as you screen enough of them. We probably need two antibodies for each protein to show them to get some sense of specificity as well. Any questions from the floor? If not...